RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation

Attribute-controlled translation (ACT) is a subtask of machine translation that involves controlling stylistic or linguistic attributes (like formality and gender) of translation outputs. While ACT has garnered attention in recent years due to its usefulness in real-world applications, progress in the task is currently limited by dataset availability, since most prior approaches rely on supervised methods. To address this limitation, we propose Retrieval and Attribute-Marking enhanced Prompting (RAMP), which leverages large multilingual language models to perform ACT in few-shot and zero-shot settings. RAMP improves generation accuracy over the standard prompting approach by (1) incorporating a semantic similarity retrieval component for selecting similar in-context examples, and (2) marking in-context examples with attribute annotations. Our comprehensive experiments show that RAMP is a viable approach in both zero-shot and few-shot settings.


Introduction
Text style transfer (TST) is a task that aims to control stylistic attributes of an input text without affecting its semantic content (Jin et al., 2022).Research in TST has largely focused on English, thanks to the availability of large monolingual English datasets covering stylistic attributes like formality and simplicity (Rao andTetreault 2018, Zhu et al. 2010, inter alia).In recent years, however, multilingual and cross-lingual applications of TST have seen a steady gain in popularity (Briakou et al., 2021;Garcia et al., 2021;Krishna et al., 2022).A notable instance of cross-lingual TST is attributecontrolled translation (ACT), in which attribute 1conditioning is performed alongside machine translation (MT) to ensure that translations are not only   correct but match user-specified preferences, such as formality/honorifics (Sennrich et al., 2016;Niu et al., 2017;Michel and Neubig, 2018;Niu and Carpuat, 2020;Nadejde et al., 2022;Wang et al., 2022), gender (Rabinovich et al., 2017;Vanmassenhove et al., 2018;Saunders and Byrne, 2020), and length (Lakew et al., 2019;Schioppa et al., 2021).
ACT is especially important for sectors like customer service and business communication, where stylistic differences can have an impact on user perception (e.g., misgendering customers or speaking to them in an appropriately informal tone can be offensive or disconcerting).Table 1 gives examples of ACT for formality and gender.Most prior work on ACT relies on a supervised adaptation component that conditions the generative model on the selective attribute.However, few annotated ACT datasets are available, and they generally cover only a limited set of languages and attributes.Thus, enabling few-shot or zero-shot ACT would facilitate applying attribute control to less-resourced attributes and langauges.
In this paper, we introduce a new approach for ACT: Retrieval and Attribute-Marking enhanced Prompting (RAMP).Recent studies have shown that large language models (LLMs) can perform MT out of the box using the prompting paradigm (Brown et al., 2020;Lin et al., 2022;Chowdhery et al., 2022).We build on this, prompting LLMs to perform attribute-controlled MT through two innovations: ( 1 explicit attribute marking. Recent works adopting the prompting paradigm for text style transfer have mainly focused on the generalization capabilities of large English-centric LMs for zero-shot style transfer using previously unseen style descriptions (Suzgun et al., 2022;Reif et al., 2022).However, prior work on other NLP tasks has shown that cross-lingual prompting of multilingual LLMs can be effective (Zhao and Schütze, 2021;Zhou et al., 2022;Huang et al., 2022).As such, we leverage multilingual LLMs and extend their ACT capabilities cross-lingually to languages not covered by the in-context examples, thus enabling zero-shot ACT.

Preliminaries
Attribute-Controlled Translation ACT takes two inputs, a sentence x and a desired target attribute a ∈ A (with A being the space of attributes), and outputs a translation y that complies with the specified attribute.It can be formulated as a function f : (x, a) → y.In our experiments, we use attribute values provided by the COCOA-MT formality translation dataset and the MT-GENEVAL gender translation dataset, i.e., A = {formal, infor-mal} or {female, male}. 2   Prompting In the prompting paradigm for decoder-only LLMs, inputs are given as decoding prefixes to the model, usually combined with natural language instructions for output generation.In style-controlled translation, we formulate the prompt for target language l and attribute a using the text "Here is a sentence: {x} Here is its l translation written in a a style:" to produce the 2 See Section 5 for ethical considerations.output y. 3 In the few-shot setting, we provide a sequence of k labeled in-context examples before the unlabeled input, which can be formulated as a function f : {(x 1 , l 1 , a, y 1 ), . . ., (x k+1 , l k+1 , a)} → y k+1 .

Our Approach: RAMP
RAMP builds on the success of the prompting paradigm on few-shot generation tasks such as monolingual text style transfer (Reif et al., 2022) and MT (Garcia and Firat, 2022;Agrawal et al., 2022) by creating more informative prompts through similarity retrieval and attribute marking.See Figure 1 for an illustration of RAMP.

Similarity Retrieval
In standard prompting, incontext examples are sampled randomly from the pool of labeled examples D A .In RAMP, we select examples based on their similarity with the input text.We first embed both the input text and the source texts of D A using all-MiniLM-L6-v2 (Wang et al., 2020).Then, the top-k most similar examples are retrieved for the input text based on cosine similarity.These are then used in a descending order w.r.t.similarity as the in-context examples in the inference prompt.As demonstrated in Figure 1, the in-context example "You will always be welcome here."has the highest similarity to the test example "You're welcome."so it is prompted first.2022), we include for every in-context example an additional sentence directly after the target sentence that specifies which text spans convey the desired attribute (e.g., "The translated sentence conveys a formal style by using words such as 'Vous'.").In our experiments, we use the gold attribute spans included in the CoCoA-MT and MT-GenEval datasets.In section 4 we suggest possibilities for automatically deriving attribute spans when gold training labels are not available.

Cross-Lingual Prompting
The similarity retrieval component of RAMP requires a large pool D A from which to find appropriate incontext examples for prompting.Low-resource attributes or language pairs may have insufficient or no annotated data from which to retrieve such examples.To mitigate this issue, we introduce crosslingual prompting, in which the target side of the in-context examples differs from the desired target language of the translation task.As demonstrated in Figure 1, we study whether the system can leverage examples in one language (e.g., attribute indicators in Spanish) to produce the same attribute in another (e.g., French).Two main features of our RAMP model allow us to perform cross-lingual prompting: (1) the use of multilingual LLMs, and (2) the example retrieval step, which is done on the source language only.

Datasets
We experiment on two multilingual ACT datasets: • COCOA-MT (Nadejde et al., 2022) covers formality-controlled translation in the conversation domain.Source sentences are underspecified for formality, and references require formality markings (formal or informal).• MT-GENEVAL (Currey et al., 2022) covers gendered translation in the Wikipedia domain.
We use the contextual subset, in which sentences are gender ambiguous in the source while the reference requires gender marking.We do not use the disambiguating sentences,  instead explicitly controlling target gender.Both datasets have gold annotations for attributemarked target spans, and both cover translation from English into multiple diverse target languages.We list their target languages in Table 2.

Large Language Models (LLMs)
We select three massively multilingual decoderonly LLMs for the prompting experiments: XGLM (Lin et al., 2022), BLOOM (BigScience, 2022) and GPT-NEOX (Black et al., 2022).The selected models span three orders of magnitude in terms of number of parameters and differ in the languages that they cover (see Table 2).Appendix D motivates our choice of models in more detail.GPT-3 is not included because it is not freely accessible and it is not intended for multilingual use-cases.

Baseline
Attribute tagging is a standard method for ACT, so we include a baseline following the approach and configuration used by Nadejde et al. (2022): a transformer MT model (Vaswani et al., 2017) pre-trained on public parallel data and further finetuned on contrastive training pairs with attribute tags (from either COCOA-MT or MT-GENEVAL).We refer to this as adapted MT.

Evaluation Metrics
We measure translation quality with BLEU (Papineni et al., 2002) and COMET (Rei et al., 2020).For attribute accuracy, we use both (1) the lexical matching metrics provided with COCOA-MT and MT-GENEVAL (Lexical-Accuracy) and ( 2) sentence encoders trained on contrastive examples (Sentential-Accuracy).For (2), we train multilingual classifiers on top of the mDeBERTa-v3 encoder (He et al., 2021).High-performance pretrained classifiers have been shown to produce attribute accuracy estimates closer to human judgments for style transfer (Lai et al., 2022).Table 3 presents the accuracy of the classification models on the test sets of their respective datasets, averaged over all languages.Unlike lexical accuracy, the multilingual attribute classifier does not penalize text generated in incorrect languages.Thus, in cross-lingual prompting experiments, we include a step of language detection5 so that generated sentences not in the requested target language are considered incorrect.

Results: Same-Language Prompting
We first evaluate the effectiveness of RAMP for formality-and gender-controlled translation where the language pair used for in-context examples is the same as the one used in the prompt candidate (e.g., EN→ES formality-controlled translation using EN→ES in-context examples).We test XGLM 7.5B and BLOOM 175B with 16 in-context examples on both tasks. 6Table 4 presents our results alongside the adapted MT baseline.The base model uses in-context examples that are sampled randomly from the pool of labeled examples.We also include an ablation that adds attribute marking only on top of base, without similarity retrieval (+mark).
Using just attribute marking consistently improves attribute accuracy of the generated text, but it leads to degradation of COMET on COCOA-MT.The complete RAMP with similarity retrieval not only compensates for the COMET degradation but also improves quality and attribute metrics across the board, especially for the high-capacity BLOOM 175B model.
Adapted MT outperforms BLOOM 175B on MT-GENEVAL in all metrics, but underperforms it on COCOA-MT.This suggests that it is challenging to do fine-grained comparison between LLMs and standard MT systems as they might have different domain coverage.BLOOM 175B consistently outperforms XGLM 7.5B in both generic translation quality and attribute control accuracy, so we proceed with using BLOOM 175B in the crosslingual prompting setting.

Results: Cross-Lingual Prompting
We have demonstrated the effectiveness of selecting similar same-language examples to build the prompt, echoing contemporary work (Liu et al., 2022;Agrawal et al., 2022).In this section, we evaluate the cross-lingual prompting option, i.e., retrieving in-context examples from other target languages besides the desired language of translation.We test this zero-shot setting using the leave-oneout strategy, and results of tested language pairs are averaged. 7able 4 presents our results using BLOOM 175B.On both test sets, compared to the baseline, we observe improved attribute accuracy and comparable or better generic translation quality when using RAMP with cross-lingual prompting.
We do observe translation quality degradation with RAMP on some target languages of COCOA-MT, e.g., ES.Manual analysis shows that repeated inaccurate retrieval results could lead to hallucinations. 8For example, RAMP retrieves multiple sentences containing "million" for the input "If you got it why not?He is worth over 20 billion dollars after all.".This results in mistranslation of billion to million (millionario): "Si lo tienes, ¿por qué no? Es millonario después de todo.".We give detailed examples in Appendix H.
We introduced the new RAMP in-context learning approach to leverage attribute annotations and similar same-language or cross-lingual examples for better prompting quality.We demonstrated its effectiveness with multilingual LLMs for both formalitycontrolled and gender-controlled translation.We use gold annotations for attribute marking, but we leave unsupervised automatic attribute span extraction as future work.

Limitations
• We currently rely on gold annotations for attribute marking, which are not always available depending on the dataset.However, RAMP could be easily extended to unsupervised settings through LLM feature attribution (Sarti et al., 2023), i.e., extracting salient tokens driving the attribute prediction.This approach builds upon recent techniques in unsupervised language generation metrics (Fomicheva et al., 2021(Fomicheva et al., , 2022;;Leiter et al., 2022).We leave an empirical evaluation of its effectiveness to future work.
• Besides the choice of in-context examples, prompting is also sensitive to their ordering (Lu et al., 2022) and the design of the template (Jiang et al., 2020).We refrain from tuning example orders and templates to avoid introducing too many variables.
• Multilingual LLMs perform competitive MT out of the box for languages seen during their pre-training.However, we noticed that BLOOM 175B produces better EN-IT translations than XGLM 7.5B even though IT is not listed as a training language of BLOOM.This could possibly be due to typological similarity between Italian and the Romance languages included in BLOOM training.We leave experiments of unseen languages as future work.
• Multilingual LLMs like the ones used in this paper require larger GPU resources for inference than standard bilingual MT systems.
• One test set we use (MT-GENEVAL) provides only two gender values (female and male), but we do not intend to imply that other genders do not exist.

A Prompt Templates
Formality-Controlled Translation Here is a sentence: {x} Here is its l translation written in a a style: {y} The translated sentence conveys a a style by using words such as 'w 1 ', 'w 2 '.

Gender-Controlled Translation
Here is a sentence: {x} Here is its l translation in which the person is a: {y} In the translation, the a gender of the person is made explicit by words such as 'w 1 ', 'w 2 '.We finetune MDEBERTA-V3-BASE model 9 on the contrastive examples in the respective training sets to get the attribute classifiers.We finetune the classifier for 2 epochs with a batch size of 8, learning rate 2e-5, 500 warm up steps, max sequence length of 256, and save checkpoint every 500 steps.We do not do hyperparameter tuning, and thus, a validation set is not used.

D Selection of Large Language Models
XGLM (Lin et al., 2022) is a 7.5B-parameter model trained on a balanced corpus containing 30 languages (excluding NL).It was shown to outperform much larger models such as GPT-3 on tasks related to machine translation and cross-lingual language understanding.We select it due to its broad linguistic coverage and its manageable size.
BLOOM (BigScience, 2022) is a model available in multiple sizes, trained on a curated corpus 9 https://huggingface.co/microsoft/mdeberta-v3-base spanning 46 natural languages (and 13 programming languages).However, many of the test set languages are not part of its pre-training corpus (see Table 2).We evaluate two variants of the model (7.1B and 175B parameters) to assess how it is affected by a massive scaling in model parameters.
The larger variant has a parameter count comparable to the one of GPT-3, while it is presently the largest publicly available multilingual LLM.GPT-NEOX (Black et al., 2022) is a 20Bparameter model trained on The Pile (Gao et al., 2021), a large English-centric corpus covering a broad range of domains.While the model saw mainly English data during pre-training and as such is not intended for multilingual usage, it exhibits interesting generalization performances for many of our target languages.

E Preliminary Evaluation of Same-Language Prompting
We conduct preliminary evaluations aimed at reducing the number of experimental settings.We perform formality-controlled translation using COCOA-MT, and evaluate LLMs by varying the number of in-context examples (i.e., 4-8-16-32, selected based on the feasible context length10 ). Figure 2 presents results averaged across all four languages seen by BLOOM during its pretraining. 11Observations: • RAMP generally outperforms base prompting (i.e., random in-context examples and no attribute marking) across most LLMs and example settings for both BLEU and formality accuracy.
• BLEU and formality accuracy improve with increased model size and with the number of examples, until this number reaches 16.
Based on these results we move forward with the XGLM 7.5B and BLOOM 175B models and 16 examples.

F Detailed Scores of Aggregated Results
• Table 5: Detailed scores of same-language prompting on COCOA-MT (preliminary evaluation).• Table 6: Decomposed results of samelanguage prompting on COCOA-MT (full evaluation).
• Table 8: Decomposed results of cross-lingual prompting on COCOA-MT.
evaluation.Early truncating leads to slightly lower scores in Table 5 than in Table 4.

G Amended Details of Cross-Lingual Prompting
We test the zero-shot setting using the leave-oneout strategy, i.

H Error Analysis of Cross-Lingual Prompting
Table 10 shows two examples where RAMP performs significantly worse than the base model in terms of COMET.In the first example, having multiple in-context examples containing "million" led the model to mis-translate "billion" to "million".
In the second example, we observe that the color related in-context examples led the model to produce hallucinated output about clothing colors.Repeated misleading in-context examples are less observed on MT-GENEVAL and in the samelanguage setting because (1) COCOA-MT translates the same set of English sentences to different languages while MT-GENEVAL collects English sentences independently; (2) There are no duplicated source (English) sentences for each language.(Therefore, if RAMP retrieves duplicated English sentences as in Table 10, their reference translations are guaranteed to be in different languages.)

Formality
please follow me to your table.

Figure 1 :
Figure 1: An example of RAMP using 2 in-context examples.(Left) The input sentence is embedded by a sentence similarity model, and the top-k most similar labeled examples are retrieved from a pool of training data to build the prompt context.(Right) Labeled cross-lingual examples are used to fill in the English prompt template, which is then provided to the LLM to generate the output.
of Datasets Splits and Pre-Trained Attribute Classifiers We use the original train/test split provided by the COCOA-MT dataset.Each split contains telephony and topical_chat domains.We use the topical_chat domain in our experiments.MT-GENEVAL contains a dev and test split, and we use the dev split as training data for the classification model and prompting experiments. 12

Figure 2 :
Figure 2: BLEU and sentential formality accuracy of prompt outputs on COCOA-MT test set for different amounts of incontext examples.Confidence intervals are obtained base setting by sampling in-context examples using 3 seeds.

Table 1 :
Examples of attribute triplets from COCOA-MT and MT-GENEVAL.Attribute markers in the attribute-controlled translations are underlined.

Table 2 :
Target languages in the test sets and languages seen by LLMs in pre-training.We report results on languages seen by both LLMs.Language codes are defined in Appendix B.

Table 3 :
Dataset statistics.We report # of triplets in the train/test split aggregated across all languages and the classification accuracy on the test split of the classifiers.

Table 4 :
BLEU, COMET, Lexicaland Sentential-Accuracy of selected LLMs using 16 same-language in-context examples on two tasks, alongside adapted MT models.Scores are aggregated across seen languages (w.r.t.BLOOM pre-training) and both attributes for each task.(Decomposedresults are included in

Table 5 :
Detailed scores of same-language prompting on COCOA-MT (preliminary evaluation).Numbers in the header represent the number of in-context examples used for prompting, including zero-shot prompting (0).Scores are averaged across two available formality values (formal, informal) and languages (ES,FR,HI,PT).

Table 6 :
Decomposed results of same-language prompting on COCOA-MT (full evaluation).

Table 7 :
Decomposed results of same-language prompting on MT-GENEVAL (full evaluation).

Table 8 :
Decomposed results of cross-lingual prompting on COCOA-MT.