CoCoA-MT: A Dataset and Benchmark for Contrastive Controlled MT with Application to Formality

The machine translation (MT) task is typically formulated as that of returning a single translation for an input segment. However, in many cases, multiple different translations are valid and the appropriate translation may depend on the intended target audience, characteristics of the speaker, or even the relationship between speakers. Specific problems arise when dealing with honorifics, particularly translating from English into languages with formality markers. For example, the sentence"Are you sure?"can be translated in German as"Sind Sie sich sicher?"(formal register) or"Bist du dir sicher?"(informal). Using wrong or inconsistent tone may be perceived as inappropriate or jarring for users of certain cultures and demographics. This work addresses the problem of learning to control target language attributes, in this case formality, from a small amount of labeled contrastive data. We introduce an annotated dataset (CoCoA-MT) and an associated evaluation metric for training and evaluating formality-controlled MT models for six diverse target languages. We show that we can train formality-controlled models by fine-tuning on labeled contrastive data, achieving high accuracy (82% in-domain and 73% out-of-domain) while maintaining overall quality.


Introduction
The quality of neural machine translation (NMT) models has been improving over the years and is approaching that of human translation . With fewer glaring accuracy or fluency errors, it is important to address other aspects of translation quality, such as tone and style, in order to generate context-appropriate translations and improve the end-user experience with MT systems. In particular, for spoken language and certain text domains (customer service, business, gaming chat), problems arise when translating from English into languages that have multiple formality  levels expressed through honorifics or grammatical register. Taking the example from Table 1, the phrase 'Could you?' can have two equally correct German translations: 'Könnten Sie?' for the formal register and 'Könntest du?' for informal. This problem has been addressed previously with custom models trained on data with consistent formality (Viswanathan et al., 2019), or through side constraints to control politeness or formality (Sennrich et al., 2016a;Niu et al., 2018;Feely et al., 2019;Schioppa et al., 2021). Most prior research has been tailored to individual languages and has labeled large amounts of data using word lists or morphological analysers.
In this work we look at formality across multiple languages and frame formality control as a transfer learning problem, by leveraging a generic NMT system and a small amount of manually labeled data to obtain MT systems that are controllable for formality. Our main contributions are threefold. First, we release a novel multilingual and multidomain benchmark for Contrastive Controlled MT (CoCoA-MT) consisting of contrastive translations with phrase-level annotations of formality and grammatical gender in six diverse language pairs: English (EN) → French (FR), German (DE), Hindi (HI), Italian (IT), Japanese (JA), and Spanish (ES). Second, to accompany the CoCoA-MT dataset, we introduce a reference-based automatic metric with high precision at distinguishing formal from informal system hypotheses. Third, we propose training formality-controlled models using transfer learning on contrastive labeled data. Our method is effective across six language pairs and robust across several datasets. We show that transfer learning using CoCoA-MT is complementary to automatically labeled data, while cost-effective compared to non-contrastive curated data.
We release the CoCoA-MT dataset, together with Sockeye 3 1 baseline models and evaluation scripts. 2 These resources were also available to participants of the IWSLT 2022 (Anastasopoulos et al., 2022) shared task on Formality Control for Spoken Language Translation. 3

CoCoA-MT Dataset
We first introduce CoCoA-MT, our Contrastive Controlled MT by AWS AI dataset, which enables evaluation and training of formality-controlled models.

Source Data
The EN source data comes from three domains/ modalities: Topical-Chat 4 (Gopalakrishnan et al., 2019), as well as new Telephony and Call Center data. 5 Topical-Chat consists of text-based conversations about various topics, such as fashion, books, sports, and music. The Telephony domain contains transcribed spoken general conversations, unrestricted for topic. The Call Center data is also transcribed spoken data, where the conversations come from simulated customer support scenarios.
We use these three datasets to extract subsets containing utterances that are relevant to the formality control task. The subsets are designed to ensure coverage of diverse phenomena related to formality (honorifics or grammatical register) in the target languages. Specifically, we first selected segments (without the conversational context) having between 7 and 40 words and containing secondperson pronouns (relevant for all target languages) and first-person pronouns (relevant for honorifics in JA). Through regular expressions, we ensured that the selected data contained the relevant pronouns in various positions (subject, object, object 1 https://github.com/awslabs/sockeye 2 The full data, including train/test splits, will be released at https://github.com/amazon-research/ contrastive-controlled-mt/ under a CDLA-Sharing-1.0 license. 3 https://iwslt.org/2022/formality 4 http://github.com/alexa/Topical-Chat/ 5 The Telephony and Call Center data is part of a larger conversational dataset that is currently a work in progress. of preposition). Second, we created a list of common EN verbs and used them in data selection in order to ensure lexical diversity of verbs and verb forms. Third, the automatically selected segments were further filtered or corrected by native Englishspeaking annotators who were asked to remove stock phrases (e.g. thank you), ensure that at least one addressee or speaker is referenced, and clean disfluencies from the speech data.
The selected source segments were then further filtered after the translation and phrase-level annotation steps described in the next section.

Translations and Annotations
For each source segment, we collected one reference translation for each level of formality (formal and informal). For JA, where more than two formality levels are possible, informal was mapped to kudaketa, formal to teineigo, and respectful to sonkeigo and/or kenjougo. 6 We discarded segments if translators did not provide a translation for each formality level, because we considered these segments not relevant for the formality control task. Table 1 provides examples for EN→DE and EN→JA. Annotators also provided phrase-level annotations of formality markers in the target segments in order to facilitate evaluation and analysis (shown in bold in Table 1).
Reference translations were created by professional translators who were native speakers of the specified language and geographic variant. 7 Formal translations were created from scratch as the canonical form, and informal translations were post-edited from the formal translations to ensure that there were no spurious differences between formal and informal references. Translators were instructed to generate natural translations that preserve the meaning and tone of the original sentence while addressing formality with minimal required changes. Such changes included swapping pronouns, editing verb forms, and additional lexical changes to obtain natural-sounding translations. We report dataset statistics in the next section and the full instructions given to translators in Appendix D.

Dataset Statistics
For each language pair, we release test data for all three domains (Topical-Chat, Telephony, and Call Center), and training data for Topical-Chat and Telephony. All segments in the test data have distinct formal/informal references, while the training data contains some segments with identical references for both formality levels. Table 2 reports the number of training and test segments for each language pair, as well as the overlap (measured as BLEU) between informal and formal references in the test set. Note that EN-JA has more training data because we include both first-person and second-person formality segments. The similarity between formal and informal translations is lowest for EN-JA and highest for EN-HI, confirming that Hindi and Japanese are the two extremes with respect to the degree of formality marking among these six languages.  In Table 3 we report corpus level statistics on the variety of phenomena represented in formal training set, including the number of unique and total phrases and tokens labeled for formality in the reference translation. Additionally, we report on the fraction of tokens labelled for formality that are either verbs or pronomials. To compute the partof-speech for each token, for Hindi we utilized stanza (Qi et al., 2020). For the other target languages, we utilized spaCy 8 and the respective large language models. For Japanese there was a significant number of tokens that were nouns or adjectives (7%) which was not true for the other target languages (on average 2%).

Formality Evaluation
In this section, we present a manual analysis of formality expressed in the outputs of two generic commercial systems for inputs sampled from CoCoA-MT. Next, we propose and evaluate a reference-   based automatic metric which we will later use to evaluate formality-controlled models.

Manual Analysis of Commercial Systems
General-purpose commercial MT systems are trained on web-scale parallel and monolingual data with different formality levels. To understand how these systems behave with respect to formality, we analyzed two commercial MT systems on 300 random samples from CoCoA-MT. For each target language and each system, two professional translators were asked to label the translations according to the formality markers present in the output: formal, informal, neutral, other. The label "neutral" was used for output that can be considered both formal or informal (impersonal-passive or plural forms), while "other" was used to label inconsistent formality or incorrectly omitted formality markers 9 . Table 4 reports the distribution of labels for the two systems and the inter-annotator agreement measured by Krippendorff's alpha (Hayes and Krippendorff, 2007). Agreement is high at 0.91 on average across languages. The distribution of formality in the outputs varies widely across languages for both 9 We give examples in the appendix in Table 14 Table 5: Precision and recall of automatic segment-level classification of system outputs as formal or informal.
systems. Both systems exhibit cases of inconsistent formality, with over 20% of segments labeled as "other" for Japanese. Overall, systems A and B are surprisingly similar in their behaviour, with significant differences in only two languages: system B is more formal than system A for Japanese (73.8% vs 29.0%); system A outputs more neutral forms than system B for Italian (14.4% vs 2.7%).
Automatic Evaluation To evaluate formalitycontrolled models, we propose a reference-based corpus-level automatic accuracy metric. Given a system hypothesis, we automatically label it as formal or informal: formal if the hypothesis contains: a) any of the formality-marking phrases annotated in the formal reference and b) none of the phrases annotated in the informal reference. We reverse the conditions to assign an informal label. Note that some hypotheses may not fall into either category.
Following segment-level assignments, we compute a corpus-level Matched-Accuracy (M-Acc) metric as the percentage of outputs that match the desired formality level, out of all the instances classified automatically as either formal or informal (hence matched). We use the notation M-Acc (F)/(I) to denote this score when the desired formality level is formal/informal, respectively. We could not reliably classify neutral and other examples automatically and as such we did not include these labels when computing accuracy. Algorithm 1 formally describes the implementation of the reference-based automatic Matched-Accuracy metric.
To validate the M-Acc metric, we compare the predictions for formal and informal against the true labels given to outputs of system A and system B (described above). We report the breakdown of precision and recall for the two labels for each language in Table 5. The reference-based segmentlevel classification algorithm achieves a macroaverage of 0.90 precision and 0.64 recall across

Transfer Learning for Formality Control
We approach formality-controlled NMT as a transfer learning problem, where we fine-tune a generic pre-trained MT model on labeled contrastive translation pairs from the CoCoA-MT dataset. For each source segment we create two labeled training data points: one for each contrastive reference translation (formal and informal). We use a special token with a randomly initialized embedding for the formality label which we attach to the beginning of the source segment.
To leverage the small amount of labeled data while maintaining the overall quality of the generic pre-trained MT model, we first up-sample the labeled data by concatenating multiple copies. 10 Next, we augment the labeled data with an equal amount of unlabeled data sampled randomly from the generic training set. Finally, we fine-tune on the combined labeled and unlabelled data for one  NMT models were built using the Transformerbase architecture (Vaswani et al., 2017), but with 20 encoder layers and 2 decoder layers as recommended by Domhan et al. (2020) and SSRU decoder layers for faster decoding .
We report the complete lists of pre-processing and training arguments Appendix C.
MT data with other sources of labeled data. In Section 6.3 we perform additional evaluations on existing (non-contrastive) test sets for which a single formality level is naturally appropriate: forum discussions (informal) and customer support conversations (formal). This is a common scenario, requiring consistent translations that are appropriate for the domain and target audience.

CoCoA-MT Performance
To maximize the effectiveness of transfer learning with the small amount of curated labeled data, we first experiment with up-sampling the contrastive labeled data for EN-DE. Figure 1 shows accuracy on the CoCoA-MT test sets for different upsampling factors. We report both formal and informal M-Acc values, obtained by setting the desired formality level to formal/informal and evaluating against formal/informal references respectively. As previously described, the training data covers the Telephony and Topical Chat domains, but not the Call Center domain. For this reason, Telephony and Topical Chat results show in-domain performance while Call Center, out-of domain (distinction also used in Table 6). BLEU scores are reported as a measure of generic quality: in this setting translations are generated without any formality control.
Results show that by increasing the up-sampling factor (up to 5x), accuracy improves up to 80% on the combined test sets, while generic quality is fairly stable (small degradation of up to -0.6 BLEU). To avoid over-fitting on the labeled data, we fix the up-sampling factor to five for all language pairs throughout the rest of the paper. 12 When comparing the learning curves for the three  domains, we find that Telephony and Topical Chat show similar trends, with high accuracy for both formal and informal, while on Call Center, the outof-domain setting, the gap between formal and informal accuracy remains large (ca. 50 points). Table 6 reports results on all language pairs. On the in-domain test set, accuracy averaged across formal and informal ranges from 69.1% for EN-FR to 98.4% for EN-IT, with generally high accuracy of over 70% across languages for both formal and informal. On the out-of-domain set, accuracy across languages is high for formal (91.4%) but low for informal (55.4%). Accuracy for informal is particularly low on this domain for target languages where the generic models have a strong bias toward formal: DE, ES, FR, and HI. For these languages, we find this setting adversarial for generating informal outputs as the test set is out-of-domain and at the same time the generic training data biases the models towards formal. We leave for future work exploration of whether increasing data size can overcome this bias, as seems to be the case for EN-JA where informal accuracy is 92.9%.
From these results we conclude that transfer learning with as little as 400 to 1,000 labeled contrastive examples is effective for formality control on in-domain data and can generalize to out-ofdomain data, while generic quality is maintained. 13 Finally, a manual investigation of the outputs 13 We observe a side effect on EN-FR where generic quality improves by more than 2 BLEU points on MuST-C. We attribute this to an adaptation effect as both the CoCoA-MT training set and MuST-C test set cover spoken language, which is less represented in web crawled parallel data. reveals that formality-controlled models appear to transfer knowledge from the generic training data to generalize to other aspects of formality, beyond grammatical register. We observe examples of changes in lexical choice, punctuation or syntactic structure, even when such variations are not present in the labeled data for that target language.   Source what are your thoughts on the goatees some of the players grow? Target ¿qué piensas de las barbas de chivo que se dejan crecer algunos jugadores? We compare models trained on the rule-based labeled data against two models: one trained on the contrastive CoCoA-MT data and another trained on non-contrastive CoCoA-MT data, with twice as many source segments. 16 For comparability, we keep the total number of data points constant across all conditions. However, as additional rule-based labeled data is easy to obtain and may improve results, we test two settings: 800 data points upsampled 5x (same as the other models), as well as 4000 unique data points.
Results are shown in Table 8. Fine-tuning on noisy rule-based labeled data results in lower average accuracy across all language pairs and significantly worse performance on EN-DE and EN-JA. On FR, DE, and ES, results shift to better informal 15 CoCoA-MT could be used to train a formality classifier that can annotate more data. We leave this to future work. 16 For EN-JA we did not have additional annotated data for the non-contrastive setting. accuracy with a trade-off in formal performance. For EN-JA the rule-based data is not effective for either formal or informal control. Increasing the diversity of the rule-based data by using more unique source segments does not lead to significant improvements. However, given the complementary performance observed, combining the two labeled datasets is a promising future work direction.
The non-contrastive use of CoCoA-MT leads to accuracy improvements of 5.8 points for EN-DE and 1 point for EN-FR. This suggests that improving coverage by sourcing and annotating additional training data is beneficial. However, contrastive data is more efficient to create, as swapping formality levels is done through post-editing.

Human Evaluation on Held-Out Domains
We conduct human evaluation of accuracy and generic quality of formality-controlled models on non-contrastive data from two held-out domains. The first domain comprises noisy comments on Reddit forums from the MTNT dataset (Michel and Neubig, 2018b) and the second domain comprises task-based (customer service) dialog turns from the Taskmaster dataset (Byrne et al., 2019;Farajian et al., 2020). 17 For the human evaluation we select source segments that have at least one second person pronoun and set the formality level 17 The dialog topics are: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations. We use the first 35 dialogues included in the WMT 2020 Chat Translation shared task. https://github.com/ Unbabel/BConTrasT 92.6 +91.9 4.9 +2.6% JA 80.5 +67.5 3.9 +3.0% 69.1 −13.0 4.3 +6.4% Table 10: Human evaluation of formality-controlled models on held-out domains. Formality is set to Informal on MTNT and to Formal on Taskmaster. We report absolute accuracy difference (bl_∆) and relative quality gain (bl_%) between the controlled and baseline models.
to informal for the MTNT data and formal for the Taskmaster data, matching the typical formality level used for each domain. Translators were instructed to rate the quality of translations on a scale of 1 (poor) to 6 (perfect) and to mark whether the translation matches the desired formality level. We did not include Hindi as we believed translators would have difficulties with this task given the low level of generic quality (10 BLEU on newstest). In Table 10, we report the accuracy and quality scores 18 for the formality-controlled models as well as the improvements over the generic baseline models. Human evaluation results confirm that our formality-controlled models can generalize to unseen domains. Their accuracy is generally high (at or above 70%) and better than the baseline across languages for both Formal and Informal (with the exception of Formal for French and Japanese). At the same time, generic quality is retained or even slightly improved in some cases (up to 6.9% for French and 6.4% for Japanese on Taskmaster) compared to the generic baseline.

Gender-Specific Translations
While creating the CoCoA-MT formalitycontrolled dataset, we observed that for target languages with grammatical gender (all except JA), some reference translations require gender to be expressed in the target even though it is ambiguous in the source. 19 Table 11 shows one such sentence from the EN-ES training set.
In fact, this is similar to formality: a grammatical distinction must be made in the target language, even though the source is under-specified with re-Source Did you play with Legos growing up? Feminine ¿De pequeña jugaba con piezas de Lego? Masculine ¿De pequeño jugaba con piezas de Lego?  Effect on Gender Translation Accuracy In Section 6, for segments with gendered translations, we selected a single gender (in that case, masculine) to use consistently in all training and evaluation data.
Here, we perform an initial evaluation of the effect of gender-specific formality-controlled data on gender translation accuracy using WinoMT (Stanovsky et al., 2019) on EN-ES. 21 We compare the baseline (without formality control) to separate models trained using masculine (msc-trg; same as in Table 6) and feminine (fem-trg) target data. These results, along with formality and quality metrics, are shown in Table 13. Using only masculine target sentences causes a drop in feminine F1, whereas feminine target segments improve feminine F1 without harming masculine F1. For easy comparison with Section 6, we report formality matched accuracy with respect to the masculine-reference test set, which explains the slight drop in formality accuracy for fem-trg. These results show that gender-specific translations are useful for maintaining gender translation accuracy when creating formality-controlled models.  Table 13: WinoMT, formality, and BLEU scores on English→Spanish models trained without formality control (base), and with grammatically masculine and feminine target data.
We release the gender-specific translations to open up opportunities to explore the best use of this data. The data could also be for gender control given user-specified preferences for gender in translation (similar to formality control explored here). We leave these possibilities for future work.

Related Work
Controlling politeness for NMT was first tackled by Sennrich et al. (2016a) for EN-DE translation. They appended side constraints to the source text to indicate the preference of verbs or T-V pronoun choices (Brown and Gilman, 1960) in the output. 22 A similar approach was applied to control the presence of honorific verb forms in EN-JA MT by Feely et al. (2019). Viswanathan et al. (2019) controlled T-V pronoun choices of EN-ES/FR/Czech translations by adapting generic models with T-V distinct data. They collected politeness parallel data using heuristics. In a task of FR-EN formality-sensitive MT (Niu et al., 2017), translation and EN formality transfer were trained jointly in a multi-task setting (Niu et al., 2018;Niu and Carpuat, 2020). They assumed cross-lingual formality parallel data is not available and leveraged monolingual formality data instead (Rao and Tetreault, 2018).

Conclusions
This work addresses the problem of controlling MT output when translating into languages that make formality distinctions through honorifics or grammatical register. To train and evaluate formalitycontrolled MT models, we introduce CoCoA-MT -a novel multilingual and multidomain benchmarkand a reference-based automatic metric. Our experiments show that formality-controlled MT models can be trained effectively with transfer learning on labeled contrastive translation pairs from CoCoA-MT, achieving high targeted accuracy and retaining generic translation quality. We release the CoCoA-MT dataset to enable future work on controlling multiple features (formality and grammatical gender) simultaneously.

Ethical Considerations
As part of this paper, we created and are releasing formality-controlled contrastive parallel data from English into French, German, Hindi, Italian, Japanese, and Spanish. The translations and annotations were created by professional translators who were recruited by a language service provider and were compensated according to industry standards. The translations are based on existing English corpora which are not user-generated. Before creating the translations, we obtained approval for our use case from the creators of the existing artifacts.
As part of our formality-controlled dataset, we noticed that translations often required the gender of the speaker or the addressee to be specified, even when the English source was gender-neutral. As a result, for each such case, we include grammatically feminine and grammatically masculine reference translations. We hope that this will open up opportunities for future work in avoiding gender bias when controlling for politeness, and even in improving translations by customizing to the user's desired gender, 23 in a similar way to how we customize for the desired formality in this paper. In creating gender-specific reference translations, we limit the differences to words that are grammatically gendered in the target languages, rather than stereotypical or other differences. It is important to note that while this paper addresses grammatical gender in translation, it does not use human subjects, infer or predict gender, or otherwise use gender as a variable.
We would like to emphasize that the work on gender in this paper is very much a work in progress. We provide this dataset as an initial contribution; we will continue to improve on this work and this data, and we hope other groups also use and expand on it. Most notably, so far we have only produced translations for two genders. In the future, we plan on expanding the references translations to more genders, in consultation with native speakers of the target languages and other stakeholders. We also would like to analyze gender bias in formality-controlled models, as well as create models that can control for multiple features (e.g., formality and grammatical gender) simultaneously.

B Additional Results
We report additional results for increasing the upsampling factor (up to 8x) for EN-JA in Figure 2. On this larger labeled dataset, a higher up-sampling factor can improve accuracy up to 94% across domains, significantly increasing the out-of-domain (Call Center) accuracy while generic quality remains stable. The up-sampling factor can be tuned further for each language to achieve the optimal trade-off between accuracy and generic quality.

EN
You know what I'm saying. You want them to teach you something new. DE Du weißt, was ich meine. Sie möchten, dass sie Ihnen etwas Neues beibringen. Label OTHER: "Du weisst"(informal) -"Sie möchten" and "Ihnen" (formal) EN So I will need an early check-in and if you have a airport shuttle, that will be great. IT Quindi avrò bisogno di un check-in anticipato e se si dispone di una navetta aeroportuale, sarà fantastico. Label NEUTRAL: "si dispone" (impersonal)  : Accuracy on the CoCoA-MT test sets and generic quality scores (BLEU) for EN-JA for an increasing amount of the labeled data (through up-sampling up to 8x). The generic baseline scores correspond to 0 on the x-axis. Each source sentence in the CoCoA-MT dataset corresponds to two data points -one for each formality level. For computing the BLEU scores we translate the the generic tes tset(IWSLT) without controling formality.

C Experimental Setup
All training and development data was tokenized using the Sacremoses tokenizer. 24 Words were segmented using BPE (Sennrich et al., 2016b) with 32K operations. Source and target subwords shared the same vocabulary. Training segments longer than 95 tokens were removed.
The source embeddings, target embeddings, and the output layer's weight matrix are tied (Press and Wolf, 2017). Training is done on 8 GPUs with Sockeye 2's large batch training. It has an effective batch size of 327,680 tokens, a learning rate of 0.00113 with 2000 warmup steps and a reduce rate of 0.9, a checkpoint interval of 125 steps, and learning rate reduction after 8 checkpoints without improvement. After an extended plateau of 60 checkpoints, the 8 checkpoints with the lowest validation perplexity are averaged to produce the final model parameters.
Fine-tuning is done on 4 GPUs with an effective batch size of 8,192 tokens, a learning rate of 0.0002, and only one epoch, as per Hasler et al. (2021).
Parameters for standard training:

D Instructions for Creating Formality-Specific References
In this section, we reproduce the instructions given to the translators for DE, ES, FR, HI, and IT. Instructions for JA are similar but include some language-specific notes. We make minor edits for anonymity purposes. For brevity, we also remove example translations show to the translators.
Overview This project is to create a test set that content consists of short conversations or utterances taken from conversations. Many segments are taken out of context, but all of them are utterances said during a conversation. Sometimes you will understand the relationship between the speakers from the context, and sometimes you will not. With your translations, we are creating a very specific test set. We will use it to test the capability of an MT engine to produce a translation with the required formality of both speakers. In other words, imagine if we could ask an MT engine: now translate this conversation as if it is between two speakers, where the relationship between them is formal. To test how well it can do that, we will be using your translations (the golden set).
You will receive a source file that will consist of utterances that were initially part of a conversation; some segments will appear with the surrounding context utterances, and some will be taken out of the conversation. Each segment might consist of several sentences.
Terminology Formality marker: a (form of the) word(s) that indicates the tone of that utterance or relationship between speaker and addressee. Even if you take this word(s) out of context, by looking at it you will clearly know the tone/formality level of the conversation in which this word is used.
For example, • English: "you" (2nd person pronoun) has no formality marker in English (meaning, by looking at the word you cannot tell if the tone of addressing them is formal or informal).
• German: has formality markers in the 2nd person pronoun and corresponding verb forms -"du" (informal) versus "Sie" (formal). This means, just by looking at the pronoun "du" or the verb next to it, I know the tone is informal. So, I will mark "du bist" in DE with Formality tags.
• Spanish: has formality markers for the second person pronouns and their verb conjugations -"tú" (informal) and "usted" (formal). Since Spanish is a pro-drop language, verb conjugations may be the only indicator of this information.
• Italian: similarly to Spanish, Italian has formality markers for the second person pronouns and their verb conjugations -"tu" (informal) and "lei" (formal). Since Italian is a pro-drop language, verb conjugations may be the only indicator of this information.
• French: similarly to Spanish and Italian, French has formality markers for the second person pronouns and their verb conjugations -"tu" (informal) and "vous" (formal). French, however, is NOT a pro-drop language.
• Hindi: There are Formality markers for 2nd person in Hindi (meaning, I can address someone respectfully or in a casual way by changing the pronoun). In this case, I will mark the pronoun in Hindi with Formality tags.
For Japanese, translators were additionally provided with examples of formality levels (Table 15) and formality markers (Table 16).