TyDiP: A Dataset for Politeness Classification in Nine Typologically Diverse Languages

We study politeness phenomena in nine typologically diverse languages. Politeness is an important facet of communication and is sometimes argued to be cultural-specific, yet existing computational linguistic study is limited to English. We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples. We evaluate how well multilingual models can identify politeness levels -- they show a fairly robust zero-shot transfer ability, yet fall short of estimated human accuracy significantly. We further study mapping the English politeness strategy lexicon into nine languages via automatic translation and lexicon induction, analyzing whether each strategy's impact stays consistent across languages. Lastly, we empirically study the complicated relationship between formality and politeness through transfer experiments. We hope our dataset will support various research questions and applications, from evaluating multilingual models to constructing polite multilingual agents.


Introduction
Whether politeness phenomena and strategies are universal across languages or not have been controversial among sociologists and linguists.While Brown and Levinson (1978) claimed their universality, other followup work (Korac-Kakabadse et al., 2001) claimed how communication patterns can differ based on cultures and other social constructs such as gender (Mills, 2003) and domains.
To contribute to the linguistic study of crosscultural politeness, we collect politeness labels on nine typologically and culturally diverse languages, Hindi, Korean, Spanish, Tamil, French, Vietnamese, Russian, Afrikaans, and Hungarian.This language set covers five scripts and eight language families.We follow the seminal work (Danescu-Niculescu-Mizil et al., 2013) closely, focusing on politeness exhibited in requests as they involve the speaker imposing on the listener, requiring them to employ various politeness techniques.To capture rich linguistic strategies that can be lost in translation (Lembersky et al., 2011), we collect sentences written in each target language.To minimize the domain shift among languages, we collect examples in each language from their respective Wikipedia User talk pages, where editors make requests about administrative and editorial decisions.
Crowdsourcing labels in low-resource languages is challenging.Thus, we carefully design an annotation process that includes a translation task to evaluate annotator's language proficiency and a model-in-the-loop qualification task which filters workers whose labels diverges from highly confident predictions from multilingual models.After this process, we observe high agreements among the annotators in our dataset despite the subjectivity of the task.Interestingly, the annotators agree with each other more when assigning politeness score on requests in their native languages compared to assigning politeness score on requests in English, which is their second language.
Equipped with our new multilingual politeness dataset, we evaluate zero-shot transfer ability of existing multilingual models in predicting politeness -subjective and pragmatic language interpretation task.Pretrained language models (Conneau et al., 2020) fine-tuned on annotated English politeness data (Danescu-Niculescu-Mizil et al., 2013) show competitive performances on all languages, weighing in the universality of politeness phenomena across languages.We also witness impressive zeroshot performance of a high-capacity pretrained language model (Brown et al., 2020).We observe a degradation in classification performances when we translate the target language (via Google Translate API) to English, suggesting politeness might not be preserved in the current machine translation model.Despite the simplicity of classification task, we report a substantial difference between the estimated human accuracy and the best model accuracy (over 10% difference in accuracy in six out of nine languages).
Lastly, we provide two studies delving into politeness phenomena.We map English politeness strategy lexicon to create politeness strategy lexicon in nine languages by using tools like automatic translation, lexicon alignment (Dou and Neubig, 2021) and large-scale corpora in the same domain.Despite the limitations of automatic lexicon mapping, we largely observe consistent correlation with politeness score for each politeness strategy in nine languages we study, with some interesting exceptions.We then compare the notion of politeness and formality which has been studied in multilingual setting (Briakou et al., 2021).Our empirical results supports that notions of politeness and formality cannot be used interchangeably.However, when we control for semantics, politeness classifier judges the formal version of the same sentence as more polite than its informal variant.
We release our annotated data and aligned politeness lexicon to support future work.Our dataset can support various end applications, such as building multilingual agents optimized for politeness (Silva et al., 2022), developing a translation model that preserves politeness level (Fu et al., 2020), evaluating the impact of different pretraining corpus and modeling architecture for modeling subjective tasks in a wide range of languages (Hu et al., 2020), understanding cultural-specific politeness strategies, and many more.

TYDIP: Multilingual Politeness Dataset
Motivation Our goal is to construct high-quality multilingual evaluation data with native content, covering a wide range of languages on the task of politeness prediction.Following prior work (Danescu-Niculescu-Mizil et al., 2013), we focus on identifying politeness in requests, where requests involve speaker imposing on the listener.This scenario elicit speakers to employ diverse strategies to minimize the imposition of requests, or apologizing for the imposition (Lakoff, 1977).For each request text, we aim to collect a graded politeness score (between -3 and 3, with 0.5 increment).
Language Selection We chose Hindi, Korean, Spanish, Tamil, French, Vietnamese, Russian, Afrikaans, and Hungarian.Our criteria for selecting languages were (1) covering low resource language when possible, (2) languages with rich discussion on Wikipedia editor forum and (3) languages where we can recruit native speaker annotators on a crowdsourcing platform, Prolific.2Source Sentence Collection We source requests from Wikipedia user talk pages from target language Wikipedia dumps. 3Each request is a part of a conversation between editors on Wikipedia.We follow the pre-processing step from prior work (Danescu-Niculescu-Mizil et al., 2013), extracting each request as a sequence of two successive sentences where the second sentence ends with a question mark (?).We present one example here: "I'm somewhat puzzled by your recent edits on the Harper page, which have left two different sets of footnotes.Could you please explain your rationale for the change?"

Annotation Process
Collecting annotations for non-English data for a wide range of languages is non-trivial in all aspects, from source text collection, annotator recruiting to annotation validation.We describe our annotation process here and hope that our collection strategy can provide insights for future multilingual data collection efforts for other tasks and domains.
Pre-processing We observe that a sizable portion of the requests is written in language other than its own.Thus, we filter sentences not belonging to the target language with a language identification with langdetect (Nakatani, 2010).
Table 1 shows data statistics, including the language distribution among these requests.We use the Polyglot tokenizer4 for preprocessing.
Annotator Recruiting We collect our annotation on a crowdsourcing platform, Prolific, which allows us to find workers based on their first language.Instead of developing separate guidelines for each language, we recruit bilingual annotators.We also filter by their task approval rate (> 98%).
To annotators who meet these criteria, we perform qualification process which involves transla-  1: Languages chosen for our study and their data statistics.We report the number of available requests in Wikipedia User talk pages after pre-processing step, the distribution of languages after language identification, and the average length in bytes for each request.
tion task and the target task, which we describe below.
Target Task Qualification Inspired by strong zero-shot transfer performances of multilingual models on a variety of tasks (Conneau et al., 2018;Wu and Dredze, 2019), we use a multilingual classifier trained on existing English politeness dataset (Danescu-Niculescu-Mizil et al., 2013) to select sentences for the qualification task. 5We sample examples where the classifier assigned very high or very low politeness score for each language.Language-proficient researchers verified the correctness of model predictions on a subset (four) of languages.While the model was not always correct, their highly confident predictions were mostly correct.These requests, paired with the predicted politeness label, were used to filter crowdworkers.
Translation Qualification Task Inspired by prior work (Pavlick et al., 2014) which employed a translation task to assess the language proficiency of crowdworkers, we estimate their language proficiency by evaluating their translation skills. 6e present crowdworkers with a set of five requests (assigned either very polite or very impolite rating by the model) in the target language, and ask them to translate into English as well as to label a politeness score.We first compared the annotator's translation with the output from Google Translate API. 7 If the edit distance between their translation and the output from Google translate, we remove them from the annotator pool as they could be using this service.We also computed the distance between the user's politeness score and the model's predicted labels, and pruned workers who provided scores that varies significantly from model predictions.
The qualification is not completely automatic, with constant monitoring on four languages on which language-proficient researchers continuously provide sanity checks.Fifteen workers per language took our qualifier task, and after this filtering we ended up with 7 Afrikaans, 9 Spanish, 9 Hungarian, 10 Tamil, 10 Russian, 11 Hindi, 11 Korean, 11 French and 11 Vietnamese workers.
Final Data Collection / Postprocessing The annotators annotated 5 English requests and 15 target language requests instances per task.The annotation interface can be found in the appendix.We collect 3-way annotations for each request.Annotating 20 examples took approximately seven minutes and annotators were paid $3 for it, translating to $25.43/hr.

Inter-annotator Agreement
Ensuring data quality is challenging, especially when we do not have in-house native speaker to inspect for all languages we study.Following prior work (Pavlick et al., 2014;Danescu-Niculescu-Mizil et al., 2013), we estimate the annotation quality by comparing inter-annotator agreement with agreement between randomly assigned labels according to the data distribution8 .As we study continuous rather than categorical value, we compute pairwise spearman correlation to measure agreement score instead of Cohen's Kappa.
As each annotator provided scores for both English sentences and sentences of their native languages, we report both agreement numbers, split by language in Table 2.We consistently observe a positive correlation among the annotators' scores.In- Table 2: Pairwise correlation (mean and standard deviation (in brackets)) for each language annotator, on English data and their native language data.
en hi ko es ta fr vi ru af hu Languages terestingly, we observe substantially higher agreement when annotators were labeling their own language compared to labeling English across all nine languages.This suggests the interpreting politeness of foreign language can be less precise and more variable compared to interpreting that of native language.As our main goal is collecting target language annotations, this would not impact the quality of our dataset, which studies how native speakers perceive native contents.We plot the averaged pairwise spearman correlation of annotations and that of random assignments in Figure 2. In both English and their native languages, annotator correlation is substantially higher than correlation from random label assignments, which hovers around zero as expected.In Appendix C, we report the correlation with the English politeness labels from the previous study and our annotation, and interannotator agreement per by language.

Final Dataset
We collect three way annotations for 500 randomly sampled requests for each language.We normalize each annotator's score to a normal distribution with a mean of zero and standard deviation of 1, and then average the score of three annotators to get a final score for each item, which ranges from -3 (very impolite) to +3 (very polite).We plot the final politeness distribution per language in Figure 1.
Examples of annotated sentences are in Appendix B.
We split these examples into 4 quartiles based on their politeness scores, and consider sentences from the top and bottom 25 percentile of politeness scores only (corresponding to positive and negative politeness), following prior work (Danescu-Niculescu-Mizil et al., 2013;Aubakirova and Bansal, 2016).This results in a balanced binary politeness prediction task, while reducing the number of examples by half.We refer to this dataset (containing half of the total TYDIP dataset) as TYDIP evaluation dataset.At inference time, we translate the target language requests into English using Google Translate API (optional for XLMR model, necessary for RoBERTa model).
We use one large-scale language model, GPT3 (Brown et al., 2020) Davinci-002, in a zeroshot prompting setup with the following prompt: Is this request polite?<input example> Then, we compute the probabilities for two options for next token -"yes" and "no" respectively, which map to "polite" and "impolite" labels respectively.Designing prompts for each language is non-trivial, so in this initial study we use this exact same English template for all languages.
Results Table 3 reports the model performances.Following recent question answering benchmark (Clark et al., 2020), we only aggregate the scores on non-English languages to focus on transfer performances.Both finetuned language models (XLMR and RoBERTa) boast strong performance in English, reaching an accuracy hovering 90%.Even zero-shot GPT model performs competitively, reporting an accuracy of 80.8%.
In terms of XLMR model, the results were fairly split on whether it is better to use automatically translated English input, matching the training data, or using the target language input as is.Using the text in English showed better performances in and four (Hindi, Spanish, French, Russian) and using the target language input was better in five languages (Korean, Tamil, Vietnamese, Afrikaans, and Hungarian).Using the target language yields a slightly better performance, questioning whether automatic translation maintain the politeness level.
Large-scale language model, GPT3, even used in a zero-shot fashion without much prompt engineering (Gao et al., 2021) shows competitive performances, significantly outperforming the majority baseline.Similar to XLMR, using target language as is showed better performance than using translated text (70.6 vs. 66.8) on average, and in seven out of nine languages.
Comparing performances across languages is tricky as the annotation was done by different sets of annotators on different items for each language.To put these numbers in context, we provide a comparison between estimated human performance and model performance in the next section.Would human agreement be lower on languages with weaker model performance?
Comparison with human agreement To compute a comparable number between the annotators and models, we use our original 3-way annotated data before aggregating politeness score.We treat one annotator's label as the human prediction and consider the other two as references, taking their mean to get the gold politeness score.We repeat this random sampling process for each example in test set 1,000 times and plot the distribution of accuracy scores in Figure 3.
Annotators shows varying degree of agreements -we notice a particularly stronger agreement in Korean and Hungarian, but overall we observe strong agreement, hovering around 90%.Interestingly, models significantly underperform in these languages with high human agreement, making the gap between human and model performance large.Six out of the nine languages have a gap of at least 10%, and two of them being greater than 15%.

Building and Analyzing Politeness Strategies in Nine Languages
In this section, we develop a set of linguistic politeness strategies based on existing English strategies (Danescu-Niculescu-Mizil et al., 2013), and see how can explaining politeness phenomena in nine diverse languages we study.While politeness strategies are not necessary for building a highperforming classifier, it can be helpful to understand politeness phenomena.
The original English study presents a list of politeness strategies along with each strategies relation to assigned politeness score.They found many statistically significant correlations between politeness strategies and human perception, such as words belonging to gratitude lexicon (appreciate), counterfactual modal (could/would) correlates with being polite, and starting the sentence with first person pronoun correlating with being impolite.
Developing such a politeness lexicon for each language requires expert annotation, which can be infeasible for low-resource languages with a fewer language-proficient researchers (Joshi et al., 2020).Thus, we aim to automatically generate politeness strategies for other languages from the English ones.For this initial study, we focus on lexicon-based strategies (15 out of 20 strategies), excluding strategies involving dependency parsing.

Mapping English Lexicon to Target Languages
To build a politeness lexicon in nine languages, we use two NLP tools -translation and word alignments.
We sample 5000 Wikipedia editor requests that are not included in our annotated data for each of nine languages. 9We first automatically translate target language sentence into English (with Google Translate API) and then align the words in the translated English sentence to the words in 9 For languages (Afrikaans, Hindi) with less than 5K requests, we used all available data.original sentence in the target language.
Aligning words in parallel corpora has been longstanding task in NLP.Traditionally, alignments can be obtained as a byproduct of training statistical MT systems (Och and Ney, 2003;Dyer et al., 2013).Yet, this typically require a large parallel corpus, which we lack for nine languages we study.We instead use alignment method using the similarity between token representations from multilingual pretrained language models (mBERT (Devlin et al., 2019)), fast-align (Dou and Neubig, 2021).
For each word in English politeness lexicon, we collect their aligned word in the target language.As the alignments maps a sequence of words to a sequence of words, sometimes a single word English lexicon is mapped to multiple words in the target language.For each word in the English lexicon, we consider up to top five target language word sequences as its matching lexicon.We show examples of induced lexicon in Appendix E and full lexicon in the repository.
As automatically generated lexicon can be imprecise for either incorrect translation or alignments, we manually inspected the generated lexicon in four languages for which we have languageproficient researchers.We found that the alignments were mostly reasonable, but erroneous and imprecise for words with multiple senses.Not every lexicon was mapped to foreign words either, we show the coverage statistics (average % of words in lexicon mapped to foreign language words), which hovers around 60-70%, at the bottom of Figure 4.
Analysis with Induced Lexicon Using automatically induced lexicon, we analyze our multilingual politeness data, mirroring the analysis from Danescu-Niculescu-Mizil et al. (2013).We report the average politeness score of sentences exhibiting each each strategy in Figure 4.The baseline value here would be 0. We observe that the average politeness score for each strategy across languages are somewhat consistent (e.g., PLEASE strategy being positively correlated in all languages except Spanish).The diverging patterns can be an error with strategy mapping and needs further investigation.Interestingly, in languages with lower model performance (Korean, Tamil), we observe more diverging patterns (e.g., indirect greeting having positive implications in these two languages while mildly negative in English).subsets), which exhibits similar pattern.

Transfer between Formality and Politeness
While we are not aware of computational linguistic studies in politeness covering multiple languages, prior work (Briakou et al., 2021;Rao and Tetreault, 2018) has explored formality in four languages (English, French, Italian and Portuguese).In this section, we study the connections between the formality and politeness.Would formally written sentences perceived as more polite by our classifier?Table 7: Analysing politeness predictions on (informal, formal) sentence pairs.The left column represents the fraction of pairs for which the same politeness label is assigned to both sentences.The right column represents the fraction of pairs for which the classifier's probability of being polite for the formal sentence is higher than that of its informal counterparts.
We use GYAFC (Rao and Tetreault, 2018) and X-FORMAL (Briakou et al., 2021), two datasets containing informal sentences from the L6 Yahoo Answers Corpus10 and four formal rewrites for each sentence (dataset statistics can be found in the Appendix G).
In Table 4, we report zero-shot transfer results from politeness classifier to formality classification.We will use our best multilingual politeness classifier (XLMR-target) from Section 3. We calibrate the threshold of our politeness classifier to account for the different data distribution of positive and negative examples.Somewhat surprisingly, the classifier performs worse than the majority baseline.Table 5 shows performance numbers of transfer in the reverse direction, i.e from formality to politeness.We similarly finetune XLMR model on the English train set from GYAFC (Rao and Tetreault, 2018), and evaluated it on TYDIP evaluation dataset, using target language as an input.After the threshold calibration, the model performs better than the majority baseline, but substantially underperforms the in-domain performance reported in Table 3.Does this mean formality and politeness are not linked?Upon inspection (see Table 6 for examples), we find that politeness prediction for the the informal and formal rewrites of the same sentence often stay consistent.Looking into the model's prediction on (informal, formal) sentence pairs, we find that almost 80% of pairs in English have the same politeness prediction for both sentences.The left column in Table 7 depicts this across four languages, suggesting that politeness could be further linked to the content, not just style of the writing.
In their original work, Rao and Tetreault (2018) report that commonly used techniques to make sentences formal include phrasal paraphrases, punctuation changes, expansions, contractions, capitalization and normalization which are fairly stylistic.Would such rewriting make sentences to be perceived more polite?We investigate this by further looking into (informal, formal) sentence pairs -for each version of the sentence in the pair, we compute their politeness probability (as assigned by the classifier) and report percentage of pairs where formal version of the same sentence were viewed as more polite than its impolite counterpart.The right column in Table 7 presents these results -for about 70% examples, such rewriting indeed made the sentence perceived as more polite, despite often not enough to flip the politeness decision.

Related Work
Politeness & Formality Danescu- Niculescu-Mizil et al. (2013) presents the first quantitative, linguistic study of politeness, annotating two types of corpora -requests extracted from conversations between users on Wikipedia User Talk Pages and user comments from Stackoverflow.Followup work explored interpreting neural networks' politeness predictions (Aubakirova and Bansal, 2016) and controllable text generation with target politeness level (Sennrich et al., 2016;Niu and Bansal, 2018;Fu et al., 2020).While these work considers politeness phenomena in English, we expand it to study the phenomena in nine languages.A related concept to politeness is formality, studied in multiple prior work (Lahiri, 2016;Pavlick and Tetreault, 2016;Rao and Tetreault, 2018;Briakou et al., 2021).
Multilingual Models Recent progresses in pretrained language models have brought better representation for multitude of languages.Multilin-gual language models like mBERT (Devlin et al., 2019), XLMR (Conneau et al., 2020), based on the transformer architecture, are pretrained with the masked language modeling objective on a large amount of corpora (El-Kishky et al., 2020;Suarez et al., 2019) spanning over 100 languages.While the community also recognizes the varying quality of unlabeled data in a range languages (Caswell et al., 2022), such multilingual models provide improved representations for modeling low resource languages.When finetuned on downstream task data in a single language, these models make reasonable predictions in multiple languages (Wu and Dredze, 2019).Multilingual models have also been evaluated in a prompting setup for different tasks like Machine Translation (Tan et al., 2022) and different Multilingual NLU tasks (Zhao and Schütze, 2021;Lin et al., 2021;Winata et al., 2021).
Multilingual Benchmarks Despite recent progresses in NLP resources and benchmarks, partially powered by affordable crowdsourcing (Snow et al., 2008), linguistic resources in low resource languages are still severely limited to compared to resources in English (Joshi et al., 2020).Many existing datasets are translated from English data (Conneau et al., 2018;Longpre et al., 2021).While translating approach for dataset construction have advantage of ensuring similar data distribution across languages, data collected in such fashion will not reflect the language usages of diverse population, introducing translationese which can be different from purely native text (Lembersky et al., 2011).We provide resources for nine typologically diverse languages, capturing a subtle phenomena of politeness.

Conclusion
We present TYDIP, a corpus of requests paired with its perceived politeness score spanning nine languages.We evaluate multiple multilingual models in zero-shot politeness prediction and find that they are able to perform well without being trained on data from the same language, while not reaching human-level performances yet.

Limitations
Our dataset is moderately sized (250 examples per language in the evaluation portion, and a total of 500 examples per language) and still covers a limited number of languages.We had in-tended to cover more languages (one example being Japanese), but this were hindered by the number of annotators we could recruit for each language.
The aligned politeness strategy lexicon (Section 4) relies on multiple automatic toolkits (machine translation system and word alignments), thus analysis should be interpreted with caution.

Ethical Considerations
The data we annotate comes from Wikipedia User Talk pages, which is an online forum for communication between editors on Wikipedia.This data spans nine different languages and contains speakers from different countries and demographics.The annotation is done by crowdworkers recruited from the online platform Prolific.These workers aren't restricted to a particular country.They are paid a wage of $25.43/hr which is higher than the average pay stipulated on the platform.We use this data to evaluate an existing model across multiple languages, and do not use it for training as such.

A Annotation UI
Figure 5 contains the user interface used for the final annotation process.on English request data.They release the 5 way annotation done on their data and also a single score for each sentence after averaging and normalization.We report two scores in Table 9: a correlation with raw annotations and a correlation with the final aggregated scores.

C Additional Inter Annotator Agreement Reports
Figure 7 shows the distribution of the pairwise correlation metric over different HITs for each language.Each subplot has the distribution over the english and target language parts of each HIT, as well as a baseline method where the scores are shuffled before computing the correlation.
The correlations in the random baseline are close to 0 and the correlations on the annotations are significantly higher.The correlations on the English annotations do show more variance in their distribution.

D Politeness Score Statistics
Table 10 summarizes the distribution of scores across languages.All the languages have a mean close to 0, with similarly shaped distribution of scores.The minimum and maximum scores seem to vary a bit across languages.Some languages like Spanish have a higher median score and a higher number of sentences with a positive scores.

E Politeness Strategies
Table 11 gives some examples of the politeness strategy lexicon we obtained by our automated method.

F Politeness Strategy Distribution
Figure 8 showcases the occurrence of strategies in sentence belonging to the least polite (1st quartile) and most polite (4th quartile) subsections of our data.Cells shaded in light orange represent a baseline value of 0.25 and anything deviating from this appear in Dark Green or Red.We can clearly see difference across the the 2 quartiles for some of these strategies.

G Politeness to Formality Transfer
We use the XLMR classifier trained in Section 3 and evaluate it on the mix of informal and formal sentences (1:4 ratio) as a test set.These performance numbers are shown in Table 4.We report the classifier Accuracy, as well as a majority baseline.Since we have an imbalanced mix of sentences, we decided to calibrate the classifier's threshold using the dev set.We get the probability for the 80th percentile of scores from the English dev set and use this on the test sets.For the first, it is possible and would be feasible today.

Figure 1 :
Figure 1: Distribution of final politeness scores per language, with mean and median highlighted.

Figure 2 :
Figure2: Spearman correlation.The fist and third graph represents our annotated data in English and target languages respectively, and the second and the fourth shows correlation for random assignments, which hovers around zero as expected.

Figure 6
Figure6compares the overall IRR metrics on our annotations with the IRR on the annotations released byDanescu-Niculescu-Mizil et al. (2013) on English request data.They release the 5 way annotation done on their data and also a single score for each sentence after averaging and normalization.We report two scores in Table9: a correlation with raw annotations and a correlation with the final aggregated scores.Figure7shows the distribution of the pairwise correlation metric over different HITs for each language.Each subplot has the distribution over the english and target language parts of each HIT, as well as a baseline method where the scores are shuffled before computing the correlation.The correlations in the random baseline are close to 0 and the correlations on the annotations are significantly higher.The correlations on the English annotations do show more variance in their distribution.

Table 3 :
Danescu-Niculescu-Mizil et al. (2013)The XLMR and RoBERTa models are finetuned in English politeness data fromDanescu-Niculescu-Mizil et al. (2013), while GPT3 model is prompted in a zero-shot fashion.When Input Lang.column is "en", we use Google Translate API to translate the target language into English.We randomly split data from Danescu-Niculescu-Mizil et al. (2013) to yield 1,926 training and 251 evaluation examples in English.With this training dataset, we fine-tuned each model for five epochs with a batch size of 32 and an learning rate of 5e-6 on a Quadro RTX 6000 machine.We use the large variants for both models.

Table 4 :
In the Appendix E, we include the occurrence of different strategies in different politeness quartiles (polite or impolite Figure4: Induced politeness strategies and their relation to politeness scores in nine languages.We plot the average politeness score of a set of sentences containing corresponding strategy.Here, the baseline value is 0. The number of strategies covered by the induced lexicon is also mentioned for each language.Transfer from politeness to formality.Formality classification accuracy on X-FORMAL dataset.

Table 5 :
Transfer from formality to politeness.Politeness classification accuracy on TYDIP evaluation dataset.

Table 6 :
Rao and Tetreault (2018)ted formality label and predicted politeness label.The formality labels are fromRao and Tetreault (2018)and the politeness labels assigned by our classifier.

Table 9
Table 8 contains examples of requests in different languages and the politeness score assigned to them.

Table 10 :
Statistics on Final Politeness Scores

Table 12 :
Statistics of Formality Data used for Evaluation (as test sets)