A Dataset and Baselines for Multilingual Reply Suggestion

Reply suggestion models help users process emails and chats faster. Previous work only studies English reply suggestion. Instead, we present MRS, a multilingual reply suggestion dataset with ten languages. MRS can be used to compare two families of models: 1) retrieval models that select the reply from a fixed set and 2) generation models that produce the reply from scratch. Therefore, MRS complements existing cross-lingual generalization benchmarks that focus on classification and sequence labeling tasks. We build a generation model and a retrieval model as baselines for MRS. The two models have different strengths in the monolingual setting, and they require different strategies to generalize across languages. MRS is publicly available at https://github.com/zhangmozhi/mrs.

: An example of reply suggestion system. User can click on the suggestions for a quick reply.
focuses on English (Kannan et al., 2016;Henderson et al., 2017;Deb et al., 2019). To investigate reply suggestion for other languages with possibly limited data, we build a multilingual dataset, dubbed MRS (Multilingual Reply Suggestion). From publicly available Reddit threads, we extract messagereply pairs, response sets, and machine-translated examples in ten languages (Table 1).
One interesting aspect of the reply suggestion problem is that there are two modeling approaches. Some models follow the retrieval framework and select the reply from a predetermined response set (Henderson et al., 2017). Others follow the generation framework and generate the reply from scratch (Kannan et al., 2016). The two approaches have different advantages. Generation models are more powerful because they are not constrained by the response set. In comparison, retrieval models are easier to train and runs faster, and a curated response set guarantees the coherence and the safety of the model output.
The two frameworks make reply suggestion an interesting task for studying cross-lingual generalization. Most cross-lingual generalization benchmarks use classification and sequence labeling tasks (Tjong Kim Sang, 2002;Nivre et al., 2016;Strassel and Tracey, 2016;Conneau et al., 2018;Schwenk and Li, 2018;Clark et al., 2020;Hu et al., 2020;Lewis et al., 2020b). In contrast, reply suggestion has two formulations that require different cross-lingual generalization strategies. While some recent work explores cross-lingual transfer learning in generation tasks, the tasks are extractive; i.e., the output often has significant overlap with the input. These tasks include news title generation, text summarization, and question generation (Chi et al., 2020;Liang et al., 2020;Scialom et al., 2020). Reply suggestion is more challenging because the reply often does not overlap with the message (Figure 1), so the model needs to address different cross-lingual generalization challenges (Section 5.2).
We build two baselines for MRS: a retrieval model and a generation model. We first compare the models in English, where we have abundant training data and human referees. We evaluate the models with both automatic metrics and human judgments. The two models have different strengths. The generation model has higher word overlap scores and is favored by humans on average, but inference is slower, and the output is sometimes contradictory or repetitive (Holtzman et al., 2020). In contrast, the retrieval model is faster and always produces coherent replies, but the replies are sometimes too generic or irrelevant due to the fixed response set.
Next, we test models in other languages. We compare different training settings and investigate two cross-lingual generalization methods: initializing with pre-trained multilingual models (Wu and Dredze, 2019;Conneau et al., 2020;Liang et al., 2020) and training on machine-translated data (Banea et al., 2008). Interestingly, the two models prefer different methods: multilingual pretraining works better for the retrieval model, while the generation model prefers machine translation.
In summary, we present MRS, a multilingual reply suggestion dataset. We use MRS to provide the first systematic comparison between generation and retrieval models for reply suggestion in both monolingual and multilingual settings. MRS is also a useful benchmark for future research in reply suggestion and cross-lingual generalization. The rest of the paper is organized as follows. Section 2 describes the data collection process for MRS. Section 3 introduces task formulations, experiment settings, and evaluation metrics. Section 4 describes the baseline generation and retrieval models. Section 5 presents our experiment results. Section 6 discusses how MRS can help future research.

Dataset Construction
To study reply suggestion in multiple languages, we build MRS, a dataset with message-reply pairs based on Reddit comments. The dataset is available at https://github.com/zhangmozhi/mrs.
We download Reddit comments between January 2010 and December 2019 from the Pushshift Reddit dataset (Baumgartner et al., 2020). 1 We extract message-reply pairs from each thread by considering the parent comment as an input message and the response to the comment as the reference reply. We remove comments starting with [removed] or [deleted], which are deleted messages. We also skip comments with a rating of less than one, since they are likely to contain inappropriate content.
After extracting examples, we identify their languages with fastText language detector (Joulin et al., 2016). For each example, we run the model on the concatenation of the message and the reply. We discard low-confidence examples where none of the languages has a score higher than 0.7. For the remaining examples, we use the highest-scoring label as the language.
We only use English data from 2018 because English data is abundant on Reddit. Non-English examples are much more scarce, so we use data from the last ten years. We select the top ten languages with at least 100K examples. We create three splits for each language: 80% examples for training, 10% for validation, and 10% for testing. Table 1 shows some dataset statistics. MRS is heavily biased towards English. We have more than 48 million English examples, but fewer than one million examples for half of the languages. This gap reflects a practical challenge for reply suggestion-we do not have enough data for most languages in the world. Nevertheless, we can use MRS to test models in different multilingual settings, including cross-lingual transfer learning, where we build non-English reply suggestion models from English data (Section 3.2).
We also build response sets and filter out toxic examples. We describe these steps next.

Response Set
We build a response set of 30K to 50K most frequent replies for each language, which are used in the retrieval model. We want the response set to cover generic responses, so we select replies that appear at least twenty times in the dataset. This simple criterion works well for English, but the set is too small for other languages. For non-English languages, we augment the response set by translating the English response set to other languages with Microsoft Translator. The non-English response set is sometimes smaller than the English set, because different English responses may have the same translation.

Filtering Toxic Examples
Exchanges on Reddit are sometimes uncivil, inappropriate, or even abusive (Massanari, 2017;Mohan et al., 2017). We try to filter out toxic contents, as they are not desirable for reply suggestion systems.
We use two toxicity detection models. First, we use an in-house multilingual model. The model is initialized with multilingual BERT (Devlin et al., 2019, MBERT) and fine-tuned on a mixture of proprietary and public datasets with toxic and offen-sive language labels. The model outputs a score from zero to one, with a higher score corresponding to a higher level of toxicity. Second, we use Perspective API 2 , a publicly available model. Perspective API has limited free access (one query per second), so we only use the API on the English validation, test, and response set. For other languages, we rely on our in-house model. We filter message-reply pairs if it has greater than 0.9 score according to the in-house model, or greater than 0.5 score according to Perspective API (Gehman et al., 2020). About one percent of examples are filtered. After filtering the data, we manually validate three hundred random examples and do not find any toxic examples, which confirms that our filter method have a high recall.
While we hope the filtered dataset leads to better reply suggestion models, existing filtering methods are not perfect and can introduce other biases (Dixon et al., 2018;Sap et al., 2019;Hutchinson et al., 2020). Therefore, models trained on all MRS data may still have undesirable behavior. MRS is intended to be used as a benchmark for testing cross-lingual generalization of generation and retrieval models. The dataset should not be directly used in production systems. To use the dataset in practice, additional work is required to address other possible biases and toxic or inappropriate content that may exist in the data.

Experiment Settings
After presenting the dataset, we explain how we use MRS to compare reply suggestion models. We describe the two frameworks for reply suggestion, our experiment settings, and evaluation metrics.

Task Formulation
In reply suggestion, the input is a message x, and the output is one or more suggested replies y. In practice, reply suggestion systems can choose to not suggest any replies. This decision is usually made by a separate trigger model (Kannan et al., 2016). In this paper, we focus on reply generation, so we assume that the models always need to suggest a fixed number of replies. Reply suggestion can be formulated as either a retrieval problem or a generation problem.
Retrieval Model. A retrieval model selects the reply y from a fixed response set Y (Section 2.1).
Given an input message x, the model computes a relevance score Θ xy for each candidate reply y ∈ Y. The model then selects the highestscoring replies as suggestions; e.g., the top-1 reply is arg max y∈Y Θ xy .
Generation Model. A generation model generates the reply y from scratch. Generation models usually follow the sequence-to-sequence framework (Sutskever et al., 2014, SEQ2SEQ), which generates y token by token. Given an input message x = (x 1 , x 2 , · · · , x n ) of n tokens, a SEQ2SEQ model estimates the probability of a reply y = (y 1 , y 2 , · · · , y m ) of m tokens as following: (1) The model computes probability for the next token p(y i | x, y <i ) based on the input x and the first (i − 1) tokens of the output y. The model is trained to maximize the probability of reference replies in the training set. At test time, we find the top replies that approximately maximize (1) with beam search. The two models have different strengths. The generation model is more flexible, but the retrieval model is faster (Henderson et al., 2017), and the output can be controlled by curating the response set (Kannan et al., 2016).
We compare a retrieval model and a generation model as baselines for MRS. To our knowledge, we are the first to systematically compare the two models in both monolingual and multilingual settings. We explain our training settings and metrics next.

Training Settings
For each language in MRS, we train and compare models in four settings. Future work can experiment with other settings (discussed in Section 6).
Monolingual. Here, we simply train and test models in a single language. This setting simulates the scenario where we have adequate training data for the target language. Previous reply suggestion models were only studied in the English monolingual setting.
Zero-Shot. Next, we train models in a zero-shot cross-lingual setting. We train the model on the English training set and use the model on the test set for another language. This setting simulates the scenario where we want to build models for a low-resource language using our large English set.
To generalize across languages, we initialize the models with pre-trained multilingual models (details in Section 4). These models work well in other tasks (Wu and Dredze, 2019;Liang et al., 2020). We test if they also work for reply suggestion, as different tasks often prefer different multilingual representations (Zhang et al., 2020b).
Machine Translation (MT). Another strategy for cross-lingual generalization is to train on machine-translated data (Banea et al., 2008). We train models on nineteen million English training examples machine-translated to the target language with Microsoft Translator. We compare against the zero-shot setting to compare the two cross-lingual generalization strategies.
Multilingual. Finally, we build a multilingual model by jointly training on the five languages with the most training data: English, Spanish, German, Portuguese, and French. We oversample non-English training data to have the same number of training examples data across all languages (Johnson et al., 2017). We make two comparisons: 1) for the five training languages, we compare against the monolingual setting to test whether fitting multiple languages in a single model hurts performance; and 2) for other languages, we compare against the zero-shot setting to check if adding more training languages helps cross-lingual generalization.

Evaluation Metrics
The goal of reply suggestion is to save user typing time, so the ideal metrics are click-through rate (CTR), how often the user chooses a suggested reply, and time reduction, how much time is saved by clicking the suggestion instead of typing. However, these metrics require deploying the model to test on real users, which is not feasible at full-scale while writing this paper. Instead, we focus on automated offline metrics that can guide research and model development before deploying production systems. Specifically, we evaluate models using a test set of message-reply pairs.
To identify a good metric, we compare several metrics in a pilot study by deploying an English system. We collect millions of user interactions and measure Pearson's correlation between CTR and automated offline metrics. The next paragraph lists the metrics. Based on the study, we recommend weighted ROUGE F1 ensemble (ROUGE in tables), which has the highest correlation with CTR. For the retrieval model, we follow previous work and consider mean reciprocal rank (Kannan et al., 2016, MRR) and precision at one (Henderson et al., 2017). These metrics test if the model can retrieve the reference response from a random set of responses. Alternatively, we compute MRR and precision on a subset of examples where the reference reply is in the response set so that we can directly measure the rank of the reference response in the response set. This set also allows us to compute MRR for individual responses, so we can compute macro-MRR, the average MRR over each response in the set. Higher macro-MRR can indicate diversity but has a worse correlation than computing MRR over the entire test set. For the generation model, we consider model perplexity (Adiwardana et al., 2020). Finally, we consider two word overlap scores, BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), which can be used for both retrieval and generation models.
Our pilot study shows that ROUGE has the best correlation. However, individual ROUGE F1 scores (ROUGE-1/2/3) are sensitive to small changes in sequence lengths (more so because our responses are generally short). Therefore, we use a weighted average of the three scores: This weighted score leads to the highest correlation with CTR. Intuitively, the weights balance the differences in the average magnitude of each metric and thus reduce variance on short responses. Popular reply suggestion systems (such as Gmail and Outlook) suggest three replies for each message, while the user only selects one. To simulate this setting, we predict three replies for each message. For the retrieval model, we use the three highest-scoring replies from the response set. For the generation model, we use top-three results from beam search. Out of the three replies, we only use the reply with the highest ROUGE compared to the reference reply when computing the final metrics; i.e., the model only has to provide one "correct" reply to have a full score.
We compare models primarily with ROUGE, since the metric has the best correlation in the pilot study. Nevertheless, word overlap scores have known limitations (Liu et al., 2016), as there are different ways to reply to a message. We encourage future research to investigate other metrics to understand different aspects of the model.
As examples, we also report two diversity scores: the proportion of distinct unigrams (Dist-1) and bigrams (Dist-2) in the generated replies (Li et al., 2016). While ROUGE measures the relevance of the replies, higher diversity can also increase CTR (Deb et al., 2019). We can improve the diversity of the three replies with diversity-promoting decoding (Li et al., 2016;Vijayakumar et al., 2018;Zhang et al., 2018) or latent variable models (Deb et al., 2019), but we leave this direction to future work.
For our English monolingual experiments, we also complement automatic metrics with human judgments (Human in Figure 2). For each example, we display the input message and sets of three suggested replies from both generation and retrieval models to three human annotators (crowd workers). We then ask the annotators to select the set with more responses that they prefer to send as a reply. We leave evaluations for other languages to future work due to resource limitations.

Baseline Models
This section introduces the two baseline models: a retrieval model and a generation model.

Retrieval Model
For the retrieval model, we use the architecture from Henderson et al. (2017), except we replace the feedforward network encoders with Transformers (Vaswani et al., 2017). Given an input message x and candidate reply y, two Transformer encoders Φ x and Φ y map the message and the reply to two vectors Φ x (x) and Φ y (y). The relevance score Θ xy between the message x and the reply y is the dot product of the two vectors: Henderson et al. (2017) also adds a language model score to encourage more frequent replies. We do not use language model score for simplicity. We train the model with the symmetric loss from Deb et al. (2019). Suppose the batch size is n. For a batch of training messages {x i } n i=1 and corresponding replies {y i } n j=1 , we maximize: In a regular softmax loss, the denominator only sums over one variable. The denominator in the .320 Generation Retrieval Figure 2: Generation vs. retrieval model on English. Both automatic metrics and human referees prefer the generation model. The human score measures how often the human referee prefers the replies from one model over another. The human score does not add up to one, because the referee can choose tie.

Message:
I think I want to play it so it better be available for PS4. Generation: I'm sure it'll be available for PS4 as well. Retrieval: I'd play it.

Message:
Very nice! What model is this? Obviously new Softail/Dyna, but which? Generation: It's a Softail, but I'm not sure what model it is. Retrieval: Delta.

Message:
Interesting proposal. Especially since they just announced Mixon is going to have a scope. Generation: I'm pretty sure it's going to have a scope, but I don't think it's going to have a scope. Retrieval: Where did they say that?

Message:
Oh the stickers! I envy you, Lush didn't sent me any with my body lotion order. Generation: I'm so sorry. I'm so sorry. I'm so sorry. Retrieval: What did you order? Table 2: Example replies suggested by English models. In the top two examples, the generation model produces more relevant replies than the retrieval model. In the bottom two examples, the generation model produces contradictory and repetitive replies, while the retrieval model selects appropriate replies from the response set. symmetric loss sum over both variables to encourage bidirectional compatibility: the message should be predictive of the reply, and the reply should be predictive of the message. This encourages the model to select responses specific to the message, similar to the Maximum Mutual Information objective from Li et al. (2016).
The two encoders Φ x and Φ y are initialized with MBERT (Devlin et al., 2019), a Transformer with 110 million parameters pre-trained on multilingual corpora. Initializing with MBERT allows the model to generalize across languages (Wu and Dredze, 2019). In Appendix A, we experiment with another pre-trained multilingual Transformer, XLM-R (Conneau et al., 2020). We use the "base" version with 270 million parameters.

Generation Model
For the generation model, we follow the SEQ2SEQ architecture (Section 3.1). We use a Transformer encoder to read the input x, and another Transformer decoder to estimate p(y i | x, y <i ) in (1).
We cannot initialize the generation model with MBERT or XLM-R, because the model also has a decoder. Instead, we use Unicoder-XDAE (Liang et al., 2020), a pre-trained multilingual SEQ2SEQ model, which can generalize across languages in extractive generation tasks such as news title generation and question generation. We test if Unicoder-XDAE also generalizes in the more challenging reply suggestion task. There are other generation models we can use, which we discuss as future work in Section 6.

Training Details
We train the retrieval model using Adam optimizer (Kingma and Ba, 2015) with 1e-6 learning rate, default β, and 256 batch size. For monolingual and zero-shot settings, we use twenty epochs for English and fifty epochs for other languages. We use ten epochs for MT and multilingual settings. The first 1% training steps are warmup steps. During training, we freeze the embedding layers and the bottom two Transformer layers of both en-  coders, which preserves multilingual knowledge from the pre-trained model and improves crosslingual transfer learning (Wu and Dredze, 2019). All hyperparameters are manually tuned on the English validation set. We use almost the same hyperparameters as Liang et al. (2020) to train generation models. Specifically, we use Adam optimizer with 1e-5 initial learning rate, default β, and 1024 batch size. For the monolingual and zero-shot setting, we use four epochs for English and 5000 steps for other languages (equivalent to two to nine epochs depending on the language). We use one epoch for the MT setting and 40,000 steps for the multilingual setting. The first 20% training steps are warmup steps. We freeze the embedding layer during training for faster training.
All models are trained with eight Tesla V100 GPU. It takes about an hour to train the generation

Results and Discussion
We experiment with the two baselines from Section 4 on MRS. We first compare the models in English, where we have enough training data and human referees. We then build models for other languages and compare training settings listed in Section 3.2.

Results on English
Figure 2 compares the generation and retrieval models in the English monolingual setting. Generation model not only has higher relevance (ROUGE) score but also can generate more diverse replies (higher DIST scores). For English, we also ask three human referees to compare the model outputs on a subset of 500 test examples. Again, the referees prefer the generation model more often than the retrieval model ( Figure 2).
We look at some generated responses to understand the models qualitatively. In the top two examples in Table 2, the generation model produces replies highly specific to the input message. In contrast, the retrieval model fails to find a relevant reply, because the response set does not cover these topics. This explains why the generation model has much higher ROUGE and distinct n-gram scores than the retrieval model. However, the expressiveness comes at the cost of a lack of control over the generated replies. The generation model sometimes produces incoherent replies that are repetitive and/or contradictory, as shown in the bottom two examples of Table 2. For the retrieval model, we can easily avoid these problems by curating the fixed response set. These degenerative behaviors are observed in other text  generation tasks and can be mitigated by changing training and decoding objectives (Holtzman et al., 2020;Welleck et al., 2020). We leave these directions for future research.

Results on Other Languages
After comparing English models, we experiment on other languages using the settings from Section 3.2.
Retrieval Model. Table 3 shows results for the retrieval model when initialized with MBERT. The retrieval model can generalize fairly well across languages, as the ROUGE in the zero-shot setting is often close to the monolingual setting. This result confirms that initializing with MBERT is an effective strategy for cross-lingual generalization. Training on MT data is usually worse than training in the zero-shot setting. This is possible because the MT system may create artifacts that do not appear in organic data (Artetxe et al., 2020). For the multilingual model, the training language ROUGE scores are lower than monolingual training (gray cells in Table 3). However, multilingual training sometimes leads to better ROUGE on unseen languages compared to transferring from only English (zero-shot). Previous work observes similar results on other tasks, where multilingual training hurts training languages but helps generalization to unseen languages (Johnson et al., 2017;Con-neau et al., 2020;. Finally, Appendix A shows similar results when initializing with XLM-R (Conneau et al., 2020).
Generation Model. Table 4 shows results for the generation model. In the monolingual setting, the generation model has higher scores than the retrieval model on most languages, consistent with the English result ( Figure 2). However, unlike the retrieval model, the generation model fails to generalize across languages in the zero-shot setting, despite using Unicoder-XDAE for initialization. We do not show zero-shot results in Table 4, because ROUGE are close to zero for non-English languages. After training on English data, the model always produces English replies, regardless of the input language; i.e., the generation model "forgets" multilingual knowledge acquired during pre-training (Kirkpatrick et al., 2017). This result is surprising because Unicoder-XDAE works in the zero-shot setting for other generation tasks (Liang et al., 2020), which suggests that reply suggestion poses unique challenges for cross-lingual transfer learning. Interestingly, the multilingual model can generalize to unseen languages; perhaps training on multiple languages regularizes the model to produce replies in the input language. Overall, the best method to generalize the generation model across languages is to use machine-translated data.
6 Future Work MRS opens up opportunities for future research. Our experiments use four training settings (Section 3.2), but there are many other settings to explore. For example, we can use other combinations of training languages, which may work better for some target languages (Ammar et al., 2016;Cotterell and Heigold, 2017;Ahmad et al., 2019;Lin et al., 2019;Zhang et al., 2020a). We are also interested in training on both organic data and MT data; i.e., mixing the zero-shot and MT setting. We can also compare other models on MRS. For the English monolingual setting, we can initialize the generation model with state-of-the-art language models (Radford et al., 2019;Brown et al., 2020;Zhang et al., 2020c). For cross-lingual settings, we can initialize the generation model with several recent pre-trained multilingual SEQ2SEQ models (Chi et al., 2020(Chi et al., , 2021Tran et al., 2020;Lewis et al., 2020a;Xue et al., 2020). For retrieval models, we can experiment with other multilingual encoders that use different pre-training tasks (Artetxe and Schwenk, 2019;Chidambaram et al., 2019;Reimers and Gurevych, 2020;Feng et al., 2020).
Another idea is to combine the two models. Given an input message, we first use a generation model to create a set of candidate replies. We then use a retrieval model to compute relevance scores and rerank these candidates. Reranking the output of a generation model helps other natural language processing tasks (Shen et al., 2004;Collins and Koo, 2005;Ge and Mooney, 2006), and previous work uses a similar idea for chatbots (Qiu et al., 2017).
Our experiment shows that reply suggestion poses unique challenges for cross-lingual generalization, especially for the generation model. Future work can study methods to improve cross-lingual generalization methods. Some examples include applying adversarial learning (Chen et al., 2018(Chen et al., , 2019Huang et al., 2019), using adapters (Pfeiffer et al., 2020), adaptive transfer , mixing pre-training and fine-tuning (Phang et al., 2020), and bringing a human in the loop (Yuan et al., 2020).

Conclusion
We present MRS, a multilingual dataset for reply suggestion. We compare a generation and a retrieval baseline on MRS. The two models have dif-ferent strengths in the English monolingual setting and require different strategies to transfer across languages. MRS provides a benchmark for future research in both reply suggestion and cross-lingual transfer learning.

Ethical Considerations
Data Collection. No human annotators are involved while creating MRS. The examples and response sets of MRS come from publicly available Reddit dumps from Pushshift, which are used in more than a hundred peer-reviewed publications (Baumgartner et al., 2020).
Privacy. Examples in MRS do not have the username and are from publicly available data. Therefore, we do not anticipate any privacy issues. In the pilot study (Section 3.3), we measure the correlation of user CTR with different evaluation metrics. To protect user privacy, we only collect aggregated statistics (CTR) and use no other information.
Potential Biased and Toxic Content. Despite our best effort to filter toxic contents (Section 2.2), the dataset may not be perfectly cleansed and may have other biases that are typical in open forums (Massanari, 2017;Mohan et al., 2017). Users should be aware of these issues. We will continue to improve the quality of the dataset.
Intended Use of MRS. Because of the possible biases and inappropriateness in the data, MRS should not be directly used to build production systems (as mentioned in Section 2.2). The main use of MRS is to test cross-lingual generalization for text retrieval and generation models, and researchers should be aware of possible ethical issues of Reddit data before using MRS.  Table 5: Results for retrieval model initialized with XLM-R (Conneau et al., 2020), The settings are in Section 3.2. Gray cells indicate when the model is trained on the target language training set. White cells indicate cross-lingual settings where the target language training set is not used for training. For each language, we boldface the best ROUGE scores in cross-lingual settings (white cells). We observe similar trends as MBERT (Table 3).