Probing Multilingual Language Models for Discourse

Pre-trained multilingual language models have become an important building block in multilingual Natural Language Processing. In the present paper, we investigate a range of such models to find out how well they transfer discourse-level knowledge across languages. This is done with a systematic evaluation on a broader set of discourse-level tasks than has been previously been assembled. We find that the XLM-RoBERTa family of models consistently show the best performance, by simultaneously being good monolingual models and degrading relatively little in a zero-shot setting. Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations, while language dissimilarity at most has a modest effect. We hope that our test suite, covering 5 tasks with a total of 22 languages in 10 distinct families, will serve as a useful evaluation platform for multilingual performance at and beyond the sentence level.


Introduction
Large-scale pre-trained neural language models have become immensely popular in the natural language processing (NLP) community in recent years (Devlin et al., 2019;Peters et al., 2018). When used as contextual sentence encoders, these models have led to remarkable improvements in performance for a wide range of downstream tasks (Qiu et al., 2020). In addition, multilingual versions of these models (Devlin et al., 2019;Conneau and Lample, 2019) have been successful in transferring knowledge across languages by providing language-independent sentence encodings.
The general usefulness of pre-trained language models has been convincingly demonstrated thanks to persistent creation and application of evaluation datasets by the NLP community. Discourse-level analysis is particularly interesting to study, given that many of the currently available models are trained with relatively short contexts such as pairs of adjacent sentences. Wang et al. (2019) use a diverse set of natural language understanding (NLU) tasks to investigate the generality of the sentence representations produced by different language models. Hu et al. (2020) use a broader set of tasks from across the NLP field to investigate the ability of multilingual models to transfer various types of knowledge across language boundaries.
Our goal in this paper is to systematically evaluate the multilingual performance on NLU tasks, particularly at the discourse level. This combines two of the most challenging aspects of representation learning: multilinguality and discourse-level analysis. A few datasets have been used for this purpose before, most prominently the XNLI evaluation set (Conneau et al., 2018) for Natural Language Inference (NLI), and recently also XQuAD (Artetxe et al., 2020) and MLQA (Lewis et al., 2020) for Question Answering (QA). We substantially increase the breadth of our evaluation by adding three additional tasks: 1. Penn Discourse TreeBank (PDTB)-style implicit discourse relation classification on annotated TED talk subtitles in seven languages (Section 3.1.1) 2. Rhetorical Structure Theory (RST)-style discourse relation classification with a custom set consisting of treebanks in six non-English languages (Section 3.1.2) 3. Stance detection with a custom dataset in five languages (Section 3.1.3) We investigate the cross-lingual generalization capabilities of seven multilingual sentence encoders with considerably varying model sizes through their cross-lingual zero-shot performance 1 which, in this context, refers to the evaluation scheme where sentence encoders are tested on the languages that they are not exposed to during training. The complied test suite consists of five tasks, covering 22 different languages in total.
We specifically focus on zero-shot transfer scenario where a sufficient amount of annotated data to fine-tune a pre-trained language model is assumed to be available only for one language. We believe that this is the most realistic scenario for a great number of languages; therefore, zero-shot performance is the most direct way of assessing cross-lingual usefulness in a large scale.
Our contributions are as follows: (i) we provide a detailed analysis of a wide range of sentence encoders on large number of probing tasks, several of which have not previously been used with multilingual sentence encoders despite their relevancy, (ii) we provide suitably pre-processed versions of these datasets to be used as a multilingual benchmark for future work with strong baselines provided by our evaluation, (iii) we show that the zero-shot performance on discourse level tasks are not correlated with any kind of language similarity and hard to predict, (iv) we show that knowledge distillation may selectively destroy multilingual transfer ability in a way that harms zero-shot transfer, but is not visible during evaluations where the models are trained and evaluated with the same language.

Background
The standard way of training a multilingual language model is through a large non-parallel multilingual corpora, e.g. Wikipedia articles, where the models are not provided with any explicit mapping across languages which renders cross-lingual performance of such models puzzling. Pires et al. (2019) and Wu and Dredze (2019) are the earliest studies to explore that puzzle by trying to uncover the factors that give multilingual BERT (henceforth, mBERT) its cross-lingual capabilities. Pires et al. (2019) perform a number of probing tasks and hypothesize that the shared sentence pieces across languages gives mBERT its generalization ability by forcing other pieces to be mapped into the same space. Similarly, Wu and Dredze (2019) evaluate the performance of mBERT in five tasks and report that while mBERT shows a strong zeroshot performance, it also retains language-specific information in each layer. Chen et al. (2019a) proposes a benchmark to evaluate sentence encoders specifically on discourse level tasks. The proposed benchmark consists of discourse relation classification and a number of custom tasks such as finding the correct position of a randomly moved sentence in a paragraph or determining if a given paragraph is coherent or not. The benchmark is confined to English, hence, only targets monolingual English models.
Two very recent studies, XTREME (Hu et al., 2020) and XGLUE (Liang et al., 2020), constitute the first studies on the cross-lingual generalization abilities of pre-trained language models via their zero-shot performance. The tasks in both studies largely overlap, where XTREME serves as cross-lingual benchmark consisting of well-known datasets, e.g. XNLI, XQuAD. On the other hand, while covering the most of XTREME tasks 2 , XGLUE offers new datasets which either focus on the relation between a pair of inputs, such as web page-query matching, or on text generation via question/news title generation. In addition to the mBERT and certain XLM and XLM-R versions, XTREME includes MMTE (Arivazhagan et al., 2019) whereas XGLUE evaluates Unicoder (Huang et al., 2019) among its baselines.

Cross-lingual Discourse-level Evaluation
In discourse research, sentences/clauses are not understood in isolation but in relation to one another. The semantic interactions between these units are usually regarded as the backbone of coherence in various prominent discourse theories including that underlying the Penn Discourse TreeBank (PDTB) (Prasad et al., 2007), and Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) used in the RST Discourse Treebank (Carlson and Marcu, 2001). Modelling such interactions requires an understanding that is beyond sentence-level and, from this point-of-view, determining any kind of relation between sentences/clauses can be associated with discourse.
Although paraphrase detection or natural language inference may not strike as discourse-level tasks at first glance, they both deal with semantic relations between sentences. Tonelli and Cabrio (2012) show that textual entailment is, in fact, a subclass of Restatement relations of the PDTB framework whereas Nie et al. (2019) report an increase in discourse relation classification accuracy when NLI is used as the intermediate fine-tuning task. In a similar vein, a stance against a judgement, Favor or Against, can be seen as CONTINGENCY: Cause: reason and COMPARISON: Contrast in PDTB; Explanation and Antithesis in RST, respectively. Therefore, these NLU tasks can be seen as special subsets of discourse relation classification; only a model with a good understanding beyond individual sentences can be expected to solve these tasks. Finally, since question answering requires an understanding on discourse level in order to be solved, so we also believe classifying this as a discourse-level task should be uncontroversial.

Tasks & Datasets
In this section, we present our task suite and the datasets used for training and zero-shot evaluation. For the sake of clarity, we name each task after the dataset used for training.

Implicit Discourse Relation Classification (PDTB)
Implicit discourse relations hold between adjacent sentence pairs but are not explicitly signaled with a connective such as because, however. Implicit discourse relation classification is the task of determining the sense conveyed by these adjacent sentences, which can be easily inferred by readers. Classifying implicit relations constitutes the most challenging step of shallow discourse parsing (Xue et al., 2016). The training is performed on PDTB3  where sections 2-20, 0-1 are used for training and development respectively. The zeroshot evaluation is performed on the TED-MDB corpus (Zeyrek et al., 2019) 3 , which is a PDTB-style annotated parallel corpus consisting of 6 TED talk transcripts, and the recent Chinese annotation effort on TED talk transcripts that however are mostly not parallel to TED-MDB (Long et al., 2020). Due to the small size of the test sets, we confine ourselves to the top-level senses: Contingency, Comparison, Expansion, Temporal which is also the most common setting for this task. Despite the limited size of TED-MDB, zero-shot transfer is possible and yields meaningful results as shown in (Kurfalı and Ostling, 2019). In total, seven languages are evaluated in this task: English, German, Lithuanian 4 , Portuguese, Polish, Russian and Chinese.

Rhetorical Relation Classification (RST)
Rhetorical relations are just another name for discourse relations but this term is most commonly associated with Rhetorical Structure Theory (RST) (Mann and Thompson, 1988). Similar to PDTB's discourse relations, rhetorical relations also denote links between discourse units, but are considerably different from the former. The difference largely stems from the take of the respective theories on the structure of the discourse. RST conceives discourse as one connected tree-shaped structure assuming hierarchical relations among the discourse relations. On the other hand, PDTB does not make any claims regarding the structure of the discourse and annotates discourse relations only in a local context (i.e. adjacent clauses/sentences) without assuming any relation on higher levels. Hence, evaluation on RST and PDTB relations can be seen as complementary to each other as the former focuses on both global and local discourse structure whereas PDTB focuses only on local structure. We use English RST-DT (Carlson and Marcu, 2001) for training where a randomly selected 35 documents are reserved for development. However, unlike PDTB, there is not any compact parallel RST corpus; RST annotations across languages usually differ from each other in several ways. Therefore, we follow Braud et al. (2017) and create a custom multilingual corpus for the zero-shot experiments which consists of the following languages: Basque (Iruskieta et al., 2013), Brazilian Portuguese (Cardoso et al., 2011;Collovini et al., 2007;Pardo and Seno, 2005), Chinese (Cao et al., 2018), German (Stede, 2004), Spanish (Da Cunha et al., 2011), Russian (Pisarevskaya et al., 2017). We perform a normalization step on each treebank which includes binarization of non-binary trees and mapping all relations to 18 coarse grained classes described in (Carlson and Marcu, 2001). The normalization step is performed via the pre-processing scripts of (Braud et al., 2017). Due to memory constraints, we limit the sequence lengths to 384. Hence, we only keep those relations where the first discourse unit is shorter than 150 words so that both units can be equally represented which lead to omission of only 5% of all non-English relations.

Stance Detection (X-Stance)
The stance detection is task of determining the attitude expressed in a text towards a target claim. For experiments, we mainly use the X-stance corpus which consists of 60K answers to 150 questions concerning politics in German, Italian and French (Vamvas and Sennrich, 2020). Unlike other tasks, we select German as the training language for stance detection as it is the largest language in X-Stance. Following the official split, we use the German instances in the training and development sets during fine-tuning and non-German instances in the test set for evaluation. Furthermore, we enrich the scope of our zero-shot evaluation by two additional dataset, one in English (Chen et al., 2019b) and other one in Chinese , which also consist of stance annotated claim-answer pairs, despite in different domains.

Natural Language Inference (XNLI)
Natural language inference (NLI) is the task of determining whether a premise sentence entails, contradicts or is neutral to a hypothesis sentence. MultiNLI and the mismatched part of the development data (Williams et al., 2018) are used for training and validation, respectively. The evaluation is performed on the test sets of the XNLI (Conneau et al., 2018) corpus which covers the following 14 languages in addition to English: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.

Question Answering (XQuAD)
Question answering is the task of identifying span in a paragraph which answers to a question. We use the SQuAD v1.1 (Rajpurkar et al., 2016) for training. We evaluate the models on the popular XQuAD dataset which contains the translation of SQuAD v1.1 development set into ten languages (Artetxe et al., 2020): Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi.

Experiments
We evaluate a wide range of multilingual sentence encoders which learn contextual representations. The evaluated models represent a broad spectrum of model sizes, in order to allow practitioners to estimate the trade-off between model size and accuracy.

Sentence Encoders
The sentence encoders evaluated in the current paper are described in detailed below, and their characteristics summarized in Table 2.
Multilingual BERT (mBERT): mBERT is a transformer-based language model trained with masked language modelling and next sentence prediction objectives similar to the original English BERT model (Devlin et al., 2019) 5 . mBERT is pretrained on the Wikipedias of 104 languages with a shared word piece vocabulary. As discussed in Section 2, its input is not marked with any languagespecific signal and mBERT does not have any objective to encode different languages in the same space.
distilmBERT: distilmBERT is a compressed version of mBERT obtained via model distillation (Sanh et al., 2019). Model distillation is a compression technique where a smaller model, called student, learns to mimic the behavior of the larger model, called teacher, by matching its output distribution. distilmBERT is claimed to reach 92% of mBERT's performance on XNLI while being two times faster and 25% smaller. 6 However, to the best of our knowledge, there is not any comprehensive analysis of distilmBERT's zero-shot performance.
XLM: XLM is a transformer-based language model aimed at extending BERT to cross-lingual setting (Conneau and Lample, 2019). To this end,  There are two released XLM-R models, XLM-R base and XLM-R large , named after the BERTarchitecture they are based on. Compared to original multilingual-BERT, XLM-RoBERTa models have a considerably larger vocabulary size which results in larger models.

Experimental Setup
A summary of the datasets used in the experiments is provided in Table 1. Except PDTB, all datasets are publicly available. As stated earlier, the training language is English for all tasks except stance detection where German is preferred due the size of the available data. In the spirit of real zeroshot transfer, the validation sets only consist of instances in the training language; hence, no crosslingual information whatsoever is utilized during training/model selection. For the evaluation metrics, we stick to the default metrics of each task (Table 1). We set the sequence length to 384 for question answering and RST relation classification; to 250 for stance detection and to 128 for the remaining tasks. At evaluation time, we keep the same configuration. For all models, adam epsilon is set to 1e-8 and maximum gradient norm to 1.0. The learning rate of 2 × 10 −5 is used for all the models except XLM-R-large and XLM-100 where it is set to 5 × 10 −6 . We adopt the standard fine-tuning approach and fine-tune all models for 4 epochs. We do not apply any early stopping and use the model with the best validation performance during zero-shot experiments. All tasks are implemented using Huggingface's Transformers library (Wolf et al., 2019). As fine-tuning procedure is known to show high variance on small training datasets, all models are run for 4 times with different seeds and the average performance is reported. For XLM and XLM-tlm models, we fall back to English lan-  guage embeddings for non-XNLI languages. All experiments are run on a single TITAN X (12 GB) GPU.

Results and Discussion
We provide an overview of the main results in Figure 1. The detailed results with per-language breakdown are provided in the Appendix A.
Overall, there is a clear difference between the training and zero-shot performance of all models. When averaged over all tasks, the performance loss in zero-shot transfer ranges from 15.58% (XLM-R-large) to 34.96% (distilmBERT) which clearly highlights the room for improvement, especially with smaller model sizes. In the rest of the section, we discuss the results in terms of the encoder type, task and the languages.
Model-wise analysis The ranking of the encoders displays relatively little variation across tasks, with XLM-R large exhibiting the best zeroshot performance across all tasks by outperforming the second best model (XLM-R base ) by 5.98%. dis-tillmBERT, on the other hand, fails to match the performance of other encoders. 7 The Translation Language Model (TLM) objective is proved to be a better training objective than MLM by consistently outperforming the vanilla XLM in all tasks. XLM-tlm outperforms XLM-100 on XNLI languages as well which is possibly because of the 'curse of multilinguality ' (Conneau et al., 2020), the degradation of the overall performance in proportion to the number of languages in the training. However, training setting (e.g. training data, hyperparameters) outplays the 'curse of multilinguality' as XLM-R base clearly outperforms XLM-tlm even on XNLI languages. It would be interesting to see how an XLM-R trained with TLM objective on small set of languages, e.g. XNLI languages, would perform.
DistillmBERT is the lightest model evaluated in the current investigation. It is shown to retain 92% of the mBERT's performance on certain XNLI languages. 8 The results suggest that distillmBERT delivers its promise, although to a lesser extent. When averaged over all tasks, distillmBERT retains 93% of the source language performance of mBERT. However, its relative performance significantly drops to 82% on zero-shot transfer. That is, distillmBERT is not as successful when it comes to copying mBERT's cross-lingual abilities. Furthermore, its performance (relative to mBERT) is not stable across tasks either. It only achieves 69% of mBERT's zero-shot performance on RST whereas 89% on XNLI. The low memory requirement and its speed (with the same batch size, it is x2 faster than mBERT and x5 than XLM-R large ) definitely makes distillmBERT a favorable option; however, the results show that its zero-shot performance is considerably lower than its source language performance and is highly task-dependent, hence, hard to predict. Table 3 shows to what extent encoders manage to transfer their source language performance to zero-shot languages. Overall, the zero-shot performances show high variance across tasks which is quite interesting given that all tasks are on the same linguistic level. It is also surprising that mBERT manages a better zero-shot transfer performance than all XLM models while being almost as consistent as XLM-R base .

Task-wise Analysis
Overall, the results show that even modern sentence encoders struggle to capture inter-sentential interactions in both monolingual and multilingual settings, contrary to the what the high performances on well-known datasets (e.g. PAWS (Hu et al., 2020)) may suggest. We believe that this finding supports our motivation to propose new probing tasks to have a fuller picture of the capabilities of these encoders.
Language-wise Analysis: In all tasks, regardless of the model, training-language performance is better than even the best zero-shot performance. The only exception is the XLM-R-large's performance on the X-stance where the zero-shot performance is on par with its performance on the German test set.
An important aspect of cross-lingual research is predictability. The zero-shot performance of a certain language do not seem to be stable across tasks (e.g. German is the language with the worst RST performance; yet it is one of the best in XNLI). We further investigate this following Lauscher et al. (2020), who report high correlation between syntactic similarity and zero-shot performance for lowlevel tasks, POS-tagging and dependency parsing. We conduct the same correlation analysis using Lang2Vec (Littell et al., 2017). However, syntactic and geographical similarity only weakly correlates with zero-shot performances across the tasks (Pearson's r = .46 and Spearman's r = .53 on average for syntactic; Pearson's r = .30 and Spearman's r = .45 for geographical similarity). Such low correlations are important as it further supports the claim that the tasks are beyond the sentence level and also highlights a need for further research to reveal the factors at play during zero-shot transfer of discourse-level tasks.

Conclusion
As pre-trained multilingual sentence encoders have become prevalent in natural language processing, research on cross-lingual zero-shot transfer gains increasing importance (Hu et al., 2020;Liang et al., 2020). In this work, we evaluate a wide range of sentence encoders on a variety of discourse-level tasks in a zero-shot transfer setting. Firstly, we enrich the set of available probing tasks by introducing three resources which have not been utilized in this context before. We systematically evaluate a broad range of widely used sentence encoders with considerably varying sizes, an analysis which has not been made before.
The main variable we look at is the performance gap between training-language evaluation and zeroshot evaluation. Unsurprisingly, nearly always there is such a gap, but its magnitude depends on a number of factors: • Distillation: the distilled mBERT model has a larger gap than the full mBERT model, indicating loss of multilingual transfer ability during distillation.
• Language similarity: the gap correlates only weakly with measures of language similarity (syntactic and geographical), indicating that sentence encoders generally transfer discourse-level information about as well between similar and dissimilar languages.
• High variance: apart from the above, we also observe a generally high variance in the gap magnitude between different tasks in our benchmark suite.
These observation provide several starting points for future work: investigating why knowledge distillation seems to hurt zero-shot performance to a much greater extent than same-language sentence encoding ability and what can be done to solve this problem, and explaining the large variations in the zero-shot transfer gap between different discourselevel NLP tasks.  Table 3: Relative zero-shot performance of each encoder to the source language performance (metrics differ between tasks but higher is better in all cases). The figures shows what percentage of the source language performance is retained through zero-shot transfer in each task. Hu et al. (2020) refer to this as the cross-lingual transfer gap.
A score above 100 indicates that a better zero-shot performance than that of training.