Saksham Singhal


pdf bib
XLM-E: Cross-lingual Language Model Pre-training via ELECTRA
Zewen Chi | Shaohan Huang | Li Dong | Shuming Ma | Bo Zheng | Saksham Singhal | Payal Bajaj | Xia Song | Xian-Ling Mao | Heyan Huang | Furu Wei
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we introduce ELECTRA-style tasks to cross-lingual language model pre-training. Specifically, we present two pre-training tasks, namely multilingual replaced token detection, and translation replaced token detection. Besides, we pretrain the model, named as XLM-E, on both multilingual and parallel corpora. Our model outperforms the baseline models on various cross-lingual understanding tasks with much less computation cost. Moreover, analysis shows that XLM-E tends to obtain better cross-lingual transferability.


pdf bib
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs
Zewen Chi | Li Dong | Shuming Ma | Shaohan Huang | Saksham Singhal | Xian-Ling Mao | Heyan Huang | Xia Song | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multilingual T5 pretrains a sequence-to-sequence model on massive monolingual texts, which has shown promising results on many cross-lingual tasks. In this paper, we improve multilingual text-to-text transfer Transformer with translation pairs (mT6). Specifically, we explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. In addition, we propose a partially non-autoregressive objective for text-to-text pre-training. We evaluate the methods on seven multilingual benchmark datasets, including sentence classification, named entity recognition, question answering, and abstractive summarization. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.

pdf bib
Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training
Bo Zheng | Li Dong | Shaohan Huang | Saksham Singhal | Wanxiang Che | Ting Liu | Xia Song | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at

pdf bib
Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task
Jian Yang | Shuming Ma | Haoyang Huang | Dongdong Zhang | Li Dong | Shaohan Huang | Alexandre Muzio | Saksham Singhal | Hany Hassan | Xia Song | Furu Wei
Proceedings of the Sixth Conference on Machine Translation

This report describes Microsoft’s machine translation systems for the WMT21 shared task on large-scale multilingual machine translation. We participated in all three evaluation tracks including Large Track and two Small Tracks where the former one is unconstrained and the latter two are fully constrained. Our model submissions to the shared task were initialized with DeltaLM, a generic pre-trained multilingual encoder-decoder model, and fine-tuned correspondingly with the vast collected parallel data and allowed data sources according to track settings, together with applying progressive learning and iterative back-translation approaches to further improve the performance. Our final submissions ranked first on three tracks in terms of the automatic evaluation metric.

pdf bib
Consistency Regularization for Cross-Lingual Fine-Tuning
Bo Zheng | Li Dong | Shaohan Huang | Wenhui Wang | Zewen Chi | Saksham Singhal | Wanxiang Che | Ting Liu | Xia Song | Furu Wei
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Fine-tuning pre-trained cross-lingual language models can transfer task-specific supervision from one language to the others. In this work, we propose to improve cross-lingual fine-tuning with consistency regularization. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks, including text classification, question answering, and sequence labeling.

pdf bib
InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training
Zewen Chi | Li Dong | Furu Wei | Nan Yang | Saksham Singhal | Wenhui Wang | Xia Song | Xian-Ling Mao | Heyan Huang | Ming Zhou
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this work, we present an information-theoretic framework that formulates cross-lingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts. The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, inspired by the framework, we propose a new pre-training task based on contrastive learning. Specifically, we regard a bilingual sentence pair as two views of the same meaning and encourage their encoded representations to be more similar than the negative examples. By leveraging both monolingual and parallel corpora, we jointly train the pretext tasks to improve the cross-lingual transferability of pre-trained models. Experimental results on several benchmarks show that our approach achieves considerably better performance. The code and pre-trained models are available at