Vishrav Chaudhary


2022

pdf bib
Alternative Input Signals Ease Transfer in Multilingual Machine Translation
Simeng Sun | Angela Fan | James Cross | Vishrav Chaudhary | Chau Tran | Philipp Koehn | Francisco Guzmán
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent work in multilingual machine translation (MMT) has focused on the potential of positive transfer between languages, particularly cases where higher-resourced languages can benefit lower-resourced ones. While training an MMT model, the supervision signals learned from one language pair can be transferred to the other via the tokens shared by multiple source languages. However, the transfer is inhibited when the token overlap among source languages is small, which manifests naturally when languages use different writing systems. In this paper, we tackle inhibited transfer by augmenting the training data with alternative signals that unify different writing systems, such as phonetic, romanized, and transliterated input. We test these signals on Indic and Turkic languages, two language families where the writing systems differ but languages still share common features. Our results indicate that a straightforward multi-source self-ensemble – training a model on a mixture of various signals and ensembling the outputs of the same model fed with different signals during inference, outperforms strong ensemble baselines by 1.3 BLEU points on both language families. Further, we find that incorporating alternative inputs via self-ensemble can be particularly effective when training set is small, leading to +5 BLEU when only 5% of the total training data is accessible. Finally, our analysis demonstrates that including alternative signals yields more consistency and translates named entities more accurately, which is crucial for increased factuality of automated systems.

pdf bib
AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages
Abteen Ebrahimi | Manuel Mager | Arturo Oncevay | Vishrav Chaudhary | Luis Chiruzzo | Angela Fan | John Ortega | Ricardo Ramos | Annette Rios | Ivan Vladimir Meza Ruiz | Gustavo Giménez-Lugo | Elisabeth Mager | Graham Neubig | Alexis Palmer | Rolando Coto-Solano | Thang Vu | Katharina Kann
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot setting, even for languages unseen during pretraining. However, prior work evaluating performance on unseen languages has largely been limited to low-level, syntactic tasks, and it remains unclear if zero-shot learning of high-level, semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, an extension of XNLI (Conneau et al., 2018) to 10 Indigenous languages of the Americas. We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches. Additionally, we explore model adaptation via continued pretraining and provide an analysis of the dataset by considering hypothesis-only models. We find that XLM-R’s zero-shot performance is poor for all 10 languages, with an average performance of 38.48%. Continued pretraining offers improvements, with an average accuracy of 43.85%. Surprisingly, training on poorly translated data by far outperforms all other methods with an accuracy of 49.12%.

pdf bib
OCR Improves Machine Translation for Low-Resource Languages
Oana Ignat | Jean Maillard | Vishrav Chaudhary | Francisco Guzmán
Findings of the Association for Computational Linguistics: ACL 2022

We aim to investigate the performance of current OCR systems on low resource languages and low resource scripts.We introduce and make publicly available a novel benchmark, OCR4MT, consisting of real and synthetic data, enriched with noise, for 60 low-resource languages in low resource scripts. We evaluate state-of-the-art OCR systems on our benchmark and analyse most common errors. We show that OCR monolingual data is a valuable resource that can increase performance of Machine Translation models, when used in backtranslation. We then perform an ablation study to investigate how OCR errors impact Machine Translation performance and determine what is the minimum level of OCR quality needed for the monolingual data to be useful for Machine Translation.

2021

pdf bib
Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas
Manuel Mager | Arturo Oncevay | Abteen Ebrahimi | John Ortega | Annette Rios | Angela Fan | Ximena Gutierrez-Vasques | Luis Chiruzzo | Gustavo Giménez-Lugo | Ricardo Ramos | Ivan Vladimir Meza Ruiz | Rolando Coto-Solano | Alexis Palmer | Elisabeth Mager-Hois | Vishrav Chaudhary | Graham Neubig | Ngoc Thang Vu | Katharina Kann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper presents the results of the 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. The shared task featured two independent tracks, and participants submitted machine translation systems for up to 10 indigenous languages. Overall, 8 teams participated with a total of 214 submissions. We provided training sets consisting of data collected from various sources, as well as manually translated sentences for the development and test sets. An official baseline trained on this data was also provided. Team submissions featured a variety of architectures, including both statistical and neural models, and for the majority of languages, many teams were able to considerably improve over the baseline. The best performing systems achieved 12.97 ChrF higher than baseline, when averaged across languages.

pdf bib
Findings of the 2021 Conference on Machine Translation (WMT21)
Farhad Akhbardeh | Arkady Arkhangorodsky | Magdalena Biesialska | Ondřej Bojar | Rajen Chatterjee | Vishrav Chaudhary | Marta R. Costa-jussa | Cristina España-Bonet | Angela Fan | Christian Federmann | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Barry Haddow | Leonie Harter | Kenneth Heafield | Christopher Homan | Matthias Huck | Kwabena Amponsah-Kaakyire | Jungo Kasai | Daniel Khashabi | Kevin Knight | Tom Kocmi | Philipp Koehn | Nicholas Lourie | Christof Monz | Makoto Morishita | Masaaki Nagata | Ajay Nagesh | Toshiaki Nakazawa | Matteo Negri | Santanu Pal | Allahsera Auguste Tapo | Marco Turchi | Valentin Vydrin | Marcos Zampieri
Proceedings of the Sixth Conference on Machine Translation

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021.In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories. The taskwas also opened up to additional test suites toprobe specific aspects of translation.

pdf bib
Findings of the WMT 2021 Shared Task on Large-Scale Multilingual Machine Translation
Guillaume Wenzek | Vishrav Chaudhary | Angela Fan | Sahir Gomez | Naman Goyal | Somya Jain | Douwe Kiela | Tristan Thrush | Francisco Guzmán
Proceedings of the Sixth Conference on Machine Translation

We present the results of the first task on Large-Scale Multilingual Machine Translation. The task consists on the many-to-many evaluation of a single model across a variety of source and target languages. This year, the task consisted on three different settings: (i) SMALL-TASK1 (Central/South-Eastern European Languages), (ii) the SMALL-TASK2 (South-East Asian Languages), and (iii) FULL-TASK (all 101 x 100 language pairs). All the tasks used the FLORES-101 dataset as the evaluation benchmark. To ensure the longevity of the dataset, the test sets were not publicly released and the models were evaluated in a controlled environment on Dynabench. There were a total of 10 participating teams for the tasks, with a total of 151 intermediate model submissions and 13 final models. This year’s result show a significant improvement over the known base-lines with +17.8 BLEU for SMALL-TASK2, +10.6 for FULL-TASK and +3.6 for SMALL-TASK1.

pdf bib
Findings of the WMT 2021 Shared Task on Quality Estimation
Lucia Specia | Frédéric Blain | Marina Fomicheva | Chrysoula Zerva | Zhenhao Li | Vishrav Chaudhary | André F. T. Martins
Proceedings of the Sixth Conference on Machine Translation

We report the results of the WMT 2021 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels. This edition focused on two main novel additions: (i) prediction for unseen languages, i.e. zero-shot settings, and (ii) prediction of sentences with catastrophic errors. In addition, new data was released for a number of languages, especially post-edited data. Participating teams from 19 institutions submitted altogether 1263 systems to different task variants and language pairs.

pdf bib
Classification-based Quality Estimation: Small and Efficient Models for Real-world Applications
Shuo Sun | Ahmed El-Kishky | Vishrav Chaudhary | James Cross | Lucia Specia | Francisco Guzmán
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task, and the performance of QE models is typically measured by Pearson correlation with human labels. Recent QE models have achieved previously-unseen levels of correlation with human judgments, but they rely on large multilingual contextualized language models that are computationally expensive and make them infeasible for real-world applications. In this work, we evaluate several model compression techniques for QE and find that, despite their popularity in other NLP tasks, they lead to poor performance in this regression setting. We observe that a full model parameterization is required to achieve SoTA results in a regression task. However, we argue that the level of expressiveness of a model in a continuous range is unnecessary given the downstream applications of QE, and show that reframing QE as a classification problem and evaluating QE models using classification metrics would better reflect their actual performance in real-world applications.

pdf bib
Multilingual Translation from Denoising Pre-Training
Yuqing Tang | Chau Tran | Xian Li | Peng-Jen Chen | Naman Goyal | Vishrav Chaudhary | Jiatao Gu | Angela Fan
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Self-training Improves Pre-training for Natural Language Understanding
Jingfei Du | Edouard Grave | Beliz Gunel | Vishrav Chaudhary | Onur Celebi | Michael Auli | Veselin Stoyanov | Alexis Conneau
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning.

pdf bib
Quality Estimation without Human-labeled Data
Yi-Lin Tuan | Ahmed El-Kishky | Adithya Renduchintala | Vishrav Chaudhary | Francisco Guzmán | Lucia Specia
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Quality estimation aims to measure the quality of translated content without access to a reference translation. This is crucial for machine translation systems in real-world scenarios where high-quality translation is needed. While many approaches exist for quality estimation, they are based on supervised machine learning requiring costly human labelled data. As an alternative, we propose a technique that does not rely on examples from human-annotators and instead uses synthetic training data. We train off-the-shelf architectures for supervised quality estimation on our synthetic data and show that the resulting models achieve comparable performance to models trained on human-annotated data, both for sentence and word-level prediction.

pdf bib
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
Holger Schwenk | Vishrav Chaudhary | Shuo Sun | Hongyu Gong | Francisco Guzmán
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages. We do not limit the extraction process to alignments with English, but we systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 16720 different language pairs, out of which only 34M are aligned with English. This corpus is freely available. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.

pdf bib
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data
Wei-Jen Ko | Ahmed El-Kishky | Adithya Renduchintala | Vishrav Chaudhary | Naman Goyal | Francisco Guzmán | Pascale Fung | Philipp Koehn | Mona Diab
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

2020

pdf bib
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau | Kartikay Khandelwal | Naman Goyal | Vishrav Chaudhary | Guillaume Wenzek | Francisco Guzmán | Edouard Grave | Myle Ott | Luke Zettlemoyer | Veselin Stoyanov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.

pdf bib
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
Ahmed El-Kishky | Vishrav Chaudhary | Francisco Guzmán | Philipp Koehn
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average precision of 94.5% across different language pairs. We mine sixty-eight snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of over 392 million URL pairs from Common Crawl covering documents in 8144 language pairs of which 137 pairs include English. In addition to curating this massive dataset, we introduce baseline methods that leverage cross-lingual representations to identify aligned documents based on their textual content. Finally, we demonstrate the value of this parallel documents dataset through a downstream task of mining parallel sentences and measuring the quality of machine translations from models trained on this mined data. Our objective in releasing this dataset is to foster new research in cross-lingual NLP across a variety of low, medium, and high-resource languages.

pdf bib
An Exploratory Study on Multilingual Quality Estimation
Shuo Sun | Marina Fomicheva | Frédéric Blain | Vishrav Chaudhary | Ahmed El-Kishky | Adithya Renduchintala | Francisco Guzmán | Lucia Specia
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Predicting the quality of machine translation has traditionally been addressed with language-specific models, under the assumption that the quality label distribution or linguistic features exhibit traits that are not shared across languages. An obvious disadvantage of this approach is the need for labelled data for each given language pair. We challenge this assumption by exploring different approaches to multilingual Quality Estimation (QE), including using scores from translation models. We show that these outperform single-language models, particularly in less balanced quality label distributions and low-resource settings. In the extreme case of zero-shot QE, we show that it is possible to accurately predict quality for any given new language from models trained on other languages. Our findings indicate that state-of-the-art neural QE models based on powerful pre-trained representations generalise well across languages, making them more applicable in real-world settings.

pdf bib
Findings of the WMT 2020 Shared Task on Machine Translation Robustness
Lucia Specia | Zhenhao Li | Juan Pino | Vishrav Chaudhary | Francisco Guzmán | Graham Neubig | Nadir Durrani | Yonatan Belinkov | Philipp Koehn | Hassan Sajjad | Paul Michel | Xian Li
Proceedings of the Fifth Conference on Machine Translation

We report the findings of the second edition of the shared task on improving robustness in Machine Translation (MT). The task aims to test current machine translation systems in their ability to handle challenges facing MT models to be deployed in the real world, including domain diversity and non-standard texts common in user generated content, especially in social media. We cover two language pairs – English-German and English-Japanese and provide test sets in zero-shot and few-shot variants. Participating systems are evaluated both automatically and manually, with an additional human evaluation for ”catastrophic errors”. We received 59 submissions by 11 participating teams from a variety of types of institutions.

pdf bib
Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment
Philipp Koehn | Vishrav Chaudhary | Ahmed El-Kishky | Naman Goyal | Peng-Jen Chen | Francisco Guzmán
Proceedings of the Fifth Conference on Machine Translation

Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018, 2019), we posed again the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting the highest-quality data to be used to train ma-chine translation systems. This year, the task tackled the low resource condition of Pashto–English and Khmer–English and also included the challenge of sentence alignment from document pairs.

pdf bib
Findings of the WMT 2020 Shared Task on Quality Estimation
Lucia Specia | Frédéric Blain | Marina Fomicheva | Erick Fonseca | Vishrav Chaudhary | Francisco Guzmán | André F. T. Martins
Proceedings of the Fifth Conference on Machine Translation

We report the results of the WMT20 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word, sentence and document levels. This edition included new data with open domain texts, direct assessment annotations, and multiple language pairs: English-German, English-Chinese, Russian-English, Romanian-English, Estonian-English, Sinhala-English and Nepali-English data for the sentence-level subtasks, English-German and English-Chinese for the word-level subtask, and English-French data for the document-level subtask. In addition, we made neural machine translation models available to participants. 19 participating teams from 27 institutions submitted altogether 1374 systems to different task variants and language pairs.

pdf bib
BERGAMOT-LATTE Submissions for the WMT20 Quality Estimation Shared Task
Marina Fomicheva | Shuo Sun | Lisa Yankovskaya | Frédéric Blain | Vishrav Chaudhary | Mark Fishel | Francisco Guzmán | Lucia Specia
Proceedings of the Fifth Conference on Machine Translation

This paper presents our submission to the WMT2020 Shared Task on Quality Estimation (QE). We participate in Task and Task 2 focusing on sentence-level prediction. We explore (a) a black-box approach to QE based on pre-trained representations; and (b) glass-box approaches that leverage various indicators that can be extracted from the neural MT systems. In addition to training a feature-based regression model using glass-box quality indicators, we also test whether they can be used to predict MT quality directly with no supervision. We assess our systems in a multi-lingual setting and show that both types of approaches generalise well across languages. Our black-box QE models tied for the winning submission in four out of seven language pairs inTask 1, thus demonstrating very strong performance. The glass-box approaches also performed competitively, representing a light-weight alternative to the neural-based models.

bib
A Survey of Qualitative Error Analysis for Neural Machine Translation Systems
Denise Díaz | James Cross | Vishrav Chaudhary | Ahmed Kishky | Philipp Koehn
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

pdf bib
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
Guillaume Wenzek | Marie-Anne Lachaux | Alexis Conneau | Vishrav Chaudhary | Francisco Guzmán | Armand Joulin | Edouard Grave
Proceedings of the 12th Language Resources and Evaluation Conference

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.

pdf bib
Unsupervised Quality Estimation for Neural Machine Translation
Marina Fomicheva | Shuo Sun | Lisa Yankovskaya | Frédéric Blain | Francisco Guzmán | Mark Fishel | Nikolaos Aletras | Vishrav Chaudhary | Lucia Specia
Transactions of the Association for Computational Linguistics, Volume 8

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation, and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By utilizing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivaling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.

2019

pdf bib
Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
Philipp Koehn | Francisco Guzmán | Vishrav Chaudhary | Juan Pino
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

Following the WMT 2018 Shared Task on Parallel Corpus Filtering, we posed the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting 2% and 10% of the highest-quality data to be used to train machine translation systems. This year, the task tackled the low resource condition of Nepali-English and Sinhala-English. Eleven participants from companies, national research labs, and universities participated in this task.

pdf bib
Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings
Vishrav Chaudhary | Yuqing Tang | Francisco Guzmán | Holger Schwenk | Philipp Koehn
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali-English and Sinhala-English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.

pdf bib
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
Francisco Guzmán | Peng-Jen Chen | Myle Ott | Juan Pino | Guillaume Lample | Philipp Koehn | Vishrav Chaudhary | Marc’Aurelio Ranzato
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLORES evaluation datasets for Nepali–English and Sinhala– English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Data and code to reproduce our experiments are available at https://github.com/facebookresearch/flores.

pdf bib
Facebook AI’s WAT19 Myanmar-English Translation Task Submission
Peng-Jen Chen | Jiajun Shen | Matthew Le | Vishrav Chaudhary | Ahmed El-Kishky | Guillaume Wenzek | Myle Ott | Marc’Aurelio Ranzato
Proceedings of the 6th Workshop on Asian Translation

This paper describes Facebook AI’s submission to the WAT 2019 Myanmar-English translation task. Our baseline systems are BPE-based transformer models. We explore methods to leverage monolingual data to improve generalization, including self-training, back-translation and their combination. We further improve results by using noisy channel re-ranking and ensembling. We demonstrate that these techniques can significantly improve not only a system trained with additional monolingual data, but even the baseline system trained exclusively on the provided small parallel dataset. Our system ranks first in both directions according to human evaluation and BLEU, with a gain of over 8 BLEU points above the second best system.
Search
Co-authors