Toan Q. Nguyen

Also published as: Toan Nguyen


pdf bib
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.


pdf bib
Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution
Toan Q. Nguyen | Kenton Murray | David Chiang
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

In this paper, we investigate the driving factors behind concatenation, a simple but effective data augmentation method for low-resource neural machine translation. Our experiments suggest that discourse context is unlikely the cause for concatenation improving BLEU by about +1 across four language pairs. Instead, we demonstrate that the improvement comes from three other factors unrelated to discourse: context diversity, length diversity, and (to a lesser extent) position shifting.


pdf bib
Masked Language Model Scoring
Julian Salazar | Davis Liang | Toan Q. Nguyen | Katrin Kirchhoff
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeech model’s WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL’s unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and their associated pseudo-perplexities (PPPLs) enable plug-and-play use of the growing number of pretrained MLMs; e.g., we use a single cross-lingual model to rescore translations in multiple languages. We release our library for language model scoring at


pdf bib
Neural Machine Translation of Text from Non-Native Speakers
Antonios Anastasopoulos | Alison Lui | Toan Q. Nguyen | David Chiang
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Neural Machine Translation (NMT) systems are known to degrade when confronted with noisy data, especially when the system is trained only on clean data. In this paper, we show that augmenting training data with sentences containing artificially-introduced grammatical errors can make the system more robust to such errors. In combination with an automatic grammar error correction system, we can recover 1.0 BLEU out of 2.4 BLEU lost due to grammatical errors. We also present a set of Spanish translations of the JFLEG grammar error correction corpus, which allows for testing NMT robustness to real grammatical errors.

pdf bib
Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation
Kenton Murray | Jeffery Kinnison | Toan Q. Nguyen | Walter Scheirer | David Chiang
Proceedings of the 3rd Workshop on Neural Generation and Translation

Neural sequence-to-sequence models, particularly the Transformer, are the state of the art in machine translation. Yet these neural networks are very sensitive to architecture and hyperparameter settings. Optimizing these settings by grid or random search is computationally expensive because it requires many training runs. In this paper, we incorporate architecture search into a single training run through auto-sizing, which uses regularization to delete neurons in a network over the course of training. On very low-resource language pairs, we show that auto-sizing can improve BLEU scores by up to 3.9 points while removing one-third of the parameters from the model.

pdf bib
Transformers without Tears: Improving the Normalization of Self-Attention
Toan Q. Nguyen | Julian Salazar
Proceedings of the 16th International Conference on Spoken Language Translation

We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PRENORM) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose l2 normalization with a single scale parameter (SCALENORM) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FIXNORM). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT '15 English-Vietnamese. We ob- serve sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the high-resource setting (WMT '14 English-German), SCALENORM and FIXNORM remain competitive but PRENORM degrades performance.


pdf bib
Improving Lexical Choice in Neural Machine Translation
Toan Nguyen | David Chiang
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We explore two solutions to the problem of mistranslating rare words in neural machine translation. First, we argue that the standard output layer, which computes the inner product of a vector representing the context with all possible output word embeddings, rewards frequent words disproportionately, and we propose to fix the norms of both vectors to a constant value. Second, we integrate a simple lexical module which is jointly trained with the rest of the model. We evaluate our approaches on eight language pairs with data sizes ranging from 100k to 8M words, and achieve improvements of up to +4.3 BLEU, surpassing phrase-based translation in nearly all settings.


pdf bib
Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation
Toan Q. Nguyen | David Chiang
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We present a simple method to improve neural translation of a low-resource language pair using parallel data from a related, also low-resource, language pair. The method is based on the transfer method of Zoph et al., but whereas their method ignores any source vocabulary overlap, ours exploits it. First, we split words using Byte Pair Encoding (BPE) to increase vocabulary overlap. Then, we train a model on the first language pair and transfer its parameters, including its source word embeddings, to another model and continue training on the second language pair. Our experiments show that transfer learning helps word-based translation only slightly, but when used on top of a much stronger BPE baseline, it yields larger improvements of up to 4.3 BLEU.