Many-to-English Machine Translation Tools, Data, and Pretrained Models

While there are more than 7000 languages in the world, most translation research efforts have targeted a few high resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec and RTG. We demonstrate their usefulness by creating a multilingual neural machine translation model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.


Introduction
Neural machine translation (NMT) (Bahdanau et al., 2015;Vaswani et al., 2017) has progressed to reach human performance on select benchmark tasks (Barrault et al., 2019(Barrault et al., , 2020. However, as MT research has mainly focused on translation between a small number of high resource languages, the unavailability of usable-quality translation models for low resource languages remains an ongoing concern. Even those commercial translation services attempting to broaden their language coverage has only reached around one hundred languages; this excludes most of the thousands of languages used around the world today.
Freely available corpora of parallel data for many languages are available, though they are hosted at various sites, and are in various forms. A challenge for incorporating more languages into MT models is a lack of easy access to all of these datasets.While standards like ISO 639-3 have been established to 1 Demo website: http://rtg.isi.edu/many-eng. Video demo: https://youtu.be/  bring consistency to the labeling of language resources, these are not yet widely adopted. In addition, scaling experimentation to several hundred languages on large corpora involves a significant engineering effort. Simple tasks such as dataset preparation, vocabulary creation, transformation of sentences into sequences, and training data selection becomes formidable at scale due to corpus size and heterogeneity of data sources and file formats. We have developed tools to precisely address all these challenges, which we demonstrate in this work.
Specifically, we offer three tools which can be used either independently or in combination to advance NMT research on a wider set of languages (Section 2): firstly, MTDATA, which helps to easily obtain parallel datasets (Section 2.1); secondly, NL-CODEC, a vocabulary manager and storage layer for transforming sentences to integer sequences, that is efficient and scalable (Section 2.2); and lastly, RTG, a feature-rich Pytorch-backed NMT toolkit that supports reproducible experiments (Section 2.3).
We demonstrate the capabilities of our tools by preparing a massive bitext dataset with more than 9 billion tokens per side, and training a single multilingual NMT model capable of translating 500 source languages to English (Section 3). We show that the multilingual model is usable either as a service for translating several hundred languages to English (Section 4.1), or as a parent model in a transfer learning setting for improving translation of low resource languages (Section 4.2).

Tools
Our tools are organized into the following sections:

MTDATA
MTDATA addresses an important yet often overlooked challenge -dataset preparation. By assign-

NLCODEC
NLCODEC is a vocabulary manager with encodingdecoding schemes to transform natural language sentences to and from integer sequences. Features: • Versatile: Supports commonly used vocabulary schemes such as characters, words, and bytepair-encoding (BPE) subwords (Sennrich et al., 2016). • Scalable: Apache Spark 4 (Zaharia et al., 2016) backend can be optionally used to create vocabulary from massive datasets. • Easy Setup: pip install nlcodec • Open-source: https://github.com/isi-nlp/nlcodec/ When the training datasets are too big to be kept in the primary random access memory (RAM), the use of secondary storage is inevitable. The training processes requiring random examples lead to random access from a secondary storage device. Even though the latest advancements in secondary storage technology such as solid-state drive (SSD) have faster serial reads and writes, their random access speeds are significantly lower than that of RAM. To address these problems, we include an efficient storage and retrieval layer, NLDB, which has the following features: • Memory efficient by adapting datatypes based on vocabulary size. For instance, encoding with vocabulary size less than 256 (such as characters) can be efficiently represented using 1-byte unsigned integers. Vocabularies with fewer than 65,536 types, such as might be generated when using subword models (Sennrich et al., 2016) require only 2-byte unsigned integers, and 4-byte unsigned integers are sufficient for vocabularies up to 4 billion types. As the default implementation of Python, CPython, uses 28 bytes for all integers, we accomplish this using NumPy (Harris et al., 2020). This optimization makes it possible to hold a large chunk of training data in smaller RAM, enabling a fast random access. • Parallelizable: Offers a multi-part database by horizontal sharding that supports parallel writes (e.g., Apache Spark) and parallel reads (e.g., distributed training). • Supports commonly used batching mechanisms such as random batches with approximatelyequal-length sequences. NLDB has a minimal footprint and is part of 308 the NLCODEC package. In Section 3, we take advantage of the scalability and efficiency aspects of NLCODEC and NLDB to process a large parallel dataset with 9 billion tokens on each side.

RTG
Reader translator generator (RTG) is a neural machine translation (NMT) toolkit based on Pytorch (Paszke et al., 2019). Notable features of RTG are: • Reproducible: All the required parameters of an experiment are included in a single YAML configuration file, which can be easily stored in a version control system such as git or shared with collaborators. • Implements Transformer (Vaswani et al., 2017), and recurrent neural networks (RNN) with crossattention models (Bahdanau et al., 2015;Luong et al., 2015). • Supports distributed training on multi-node multi-GPUs, gradient accumulation, and Float16 operations. • Integrated Tensorboard helps in visualizing training and validation losses. • Supports weight sharing (Press and Wolf, 2017), parent-child transfer (Zoph et al., 2016), beam decoding with length normalization (Wu et al., 2016), early stopping, and checkpoint averaging. • Flexible vocabulary options with NLCODEC and SentencePiece (Kudo and Richardson, 2018) which can be either shared or separated between source and target languages. • Easy setup: pip install rtg • Open-source: https://isi-nlp.github.io/rtg/

Many-to-English Multilingual NMT
In this section, we demonstrate the use of our tools by creating a massively multilingual NMT model from publicly available datasets.

Dataset
We use MTDATA to download datasets from various sources, given in Table 1. To minimize data imbalance, we select only a subset of the datasets available for high resource languages, and select all available datasets for low resource languages. The selection is aimed to increase the diversity of data domains and quality of alignments.
Cleaning: We use SACREMOSES 5 to normalize  Bojar et al. (2013Bojar et al. ( , 2014Bojar et al. ( , 2015Bojar et al. ( , 2016Bojar et al. ( , 2017Bojar et al. ( , 2018; Barrault et al. (2019Barrault et al. ( , 2020  Unicode punctuations and digits, followed by word tokenization. We remove records that are duplicates, have abnormal source-to-target length ratios, have many non-ASCII characters on the English side, have a URL, or which overlap exactly, either on the source or target side, with any sentences in held out sets. As preprocessing is computeintensive, we parallelize using Apache Spark. The cleaning and tokenization results in a corpus of 474 million sentences and 9 billion tokens on the source and English sides each. The token and sentence count for each language are provided in Figure 1. Both the processed and raw datasets are available at http://rtg.isi.edu/many-eng/data/v1/. 6

Many-to-English Multilingual Model
We use RTG to train Transformer NMT (Vaswani et al., 2017) with a few modifications. Firstly, instead of a shared BPE vocabulary for both source and target, we use two separate BPE vocabularies.
Since the source side has 500 languages but the target side has English only, we use a large source vocabulary and a relatively smaller target vocabulary. A larger target vocabulary leads to higher time and memory complexity, whereas a large source vocabulary increases only the memory complexity but not the time complexity. We train several models, ranging from the standard 6 layers, 512dimensional Transformers to larger ones with more parameters. Since the dataset is massive, a larger model trained on big mini-batches yields the best results. Our best performing model is a 768 dimensional model with 12 attention heads, 9 encoder layers, 6 decoder layers, feed-forward dimension of 2048, dropout and label smoothing at 0.1, using FRA  RUS  ARA  ZHO  TUR  SRP  HEB  NLD  POR  DEU  ITA  SPA  SWE  DAN  FIN  POL  ELL  NOR  HUN  SLV  BOS  SLK  EST  LIT  LAV  FAS  JPN  VIE  UKR  CES  MLT  KOR  IND  CAT  RON  BUL  THA  GLE  HRV  HIN  MKD  EUS  SQI  URD  TGL  BEN  GLG  AFR  CEB  EPO  SWA  ZUL  MSA  TAM  XHO  MAL  ILO  SIN  MLG  HIL  SNA  NYA  TSN  TSO  AMH  ISL  AZE  KAT  MAR  MYA  EWE  SRN  TAH  NSO  LIN  TWI  TEL  KIN  BIS  BCL  NEP  LOZ  GAA  IBO  YOR  PAN  HYE  KAN  TAT  PAP  BEM  TPI  GUJ  SMO  RUN  FIJ  EFI  TIR  TON Figure 1: Training data statistics for the 500 languages, sorted based on descending order of English token count. These statistics are obtained after de-duplication and filtering (see Section 3.1). The full name for these ISO 639-3 codes can be looked up using MTDATA, e.g. mtdata-iso eng .
512,000 and 64,000 BPE types as source and target vocabularies, respectively. The decoder's input and output embeddings are shared. Since some of the English sentences are replicated to align with many sentences from different languages (e.g. the Bible corpus), BPE merges are learned from the deduplicated sentences using NLCODEC. Our best performing model is trained with an effective batch size of about 720,000 tokens per optimizer step. Such big batches are achieved by using mixed-precision distributed training on 8 NVIDIA A100 GPUs with gradient accumulation of 5 minibatches, each having a maximum of 18,000 tokens. We use the Adam optimizer (Kingma and Ba, 2014) with 8000 warm-up steps followed by a decaying learning rate, similar to Vaswani et al. (2017). We stop training after five days and six hours when a total of 200K updates are made by the optimizer; validation loss is still decreasing at this point. To assess the translation quality of our model, we report BLEU (Papineni et al., 2002;Post, 2018) 7 on a subset of languages for which known test sets are available, as given in Figure 2, along with a comparison to Zhang et al. (2020)'s best model. 8

Applications
The model we trained as a demonstration for our tools is useful on its own, as described in the following sections.  Figure 2: Many-to-English BLEU on OPUS-100 tests (Zhang et al., 2020). Despite having four times more languages on the source side, our model scores competitive BLEU on most languages with the strongest system of Zhang et al. (2020). The tests where our model scores lower BLEU have shorter source sentences (mean length of about three tokens).

Readily Usable Translation Service
Our pretrained NMT model is readily usable as a service capable of translating several hundred source languages to English. By design, source language identification is not necessary. Figure 2 shows that the model scores more than 20 BLEU, which maybe be a useful quality for certain downstream applications involving web and social media content analysis. Apache Tika (Mattmann and Zitting, 2011), a content detection and analysis toolkit capable of parsing thousands of file formats, has an option for translating any document into English using our multilingual NMT model. 9 Our model has been packaged and published to DockerHub, 10 which can be obtained by the following command: IMAGE=tgowda/rtg-model:500toEng-v1 docker run --rm -i -p 6060:6060 $IMAGE # For GPU support: --gpus '"device=0"' The above command starts a docker image with HTTP server having a web interface, as can be seen in Figure 3, and a REST API. An example interaction with the REST API is as follows:

Parent Model for Low Resource MT
Fine tuning is a useful transfer learning technique for improving the translation of low resource languages (Zoph et al., 2016;Neubig and Hu, 2018;9 https://cwiki.apache.org/confluence/display/ TIKA/NMT-RTG 10 https://hub.docker.com/ Figure 3: RTG Web Interface Gheini and May, 2019). For instance, consider Breton-English (BRE-ENG) and Northern Sami-English (SME-ENG), two of the low resource settings for which our model has relatively poor BLEU (see Figure 2). To show the utility of fine tuning with our model, we train a strong baseline Transformer model, one for each language, from scratch using OPUS-100 training data (Zhang et al., 2020), and finetune our multilingual model on the same dataset as the baselines. We shrink the parent model vocabulary and embeddings to the child model dataset, and train all models on NVIDIA P100 GPUs until convergence. 11 Table 2, which shows BLEU on the OPUS-100 test set for the two low resource languages indicates that our multilingual NMT parent model can be further improved with finetuning on limited training data. The finetuned model is significantly better than baseline model.  NLCODEC: NLCODEC is a Python library for vocabulary management. It overcomes the multithreading bottleneck in Python by using PySpark. SentencePiece (Kudo and Richardson, 2018) and HuggingfaceTokenizers (Wolf et al., 2020) are the closest alternatives in terms of features, however, modification is relatively difficult for Python users as these libraries are implemented in C++ and Rust, respectively. In addition, SentencePiece uses a binary format for model persistence in favor of efficiency, which takes away the inspectability of the model state. Retaining the ability to inspect models and modify core functionality is beneficial for further improving encoding schemes, e.g. subword regularization (Kudo, 2018), BPE dropout (Provilkov et al., 2020), and optimal stop condition for subword merges (Gowda and May, 2020). FastBPE is another efficient BPE tool written in C++. 12 Subword-nmt (Sennrich et al., 2016) is a Python implementation of BPE, and stores the model in an inspectable plain text format, however, it is not readily scalable to massive datasets such as the one used in this work. None of these tools have an equivalent to NLDB's mechanism for efficiently storing and retrieving variable length sequences for distributed training.
RTG: Tensor2Tensor (Vaswani et al., 2018) originally offered the Transformer (Vaswani et al., 2017) implementation using Tensorflow (Abadi 12 https://github.com/glample/fastBPE et al., 2015); our implementation uses Pytorch (Paszke et al., 2019) following Annotated Transformer (Rush, 2018). OpenNMT currently offers separate implementations for both Pytorch and Tensorflow backends (Klein et al., 2017(Klein et al., , 2020. As open-source toolkits evolve, many good features tend to propagate between them, leading to varying degrees of similarities. Some of the available NMT toolkits are: Nematus (Sennrich et al., 2017), xNMT . Marian NMT (Junczys-Dowmunt et al., 2018), Joey NMT (Kreutzer et al., 2019), Fairseq (Ott et al., 2019), and Sockey (Hieber et al., 2020). An exhaustive comparison of these NMT toolkits is beyond the scope of our current work. Johnson et al. (2017) show that NMT models are capable of multilingual translation without any architectural changes, and observe that when languages with abundant data are mixed with low resource languages, the translation quality of low resource pairs are significantly improved. They use a private dataset of 12 language pairs; we use publicly available datasets for up to 500 languages.  assemble a multi-parallel dataset for 58 languages from TEDTalks domains, which are included in our dataset. Zhang et al. (2020) curate OPUS-100, a multilingual dataset of 100 languages sampled from OPUS, including test sets; which are used in this work. Tiedemann (2020) have established a benchmark task for 500 languages including single directional baseline models. Wang et al. (2020) examine the language-wise imbalance problem in multilingual datasets and propose a method to address the imbalance using a scoring function, which we plan to explore in the future.

Conclusion
We have introduced our tools: MTDATA for downloading datasets, NLCODEC for processing, storing and retrieving large scale training data, and RTG for training NMT models. Using these tools, we have collected a massive dataset and trained a multilingual model for many-to-English translation. We have demonstrated that our model can be used independently as a translation service, and also showed its use as a parent model for improving low resource language translation. All the described tools, used datasets, and trained models are made available to the public for free. of Southern California for providing computing resources that have contributed to the research results reported within this publication. URL: https://carc.usc.edu. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu. This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via AFRL Contract FA8650-17-C-9116. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

Ethical Consideration
Failure Modes: MTDATA will fail to operate, unless patched, when hosting services change their URLs or formats over time. On certain scenarios when a dataset has been previously accessed and retained in local cache, MTDATA continues to operate with a copy of previous version and ignores server side updates. We have done our best effort in normalizing languages to ISO 639-3 standard; our current version does not accommodate country and script variations of languages; e.g. UK English and US English are both mapped to eng. Our multilingual NMT model is trained to translate a full sentence at a time without considering source language information; translation of short phrases without a proper context might result in a poor quality translation.
Diversity and Fairness: We cover all languages on the source side for which publicly available dataset exists, which happens to be about 500 source languages. Our model translates to English only, hence only English speakers are benefited from this work.
Climate Impact: MTDATA reduces network transfers to the minimal by maintaining a local cache to avoid repetitive downloads. In addition to the raw datasets, preprocessed data is also available to avoid repetitive computation. Our Multilingual NMT has higher energy cost than a typical single directional NMT model due to higher number of parameters, however, since our single model translates hundreds of languages, the energy requirement is significantly lower than the total consumption of all independent models. Our trained models with all the weights are also made available for download.
Dataset Ownership: MTDATA is a client side library that does not have the ownership of datasets in its index. Addition, removal, or modification in its index is to be submitted by creating an issue at https://github.com/thammegowda/mtdata/issues. We ask the dataset users to review the dataset license, and acknowledge its original creators by citing their work, whose BIBT E X entries may be accessed using: mtdata list -n <NAME> -l <L1-L2> -full