PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

We introduce a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs, which is 2.9M pairs larger than the benchmark Vietnamese-English machine translation corpus IWSLT15. We conduct experiments comparing strong neural baselines and well-known automatic translation engines on our dataset and find that in both automatic and human evaluations: the best performance is obtained by fine-tuning the pre-trained sequence-to-sequence denoising auto-encoder mBART. To our best knowledge, this is the first large-scale Vietnamese-English machine translation study. We hope our publicly available dataset and study can serve as a starting point for future research and applications on Vietnamese-English machine translation. We release our dataset at: https://github.com/VinAIResearch/PhoMT


Introduction
Vietnam has achieved rapid economic growth in the last two decades (Baum, 2020). It is now an attractive destination for trade and investment. Due to the language barrier, foreigners usually rely on automatic machine translation (MT) systems to translate Vietnamese texts into their native language or another language they are familiar with, e.g. the global language English, so they could quickly catch up with ongoing events in Vietnam. Thus the demand for high-quality Vietnamese-English MT has rapidly increased. However, state-of-theart MT models require high-quality and large-scale corpora for training to be able to reach near humanlevel translation quality (Wu et al., 2016;Ott et al., 2018). Despite being one of the most spoken languages in the world with about 100M speakers, Vietnamese is referred to as a low-resource language in MT research because publicly available * The first three authors contributed equally to this work. † Work done during internship at VinAI Research. Email: qthai912@cs.washington.edu parallel corpora for Vietnamese in general and in particular for Vietnamese-English MT are not large enough or have low-quality translation pairs, including those with different sentence meaning (i.e. misalignment).
Two main concerns are detailed as follows: • High-quality Vietnamese-English parallel corpora are either not publicly available or smallscale. Ngo et al. (2013) and Phan-Vu et al. (2019) present two corpora each comprising of 800K sentence pairs, however, these two corpora are not publicly available. Thus, the Vietnamese-English parallel corpus IWSLT15 (Cettolo et al., 2015) of 133K sentence pairs extracted from TED-Talks transcripts is still considered as the standard benchmark for MT when it comes to Vietnamese. Recently, the OPUS project (Tiedemann, 2012) provides 300K+ sentence pairs extracted from the TED2020 v1 corpus of TED-Talks transcripts (Reimers and Gurevych, 2020).
• Larger Vietnamese-English parallel corpora are noisy, e.g. see discussions on the 300K-600K sentence pair corpora of JW300 (Agić and Vulić, 2019), OPUS's GNOME and QED (Abdelali et al., 2014) in Section 2.1, and on the OpenSubtitles corpus (Lison and Tiedemann, 2016) in Section 2.2. Recently, CCAligned (El-Kishky et al., 2020) and WikiMatrix (Schwenk et al., 2021) are created by using LASER sentence embeddings (Artetxe and Schwenk, 2019) and margin-based sentence alignment to mine parallel sentences from comparable web-document pairs. Though containing millions of Vietnamese-English parallel sentence pairs, they still have a large proportion of misalignment and low-quality translation pairs. In particular, we randomly sample from each corpus 100 sentence pairs and manually inspect their quality. We find that only 37/100 CCAligned pairs and 31/100 WikiMatrix pairs are at a high-quality translation level.
As the first contribution, to help handle the two concerns above, we present a high-quality and large-scale Vietnamese-English parallel dataset, named PhoMT, that consists of 3.02M sentence pairs. Here, from PhoMT, we also prepare 38K sentence pairs with manually qualitative inspection, that are used for validation and test. We believe that our dataset construction process will help develop more efficient data creation strategies for other low-resource languages. As the second contribution, we empirically investigate strong baselines on our dataset, including Transformer-base, Transformer-big (Vaswani et al., 2017) and the pre-trained sequence-to-sequence denoising autoencoder mBART (Liu et al., 2020), and compare these baselines with well-known automatic translation engines. We find that mBART obtains the highest scores in terms of both automatic and human evaluations on both translation directions. To the best of our knowledge, this is the first largescale empirical study for Vietnamese-English MT. As our final contribution, we publicly release our PhoMT dataset for research or educational purposes. We hope PhoMT together with our empirical study can serve as a starting point for future Vietnamese-English MT research and applications.

Our PhoMT dataset
Our dataset construction process consists of 4 phases. The 1st phase is to collect parallel document pairs. The 2nd phase is a pre-processing step that is to produce cleaned and high-quality parallel document pairs and then extract sentences from these pairs. The 3rd phase is to align parallel sentences within a pair of parallel documents. The 4th phase is a post-processing step that is to filter out duplicated parallel sentence pairs and manually verify the quality of validation and test sets.

Collecting parallel document pairs
We collect the parallel document pairs from publicly available resources that contain original English documents and their corresponding Vietnamese-translated version. WikiHow: It is an online knowledge base of how-to guides that are available in multiple languages. We employ a multilingual WikiHow-based document summarization corpus (Ladhak et al., 2020) that contains 6616 pairs of WikiHow English articles and their Vietnamese-translated variant. TED-Talks: We use the TED2020 v1 corpus (Reimers and Gurevych, 2020) that includes 3123 English-Vietnamese subtitle pairs of TED talks. OpenSubtitles: We employ the latest version v2018 of the OpenSubtitles corpus (Lison and Tiedemann, 2016) that contains 3886 parallel movie and TV subtitles. MediaWiki: We also use parallel documents from the MediaWiki content translation data dump. News & Blogspot: We collect English and Vietnamese-translated versions of news and Blogspot articles from eight websites for English learners. See URLs for the described resources in the Appendix.
Here, we do not include available corpora of JW300 (Agić and Vulić, 2019), OPUS's GNOME and QED (Abdelali et al., 2014). We manually check 100 randomly sampled pairs from the 600K Vietnamese-English sentence pair corpus JW300 and find that there are 71 high-quality translation pairs. However, it is worth noting that JW300 can introduce potential bias because of its religious domain. GNOME from OPUS contains 600K sentences pairs, in which most Vietnamese target sentences include many original translatable technical English words, thus not natural. QED has 340K sentence pairs, however, our investigation finds that about a half of the QED pairs are from the TED-Talks transcripts (Reimers and Gurevych, 2020); and from the remaining sentence pairs, we randomly sample 100 pairs and find that only 43 pairs have a high-quality translation.

Pre-processing
We find that not all of 3886 English-Vietnamese parallel document pairs in OpenSubtitles have a high-quality translation. We manually inspect each OpenSubtitles pair and remove 574/3886 (15%) document pairs with a low-quality translation, thus remaining 3312 pairs. In MediaWiki, there are original English paragraphs appearing in some Vietnamese target documents, that have not been translated into Vietnamese yet. We employ the language identification module of fastText (Joulin et al., 2017) to identify and filter those English paragraphs out of the Vietnamese documents. We also remove reference sections and tables that appear in some MediaWiki and Blogspot documents.
To extract sentences for parallel sentence alignment in the next phase, we perform (tokenization and) sentence segmentation by using the Stanford CoreNLP toolkit (Manning et al., 2014) and RDRSegmenter (Nguyen et al., 2018) Table 1: Our dataset statistics. "#doc", "#pair", "#en/s" and "#vi/s" denote the number of parallel document pairs, the number of aligned parallel sentence pairs, the average number of word tokens per English sentence and the average number of syllable tokens per Vietnamese sentence, respectively. "OpenSub" abbreviates OpenSubtitles.

Aligning parallel sentence pairs
To align parallel sentences within a parallel document pair, our approach first employs the strong neural MT engine Google Translate to translate each English source sentence into Vietnamese. Then we use three toolkits of Hunalign (Varga et al., 2007), Gargantua (Braune and Fraser, 2010) and Bleualign (Sennrich and Volk, 2011) to perform sentence alignment via alignment between the Google-translated variants of the English source sentences and the Vietnamese target sentences. Finally, we only select sentence pairs that are aligned by at least two out of three toolkits as the output of our alignment process. The quality of our sentence alignment output is shown in Section 2.4. Here, we discuss alignment coverage rates. On the same TED2020 v1 corpus, the automatic alignment approach OPUS (Tiedemann, 2012), based on word alignments and phrase tables, aligns a total of 326K Vietnamese sentences, 1 while our approach aligns 350K Vietnamese ones (i.e. a 7.5% relative improvement). 2 Note that from each resource domain except OpenSubtitles, our approach selects 99+% of Vietnamese sentences to be included in the output of our alignment process. Particularly, a total of 14.6K Vietnamese sentences (i.e. 0.86%) are not selected from five resource domains of News, Blogspot, TED-Talks, MediaWiki and WikiHow. When it comes to OpenSubtitles, the rate reduces to 95% (here, 120K Vietnamese sentences are not included in our alignment output). On average, the align-1 https://object.pouta.csc.fi/ OPUS-TED2020/v1/moses/en-vi.txt.zip wherein duplicate removal is not performed.
2 See the Appendix for an additional discussion. ment coverage rate for English is about 2% absolute lower than the one for Vietnamese as there are English source sentences that are not translated to Vietnamese in the collected corpora.

Post-processing
On the alignment output from the previous phase, we normalize punctuations, remove all duplicate sentence pairs within and across all document pairs, and also remove sentence pairs where the reference English sentence contains either only a single word token or swear words. 3 Then we randomly split each domain into training/validation/test sets on document level with a 98.75/0.60/0.65 ratio, resulting in 2977999 training, 18876 validation and 19291 test sentence pairs. To qualify our dataset, we manually inspect each sentence pair in the validation and test sets. Here, each pair is inspected by two out of the first three co-authors independently: one inspector checks about (18876 + 19291) × 2 / 3 = 25K sentence pairs in 125 hours on average (i.e. 200 sentences/hour). After cross-checking, we find that in the validation set, 32 sentence pairs (0.17%) are misaligned (i.e. completely different sentence meaning or partly preserving the sentence meaning); and 125 pairs (0.66%) are low-quality translation ones (i.e. mostly or completely preserving the sentence meaning, however, the Vietnamese target sentence is not naturally smooth). In the test set, there are 27 misaligned sentence pairs (0.14%) and 113 low-quality translation pairs (0.59%). Note that performing a similar manual inspection on the training set is beyond our current human resource; however, with small proportions of misalignment and low-quality

Experimental setup
We conduct experiments on our PhoMT dataset to study: (i) a comparison between the wellknown automatic translation engines (here, Google Translate and Bing Translator) and strong neural MT baselines, and (ii) the usefulness of pre-trained sequence-to-sequence denoising autoencoder. In particular, we use the baseline models: Transformer-base, Transformer-big (Vaswani et al., 2017), and the pre-trained denoising auto-encoder mBART (Liu et al., 2020). We report standard metrics TER (Snover et al., 2006) and BLEU (Papineni et al., 2002), in which lower TER and higher BLEU indicate better performances. We compute the case-sensitive BLEU score using SacreBLEU (Post, 2018). See the Appendix for implementation details. Here, we select the model checkpoint that obtains the highest BLEU score on the validation set to apply to the test set.   Google Translate and Bing Translator, in which Transformer-big outperforms Transformer-base. In addition, mBART achieves the best performance among all models, reconfirming the quantitative effectiveness of multilingual denoising pre-training for neural MT (Liu et al., 2020). We present BLEU scores on the Vi-to-En test set for each resource domain and sentence length bucket in Tables 3 and 4, respectively. Table 3 shows that the highest BLEU scores are reported for MediaWiki (wherein documents share and link common events or topics), followed by the ones reported for News and TED-Talks. Three remaining resource domains Blogspot, WikiHow and Open-Subtitles contain less common topic-specific document pairs, thus resulting in lower scores. In Table  4, we find that models produce lower BLEU scores for short-and medium-length sentences (i.e. < 20 tokens) than for long sentences. This is not surprising as a major proportion of short-and medium-  Figure 1 presents BLEU scores of Transformerbase on the validation set for the Vi-to-En setup when varying the numbers of training sentence pairs. Those scores clearly show the effectiveness of larger training sizes.

Automatic evaluation results
We also perform an experiment to additionally show that our curation effort has paid off. In particular, as not all of our data are overlapping with OPUS, for a fair comparison, we sample a set of 1.55M non-duplicate Vietnamese-English sentence pairs from OPUS's OpenSubtitles, which has the same size as our PhoMT's OpenSubtitles training subset and do not contain pairs appearing in our OpenSubtitles validation and test subsets. We train two Transformer-base models for Vi-to-En translation: one trained using the sampled OPUS's Open-Subtitles set and another one trained using our OpenSubtitles training subset. Hyper-parameter tuning is performed using our OpenSubtitles validation subset in the same manner as presented in the Appendix. We evaluate the models using our Open-Subtitles test subset. We find that Transformer-base trained using the sampled OPUS's OpenSubtitles set produces a significantly lower Vi-to-En BLEU score on our OpenSubtitles test subset than the one trained using our OpenSubtitles training subset (29.72 vs. 31.11), clearly showing the effectiveness of our quality control. Note that as shown in Table 3, Transformer-base trained using the whole PhoMT's training set obtains a higher Vi-to-En BLEU score at 32.29 on our OpenSubtitles test subset. Thus this experiment also reconfirms the positive effect of a larger training size.

Human evaluation results
We also conduct a human-based manual comparison between the outputs generated by the two automatic translation systems and our three neural baselines. For each translation direction, we randomly sample 100 source sentences in the test set; and for each sentence sample, we anonymously shuffle the translation outputs from five systems. Here, each sampled pair satisfies that any two out of five translation outputs are not exactly the same. We then ask three external Vietnamese annotators to choose which translation they think is the best. 4 The inter-annotator agreement score computed for Fleiss' kappa coefficient (Fleiss, 1971) between the three annotators is 0.63 which is relatively substantial. Our fourth co-author hosts and participates in a discussion session with the three annotators to resolve annotation conflicts. 5 Table 2 shows final results, where mBART gains the highest human evaluation scores, thus demonstrating its qualitative effectiveness for both En-to-Vi and Vi-to-En translation. Table 2 also shows that human preference is not always correlated with the automatic metrics TER and BLEU. For example, in the En-to-Vi setup, though Transformer models have 2+ points better TER and BLEU than Google Translate, they are less preferred by humans than Google Translate (13 vs. 23 and 18 vs. 23). A detailed study is beyond the scope of our paper, but it is worth investigating in future work.

Conclusion
We have presented PhoMT-a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs. We empirically conduct experiments on our PhoMT dataset to compare strong baselines and demonstrate the effectiveness of the pre-trained denoising auto-encoder mBART for neural MT in both automatic and human evaluations. We hope that the public release of our dataset can serve as the starting point for further Vietnamese-English MT research and applications.

Discussion on the use of Google Translate
To align parallel sentences within a parallel document pair as described in Section 2.3, we first translate each English source sentence into Vietnamese by using Google Translate. Here, the use of Google Translate in this step is via utilizing the "Google-Translate" function in Google Sheets. However, we later find that this "GoogleTranslate" function in Google Sheets produces lower performance scores than using the Google Translate API in both automatic and human evaluation setups. Therefore, in our result tables, we report "Google Translate" scores accounted for the Google Translate API on the validation and test sets.

Implementation details
We employ Transformer and mBART implementations from fairseq (Ott et al., 2019). For both Transformer models (Vaswani et al., 2017), following Ding et al. (2019), we use subword-nmt to learn joint BPE with 32K merge operations (Sennrich et al., 2016). For mBART, we finetune the pre-trained sequence-to-sequence model mBART25 (Liu et al., 2020). Here, mBART25 is pre-trained on a Common Crawl dataset of 25 languages, which contains 300GB of English texts and 137 GB of Vietnamese texts. Following Vaswani et al. (2017), we use beam search with a beam size of 4 and length normalization of 0.6 for decoding. Due to the model size, we apply batch sizes of 16K tokens for Transformer-base, 8K tokens for Transformer-big and 4K tokens for mBART. We optimize the models using Adam (Kingma and Ba, 2014) and run for 30 training epochs, wherein the Adam initial learning rate is warmed up for the first epoch. In addition, we also perform grid search to select the initial learning rate from {1e-4, 3e-4, 5e-4, 7e-4} for Transformer models and from {1e-5, 3e-5, 5e-5, 7e-5} for mBART. For both Englishto-Vietnamese and Vietnamese-to-English translation setups, the optimal learning rates selected for Transformer-base, Transformer-big and mBART are 5e-4, 3e-4 and 5e-5, respectively. Here, we evaluate each model 8 times during every training epoch, and then select the model checkpoint that obtains the highest BLEU score on the validation set to apply to the test set. We compute the detokenized and case-sensitive BLEU score using SacreBLEU (with the signature "BLEU+case.mixed+numrefs.1+smooth.exp+tok.1-3a+version.1.5.1"). 7 Similarly, we also compute the detokenized and case-sensitive TER score (with the option "-N" of normalization). 8

Additional results
Tables 5 and 6 present details of TER and BLEU scores on the validation and test sets for each domain. In addition, we show TER and BLEU scores for each sentence length bucket in Table 7