NECTEC’s Participation in WAT-2021

In this paper, we report the experimental results of Machine Translation models conducted by a NECTEC team for the translation tasks of WAT-2021. Basically, our models are based on neural methods for both directions of English-Myanmar and Myanmar-English language pairs. Most of the existing Neural Machine Translation (NMT) models mainly focus on the conversion of sequential data and do not directly use syntactic information. However, we conduct multi-source neural machine translation (NMT) models using the multilingual corpora such as string data corpus, tree data corpus, or POS-tagged data corpus. The multi-source translation is an approach to exploit multiple inputs (e.g. in two different formats) to increase translation accuracy. The RNN-based encoder-decoder model with attention mechanism and transformer architectures have been carried out for our experiment. The experimental results showed that the proposed models of RNN-based architecture outperform the baseline model for English-to-Myanmar translation task, and the multi-source and shared-multi-source transformer models yield better translation results than the baseline.

In this paper, we report the experimental results of Machine Translation models conducted by a NECTEC team (Team-ID: NECTEC) for the WAT-2021 Myanmar-English translation task (Nakazawa et al., 2021). Basically, our models are based on neural methods for both directions of English-Myanmar and Myanmar-English language pairs. Most of the existing Neural Machine Translation (NMT) models mainly focus on the conversion of sequential data and do not directly use syntactic information. However, we conduct multisource neural machine translation (NMT) models using the multilingual corpora such as string data corpus, tree data corpus, or POS-tagged data corpus. The multisource translation is an approach to exploit multiple inputs (e.g. in two different formats) to increase translation accuracy. The RNN-based encoder-decoder model with attention mechanism and transformer architectures have been carried out for our experiment. The experimental results showed that the proposed models of RNNbased architecture outperform the baseline model for the English-to-Myanmar translation task, and the multi-source and sharedmulti-source transformer models yield better translation results than the baseline.

Introduction
Machine translation (MT) is a quick and very effective way to communicate one language to another. MT consists of the automatic translation of human languages by using computers. The first machine translation systems were rule-based built only using linguistic information. The translation rules were manually created by experts. Although the rules are well defined, this process is very expensive and cannot translate well for all domains and languages. Currently, many researchers had successfully built the most popular machine translations such as SMT (Statistical Machine Translation) and NMT (Neural Machine Translation) for various languages instead of rule-based translation.
NMT has become the state-of-the-art approach compared to the previously dominant phrase-based statistical machine translation (SMT) approaches. However, the existing NMT models do not directly use syntactic information. Therefore, we propose tree-tostring and pos-to-string NMT systems by the multi-source translation models. We conducted these multi-source translation models with Myanmar-English and English-Myanmar in both directions. The multi-source translation models conducted in our experiments are based on the multi-source and shared-multisource approaches of the previous research work (Junczys-Dowmunt and Grundkiewicz, 2017). Figure 1 and Figure 2 show the architecture of multi-source translation models. For doing the training processes of proposed models by the transformer and s2s architectures, word-level segmentation and tree-format on the English corpus side and syllable-level segmentation on the Myanmar corpus side are applied in English-to-Myanmar translation. In addition, we used the syllable-level segmentation and POS-tagged word on the Myanmar corpus side, and word-level segmentation on the English side for conducting the Myanmarto-English translation.
In this paper, section 2 will describe our MT systems. The experimental setup will be proposed in section 3. In section 4, the results of our experiments will be reported, and section 5 will present the error analysis on translated outputs. Finally, section 6 will conclude the report.

System Description
In this section, we describe the methodology used in our experiments for this paper. To build NMT systems, we chose the Marian framework 1 (Junczys-Dowmunt et al., 2018) with the architectures of Transformer and RNN based encoder-decoder model with attention mechanism (s2s). Marian is a selfcontained neural machine translation toolkit focus on efficiency and research. This framework, the reimplementation of Nematus (Sennrich et al., 2017), is an efficient Neural Machine Translation framework written in pure C++ with minimal dependencies.
The main features of Marian are pure C++ implementation, one engine for GPU/CPU training and decoding, fast multi-GPU training and batched translation on GPU/CPU, minimal dependencies on external software (CUDA or MKL, and Boost), the static compilation (i.e., compile once, copy the binary and use anywhere), and permissive opensource MIT license. There are several model types supported by the Marian framework. Among them, we used transformer, multitransformer, shared-multi-transformer, s2s (RNN-based encoder-decoder model with attention mechanism), multi-s2s, and shared-multi-s2s models for our experiment. transformer: a model originally proposed by Google (Vaswani et al., 2017) based on attention mechanisms. multi-transformer: a transformer model but uses multiple encoders. shared-multi-transformer: is the same as multi-transformer but the difference is that the two encoders in shared-multi-transformer share parameters during training. s2s: an RNN-based encoder-decoder model with atten-1 https://github.com/marian-nmt/marian tion mechanism. The architecture is equivalent to the Nematus models (Sennrich et al., 2017). multi-s2s: s2s model but uses two or more encoders allowing multi-source neural machine translation. shared-multi-s2s: is the same as multi-s2s but the difference is that the two encoders in shared-multi-s2s share parameters during training.
In our experiments, two baseline models (transformer and RNN based attention: s2s) are used for the translation tasks of Englishto-Myanmar and Myanmar-to-English. For the first translation task, the baseline models take single input of English tree data {tree-en} and produce the output of Myanmar string {my}. The multi-transformer, shared-multitransformer, multi-s2s, and shared-multi-s2s models use two inputs of English string data and tree data {en, tree-en} and produce the output of Myanmar string {my}. For the second translation task, the input of Myanmar POS data {pos-my} is taken by the baseline models and produces the output of English string {en}. The multi-source and shared multi-source models take two inputs of Myanmar sting data and Myanmar POS data {my, pos-my} and yield the output of English string {en}. The baseline models, the multi-source and shared-multi-source models do the same action as the first translation task with different inputs and outputs.

Parallel Data
The parallel data for Myanmar-English and English-Myanmar translation tasks was provided by the organizers of the competition and consists of two corpora: the ALT corpus and the UCSY corpus. The ALT cor-pus is one part of the Asian Language Treebank (ALT) Project (Riza et al., 2016) which consists of twenty thousand Myanmar-English parallel sentences from the Wikinews. The UCSY corpus (Yi Mon ShweSin et al., 2018) contains 238,014 sentences from various domains, including news articles and textbooks. The UCSY corpus for WAT-2021 is not identical to those used in WAT 2020 due to the extension of corpus size. Unlike the ALT corpus, Myanmar text in the UCSY corpus is not segmented. ALT corpus size is extremely small. And thus, the development data and test data were chosen from the ALT corpus. Moreover, we planned to do the experimental settings in training data with and without ALT training data because the test data are retrieved only from the ALT corpus. Due to the very limited hardware (only 2 GPUs and 8 GB memory workstation), the training time took very long and also crush several times, and we couldn't manage to finish both of the experiments. Therefore, in this paper, we present the experimental results with the training data only using the UCSY corpus that contained around 238,000 lines. Table 1 shows data statistics used for the experiments.

Data Preprocessing
In this section, we describe the preprocessing steps before doing the training processes. Proper syllable segmentation or word segmentation is essential for the quality improvement of machine translation in the Myanmar language because this language has no clear definition of word boundaries. Although Myanmar text data in the ALT corpus are manual word segmentation data, those in the UCSY corpus are not segmented. Thus, we need to segment these data. We prepared both syllable and word segmentation for Myanmar language data. We used in-house myWord 2 segmenter for Myanmar word segmentation and Myanmar sylbreak 3 segmenter for syllable segmentation. The myWord segmenter is a useful tool that can make the syllable segmentation, word segmentation, and phrase segmentation for the Myanmar language. In this paper, we used this tool only for word segmentation. The myWord segmenter tool will be released soon.
After doing the word segmentation process, we need to apply POS tagging to the segmented Myanmar data. In addition, for the English tree data, we also need to parse the English data. There are some reasons that we had implemented a multi-source NMT system for this paper. To the best of our knowledge, no experiments have been conducted for the multi-source NMT system using POS data and syntactic tree information. In particular, this multi-source NMT system has not been developed in the Myanmar language. There is only one Factored SMT paper (Ye Kyaw Thu et al., 2014) using Myanmar POS data. Thus, we had implemented a multi-source NMT system for Myanmar-to-English and English-to-Myanmar translations in this paper. To implement this system, we need to apply the POS tagging on the Myanmar data side and the tree data format on the English side. Although we desired to use the tree format on the Myanmar side, Myanmar data cannot be currently built like the English syntactic tree data format. And thus, we can only use Myanmar POS(Part-of-speech) data and English tree data format for implementing the multisource translation models. Part-of-speech tagging and the parser that we used in our experiment will be described in the following sections.  data for the sentence "က န တ က သ တသ တစ ယ က ပ ။" (I am a researcher.) is described in the following:

Part-of-speech Tagging
က န တ /pron က/ppm သ တသ /n တစ /tn ယ က /part ပ /part ။/punc We also evaluated the accuracy of the RDR model. To evaluate this model, 1,300 Myanmar sentences were retrieved from the UCSY corpus, and these sentences were tagged by the selected RDR model. On the other hand, we manually tagged these Myanmar sentences. Finally, we evaluated the accuracy of the RDR model by comparing these two tagged data. We found that the RDR model provides the tagging accuracy of 77% Precision, 81% Recall, and 79% F-Measure.

RegexpParser
Word-level segmentation and tree data format were used on the English side for the experiment. English data given by the WAT-2021 are already segmented. Thus, no segmentation process is needed to do for the English side. For parsing the English data, some parsers such as English PCFG (Probabilistic Context-Free Grammar) parser from Stanford Parser 5 , BLLIP Parser 6 , Berkeley Neural Parser 7 , and RegexpParser 8 were tested with our experiment data of English side. PCFG Parser is used to parse the English sentence into tree data format. This parser cannot parse long sentences of more than 70 words. The longest sentence in our experiment data contains approximately 1,000 words. And thus, this PCFG parser cannot be used for parsing our experiment data. BLLIP Parser is a statistical natural language parser that includes a generative constituent parser and discriminative maximum entropy re-ranker. It can be used as Python version or Java version. This parser cannot parse the long sentences in our experiment data although it can accept more sentence length 853 than the PCFG parser.
Berkeley Neural Parser is a highaccuracy parser with models for 11 languages which is implemented by Python. It is based on constituency parsing with a self-attentive encoder, with additional changes in multilingual constituency parsing with self-attention and pre-training. Although this parser can parse the long sentences in our experiment data, training time takes a lot more than the RegexpParser 9 (grammar-based chunk parser) from nltk package. By comparing the aforementioned parsers, RegexpParser can parse the longest sentences and all the experiment data within a few minutes. Moreover, this RegexpParser is the simplest parser for generating the parse tree data. Thus, we chose the RegexpParser for the tree data format of the English side of our experiment data.
A grammar-based chunk parser Regexp-Parser uses a set of regular expression patterns to specify the behavior of the parser. The chunking of the text is encoded by using a ChunkString, and each rule performs by modifying the chunking in the ChunkString. The rules are implemented by using regular expression matching and substitution. A grammar contains one or more clauses in the following form: {< DT | JJ >} #chunk determiners and adjectives } < [\ · V I] · * > +{ #strip any tag beginning with V, I, or .
< · * >}{< DT > #split a chunk at a determiner < DT | JJ > {} < N N · * > #merge chunk ending with det /adj with one starting with a noun The clauses of a grammar are also executed in order. A cascaded chunk parser is one having more than one clause. The maximum depth of a parse tree generated by RegexpParser is the same as the number of clauses in the gram-mar. To parse a sentence, firstly, we need to create the chunker by using the RegexpParser function with the built grammar. Secondly, an input sentence is needed to tokenize and the tokenized sentence will need to be tagged by using the functions from nltk package. After tagging the tokenized sentence, the chunker calls the parse function with the tagged string parameter. Later, we will get the parse tree format output and need to convert this tree format to the tree format string. These procedures were used for parsing the English side of our experiment data. The example of English parse tree produced by this Regexp-Parser is shown as follow:

Evaluation Results
Our systems are evaluated on the ALT test set and the evaluation results are shown in Table 2. For the evaluation of Myanmar-to-English and English-to-Myanmar translation pairs, we used the different evaluation metrics such as Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002), Rank-based Intuitive Bilingual Evaluation Score (RIBES) (Isozaki et al., 2010), and Adequacy-Fluency Metrics (AMFM) (Banchs et al., 2015).
The BLEU score measures the precision of ngram (overall n ≤ 4 in our case) with respect to a reference translation with a penalty for short translations. Intuitively, the BLEU score measures the adequacy of the translation and a larger BLEU score indicates a better translation quality. RIBES is an automatic evaluation metric based on rank correlation coefficients modified with precision and special care is paid to the word order of the translation results. The RIBES score is suitable for distance language pairs such as Myanmar and English. Larger RIBES scores indicate better translation quality. AM-FM is a two-dimensional automatic evaluation metric for machine translation, which is used to evaluate the machine translation systems. The evaluation metric designed to address independently the semantic and syntactic aspects of the translation. The larger the AMFM scores, the better the trans-  For the first architecture (i.e., transformer) in the first part of the table, the shared-multi-transformer model achieves higher BLEU scores (+1.18) than the baseline transformer model. Furthermore, the multi-transformer model performs better than the baseline transformer in terms of AMFM scores. However, RIBES scores of multi-transformer and shared-multitransformer models are lower than the baseline transformer model. For the second architecture (i.e., s2s or RNN-based Attention), the multi-s2s model outperforms the baseline s2s model and shared-multi-s2s in terms of BLEU and AMFM scores. The sharedmulti-s2s model provides better RIBES scores (0.626460). The highest BLEU scores (13.90) of the shared-multi-transformer model and the highest AMFM scores (0.654780) of the multitransformer model are produced by the first architecture while the highest RIBES scores (0.625476) are achieved by the multi-s2s model of the second architecture.
Myanmar-to-English translation results are shown in the second part of the Table 2. For Myanmar to English translation, the two baseline models (i.e., transformer and s2s) outperform the other models in terms of BLEU, RIBES, and AMFM scores. No improvements occur in this translation task. On the other hand, from English to Myanmar translation, the multi-transformer model is better than the baseline transformer model in terms of AMFM score, and the shared-multi-transformer model performs better than the baseline in terms of BLEU score. Moreover, the multi-s2s and shared-multi-s2s models also provide better translation results compared with the baseline model.

Error Analysis
For both English-to-Myanmar and Myanmarto-English translation models, we analyzed the translated outputs by using Word Error Rate 10 . For doing the error analysis, we used SCLITE (score speech recognition system output) program from the NIST scoring toolkit SCTK 11 version 2.4.10 for making dynamic programming based alignments between reference (ref) and hypothesis (hyp) and calculation of WER. The WER formula can be described as the following equation: where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words and N is the number of words in the reference (N = S + D + C). The percentage of WER can be greater than 100% when the number of insertions is very high.  Table 3 shows the WER scores of Englishto-Myanmar and Myanmar-to-English translation models. In this table, lower WER scores are highlighted as bold numbers. The lower the WER scores, the better the translation models.
For the first architecture of English-to-Myanmar translation, the baseline transformer model gives lower WER scores (81.3%) than the multi-transformer and shared-multi-transformer models. However, in the second architecture, the sharedmulti-s2s model provides lower WER scores (82.5%) compared with the baseline (s2s) and multi-s2s models. In Myanmar-to-English translation, the multi-transformer and sharedmulti-transformer models yield greater WER scores (90.0% and 88.2%) than the baseline transformer model of the first architecture.
Due to the higher WER scores in Myanmar-to-English translation models, the multi-transformer and sharedmulti-transformer models couldn't provide better translation results than the baseline transformer model, and the multi-s2s and shared-multi-s2s models couldn't also yield the improvements than the baseline s2s model.
After we analyzed the confusion pairs of English-to-Myanmar and Myanmar-to-English translation models in detail, we found that most of the confusion pairs in the translations are caused by (1) the nature of the Myanmar language (written or speaking form), (2) the incorrect word segmentation or data cleaning errors of English language, (3) the Myanmar language with no articles (i.e., a, an, and the), and (4) the different nature and language gaps of Myanmar and English languages. The top 10 confusion pairs of Englishto-Myanmar and Myanmar-to-English translations of the model transformer are shown in Table 4. In this table, the first column is the reference and hypothesis pair (i.e., output of the translation model) for English-to-Myanmar translation. The third one is for that of Myanmar-to-English translation.
All of the confusion pairs in the first column are caused by the nature of the Myanmar language. For example, in Myanmar written or speaking form, the word "သည ("is" in English)" are the same as the word "တယ ("is" in English)". Moreover, the words "၏ ("of or 's" in English)" and "ရ ("of or 's" in English)" in the possessive place and the words "မ ("plural form" in English)" and " တ ("plural form" in English)" are the same meanings. In other words, these hypotheses are synonyms of the reference words. In the third column of the Table 4, for the Myanmar-to-English translation, the confusion pairs of "apos → quot", "quot → apos", "the → &amp", ", → the" and "the → s" are caused by the incorrect word segmentation or data cleaning errors of English language. Furthermore, we found that the confusion pairs of "the → a" and "a → the" are caused by the Myanmar language with no articles (i.e., a, an, and the). The confusion pairs of "in → of", "to → of" and "with → and" are caused due to the different nature and language gaps of Myanmar and English languages. Occasionally, most of the Myanmar people misused the usage of the words "in, of, and with" in English writing.
For instance, for the Myanmar sentence "သ