Hybrid Statistical Machine Translation for English-Myanmar: UTYCC Submission to WAT-2021

In this paper we describe our submissions to WAT-2021 (Nakazawa et al., 2021) for English-to-Myanmar language (Burmese) task. Our team, ID: “YCC-MT1”, focused on bringing transliteration knowledge to the decoder without changing the model. We manually extracted the transliteration word/phrase pairs from the ALT corpus and applying XML markup feature of Moses decoder (i.e. -xml-input exclusive, -xml-input inclusive). We demonstrate that hybrid translation technique can significantly improve (around 6 BLEU scores) the baseline of three well-known “Phrase-based SMT”, “Operation Sequence Model” and “Hierarchical Phrase-based SMT”. Moreover, this simple hybrid method achieved the second highest results among the submitted MT systems for English-to-Myanmar WAT2021 translation share task according to BLEU (Papineni et al., 2002) and AMFM scores (Banchs et al., 2015).


Introduction
While both statistical machine translation (SMT) and neural machine translation (NMT) have proven successful for high resource language, it is still an open research question how to make it work well especially for the low resource and long distance reordering language pairs such as English and Burmese (Duh et al., 2020), (Kolachina et al., 2012), (Trieu et al., 2019), . To the best of our knowledge there are only two publicly available English-Myanmar parallel corpora; ALT Corpus (Ding et al., 2020) and UCSY Corpus (Yi Mon Shwe Sin and Khin Mar Soe, 2019) for research purpose, and the size of the corpora are around 20K and 200K respectively. The parallel data for Myanmar-English machine translation share task at Workshop on Asian Translation (WAT) using combination of that two corpora and thus it is a good chance for the NLP researchers who are working on low resource machine translation. Motivated by this challenge, we represented the University of Technology, Yatanarpon Cyber City (UTYCC) and participated in the English-Myanmar (en-my) share task of WAT2021 (Nakazawa et al., 2021).
In this paper, we propose one hybrid system based on plugging XML markup translation knowledge to the SMT decoder. The translation rules for transliteration and borrowed words, and direct usage of English words in the target language are constructed by using a parallel word dictionary. The English-Myanmar transliteration dictionary was built by manual extracting parallel words/phrases from the whole ALT corpus. This simple hybrid method outperformed the three baselines and achieved the second highest results among the submitted MT systems for Englishto-Myanmar WAT2021 translation share task according to BLEU (Papineni et al., 2002) and AMFM scores (Banchs et al., 2015).
The remainder of this paper is organized as follows. In Section 2, we introduce the data preprocessing, including word segmentation and cleaning steps. In Section 3, we describe the details of our three SMT systems. The machine translation evaluation metrics are presented in Section 4. The manual extraction process of transliteration word/phrase pairs from the ALT English-Myanmar parallel data is described in Section 5. Then, the SMT decoding with XML markup technique is described in Section 6. In Section 7, we present hybrid translation results achieved by all our systems. Section 8 concludes this paper.

Preprocessing for English and Myanmar
We tokenized and escaping English data respectively with the tokenizer and escaping perl script (escape-special-chars.perl) of Moses (Koehn et al., 2007). For Myanmar, although provided training data of ALT was already segmented, word segmentation was not provided for the UCSY corpus. And thus, we did syllable segmentation by using sylbreak.pl (Ye Kyaw Thu, 2017).

Parallel Data Statistic
The corpus for the English-Myanmar share task contained two separate corpora and they are UCSY corpus and ALT corpus. The domain of the UCSY corpus is general and the

SMT Systems
In this section, we describe the methodology used in the machine translation experiments for this share task.

Phrase-based Statistical Machine
Translation A PBSMT translation model is based on phrasal units (Koehn et al., 2003). Here, a phrase is simply a contiguous sequence of words and generally, not a linguistically motivated phrase. A phrase-based translation model typically gives better translation performance than word-based models. We can describe a simple phrase-based translation model consisting of phrase-pair probabilities extracted from corpus and a basic reordering model, and an algorithm to extract the phrases to build a phrase-table (Specia, 2011). The phrase translation model is based on noisy channel model. To find best translationê that maximizes the translation probability P(f ) given the source sentences; mathematically. Here, the source language is French and the target language is an English. The translation of a French sentence into an English sentence is modeled as equation 1.
Applying the Bayes' rule, we can factorized into three parts.
The final mathematical formulation of phrasebased model is as follows:

Operation Sequence Model
The operation sequence model which combines the benefits of two state-of-the-art SMT frameworks named n-gram-based SMT and phrasebased SMT. This model simultaneously generate source and target units and does not have spurious ambiguity that is based on minimal translation units (Durrani et al., 2011) (Durrani et al., 2015. It is a bilingual language model that also integrates reordering information. OSM motivates better reordering mechanism that uniformly handles local and nonlocal reordering and strong coupling of lexical generation and reordering. It means that OSM can handle both short and long distance reordering. The operation types are such as generate, insert gap, jump back and jump forward which perform the actual reordering.

Hierarchical Phrase-based Statistical Machine Translation
The hierarchical phrase-based SMT approach is a model based on synchronous context-free grammar (Specia, 2011). The model is able to be learned from a corpus of unannotated parallel text. The advantage this technique offers over the phrase-based approach is that the hierarchical structure is able to represent the word re-ordering process. The reordering is represented explicitly rather than encoded into a lexicalized re-ordering model (commonly used in purely phrase-based approaches). This makes the approach particularly applicable to language pairs that require long-distance re-ordering during the translation process (Braune et al., 2012).

Moses SMT System
We used the PBSMT, HPBSMT and OSM system provided by the Moses toolkit (Koehn et al., 2007) for training the PBSMT, HPB-SMT and OSM statistical machine translation systems. The word segmented source language was aligned with the word segmented target language using GIZA++ (Och and Ney, 2000). The alignment was symmetrized by grow-diag-final and heuristic (Koehn et al., 2003). The lexicalized reordering model was trained with the msd-bidirectional-fe option (Tillmann, 2004). We use KenLM (Heafield, 2011) for training the 5-gram language model with modified Kneser-Ney discounting (Chen and Goodman, 1996). Minimum error rate training (MERT) (Och, 2003) was used to tune the decoder parameters and the decoding was done using the Moses decoder (version 2.1.1). We used default settings of Moses for all experiments.

Evaluation
Our systems are evaluated on the ALT test set and we used the different evaluation metrics such as Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002), Rankbased Intuitive Bilingual Evaluation Score (RIBES) (Isozaki et al., 2010), and Adequacy-Fluency Metrics (AMFM) (Banchs et al., 2015). For the official evaluation of English-to-Myanmar share task, we uploaded our hypothesis files to the WAT2021 evaluation server and sub-syllable (almost same with sylbreak toolkit's syllable units) segmentation was used for Myanmar language. We submitted "hybrid PBSMT with XML markup (inclusive)" and "hybrid OSM with XML markup (inclusive)" systems training only with UCSY corpus for human evaluation.

Manual Extraction of Parallel Transliteration Words
When we studied on the Myanmar language corpus provided by the WAT2021, we found that many sentences are very long, containing spelling errors, unnaturalness of translation (i.e. translation from English to Myanmar) and many transliteration words (i.e. wordby-word, phrase-by-phrase, compound word transliteration  We manually extracted English-Myanmar transliteration word and phrase pairs from the whole ALT corpus and prepared 14,225 unique word dictionary 1 . The main categories are Country/City Names, Demonyms, Personal Names, Month Names, General Nouns, Organization Names, Abbreviations, Units and English-to-English Trans-1 https://github.com/ye-kyaw-thu/MTRSS/tree/ master/WAT2021/en-my_transliteration-dict lation Words (see Table 2).

Hybrid Translation
Generally, hybrid translation integrates the strengths of rationalism method and empiricist method. (Hunsicker et al., 2012) described how machine learning approaches can be used to improve the phrase substitution component of a hybrid machine translation system. Essential of hybrid translation is to integrate the

Abbreviations and Units
Intel x86 Khlong Toei Na Ranong Na Ranong core of MT engines. Multiple-engine HMT integrates all available MT methods, applying to their benefits, in order to improve qualities of output (Xuan et al., 2012). The popular combinations comprise "rule-based machine translation vs the SMT" and multiple combinations of machine translation engines, for example "SMT vs neural machine translation". Our work in this paper focuses on hybrid machine translation of SMT engine and XML tags inserting (i.e. applying rules) into transliteration words of each source sentence. We used the Moses SMT toolkit and it also supports -xml-input flag to activate XML tags inserting feature with one of the five options; exclusive, inclusive, constraint, ignore and pass-through. Refer manual page of the Moses toolkit 2 for detail explanation. Although we studied all options, we will present the two options that work well for English-Myanmar hybrid translation. The Moses decoder has an XML markup scheme that allows the specification of translations for parts of the sentence. In its simplest form, we can guide the decoder what to use to translate certain transliteration words or phrases in the source sentence. We wrote a perl script for XML Markup inserting into the source English sentences based on the manually extracted transliteration dictionary. As shown in follows, the XML Markup scheme for HPBSMT is different with PBSMT and OSM. This is because the syntactic annotation of the HPBSMT system also used XML Markup. And thus, we used --xml-brackets "{{ }}" option when decoding hybrid HPB-SMT system.

Results
Our systems are evaluated on the ALT test set and the results are shown in Table 3. Our observations from the results are as follows:

Hybrid translation of SMT with XML
Markup scheme showed significant improvement for all three SMT approaches; PBSMT, OSM and HPBSMT.
2. Generally, -xml-input exclusive option gives a slightly higher scores than -xml-input inclusive.

The baseline translation performance
score difference between training with or without ALT corpus is about 5.0 BLEU score.

Conclusion
We presented in this paper the UTYCC's participation in the WAT-2021 shared translation task. Our hybrid SMT submission to the task performed the second in English-to-Myanmar translation direction according to several evaluation scores including the de facto BLEU. Our results also confirmed the XML markup technique for transliteration words dramatically increase the translation performance up