ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality Estimation and Corrective Feedback

We introduce ChrEnTranslate, an online machine translation demonstration system for translation between English and an endangered language Cherokee. It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability, two user feedback interfaces for experts and common users respectively, example inputs to collect human translations for monolingual data, word alignment visualization, and relevant terms from the Cherokee English dictionary. The quantitative evaluation demonstrates that our backbone translation models achieve state-of-the-art translation performance and our quality estimation well correlates with both BLEU and human judgment. By analyzing 216 pieces of expert feedback, we find that NMT is preferable because it copies less than SMT, and, in general, current models can translate fragments of the source sentence but make major mistakes. When we add these 216 expert-corrected parallel texts into the training set and retrain models, equal or slightly better performance is observed, which demonstrates indicates the potential of human-in-the-loop learning.


Introduction
Machine translation is a relatively mature natural language processing technique that has been de ployed to realworld applications. For instance, Google Translate currently supports translations of over 100 languages. However, a lot of low resource languages are out there without the sup port of modern technologies, which might accel erate their vanishing. In this work, we focus on one of those languages, Cherokee. Cherokee is one of the most wellknown Native American languages, however, is identified as an "endan gered" language by UNESCO. Cherokee nations have carried out language revitalization plans (Na tion, 2001) and established language immersion programs and k12 language curricula. Chero kee language courses are offered in some universi ties, including UNC Chapel Hill, the University of Oklahoma, Stanford University, Western Carolina University. A few pedagogical books have been published (Holmes and Smith, 1976; Joyner, 2014; Feeling, 2018) and a digital archive of historical Cherokee language documents has been built up (Bourns, 2019; Cushman, 2019. However, there are still very limited resources available on the In ternet for Cherokee learners; meanwhile, first lan guage speakers and translators of Cherokee are mostly elders and would likely benefit from ma chine translation's assistance. This motivates us to develop the first online CherokeeEnglish machine translation demonstration system. Extending our previous works (Frey, 2020; Zhang et al., 2020, we develop the backbone statistical and neural ma chine translation systems (SMT and NMT) on a larger parallel dataset (17K) and obtain the state oftheart CherokeeEnglish (ChrEn) and English Cherokee (EnChr) translation performance.
Besides translation, our system also supports quality estimation (QE) for both SMT and NMT. QE is an important (missing) component of ma chine translation systems, which is used to inform users of the reliability of machinetranslated con tent (Specia et al., 2010). Since our models are trained on a very limited number of parallel sen tences, it is expected that the translations will be poor in most cases when used by Internet users. Therefore, QE is essential for avoiding misuse and warning users of potential risks. Existing best performance QE models are usually trained under supervision with quality ratings from professional translators (Fomicheva et al., 2020a). However, we are unable to easily collect a lot of human rat ings for Cherokee, due to its state of endanger ment. Nonetheless, we test both supervised and unsupervised QE methods: (1) Supervised: we use BLEU (Papineni et al., 2002) as the quality rat ing proxy and train a BLEU regressor; (2) Unsu pervised: following the uncertain estimation lit erature (Lakshminarayanan et al., 2017), we use the ensemble model's output probability as the es timation of quality. Furthermore, to evaluate how well the QE models perform, we collect 200 human quality ratings (50 ratings for SMT ChrEn, SMT EnChr, NMT ChrEn, and NMT EnChr, respec tively). We show that our methods obtain mod erate to strong correlations with human judgment (Pearson correlation coefficient γ ≥ 0.44).
One main purpose of our system is to allow humanintheloop learning. Since limited paral lel texts are available, it is important to involve humans, especially experts, in the loop to give feedback and then improve the models accord ingly. We develop two different user feedback interfaces for experts and common users, respec tively (shown in Figure 2). We ask experts to pro vide quality ratings, correct the modeltranslated content, and leave openended comments; for com mon users, we allow them to rate how helpful the translation is and to provide openended com ments. Upon submission, we collected 216 pieces of feedback from 4 experts. We find that experts favor NMT more than SMT because SMT exces sively copies from source sentences; according to their ratings and comments, current translation sys tems can translate fragments of the source sen tence but make major mistakes. Our naive human intheloop learning, by adding these 216 expert corrected parallel texts back to the training set, obtains equal or slightly better translation results. Plus, the expert comments shine a light on where the model often makes mistakes. Besides, our demo allows users to input text or choose an exam ple input to translate (shown in Figure 1). These examples are from our monolingual databases so that experts will annotate them by providing trans lation corrections. Finally, to support an interme diate interpretation of the model translations, we visualize the word alignment learned by the trans lation model and link to cherokeedictionary to pro vide relevant terms from the dictionary.
Our code is hosted at ChrEnTranslate and our online website is at chren.cs.unc.edu. Common users need to accept agreement terms before us ing our service to avoid misuse; access the ex pert page chren.cs.unc.edu/expert requires autho rization. We encourage fluent Cherokee speakers to contact us and contribute to our humaninthe loop learning procedure. A demonstration video of our website is at YouTube. In summary, our demo is featured by (1) offering the first online machine translation system for translation between Cherokee and English, which can assist both pro fessional translators or Cherokee learners; (2) doc umenting human feedback, which, in the long run, expands Cherokee data corpus and allows human intheloop model development. Additionally, our website can be easily adapted to any other low resource translation pairs.

Translation Models
As shown in Figure 1, our system allows users to choose the statistical or neural model (SMT or NMT).
SMT is more effective for outofdomain transla tion between Cherokee and English (Zhang et al., 2020). We implement phrasebased SMT model via Moses (Koehn et al., 2007), where we train a 3gram KenLM (Heafield et al., 2013) and learn word alignment by GIZA++ (Och and Ney, 2003). Model weights are tuned on a development set by MERT (Och, 2003).
NMT has better indomain performance and can generate more fluent texts. We implement the global attentional model proposed by Luong et al. (2015). Detailed hyperparameters can be found in Section 3.1. Note that we do not use Trans former because it empirically works worse (Zhang et al., 2020). And we find that the multilingual techniques we explored only significantly improve indomain performance when using multilingual Bible texts, so we suspect that it biases to Bible style texts. Hence, we also do not apply multilin gual techniques and just train the backbone models with our CherokeeEnglish parallel texts. We use a 3model ensemble as our final working model.

Quality Estimation
Supervised QE. The QE (Specia et al., 2010) task in WMT campaign provides thousands of modeltranslated texts plus corresponding human ratings, which allow participants to train super vised QE models. Fomicheva et al. (2020a) show that supervised models work significantly better than unsupervised ones. Since we are unable to collect thousands of human ratings, we use BLEU (Papineni et al., 2002) Fomicheva et al. (2020a,b) define three sets of fea tures. However, we need to compute features on line, so some features (e.g., dropout features) that require multiple forward computations will greatly increase latency. W use features that will not cause too much speed lag. For SMT, we use: For NMT, we use: (a) output length; (b) log probability and length normalized log probability; (c) probability and length normalized probabil ity; (d) attention entropy (Fomicheva et al., 2020a,b): where L s is the length of source text, and α ij is the attention weight between target token i and source to ken j.
Finally, we use XGBoost (Chen and Guestrin, 2016) as the BLEU regressor. 2 As shown in Fig  ure 1, we use 5 stars to show QE, therefore, we rescale the estimated quality to 05 by dividing the predicted BLEU score (0100) by 20.
Unsupervised QE. Even though supervised QE works better (Fomicheva et al., 2020a), we suspect that the advantage cannot generalize to open do main scenarios unless we have a large amount of humanrated data to learn from. Hence, we also explore unsupervised QE methods. Unsupervised QE is closely related to uncertainty estimation. We can use how uncertain the model is to quantify how lowquality the model output is. Though it is intu itive to use the output probability as the model's confidence, Guo et al. (2017) point out that the output probability is often poorly calibrated, so that they propose to recalibrate the probability on the development set. However, this method is designed for classification tasks and is not appli cable for language generation. Gal and Ghahra mani (2016) show that "dropout" can be a good un certainty estimator, inspired by which Fomicheva et al. (2020b) propose the dropout features. How ever, the multiple forward passes are not prefer able for an online system. Lakshminarayanan et al. (2017) demonstrate that the ensemble model's out put probability can better estimate the model's un certainty than dropout. We find that this method is  simple yet effective for NMT. Note that we normal ize the output probability by the sentence length. Similarly, we rescale the normalized probability (01) to 05 by multiplying it by 5.
Human Quality Rating. So far, our QE devel opment and evaluation are all based on BLEU. To better evaluate QE performance, we collect 200 hu man ratings (all rated by Prof. Benjamin Frey 3 ), 50 ratings for ChrEn SMT, EnChr SMT, ChrEn NMT, and EnChr NMT, respectively. We fol low the direct assessment setup used by FLoRes (Guzmán et al., 2019), 4 and thus each translated sentence receives a 0100 quality rating.
3 Benjamin Frey is a proficient secondlanguage Cherokee speaker and a citizen of the Eastern Band of Cherokee Indians. 4 0-10: represents a translation that is completely incorrect and inaccurate; 11-29 represents a translation with a few cor rect keywords, but the overall meaning is different from the source; 30-50 represents a translation that contains translated fragments of the source string, with major mistakes; 51-69 represents a translation that is understandable and conveys the overall meaning of source string but contains typos or gram matical errors; 70-90 represents a translation that closely pre serves the semantics of the source sentence; 90-100 range rep

User Feedback & Example Inputs
Enlarging the parallel texts is a fundamental ap proach to improve the translation model's per formance. Besides compiling existing translated texts, it is important to newly translate English texts to Cherokee by translators. Our system is de signed to not only assist these translators but also document their feedback and postedited correct translation, so that model can be improved by us ing this feedback, i.e., humanintheloop learning. To achieve this goal, we design two kinds of user feedback interfaces. One is for common users, in which users can rate how helpful the translation is (in 5point Likert scale) and leave openended comments, as shown in Figure 2 (a). The other is for experts, in which authorized users can rate the quality, correct the translated text, and leave openended comments, as shown in Figure 2 (b). Upon submission, we collect 216 pieces of feed back from 4 experts and detailed analysis can be found in Section 3.3. Meanwhile, as shown in resents a perfect translation.  Figure 1, besides inputting text, users can also choose an example input to translate. These ex amples are from our Cherokee or English mono lingual databases. On the one hand, this provides users with more convenience; on the other hand, whenever experts submit translation corrections of an example, we will update its status as "labeled". Hence, we can gradually collect human transla tions for the monolingual data.

Other Features
As shown in Figure 3, to make model prediction more interpretable to users, we visualize the word alignment learned by the translation model. For SMT, we visualize the hard wordtoword align ment; for NMT, we visualize the soft attention map between source and target tokens. Additionally, to provide users with some oracle and handy ref erences from the dictionary, we link to cherokee dictionary. We use each of the source and target tokens as a query and list up to 15 relevant terms on our web page.

Implementation Details
Data. To train translation models, we use the 14K parallel data collected by our previous work (Zhang et al., 2020) plus 3K newly complied par allel texts. We randomly sample 1K as our devel opment set and treat the rest as the training set. The data is opensourced at ChrEn/data/demo. To col lect human quality ratings, we randomly sample 50 examples from the development set, and for each of them, we collect 4 ratings for ChrEn/EnChr SMT and ChrEn/EnChr NMT, respectively. (Koehn et al., 2007). After training and tuning, we run it as a server process. 5 We develop our NMT models via OpenNMT (Klein et al., 2017). For both ChrEn and EnChr NMT models , we use 2layer LSTM encoder and decoder, general attention (Luong et al., 2015), hidden size=1024, label smoothing (Szegedy et al., 2016) equals to 0.2, dynamic batching with 1000 tokens. Differ ently, the ChrEn NMT model uses dropout=0.3, BPE tokenizer (Sennrich et al., 2016), and mini mum word frequency=10; the EnChr NMT model uses dropout=0.5, Moses tokenizer, and minimum word frequency=0. We train each NMT model with three random seeds (7, 77, 777) and use the 3model ensemble as the final translation model, and we use beam search (beam size=5) to gener ate translations. We implement the supervised QE model with XGBoost. 6 XGBoost has three impor tant hyperparameters: max depth, eta, the number of rounds. Tuned on the development set, we set them as (5, 0.1, 100) for ChrEn SMT, (3, 0.1, 80) for EnChr SMT, (4, 0.5, 40) for ChrEn NMT, and  (5, 0.1, 40) for EnChr NMT. Lastly, the backend of our demonstration website is based on the Flask framework.

Setup. We implement SMT models via Moses
Metrics. We evaluate translation systems by BLEU (Papineni et al., 2002) calculated via Sacre BLEU 7 (Post, 2018). Supervised QE models are developed by minimizing the mean square error of predicting BLEU, but all QE models are evaluated by the correlation with BLEU on development set and the correlation with human ratings. We use Pearson correlation (Benesty et al., 2009).

Quantitative Results
Translation. Table 2 shows the translation per formance on our 1K development set, which is sig nificantly better than the singlemodel indomain translation performance reported in our previous work (Zhang et al., 2020) and thus achieves the stateoftheart results. In addition, the 3model NMT ensemble further boosts the performance. Table 1 illustrates the performance of qual ity estimation models. In our experiments, we take every feature used in supervised QE as an unsu pervised quality estimator. Here, we only present those having a high correlation with BLEU and human rating. It can be observed that, for SMT, supervised QE consistently works better, whereas, for NMT, unsupervised QE has a better correla tion with human rating. The obtained correlations with human judgement are moderate (γ ≥ 0.3) to strong (γ ≥ 0.5) (Cohen, 1988). Therefore, 7 BLEU+c.mixed+#.1+s.exp+tok.13a+v.1.5.0 we use the trained XGBoost for SMT model's QE and use the length normalized probability (i.e., Exp(LogProbability / length)) for NMT model's QE in our online demonstration system.

Qualitative Results
Expert Feedback. Upon submission, we re ceived 216 pieces of feedback from 4 experts (in cluding Prof. Benjamin Frey and 3 other fluent Cherokee speakers). The results are shown in Ta ble 3. It can be observed that we received a lot more feedback to NMT than SMT because SMT excessively copies words from source sentences when translating opendomain texts whereas NMT can mostly translate into the target language. On average, there are only 2.3 tokens in the input or translated Cherokee sentence; however, the aver age translation quality rating is only 2.45 out of 5, which is close to the average rating (43.8 out of 100) of the 200 human ratings we collected. Therefore, according to FLoRes's rating standard (Guzmán et al., 2019) (see footnote 2), our transla tion systems can translate fragments of the source string but make major mistakes in general. Be sides ratings, we received 36 openended com ments that shine a light on common mistakes made by the models. The most frequent comments are (1) model gets some parts correct but others wrong. For example, "it got the subject but not the verb", "it got the stem right but used 3rd person prefix", "it missed the part about going to town, but got 'to day' correct", etc. (2) model uses archaic English terms, like "thy", "thou", "speaketh", etc. because the majority of our training set is the Cherokee Old Testament and the Cherokee New Testament.
HumanintheLoop Learning. To improve models based on expert feedback, we propose to simply add the 216 expertcorrected parallel texts back to our training set and retrain the translation  Table 3: Expert feedback. In each cell, the 3 numbers are the number of feedback received / average quality rating / Pearson correlation coefficient between quality rating and quality estimation. models. 8 The new BLEU results on our devel opment set are 17.3, 13.0, 20.0, 14.8 for ChrEn SMT, EnChr SMT, ChrEn NMT (ensemble), and EnChr NMT (ensemble), respectively, which are equal or slightly better than the results in Table 2. To tackle the archaic English issue, we simply replace archaic English terms ("thy", "thou") with new English terms ("your", "you").

Conclusion & Future Work
In this work, we develop a CherokeeEnglish Machine Translation demonstration system that intends to demonstrate and support automatic translation between Cherokee and English, col lect user feedback/translations, allow humanin theloop development, and eventually contribute to the revitalization of the endangered Cherokee language. Future work involves inviting more ex perts and common users to test/use our system and proposing more efficient and effective humanin theloop learning methods.

Broader Impact Statement
As shown in Section 3.3, the current translation models are still far from being reliably used in prac tice. Therefore, our system is just a demonstration or prototype of the translation between Cherokee and English, while the modeltranslated texts are not supposed to be directly applied anywhere else without confirmation from professional translators. We stress this point in our agreement terms. Com mon users need to accept those terms before using our system; experts need to agree to those terms as well before being authorized. Lastly, we sin cerely thank David Montgomery, Barnes Powell, and Tom Belt for voluntarily participating in our system test and providing their feedback.