Understanding Pre-Editing for Black-Box Neural Machine Translation

Pre-editing is the process of modifying the source text (ST) so that it can be translated by machine translation (MT) in a better quality. Despite the unpredictability of black-box neural MT (NMT), pre-editing has been deployed in various practical MT use cases. Although many studies have demonstrated the effectiveness of pre-editing methods for particular settings, thus far, a deep understanding of what pre-editing is and how it works for black-box NMT is lacking. To elicit such understanding, we extensively investigated human pre-editing practices. We first implemented a protocol to incrementally record the minimum edits for each ST and collected 6,652 instances of pre-editing across three translation directions, two MT systems, and four text domains. We then analysed the instances from three perspectives: the characteristics of the pre-edited ST, the diversity of pre-editing operations, and the impact of the pre-editing operations on NMT outputs. Our findings include the following: (1) enhancing the explicitness of the meaning of an ST and its syntactic structure is more important for obtaining better translations than making the ST shorter and simpler, and (2) although the impact of pre-editing on NMT is generally unpredictable, there are some tendencies of changes in the NMT outputs depending on the editing operation types.


Introduction
Recent advances in machine translation (MT) have greatly facilitated its practical use in various settings from business documentation to personal communication. In many practical cases, MT systems are used as black-box and one well-tested approach to make use of a black-box MT is preediting, i.e., modifying the source text (ST) to make it suitable for the intended MT system.
The effectiveness of pre-editing has so far been demonstrated in many studies (Pym, 1990;O'Brien and Roturier, 2007;Seretan et al., 2014). A study focusing on statistical MT (SMT) has also shown that more than 90% of an ST can be rewritten into a text that can be machine-translated with sufficient quality (Miyata and Fujita, 2017), exhibiting the potential of the pre-editing approach.
However, the feasibility and possibility of preediting for neural MT (NMT) has not been examined extensively. While efforts have recently been invested in the implementation of pre-editing strategies for black-box NMT settings, achieving improved MT quality (e.g., Hiraoka and Yamada, 2019;Mehta et al., 2020), the potential gains of preediting remain unexplored. Notably, the impact of pre-editing on black-box MT is unpredictable in nature. In particular, NMT models trained in an end-to-end manner can be sensitive to minor modifications of the ST (Cheng et al., 2019), which may affect the feasibility of pre-editing.
In short, while pre-editing has been implemented in practical MT use cases, what pre-editing is and how it works with black-box NMT systems remain open questions. To explore the possibility of preediting and its automation, in this study, we provide fine-grained analyses of human pre-editing practices and their impact on NMT. We systematically collected pre-editing instances in various conditions, i.e., translation directions, NMT systems, and text domains ( §3). We then conducted in-depth analyses of the collected instances from the following three perspectives: the characteristics of the pre-edited ST ( §4), the diversity of pre-editing operations ( §5), and the impact of pre-editing operations on the NMT outputs ( §6). The findings of these analyses provide useful insights into the effective and efficient implementation of pre-editing for the better use of black-box NMT systems in the future, as well as the robustness of current NMT systems when STs are manually perturbed.

Related Work
Pre-editing is the process of rewriting the source text (ST) to be translated in order to obtain better translations by MT. Though the scope of effective pre-editing operations depends on the downstream MT system and there is no deterministic relation between pre-editing operations and the quality of MT output, its effectiveness has been demonstrated for various translation directions, MT architectures, and text domains.
Manual pre-editing has long been implemented in combination with controlled languages (Pym, 1990;Reuther, 2003;Kuhn, 2014). In the period of rule-based MT (RBMT), pre-editing was considered as a promising approach since the behaviour of RBMT is more predictable and controllable. For example, O'Brien and Roturier (2007) examined the impact of English controlled language rules on two different MT engines, revealing the rules of high effectiveness. The pre-editing approach with controlled languages has also been tested for statistical MT (SMT) (Aikawa et al., 2007;Hartley et al., 2012;Seretan et al., 2014). These studies developed or utilised a set of controlled language rules for rewriting ST. While these rule sets are optimised for particular MT systems and differ from each other, we can observe some shared characteristics among them. In particular, rules that prohibit long sentences (e.g., of more than 25 words) are widely adopted in the existing rule sets (O'Brien, 2003).
Automation of pre-editing is also an important research field in natural language processing. Semi-automatic tools such as controlled language checkers (Bernth and Gdaniec, 2001; and interactive rewriting assistants (Mirkin et al., 2013;Gulati et al., 2015) were developed to facilitate manual pre-editing activities. Fully automatic pre-editing has long been explored (e.g., Shirai et al., 1998;Mitamura and Nyberg, 2001;Yoshimi, 2001;Sun et al., 2010). In particular, many researchers have examined methods of reordering the source-side word order as a pretranslation processing (Xia and McCord, 2004;Li et al., 2007;Hoshino et al., 2015). While the reordering approach has generally proven effective for SMT, its effectiveness for NMT is not obvious; negative effects have even be reported (Zhu, 2015;Du and Way, 2017). In recent years, techniques of automatic text simplification have been applied to improve NMT outputs (Štajner and Popović, 2018;Mehta et al., 2020). The underlying assumption of these studies is that simpler sentences are more machine translatable.
Previous studies have investigated various preediting methods from different perspectives, focusing on different linguistic phenomena. Indeed, individual research has led to improved MT results. However, what is crucially needed is a broad understanding of what pre-editing is and how it works. For example, Miyata and Fujita (2017) addressed this issue by collecting instances of bilingual preediting, i.e., pre-editing ST while referring to its MT output, done by human editors and analysing them in detail. They demonstrated the maximum gain of pre-editing for an SMT and provided a comprehensive typology of editing operations. Nevertheless, their study has two major limitations: (1) recent NMT was not examined, and (2) practical insights for better practices of pre-editing were not sufficiently presented.
NMT models trained in an end-to-end manner behave very differently from SMT and RBMT, which, in turn, affects pre-editing practices. As reported in several studies, despite their rapid improvement, NMT models are still vulnerable to input noise (Belinkov and Bisk, 2018;Ebrahimi et al., 2018;Cheng et al., 2019;Niu et al., 2020). The pre-editing operations identified in previous studies are not necessarily effective for current blackbox NMT systems. 1 For example, Marzouk and Hansen-Schirra (2019) adopted nine controlled language rules 2 and evaluated their impact on the MT output for German-to-English translation in the technical domain. The human evaluation results revealed that these rules improved the performance of the RBMT, SMT, and hybrid systems, but did not have positive effects on the NMT system. Hiraoka and Yamada (2019) demonstrated the effectiveness of the following three pre-editing rules in improving Japanese-to-English TED Talk subtitle 1 The ideal goal of the pre-editing approach is to adapt the STs to what the intended NMT system can properly translate, and in the end, what it has been trained on, i.e., training data. For a black-box MT system, because we cannot directly refer to its training data, we should grasp its statistical characteristics indirectly through MT output. 2 The rules are as follows: (1) using straight quotes for interface texts, (2) avoiding light-verb construction, (3) formulating conditions as if sentences, (4) using unambiguous pronominal references, (5) avoiding participial constructions, (6) avoiding passives, (7) avoiding constructions with "sein" + "zu" + infinitive, (8) avoiding superfluous prefixes, and (9) avoiding omitting parts of the words (Marzouk and Hansen-Schirra, 2019, p.184).

Perfect
Information in the original text has been completely translated. There are no grammatical errors in the translation. The word choice and phrasing are natural even from a native speaker's point of view.

Good
The word choice and phrasing are slightly unnatural, but the information in the original text has been completely translated, and there are no grammatical errors in the translation.

Fair
There are some minor errors in the translation of less important information in the original text, but the meaning of the original text can be easily understood.

Acceptable
Important parts of the original text are omitted or incorrectly translated, but the core meaning of the original text can still be understood with some effort. 1. Incorrect/nonsense The meaning of the original text is incomprehensible. Table 1: MT evaluation criterion adopted in Miyata and Fujita (2017): The "Perfect" and "Good" ratings are regarded as satisfactory quality.   translation using a black-box NMT system: (1) inserting punctuation, (2) making implied subjects and objects explicit, and (3) writing proper nouns in the target language (English). As these studies cover a limited range of linguistic phenomena, translation directions, and text domains, we are not in the position to draw decisive conclusions; we still do not know what types of pre-editing operations are possible and how NMT is affected when these operations are performed. To elicit the best pre-editing practices for NMT, as a starting point, we need to understand what is happening and what can be obtained in the process of pre-editing, while also re-examining the previous findings and conventional methods.

Protocol
To collect fine-grained manual pre-editing instances, we adopted the protocol formalised by Miyata and Fujita (2017), in which a human editor incrementally and minimally rewrites an ST on a trial-and-error basis with the aim of obtaining better MT output. An original ST (Org-ST) and its pre-edited versions are collectively called a unit. Using an online editing platform we developed, editors implement the protocol in the following steps: Step 1. Evaluate the MT output of the current ST based on a 5-point scale criterion shown in Table 1. If the quality of the MT output is satisfactory (i.e., "Perfect" or "Good"), go to Step 4; otherwise, go to Step 2.
Step 2. Select one of the versions of the ST in the unit to be rewritten and go to Step 3. If none of the versions are likely to become satisfactory through further edits, go to Step 4.
Step 3. Minimally edit the ST 3 while maintaining its meaning, referring to the corresponding MT output. The MT output for the edited ST is automatically generated and registered in the unit. Return to Step 1.
Step 4. Select one version of the ST that achieves the best MT quality (Best-ST) from among all the versions in the unit, and terminate the process for the unit.
The pre-editing instances in a unit collected through this protocol form a tree structure as shown in Figure 1. We refer to the shortest path between the Org-ST and the Best-ST as Best path. An important extension to the work in Miyata and Fujita (2017) is that our platform provides editors with a visualisation of the tree representation of the preediting history. This can facilitate the selection of ST versions in Step 2.

Implementation
To extensively investigate pre-editing phenomena, we prepared the following conditions: Translation directions: We targeted Japanese-to-English (Ja-En), Japanese-to-Chinese (Ja-Zh), and Japanese-to-Korean (Ja-Ko) translations.  MT systems: As black-box MT systems, we adopted Google Translate 4 and TexTra. 5 Both are general-purpose NMT systems that are prevalently used for translating Japanese texts into other languages. Text domains: We selected four text domains, whose linguistic characteristics, such as mode and sentence length, are different from each other (see Table 2 for details).
We randomly selected 25 Japanese sentences for each of the four text domains, and used the resulting ST set consisting of 100 sentences for all of the six combinations of translation direction and MT system. We assigned one editor to each translation direction. Each editor was asked to work with both MT systems, without being informed of the type of MT system used in the task. All editors were professional translators with sufficient writing skills in Japanese and experience for evaluating MT outputs. Before the commencement of the formal tasks, we trained the editors using example sentences so that they could become accustomed to the task and platform. The Ja-En task was implemented from November to December 2019; the Ja-Zh and Ja-Ko tasks were implemented from December 2019 to February 2020. Table 3 shows statistics for the pre-editing instances collected through the protocol described above. In general, the numbers of collected instances for the hospital and municipal domains were smaller than those for the bccwj and reuters domains, reflecting the influence of sentence length of the Org-ST. In other words, the shorter the sentence is, the fewer parts there are to be edited.

Statistics
A notable finding is that while only about 11% (69/600) of the MT output for the Org-ST was of satisfactory quality, 95% (571/600) of the MT output of the Best-ST was satisfactory. This means that almost all the ST can be pre-edited into a form that can lead to satisfactory MT output, demonstrating the potential of both pre-editing and NMT.
The number of collected instances can be interpreted as the editing efforts required to obtain the Best-ST from the Org-ST. In most of the settings, the median number of collected instances for a unit falls in the range of 5 to 10. It is thus necessary to optimise the pre-editing process for an intended MT system. The length of the Best path approximates the minimum editing efforts needed to obtain the Best-ST. The total number of preediting instances in the Best path was 2,443, while the total of all instances is 6,652. This implies that there is substantial opportunity for reduction of the pre-editing efforts.

Characteristics of Pre-Edited Sentences
To understand the differences between the original and pre-edited STs, in this section, we describe their general linguistic characteristics. Here, we compare the Org-ST and the Best-ST that achieved a satisfactory MT result in order to elicit the features of machine translatable ST.

Structural Characteristics
To quantify structural complexity, we used the following three indices: (1) sentence length: the number of words per sentence 6 (2) attachment distance: the averaged distance of all attachment pairs of the Japanese base phrases in a sentence (3) dependency depth: the maximum distance from the root word in the dependency tree We used the Japanese tokeniser MeCab 7 to calculate (1) and the Japanese dependency parser JU-MAN/KNP 8 to calculate (2) and (3). The first three blocks in Table 4 show the results for these indices. It is evident on all indices, the Org-ST exhibits the lowest scores. In other words, the length and surface complexity of the sentences generally increased through the pre-editing operations. This is a counter-intuitive finding in that most previous pre-editing practices have axiomatically assumed that shorter and less complex sentences are better for MT. We further delve into this in §5. 6 If ST instance includes multiple sentences, we averaged the scores. 7 https://taku910.github.io/mecab/ 8 http://nlp.ist.i.kyoto-u.ac.jp/index.php?KNP

Lexical Characteristics
The remaining two blocks in Table 4 present statistics for the lexical characteristics of the STs. The results for lexical diversity indicate that both the total number of word types and the Token/Type ratio increased from the Org-ST to the Best-ST for all the conditions. This suggests that though the diversity of words increased slightly, the word distribution became peakier through pre-editing. We also calculated the word frequency rank with Wikipedia as the reference. 9 To assess the status of word frequency in relation to MT, it would be ideal to use the training data for each MT system, but such data are unavailable in black-box MT settings. Therefore, we decided to use Wikipedia as a convenient way to observe general word frequency. Lower numbers indicate higher word frequencies in Wikipedia. The 50th and 75th percentile values in the datasets imply that pre-editing induced the avoidance of low-frequency words.
To further inspect the differences between the Org-ST and the Best-ST, we extracted the word types (a) that appeared only in the Org-ST and (b) that appeared only in the Best-ST. Figure 2 illustrates the rank distributions of (a) and (b) for each condition. It is clear that low-frequency words with a frequency rank of around 10,000 decreased in the Best-ST, while words with a frequency rank of around 2,000-4,000 increased in the Best-ST. As Koehn and Knowles (2017) demonstrated, lowfrequency words still pose major obstacles for NMT systems. Our results endorse this claim from a different perspective and can provide general strategies for word choice in the pre-editing task.

Typology of Edit Operations
To understand the diversity of edit operations for pre-editing, we manually annotated the collected pre-editing instances in terms of linguistic operations. Given that the Best path contains effective editing operations for improved MT quality, we focused on the pairs of ST versions in the Best path (e.g., the pairs {1→3, 3→7, 7→8} in Figure 1). We randomly selected 10 units for each of the 24 combinations of translation direction, MT system, and text domain, resulting in a total of 961 preediting instances. We then excluded 26 instances that could be decomposed into multiple smaller edits 10 and classified the remaining 935 instances, each of which consists of a minimum edit of ST, based on the typology proposed by Miyata and Fujita (2017). Through the classification, we refined the existing typology to consistently accommodate all the instances. Table 5 presents our typology of editing operations with the number of instances in the different conditions. The typology consists of 39 operation types under 6 major categories, which enables us to grasp the diversity and trends of pre-editing operations. Compared to structural editing, local modifications of words and phrases were frequently used in the Best path. The dominant type is C01 (Use of synonymous words): content words are replaced by another synonymous word. This operation is important for achieving appropriate word choice in the MT output. C07 (Change of content), the second dominant type, includes the ad-10 Only 2.7% of the edits were not regarded as minimum, which demonstrated satisfactory adherence to our instructions, compared with the implementation by Miyata and Fujita (2017), in which 568 pre-editing instances were finally decomposed into 979 instances. dition of information that is inferred by human editors based on the intra-sentential context or even external knowledge. For example, a named entity 'Nemuro-sho' (Nemuro office) was changed into 'Nemuro-keisatsu-sho' (Nemuro police office) by using the knowledge of the entity. It might be challenging to automate such creative operations.
It is also notable that S01 (Sentence splitting) only amounts to 1.5% of all instances, which supports the observation in §4.1 that in general, sentence length was not reduced, and even increased by pre-editing. Among the 14 cases of this type, nine of the split sentences were 60-67 words in length. These results support the empirical observation by Koehn and Knowles (2017) that NMT systems still have difficulty in translating sentences longer than 60 words, and suggest that sentence splitting may only be promising for such very long sentences.

Strategies for Effective Pre-Editing
Towards the effective exercise of pre-editing, we further analysed the pre-editing instances in terms of informational strategies based on the notion of explicitation/implicitation acknowledged in translation studies (Vinay and Darbelnet, 1958;Chesterman, 1997;Murtisari, 2016). Following these studies, we broadly defined explicitation as an act of indicating what is implied in the text to clarify its meaning and implicitation as the inverse act of explicitation. We classified all the instances analysed above except for the E01 and E02 types into three general strategies, namely, explicitation, implicitation, and (information) preservation. The right side of Table 5 shows the classification result. The total numbers of instances classified into each strategy were 329, 88, and 480, respectively. Not surprisingly, this indicates that explicitation is an essential   strategy for effective pre-editing. We also grouped all the 329 instances of explicitation into the following four subcategories. 11 Information addition is the strategy of adding supplementary information, such as subjects, modality, and explanation, to clarify the content of the ST. For example, subjects were sometimes inserted as they tend to be omitted in Japanese sentences. This strategy generally corresponds to operation C07 (Change of content) described earlier.

Use of clear relation includes structural changes
and the use of explicit connective markers to make the relation between words, phrases, 11 See Appendix A for details. and clauses more intelligible. For example, the relation between the subject and object can be clarified by using the nominative case marker 'ga' in Japanese. Use of narrower sense is the strategy of replacing general words with more specific ones. For example, the verb 'dasu,' which has multiple meanings such as 'put,' 'take,' and 'send,' was replaced with the verb 'teishutsusuru,' which has a narrower range of meaning and was correctly translated as 'submit.' Normalisation includes the use of authorised or standardised expressions, style, and notation. For example, elliptic sentence-ending was completed to construct a normal structure.  These strategies can be used as concise preediting principles for human editors and can guide researchers in devising effective tools for preediting. We also emphasise that these general informational strategies are not specific to the Japanese language and could be applied to other languages.

Impact of Pre-Editing on Neural Machine Translation
This section investigates how pre-editing operations affect the NMT output. As indicated in §2, NMT systems still lack robustness, and minor modifications of the input would drastically change the output. From the practical viewpoint of deploying pre-editing, predictability is an important object to pursue. Here, we examine the impacts of minimum edits of the ST on the NMT output. To measure the amount of text editing, hereafter, we use the Translation Edit Rate (TER), which is calculated by dividing the number of edits (insertion, deletion, substitution, and shift) required to change a string into the reference string by the average number of reference words (Snover et al., 2006). For any consecutive pair of STs or their corresponding MT outputs, we used the chronologically later version as the reference. For word-level tokenisation, we used MeCab for Japanese, NLTK 12 for English, jieba 13 for Chinese, and KoNLPy 14 for Korean.

Correlation of the Amount of Edits between the ST and MT
To grasp the general tendency, using all the collected pre-editing instances (see Table 3 As shown in Table 6, most coefficients are in the range of 0.15-0.25, suggesting a very weak correlation. This means that the change in NMT output is hardly predictable based on the amount of edits in the ST. For example, the replacement of a single particle in the ST sometimes caused drastic changes of lexical choices in the MT output.
The Japanese-to-Korean translation is an exception; in particular, the correlation coefficients of the TER for the Google NMT system, i.e., 0.580 for Pearson's r and 0.574 for Spearman's ρ, indicate a moderate positive relationship between the changes in the ST and those in the MT. This is partly attributable to the fact that the syntactic structures of Japanese and Korean, including the word order and usage of particles, are substantially close. Thus, it is relatively easy to build sufficiently accurate MT systems.

Impact of Editing Operations on NMT
Finally, using the pre-editing instances in the Best path analysed in §5, we further investigated to what extent each type of minimum editing operation affects the MT output. At this stage, we focused on the 28 editing types that have at least 10 instances, considering that it is difficult to derive reliable insights from fewer data. Figure 3 presents the distribution of the degree of changes in the MT output when an ST is preedited, measured by TER(MT(ST), MT(ST )). Most of the structural edits (S01-S04) resulted in sizeable changes in the MT. This is reasonable since structural modifications in the ST tended to cause major changes in the MT as well, leading to high TER. In contrast, many of the editing types that include local modifications of functional words and orthographic notations (F01-F03, F05, F06, O01, O02) did not have major impacts on the MT results.

Conclusion and Outlook
Towards a better understanding of pre-editing for black-box NMT settings, in this study, we collected instances of manual pre-editing in various conditions and conducted in-depth analyses of the instances. We implemented a human-in-the-loop protocol to incrementally record minimum edits of ST for all combinations of three translation directions, two NMT systems, and four text domains, and obtained a total of 6,652 instances of manual pre-editing. Since more than 95% of the STs were successfully pre-edited into one that led to a satisfactory MT quality, our collected instances contain empirical, tacit human knowledge on the effective use of black-box NMT systems. We also investigated the collected data from three perspectives: the characteristics of the pre-edited STs, the diversity of pre-editing operations, and the impact of pre-editing operations on the NMT output. The remarkable findings can be summarised as follows: • Contrary to the acknowledged practices of pre-editing, the operation of making source sentences shorter and simpler was not frequently observed. Rather, it is more important to make the content, syntactic relations, and word senses clearer and more explicit, even if the ST becomes longer.
• As indicated by recent studies, the NMT systems are still sensitive to minor edits in the ST, and are unpredictable in general. However, there are recognisable tendencies in the MT output according to the types of editing operations, such as the relatively small impact of phrase reordering on NMT.
In future work, we plan to explore the effective implementation of pre-editing. The findings of this study provide a broad overview of the range of pre-editing operations and their expected benefits, which enables us to find feasible pre-editing solutions in practical use cases of black-box NMT systems. To develop automatic pre-editing tools using a collection of pre-editing instances, we need to handle the data insufficiency issue in machine learning, filling the gap between the training data and targeted black-box MT systems.
Moreover, as our pre-editing instances contain a wide variety of perturbations in the ST, they can also be used to evaluate the robustness of MT systems, which can lead to advances in MT research. We aim to jointly improve the two wheels of translation technology: pre-editing and MT.

Explicitation strategy Total
Example of ST pre-editing MT output
The twelfth is a holiday in Taiwan.
The stock market was closed on the twelfth due to a holiday in Taiwan.
Withdraw from your registered credit card in about 10 days without visiting the hospital.
Even if you do not visit the hospital, your credit card will be debited in about 10 days.
Please collect urine and feces.
Please submit urine and stool samples.

Figures are in billions of yen.
→ 単位は億円です。 Tan'i wa oku en desu.
The unit is 100 million yen.  Table 7 shows the statistics and examples of each subcategory of the explicitation strategy. A total of 329 pre-editing instances of the explicitation strategy can be further classified into four subcategories: information addition, use of clear relation, use of narrower sense, and normalisation. The example of the information addition illustrates the insertion of a subject 'kabushiki shijo wa' ('stock market'), which is implicit in the preceding ST. The example of the use of clear relation shows that the relation between the subject and object can be clarified by using the nominative case marker 'ga' instead of the accusative one 'o' and accordingly changing the voice of the main clause. As a result, the inappropriate imperative construction 'Withdraw from ...' in the MT output is changed to the correct passive construction 'will be debited.' In the example of the use of narrower sense, the verb 'dashite,' which has multiple meanings such as 'put,' 'take,' and 'send,' was replaced with the verb 'teishutsushite,' which has a narrower range of meaning and was correctly translated as 'submit.' In the example of normalisation, the elliptic sentence-ending was completed with a normal structure '... desu.' This operation led to not only the improvement of the sentence construction, but also the semantic correctness in the MT output ('billions'→'100 million').