Bring More Attention to Syntactic Symmetry for Automatic Postediting of High-Quality Machine Translations

Automatic postediting (APE) is an automated process to refine a given machine translation (MT).Recent findings present that existing APE systems are not good at handling high-quality MTs even for a language pair with abundant data resources, English–German: the better the given MT is, the harder it is to decide what parts to edit and how to fix these errors.One possible solution to this problem is to instill deeper knowledge about the target language into the model.Thus, we propose a linguistically motivated method of regularization that is expected to enhance APE models’ understanding of the target language: a loss function that encourages symmetric self-attention on the given MT.Our analysis of experimental results demonstrates that the proposed method helps improving the state-of-the-art architecture’s APE quality for high-quality MTs.


Introduction
Automatic postediting (APE) is an automated process to transform a given machine translation (MT) into a higher-quality text (Knight and Chander, 1994). Since 2015, Conference on Machine Translation (WMT) has been hosting an annual shared task for APE, and most of the recently developed APE systems are within the common framework of representation learning using artificial neural networks to learn postediting patterns from the training data (Chatterjee et al., , 2019(Chatterjee et al., , 2020Akhbardeh et al., 2021).
Since 2018, all participants in the shared task have used Transformer-based models (Vaswani et al., 2017), but recent findings of the shared task (Chatterjee et al., , 2019(Chatterjee et al., , 2020Akhbardeh et al., 2021) cast doubt on whether Transformer-based APE models learn good generalizations because such models' APE quality appears to be significantly affected by external factors such as the source-target language pair, the qualitative characteristics of the provided data, and the quality of the given MT.
Especially, the good quality of the given MTs has brought great difficulty in performing APE on the WMT 2019 test data set: the better the given MT is, the harder it is to decide what parts to edit and how to correct these errors (Chatterjee et al., , 2019. The thing to notice is that this outcome is not a question of data scarcity because the language pair of this test data set, English-German, is a language pair provided with abundant training, validation, and test data. Also, it is not a question of data heterogeneity, either: the domain of this test data set, IT, shows a high degree of lexical repetition, which indicates that data sets in this domain use the same small set of lexical items (Chatterjee et al., , 2019Akhbardeh et al., 2021). Thus, it would be a question of modeling, and one possible solution is to implant deeper knowledge about the target language into the model.
To this end, we propose a new method of regularization that is expected to enhance Transformerbased APE models' understanding of German translations. Specifically, the proposed method is based on Feldermodell ( §2), an established linguistic model, which implies the need for proper treatment of the underlying symmetry of German sentence structures. To instill the idea of syntactic symmetry into Transformer-based APE models, we introduce a loss function that encourages symmetric self-attention on the given MT. Based on experimental results, we conduct a careful analysis and conclude that the proposed method has a positive effect on improving the state-of-the-art architecture's APE quality for high-quality MTs.
For such analyses, special tree structures such as Doppelbaum (Wöllstein, 2018) ('double tree') can be used, which is a bimodal tree (Fig. 1), where two CP, C, IP, I, and VP subtrees are 'symmetric' with respect to V. We assume that this structural symmetry is parameterized from the perspective, not only of generative linguistics (Wöllstein, 2018;Höhle, 2019), but also of a parametric model P = {P θ θ ∈ Θ}, where P θ and Θ are a probability distribution and the parameter space, respectively.
Especially, if we look at APE in terms of sequence-to-sequence learning , the probability distribution of the output sequence (y 1 , ⋯, y Ly ) is obtained in the following manner: where u and v are the representations of a source text (x 1 , ⋯, x Lx ) and its MT (z 1 , ⋯, z Lz ), respectively. In this process, we presume that the syntactic symmetry of the target language affects the resulting distribution P θ ; in other words, this syntactic symmetry would be an inductive bias (Mitchell, 1980) that should be handled properly.

Methodology
We implement a multi-encoder Transformer model consisting of the "Joint-Final" encoder and the "Parallel" decoder, which is a state-of-the-art architecture for APE (Shin et al., 2021), and conduct a controlled experiment without concern for usage of performance-centered tuning techniques. Specifically, the Joint-Final encoder consists of a sourcetext encoder and an MT encoder, which process the given source text and MT, respectively. Based on this baseline architecture, we propose a method to encourage the MT encoder to perform symmetric self-attention by minimizing the skewness of each self-attention layer's categorical distribution p self .
The used measure of skewness is for each token z i in the given MT (z 1 , ⋯, z Lz ). Accordingly, the basic cross-entropy loss L CE is regularized by (μ 3 ) i , resulting in a new loss function in the given minibatch, and is the expected value of (μ 3 ) b,n,h,i . In addition, (1 − α) is an initial inducement to utilizingμ 3 . In the equations above, σ is the sigmoid function, v is the output of the final layer of the MT encoder, W ∈ R d model and β ∈ R are learned parameters, B is the number of data examples, N is the number of layers, and H is the number of heads.

Experiment
In the conducted experiment, all hyperparameters are the same as those of Shin et al. (2021)   Both the baseline model and the proposed model are trained by using the training data sets and the validation data set listed in Table 1; we first train the models by using eSCAPE-NMT mixed with the WMT 2019 training data in the ratio of 27 ∶ 1, and then tune them by using the WMT 2019 training data solely.

Results and Analysis
The result of automatic evaluation (Table 2) indicates that the proposed model improves on the baseline model in terms of BLEU (75.47) but does not in terms of TER (16.54), which is unusual. Although those measures have a strong correlation overall (Fig. 2), the proposed model has more outliers, δBLEU (the value obtained by subtracting a given MT's BLEU from the postedited result's BLEU) of which is over 20, compared to the baseline model; they must be the ones that bring the improvement in BLEU.
Thus, we present an additional evaluation result to further investigate this mismatch between TER improvements and BLEU improvements: a relative frequency distribution of successes and failures in APE with regard to the TER difference   (Snover et al., 2006) and BLEU (Papineni et al., 2002), their sentence-level standard deviations (σ) are presented. In each column, the figure implying the best performance is in bold. The dagger symbols denote the proposed model's quality improvement on the given MTs is statistically significant (p ≤ 0.05). The asterisks denote the proposed model's performance improvement on the baseline model is statistically significant (p ≤ 0.05). between a given MT and each model's output ( Table 3). Then, the mentioned outliers correspond to PERF, which is the set of the cases where an APE system succeeds in perfectly correcting the given MT with one or more errors, considering that the proposed model's PERF has a µ δBLEU (the average of sentence-level BLEU improvements) of 27.21. We see that the proposed model has substantially more PERF cases (5.87%) than the baseline model (4.30%) and that because most of those 'new' (1.57pp) cases are results of nontrivial postediting (  Table 3: A relative frequency distribution containing the frequencies of the following groups (we compare the TER of the given MT and that of the postedited result.): the cases where an APE system injects errors to an already perfect MT (RUIN); both the given MT and the APE result are not perfect, but the former is better in terms of TER (DEGR); both are not perfect and have the same TER although they are different from each other (EVEN); both are not perfect, but the latter is better (IMPR); the given MT is not perfect whereas the APE result is (PERF); both are perfect (ACCE); and lastly, even though the MT is not perfect, the APE system does not change anything (NEGL). The calculation of the F1 score is based on two criteria: whether the given MT is perfect or not (for recall) and whether the APE system edits the given MT or not (for precision). % is the proportion of the cases belonging to each category, µ δBLEU is the average of sentence-level BLEU improvements, and σ δBLEU is their standard deviation.  In addition, in an actual example where only the proposed model corrects the given MT perfectly (Table 5), we observe that the proposed model successfully captures the close relation between the verb "enthält" ('contains') and its object so that the correct form "Variablen" ('variables') is used. Considering that the adverb phrase "zum Beispiel" ('for example') in the given MT makes some distance between the verb and its object, it appears that the proposed model integrates information from a wider range of constituents than the baseline model; hence the conclusion that the proposed method instills Feldermodell's idea of syntactic symmetry into Transformer-based APE models and enhances their understanding of German translations.
Another example (Table 6) suggests that the increase in the proportion of ACCE (0.3pp), which is the set of the cases where an APE system adopts the given, already perfect MT, should be cautiously interpreted. Although professional translators tend to perform "only the necessary and sufficient corrections" (Bojar et al., 2015), the validity of test data created by professional translators, including the WMT 2019 test data set, can also be disputable because other native speakers might argue that they can perform better postediting. For example, some people may consider hyphenated compound "Zoom-Werkzeug" ('Zoom tool') more natural than closed compound "Zoomwerkzeug" (Table 6).
However, considering the big differences in the proportion of NEGL (2.35pp), which is the set of the cases where an APE system neglects to postedit the given MT, and the F1 score (Table 3), it appears that such a risk need not be considered in this analysis. Moreover, the proposed model has fewer RUIN cases (1.56%), where it injects errors to the given, already perfect MT, than the baseline model (1.86%). Although the proposed model has more DEGR cases (7.33%), where it degrades the given MT, than the baseline   (6.65%), the proposed model's quality degradation µ δBLEU = −11.72 is less severe than that of the baseline (µ δBLEU = −13.51). Therefore, we conclude that the proposed method results in small but certain improvements.

Conclusion
To improve the APE quality for high-quality MTs, we propose a linguistically motivated method of regularization that enhances Transformer-based APE models' understanding of the target language: a loss function that encourages APE models to perform symmetric self-attention on a given MT. Experimental results suggest that the proposed method helps improving the state-of-the-art architecture's APE quality for high-quality MTs; we also present a relative frequency distribution of successes and failures in APE and see increases in the proportion of perfect postediting and the F1 score. This evaluation method could be useful for assessing the APE quality for high-quality MTs in general. Actual cases support that the proposed method successfully instills the idea of syntactic symmetry into APE models. Future research should consider different language pairs and different sets of hyperparameters.

Acknowledgements
This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (Ministry of Science and ICT) (No. 2019-0-01906, Artificial Intelligence Graduate School Program (POSTECH)). We thank Richard Albrecht for assistance in the manual categorization of cases.

Limitations
First, neither Feldermodell (Reis, 1980;Wöllstein, 2018;Höhle, 2019) nor Doppelbaum (Wöllstein, 2018) has obtained complete concurrence among linguists. Also, we limit our scope to the English-German language pair and the IT domain using the WMT 2019 training, validation, and test data sets. A broader scope would not provide confidence in the validity of conducted experiments because there are hardly any standard setups for experimental research (Chatterjee et al., , 2019Akhbardeh et al., 2021). In addition, the conducted experiment should take into consideration the effect of randomness that is attended in the process of training artificial neural networks; different techniques, different hyperparameters, and multiple runs of optimizers (Clark et al., 2011) may present different results. However, as previous studies (Chatterjee et al., , 2019(Chatterjee et al., , 2020Akhbardeh et al., 2021), including the study on the baseline model (Shin et al., 2021), do not consider the effect of randomness, we also do not investigate the effect of randomness further, considering that training multiple models (Appendix A) to obtain good estimators (TER and BLEU) will cost a lot.