Yuto Kuroda
2024
Word-level Translation Quality Estimation Based on Optimal Transport
Yuto Kuroda
|
Atsushi Fujita
|
Tomoyuki Kajiwara
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Word-level translation quality estimation (TQE) is the task of identifying erroneous words in a translation with respect to the source. State-of-the-art methods for TQE exploit large quantities of synthetic training data generated from bilingual parallel corpora, where pseudo-quality labels are determined by comparing two independent translations for the same source text, i.e., an output from a machine translation (MT) system and a reference translation in the parallel corpora. However, this process is sorely reliant on the surface forms of words, with acceptable synonyms and interchangeable word orderings regarded as erroneous. This can potentially mislead the pre-training of models. In this paper, we describe a method that integrates a degree of uncertainty in labeling the words in synthetic training data for TQE. To estimate the extent to which each word in the MT output is likely to be correct or erroneous with respect to the reference translation, we propose to use the concept of optimal transport (OT), which exploits contextual word embeddings. Empirical experiments using a public benchmarking dataset for word-level TQE demonstrate that pre-training TQE models with the pseudo-quality labels determined by OT produces better predictions of the word-level quality labels determined by manual post-editing than doing so with surface-based pseudo-quality labels.
2022
Adversarial Training on Disentangling Meaning and Language Representations for Unsupervised Quality Estimation
Yuto Kuroda
|
Tomoyuki Kajiwara
|
Yuki Arase
|
Takashi Ninomiya
Proceedings of the 29th International Conference on Computational Linguistics
We propose a method to distill language-agnostic meaning embeddings from multilingual sentence encoders for unsupervised quality estimation of machine translation. Our method facilitates that the meaning embeddings focus on semantics by adversarial training that attempts to eliminate language-specific information. Experimental results on unsupervised quality estimation reveal that our method achieved higher correlations with human evaluations.