Translation of Multifaceted Data without Re-Training of Machine Translation Systems

Hyeonseok Moon, Seungyoon Lee, SeongTae Hong, Seungjun Lee, Chanjun Park, Heuiseok Lim


Abstract
Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation. in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data. Compared with the conventional approach that translates each data component separately, our method yields better training data that enhances the performance of the trained model by 2.690 points for the web page ranking (WPR) task, and 0.845 for the question generation (QG) task in the XGLUE benchmark.
Anthology ID:
2024.findings-emnlp.114
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2088–2108
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.114
DOI:
10.18653/v1/2024.findings-emnlp.114
Bibkey:
Cite (ACL):
Hyeonseok Moon, Seungyoon Lee, SeongTae Hong, Seungjun Lee, Chanjun Park, and Heuiseok Lim. 2024. Translation of Multifaceted Data without Re-Training of Machine Translation Systems. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2088–2108, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Translation of Multifaceted Data without Re-Training of Machine Translation Systems (Moon et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.114.pdf