How Effective is Synthetic Data and Instruction Fine-tuning for Translation with Markup using LLMs?

Raj Dabre; Haiyue Song; Miriam Exel; Bianka Buschbeck; Johannes Eschbach-Dymanus; Hideki Tanaka

How Effective is Synthetic Data and Instruction Fine-tuning for Translation with Markup using LLMs?

Raj Dabre, Haiyue Song, Miriam Exel, Bianka Buschbeck, Johannes Eschbach-Dymanus, Hideki Tanaka

Abstract

Recent works have shown that prompting large language models (LLMs) is effective for translation with markup where LLMs can simultaneously transfer markup tags while ensuring that the content, both inside and outside tag pairs is correctly translated. However, these works make a rather unrealistic assumption of the existence of high-quality parallel sentences with markup for prompting. Furthermore, the impact of instruction fine-tuning (IFT) in this setting is unknown. In this paper, we provide a study, the first of its kind, focusing on the effectiveness of synthetically created markup data and IFT for translation with markup using LLMs. We focus on translation from English to five European languages, German, French, Dutch, Finnish and Russian, where we show that regardless of few-shot prompting or IFT, synthetic data created via word alignments, while leading to inferior markup transfer compared to using original data with markups, does not negatively impact the translation quality. Furthermore, IFT mainly impacts the translation quality compared to few-shot prompting and has slightly better markup transfer capabilities than the latter. We hope our work will help practitioners make effective decisions on modeling choices for LLM based translation with markup.

Anthology ID:: 2024.amta-research.8
Volume:: Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Month:: September
Year:: 2024
Address:: Chicago, USA
Editors:: Rebecca Knowles, Akiko Eriguchi, Shivali Goel
Venue:: AMTA
SIG:
Publisher:: Association for Machine Translation in the Americas
Note:
Pages:: 73–87
Language:
URL:: https://aclanthology.org/2024.amta-research.8/
DOI:
Bibkey:
Cite (ACL):: Raj Dabre, Haiyue Song, Miriam Exel, Bianka Buschbeck, Johannes Eschbach-Dymanus, and Hideki Tanaka. 2024. How Effective is Synthetic Data and Instruction Fine-tuning for Translation with Markup using LLMs?. In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 73–87, Chicago, USA. Association for Machine Translation in the Americas.
Cite (Informal):: How Effective is Synthetic Data and Instruction Fine-tuning for Translation with Markup using LLMs? (Dabre et al., AMTA 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.amta-research.8.pdf

PDF Cite Search Fix data