Saúl Buján

2026

Improving Machine Translation of Idioms: A Spanish–Galician Parallel Dataset and Synthetic Augmentation Approach
Lúa Santamaría Montesinos | Saúl Buján | Daniel Bardanca | Pablo Gamallo
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Idiomatic expressions are a well-known challenge for neural machine translation, including both traditional sequence-to-sequence models and large language models (LLMs). This paper presents a systematic approach to improve idiom translation between Spanish and Galician. First, we build a high-quality parallel dataset of idioms manually aligned across both languages. Then, we automatically extend this dataset into a large synthetic parallel corpus using LLMs, following a strategy that prioritizes the most frequent idioms observed in authentic corpora. This augmented dataset is used to retrain a seq2seq translation model. We evaluate the resulting system and compare it both to the baseline model without idiom data and to state-of-the-art LLM-based translators such as SalamandraTA. Results show that the translation of idioms improves significantly after the training, alongside a slight boost in the model’s overall performance.

Co-authors

Venues

PROPOR1

Fix author