Optimizing Word Alignments with Better Subword Tokenization

Anh Khoa Ngo Ho, François Yvon


Abstract
Word alignment identify translational correspondences between words in a parallel sentence pair and are used and for example and to train statistical machine translation and learn bilingual dictionaries or to perform quality estimation. Subword tokenization has become a standard preprocessing step for a large number of applications and notably for state-of-the-art open vocabulary machine translation systems. In this paper and we thoroughly study how this preprocessing step interacts with the word alignment task and propose several tokenization strategies to obtain well-segmented parallel corpora. Using these new techniques and we were able to improve baseline word-based alignment models for six language pairs.
Anthology ID:
2021.mtsummit-research.21
Volume:
Proceedings of Machine Translation Summit XVIII: Research Track
Month:
August
Year:
2021
Address:
Virtual
Editors:
Kevin Duh, Francisco Guzmán
Venue:
MTSummit
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
256–269
Language:
URL:
https://aclanthology.org/2021.mtsummit-research.21
DOI:
Bibkey:
Cite (ACL):
Anh Khoa Ngo Ho and François Yvon. 2021. Optimizing Word Alignments with Better Subword Tokenization. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 256–269, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
Optimizing Word Alignments with Better Subword Tokenization (Ngo Ho & Yvon, MTSummit 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mtsummit-research.21.pdf