A3-108 Controlling Token Generation in Low Resource Machine Translation Systems

Saumitra Yadav, Ananya Mukherjee, Manish Shrivastava


Abstract
Translating for languages with limited resources poses a persistent challenge due to the scarcity of high-quality training data. To enhance translation accuracy, we explored controlled generation mechanisms, focusing on the importance of control tokens. In our experiments, while training, we encoded the target sentence length as a control token to the source sentence, treating it as an additional feature for the source sentence. We developed various NMT models using transformer architecture and conducted experiments across 8 language directions (English = Assamese, Manipuri, Khasi, and Mizo), exploring four variations of length encoding mechanisms. Through comparative analysis against the baseline model, we submitted two systems for each language direction. We report our findings for the same in this work.
Anthology ID:
2024.wmt-1.61
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
728–734
Language:
URL:
https://aclanthology.org/2024.wmt-1.61
DOI:
Bibkey:
Cite (ACL):
Saumitra Yadav, Ananya Mukherjee, and Manish Shrivastava. 2024. A3-108 Controlling Token Generation in Low Resource Machine Translation Systems. In Proceedings of the Ninth Conference on Machine Translation, pages 728–734, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
A3-108 Controlling Token Generation in Low Resource Machine Translation Systems (Yadav et al., WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.61.pdf