How Should Markup Tags Be Translated?

Greg Hanneman, Georgiana Dinu


Abstract
The ability of machine translation (MT) models to correctly place markup is crucial to generating high-quality translations of formatted input. This paper compares two commonly used methods of representing markup tags and tests the ability of MT models to learn tag placement via training data augmentation. We study the interactions of tag representation, data augmentation size, tag complexity, and language pair to show the drawbacks and benefits of each method. We construct and release new test sets containing tagged data for three language pairs of varying difficulty.
Anthology ID:
2020.wmt-1.138
Volume:
Proceedings of the Fifth Conference on Machine Translation
Month:
November
Year:
2020
Address:
Online
Editors:
Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
1160–1173
Language:
URL:
https://aclanthology.org/2020.wmt-1.138
DOI:
Bibkey:
Cite (ACL):
Greg Hanneman and Georgiana Dinu. 2020. How Should Markup Tags Be Translated?. In Proceedings of the Fifth Conference on Machine Translation, pages 1160–1173, Online. Association for Computational Linguistics.
Cite (Informal):
How Should Markup Tags Be Translated? (Hanneman & Dinu, WMT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wmt-1.138.pdf
Video:
 https://slideslive.com/38939626
Code
 amazon-research/mt-markup-tags