TransIns: Document Translation with Markup Reinsertion

Jörg Steffen, Josef van Genabith


Abstract
For many use cases, it is required that MT does not just translate raw text, but complex formatted documents (e.g. websites, slides, spreadsheets) and the result of the translation should reflect the formatting. This is challenging, as markup can be nested, apply to spans contiguous in source but non-contiguous in target etc. Here we present TransIns, a system for non-plain text document translation that builds on the Okapi framework and MT models trained with Marian NMT. We develop, implement and evaluate different strategies for reinserting markup into translated sentences using token alignments between source and target sentences. We propose a simple and effective strategy that compiles down all markup to single source tokens and transfers them to aligned target tokens. A first evaluation shows that this strategy yields highly accurate markup in the translated documents that outperforms the markup quality found in documents translated with popular translation services. We release TransIns under the MIT License as open-source software on https://github.com/DFKI-MLT/TransIns. An online demonstrator is available at https://transins.dfki.de.
Anthology ID:
2021.emnlp-demo.4
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Heike Adel, Shuming Shi
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28–34
Language:
URL:
https://aclanthology.org/2021.emnlp-demo.4
DOI:
10.18653/v1/2021.emnlp-demo.4
Bibkey:
Cite (ACL):
Jörg Steffen and Josef van Genabith. 2021. TransIns: Document Translation with Markup Reinsertion. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 28–34, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
TransIns: Document Translation with Markup Reinsertion (Steffen & van Genabith, EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-demo.4.pdf
Video:
 https://aclanthology.org/2021.emnlp-demo.4.mp4
Code
 dfki-mlt/transins
Data
OPUS-MT