Training an NMT system for legal texts of a low-resource language variety South Tyrolean German - Italian

Antoni Oliver, Sergi Alvarez-Vidal, Egon Stemle, Elena Chiocchetti


Abstract
This paper illustrates the process of training and evaluating NMT systems for a language pair that includes a low-resource language variety.A parallel corpus of legal texts for Italian and South Tyrolean German has been compiled, with South Tyrolean German being the low-resourced language variety. As the size of the compiled corpus is insufficient for the training, we have combined the corpus with several parallel corpora using data weighting at sentence level. We then performed an evaluation of each combination and of two popular commercial systems.
Anthology ID:
2024.eamt-1.47
Volume:
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
Month:
June
Year:
2024
Address:
Sheffield, UK
Editors:
Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Rachel Bawden, Víctor M Sánchez-Cartagena, Patrick Cadwell, Ekaterina Lapshinova-Koltunski, Vera Cabarrão, Konstantinos Chatzitheodorou, Mary Nurminen, Diptesh Kanojia, Helena Moniz
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation (EAMT)
Note:
Pages:
573–579
Language:
URL:
https://aclanthology.org/2024.eamt-1.47
DOI:
Bibkey:
Cite (ACL):
Antoni Oliver, Sergi Alvarez-Vidal, Egon Stemle, and Elena Chiocchetti. 2024. Training an NMT system for legal texts of a low-resource language variety South Tyrolean German - Italian. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 573–579, Sheffield, UK. European Association for Machine Translation (EAMT).
Cite (Informal):
Training an NMT system for legal texts of a low-resource language variety South Tyrolean German - Italian (Oliver et al., EAMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eamt-1.47.pdf