Roberto Cavicchioli
2024
Learning from Wrong Predictions in Low-Resource Neural Machine Translation
Jia Cheng Hu
|
Roberto Cavicchioli
|
Giulia Berardinelli
|
Alessandro Capotondi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Resource scarcity in Neural Machine Translation is a challenging problem in both industry applications and in the support of less-spoken languages represented, in the worst case, by endangered and low-resource languages. Many Data Augmentation methods rely on additional linguistic sources and software tools but these are often not available in less favoured language. For this reason, we present USKI (Unaligned Sentences Keytokens pre-traIning), a pre-training strategy that leverages the relationships and similarities that exist between unaligned sentences. By doing so, we increase the dataset size of endangered and low-resource languages by the square of the initial quantity, matching the typical size of high-resource language datasets such as WMT14 En-Fr. Results showcase the effectiveness of our approach with an increase on average of 0.9 BLEU across the benchmarks using a small fraction of the entire unaligned corpus, suggesting the importance of the research topic and the potential of a currently under-utilized resource and under-explored approach.