Continued Pre-training on Sentence Analogies for Translation with Small Data

Liyan Wang, Haotong Wang, Yves Lepage


Abstract
This paper introduces Continued Pre-training on Analogies (CPoA) to incorporate pre-trained language models with analogical abilities, aiming at improving performance in low-resource translations without data augmentation. We continue training the models on sentence analogies retrieved from a translation corpus. Considering the sparsity of analogy in corpora, especially in low-resource scenarios, we propose exploring approximate analogies between sentences. We attempt to find sentence analogies that might not conform to formal criteria for entire sentences but partial pieces. When training the models, we introduce a weighting scalar pertaining to the quality of analogies to adjust the influence: emphasizing closer analogies while diminishing the impact of far ones. We evaluate our approach on a low-resource translation task: German-Upper Sorbian. The results show that CPoA using 10 times fewer instances can effectively attain gains of +1.4 and +1.3 BLEU points over the original model in two translation directions. This improvement is more pronounced when there are fewer parallel examples.
Anthology ID:
2024.lrec-main.344
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
3890–3896
Language:
URL:
https://aclanthology.org/2024.lrec-main.344
DOI:
Bibkey:
Cite (ACL):
Liyan Wang, Haotong Wang, and Yves Lepage. 2024. Continued Pre-training on Sentence Analogies for Translation with Small Data. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3890–3896, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Continued Pre-training on Sentence Analogies for Translation with Small Data (Wang et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.344.pdf