Grammar-based Data Augmentation for Low-Resource Languages: The Case of Guarani-Spanish Neural Machine Translation

Agustín Lucas; Alexis Baladón; Victoria Pardiñas; Marvin Agüero-Torales; Santiago Góngora; Luis Chiruzzo

doi:10.18653/v1/2024.naacl-long.354

Grammar-based Data Augmentation for Low-Resource Languages: The Case of Guarani-Spanish Neural Machine Translation

Agustín Lucas, Alexis Baladón, Victoria Pardiñas, Marvin Agüero-Torales, Santiago Góngora, Luis Chiruzzo

Abstract

One of the main problems low-resource languages face in NLP can be pictured as a vicious circle: data is needed to build and test tools, but the available text is scarce and there are not powerful tools to collect it.In order to break this circle for Guarani, we explore if text automatically generated from a grammar can work as a Data Augmentation technique to boost the performance of Guarani-Spanish Machine Translation (MT) systems.After building a grammar-based system that generates Spanish text and syntactically transfers it to Guarani, we perform several experiments by pretraining models using this synthetic text.We find that the MT systems that are pretrained with synthetic text perform better, even outperforming previous baselines.

Anthology ID:: 2024.naacl-long.354
Volume:: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6385–6397
Language:
URL:: https://aclanthology.org/2024.naacl-long.354/
DOI:: 10.18653/v1/2024.naacl-long.354
Bibkey:
Cite (ACL):: Agustín Lucas, Alexis Baladón, Victoria Pardiñas, Marvin Agüero-Torales, Santiago Góngora, and Luis Chiruzzo. 2024. Grammar-based Data Augmentation for Low-Resource Languages: The Case of Guarani-Spanish Neural Machine Translation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6385–6397, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Grammar-based Data Augmentation for Low-Resource Languages: The Case of Guarani-Spanish Neural Machine Translation (Lucas et al., NAACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.naacl-long.354.pdf
Video:: https://aclanthology.org/2024.naacl-long.354.mp4

PDF Cite Search Video Fix data