Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection

Barah Fazili, Ashish Agrawal, Preethi Jyothi


Abstract
Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote cross-lingual transfer for low-resource target languages. Given task-specific data in a source language and a teacher model trained on this data, we propose using this teacher to label LLM generations and employ a set of simple data selection strategies that use the teacher’s label probabilities. Our data selection strategies help us identify a representative subset of diverse generations that help boost zero-shot accuracies while being efficient, in comparison to using all the LLM generations (without any subset selection). We also highlight other important design choices that affect cross-lingual performance such as the use of translations of source data and what labels are best to use for the LLM generations. We observe significant performance gains across sentiment analysis and natural language inference tasks (of up to a maximum of 7.13 absolute points and 1.5 absolute points on average) across a number of target languages (Hindi, Marathi, Urdu, Swahili) and domains.
Anthology ID:
2024.findings-acl.795
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13406–13422
Language:
URL:
https://aclanthology.org/2024.findings-acl.795
DOI:
Bibkey:
Cite (ACL):
Barah Fazili, Ashish Agrawal, and Preethi Jyothi. 2024. Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection. In Findings of the Association for Computational Linguistics ACL 2024, pages 13406–13422, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection (Fazili et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.795.pdf