On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

Kabir Ahuja, Monojit Choudhury, Sandipan Dandapat


Abstract
Borrowing ideas from Production functions in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models. We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset. One of the interesting conclusion of the study is that if the cost of machine translation is greater than zero, the optimal performance at least cost is always achieved with at least some or only manually-created data. To our knowledge, this is the first attempt towards extending the concept of production functions to study data collection strategies for training multilingual models, and can serve as a valuable tool for other similar cost vs data trade-offs in NLP.
Anthology ID:
2022.naacl-main.98
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1369–1384
Language:
URL:
https://aclanthology.org/2022.naacl-main.98
DOI:
10.18653/v1/2022.naacl-main.98
Bibkey:
Cite (ACL):
Kabir Ahuja, Monojit Choudhury, and Sandipan Dandapat. 2022. On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1369–1384, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data (Ahuja et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.98.pdf
Video:
 https://aclanthology.org/2022.naacl-main.98.mp4
Data
TyDiQATyDiQA-GoldP