RAR: Retrieval-augmented retrieval for code generation in low resource languages

Avik Dutta, Mukul Singh, Gust Verbruggen, Sumit Gulwani, Vu Le


Abstract
Language models struggle in generating code for low-resource programming languages, since these are underrepresented in training data. Either examples or documentation are commonly used for improved code generation. We propose to use both types of information together and present retrieval augmented retrieval (RAR) as a two-step method for selecting relevant examples and documentation. Experiments on three low-resource languages (Power Query M, OfficeScript and Excel formulas) show that RAR outperforms independently example and grammar retrieval (+2.81–26.14%). Interestingly, we show that two-step retrieval selects better examples and documentation when used independently as well.
Anthology ID:
2024.emnlp-main.1199
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21506–21515
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1199
DOI:
Bibkey:
Cite (ACL):
Avik Dutta, Mukul Singh, Gust Verbruggen, Sumit Gulwani, and Vu Le. 2024. RAR: Retrieval-augmented retrieval for code generation in low resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21506–21515, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
RAR: Retrieval-augmented retrieval for code generation in low resource languages (Dutta et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1199.pdf
Software:
 2024.emnlp-main.1199.software.zip
Data:
 2024.emnlp-main.1199.data.zip