Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering

Shinwoo Park, Youngwook Kim, Yo-Sub Han


Abstract
The semantic code search is to find code snippets from the collection of candidate code snippets with respect to a user query that describes functionality. Recent work on code search proposes data augmentation of queries for contrastive learning. This data augmentation approach modifies random words in queries. When a user web query for searching code snippet is too brief, the important word that represents the search intent of the query could be undesirably modified. A code snippet has informative components such as function name and documentation that describe its functionality. We propose to utilize these code components to identify important words and preserve them in the data augmentation step. We present KeyDAC (Keyword-based Data Augmentation for Contrastive learning) that identifies important words for code search from queries and code components based on term matching. KeyDAC augments query-code pairs while preserving keywords, and then leverages generated training instances for contrastive learning. We use KeyDAC to fine-tune various pre-trained language models and evaluate the performance of code search and code question answering via CoSQA and WebQueryTest. The experimental results confirm that KeyDAC substantially outperforms the current state-of-the-art performance, and achieves the new state-of-the-arts for both tasks.
Anthology ID:
2023.eacl-main.262
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3609–3619
Language:
URL:
https://aclanthology.org/2023.eacl-main.262
DOI:
10.18653/v1/2023.eacl-main.262
Bibkey:
Cite (ACL):
Shinwoo Park, Youngwook Kim, and Yo-Sub Han. 2023. Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3609–3619, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering (Park et al., EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.262.pdf
Dataset:
 2023.eacl-main.262.dataset.zip
Video:
 https://aclanthology.org/2023.eacl-main.262.mp4