CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, Nan Duan


Abstract
Finding codes given natural language query is beneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce CoSQA dataset. It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance text-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that, evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1% and incorporating CoCLR brings a further improvement of 10.5%.
Anthology ID:
2021.acl-long.442
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:
August
Year:
2021
Address:
Online
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5690–5700
Language:
URL:
https://aclanthology.org/2021.acl-long.442
DOI:
10.18653/v1/2021.acl-long.442
Bibkey:
Cite (ACL):
Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. CoSQA: 20,000+ Web Queries for Code Search and Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5690–5700, Online. Association for Computational Linguistics.
Cite (Informal):
CoSQA: 20,000+ Web Queries for Code Search and Question Answering (Huang et al., ACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-long.442.pdf
Code
 Jun-jie-Huang/CoCLR
Data
CoSQACodeSearchNetCodeXGLUEStaQC