Better Language Models of Code through Self-Improvement

Hung To, Nghi Bui, Jin L.C. Guo, Tien Nguyen


Abstract
Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a data augmentation framework using knowledge distillation. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to augment training data, which is then used for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs’ performance in sequence-generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.
Anthology ID:
2023.findings-acl.823
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12994–13002
Language:
URL:
https://aclanthology.org/2023.findings-acl.823
DOI:
10.18653/v1/2023.findings-acl.823
Bibkey:
Cite (ACL):
Hung To, Nghi Bui, Jin L.C. Guo, and Tien Nguyen. 2023. Better Language Models of Code through Self-Improvement. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12994–13002, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Better Language Models of Code through Self-Improvement (To et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.823.pdf
Video:
 https://aclanthology.org/2023.findings-acl.823.mp4