Todd Mytkowicz
2022
CodeExp: Explanatory Code Document Generation
Haotian Cui
|
Chenglong Wang
|
Junjie Huang
|
Jeevana Priya Inala
|
Todd Mytkowicz
|
Bo Wang
|
Jianfeng Gao
|
Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2022
Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code that do not capture implementation-level choices essential for these scenarios. To fill in this gap, we propose the code explanation generation task. We first conducted a human study to identify the criteria for high-quality explanatory docstring for code. Based on that, we collected and refined a large-scale code docstring corpus and formulated automatic evaluation metrics that best match human assessments. Finally, we present a multi-stage fine-tuning strategy and baseline models for the task. Our experiments show that (1) our refined training dataset lets models achieve better performance in the explanation generation tasks compared to larger-scale unrefined data (15x larger), and (2) fine-tuned models can generate well-structured long docstrings comparable to human-written ones. We envision our training dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy can boost future code explanation research. The code and annotated data are available at https://github.com/subercui/CodeExp.
Search
Fix data
Co-authors
- Haotian Cui 1
- Nan Duan 1
- Jianfeng Gao 1
- Junjie Huang 1
- Jeevana Priya Inala 1
- show all...