Novel Natural Language Summarization of Program Code via Leveraging Multiple Input Representations

Fuxiang Chen, Mijung Kim, Jaegul Choo


Abstract
The lack of description of a given program code acts as a big hurdle to those developers new to the code base for its understanding. To tackle this problem, previous work on code summarization, the task of automatically generating code description given a piece of code reported that an auxiliary learning model trained to produce API (Application Programming Interface) embeddings showed promising results when applied to a downstream, code summarization model. However, different codes having different summaries can have the same set of API sequences. If we train a model to generate summaries given an API sequence, the model will not be able to learn effectively. Nevertheless, we note that the API sequence can still be useful and has not been actively utilized. This work proposes a novel multi-task approach that simultaneously trains two similar tasks: 1) summarizing a given code (code to summary), and 2) summarizing a given API sequence (API sequence to summary). We propose a novel code-level encoder based on BERT capable of expressing the semantics of code, and obtain representations for every line of code. Our work is the first code summarization work that utilizes a natural language-based contextual pre-trained language model in its encoder. We evaluate our approach using two common datasets (Java and Python) that have been widely used in previous studies. Our experimental results show that our multi-task approach improves over the baselines and achieves the new state-of-the-art.
Anthology ID:
2021.findings-emnlp.214
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2510–2520
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.214
DOI:
10.18653/v1/2021.findings-emnlp.214
Bibkey:
Cite (ACL):
Fuxiang Chen, Mijung Kim, and Jaegul Choo. 2021. Novel Natural Language Summarization of Program Code via Leveraging Multiple Input Representations. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2510–2520, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Novel Natural Language Summarization of Program Code via Leveraging Multiple Input Representations (Chen et al., Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.214.pdf
Video:
 https://aclanthology.org/2021.findings-emnlp.214.mp4