Tool for Constructing a Large-Scale Corpus of Code Comments and Other Source Code Annotations

Luna Peck, Susan Brown


Abstract
The sublanguage of source code annotations—explanatory natural language writing that accompanies programming source code—is little-studied in linguistics. To facilitate research into this domain, we have developed a program prototype that can extract code comments and changelogs (i.e. commit messages) from public, open-source code repositories, with automatic tokenization and part-of-speech tagging on the extracted text. The program can also automatically detect and discard “commented-out” source code in data from Python repositories, to prevent it from polluting the corpus, demonstrating that such sanitization is likely feasible for other programming languages as well. With the current tool, we have produced a 6-million word corpus of English-language comments extracted from three different programming languages: Python, C, and C++.
Anthology ID:
2024.cawl-1.3
Volume:
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Kyle Gorman, Emily Prud'hommeaux, Brian Roark, Richard Sproat
Venues:
CAWL | WS
SIG:
SIGWrit
Publisher:
ELRA and ICCL
Note:
Pages:
18–22
Language:
URL:
https://aclanthology.org/2024.cawl-1.3
DOI:
Bibkey:
Cite (ACL):
Luna Peck and Susan Brown. 2024. Tool for Constructing a Large-Scale Corpus of Code Comments and Other Source Code Annotations. In Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024, pages 18–22, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Tool for Constructing a Large-Scale Corpus of Code Comments and Other Source Code Annotations (Peck & Brown, CAWL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.cawl-1.3.pdf