Luna Peck


2024

pdf bib
Tool for Constructing a Large-Scale Corpus of Code Comments and Other Source Code Annotations
Luna Peck | Susan Brown
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024

The sublanguage of source code annotations—explanatory natural language writing that accompanies programming source code—is little-studied in linguistics. To facilitate research into this domain, we have developed a program prototype that can extract code comments and changelogs (i.e. commit messages) from public, open-source code repositories, with automatic tokenization and part-of-speech tagging on the extracted text. The program can also automatically detect and discard “commented-out” source code in data from Python repositories, to prevent it from polluting the corpus, demonstrating that such sanitization is likely feasible for other programming languages as well. With the current tool, we have produced a 6-million word corpus of English-language comments extracted from three different programming languages: Python, C, and C++.