Transforming Code Understanding: Clustering-Based Retrieval for Improved Summarization in Domain-Specific Languages

Baban Gain; Dibyanayan Bandyopadhyay; Samrat Mukherjee; Aryan Sahoo; Saswati Dana; Palanivel Kodeswaran; Sayandeep Sen; Asif Ekbal; Dinesh Garg

Transforming Code Understanding: Clustering-Based Retrieval for Improved Summarization in Domain-Specific Languages

Baban Gain, Dibyanayan Bandyopadhyay, Samrat Mukherjee, Aryan Sahoo, Saswati Dana, Palanivel Kodeswaran, Sayandeep Sen, Asif Ekbal, Dinesh Garg

Abstract

A domain-specific extension of C language known as extended Berkeley Packet Filter (eBPF) has gained widespread acceptance for various tasks, including observability, security, and network acceleration in the cloud community. Due to its recency and complexity, there is an overwhelming need for natural language summaries of existing eBPF codes (particularly open-source code) for practitioners and developers, which will go a long way in easing the understanding and development of new code. However, being a niche Domain-Specific Language (DSL), there is a scarcity of available training data. In this paper, we investigate the effectiveness of LLMs for summarizing low-resource DSLs, in the context of eBPF codes. Specifically, we propose a clustering-based technique to retrieve in-context examples that are semantically closer to the test example and propose a very simple yet powerful prompt design that yields superior-quality code summary generation. Experimental results show that our proposed retrieval approach for prompt generation improves the eBPF code summarization accuracy up to 12.9 BLEU points over other prompting techniques. The codes are available at https://github.com/babangain/ebpf_summ.

Anthology ID:: 2025.coling-industry.47
Original:: 2025.coling-industry.47v1
Version 2:: 2025.coling-industry.47v2
Volume:: Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 546–560
Language:
URL:: https://aclanthology.org/2025.coling-industry.47/
DOI:
Bibkey:
Cite (ACL):: Baban Gain, Dibyanayan Bandyopadhyay, Samrat Mukherjee, Aryan Sahoo, Saswati Dana, Palanivel Kodeswaran, Sayandeep Sen, Asif Ekbal, and Dinesh Garg. 2025. Transforming Code Understanding: Clustering-Based Retrieval for Improved Summarization in Domain-Specific Languages. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 546–560, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Transforming Code Understanding: Clustering-Based Retrieval for Improved Summarization in Domain-Specific Languages (Gain et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-industry.47.pdf

PDF (v2) PDF (v1) Cite Search Fix data