Anh Dau


pdf bib
DocChecker: Bootstrapping Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies
Anh Dau | Jin L.c. Guo | Nghi Bui
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Comments in source code are crucial for developers to understand the purpose of the code and to use it correctly. However, keeping comments aligned with the evolving codebase poses a significant challenge. With increasing interest in automated solutions to identify and rectify discrepancies between code and its associated comments, most existing methods rely heavily on heuristic rules. This paper introduces DocChecker, a language model-based framework adept at detecting inconsistencies between code and comments and capable of generating synthetic comments. This functionality allows DocChecker to identify and rectify cases where comments do not accurately represent the code they describe.The efficacy of DocChecker is demonstrated using the Just-In-Time and CodeXGlue datasets in various scenarios. Notably, DocChecker sets a new benchmark in the Inconsistency Code-Comment Detection (ICCD) task, achieving 72.3% accuracy, and scoring 33.64 in BLEU-4 on the code summarization task. These results surpass other Large Language Models (LLMs), including GPT 3.5 and CodeLlama.DocChecker is accessible for use and evaluation. It can be found on and at For a more comprehensive understanding of its functionality, a demonstration video is available on


pdf bib
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen | Le Nam | Anh Dau | Anh Nguyen | Khanh Nghiem | Jin Guo | Nghi Bui
Findings of the Association for Computational Linguistics: EMNLP 2023

We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results.