Anh Nguyen


2023

pdf bib
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen | Le Nam | Anh Dau | Anh Nguyen | Khanh Nghiem | Jin Guo | Nghi Bui
Findings of the Association for Computational Linguistics: EMNLP 2023

We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results.

pdf bib
Prompt-based Zero-shot Text Classification with Conceptual Knowledge
Yuqi Wang | Wei Wang | Qi Chen | Kaizhu Huang | Anh Nguyen | Suparna De
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

In recent years, pre-trained language models have garnered significant attention due to their effectiveness, which stems from the rich knowledge acquired during pre-training. To mitigate the inconsistency issues between pre-training tasks and downstream tasks and to facilitate the resolution of language-related issues, prompt-based approaches have been introduced, which are particularly useful in low-resource scenarios. However, existing approaches mostly rely on verbalizers to translate the predicted vocabulary to task-specific labels. The major limitations of this approach are the ignorance of potentially relevant domain-specific words and being biased by the pre-training data. To address these limitations, we propose a framework that incorporates conceptual knowledge for text classification in the extreme zero-shot setting. The framework includes prompt-based keyword extraction, weight assignment to each prompt keyword, and final representation estimation in the knowledge graph embedding space. We evaluated the method on four widely-used datasets for sentiment analysis and topic detection, demonstrating that it consistently outperforms recently-developed prompt-based approaches in the same experimental settings.

pdf bib
PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search
Thang Pham | Seunghyun Yoon | Trung Bui | Anh Nguyen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

While contextualized word embeddings have been a de-facto standard, learning contextualized phrase embeddings is less explored and being hindered by the lack of a human-annotated benchmark that tests machine understanding of phrase semantics given a context sentence or paragraph (instead of phrases alone). To fill this gap, we propose PiC—a dataset of ∼28K of noun phrases accompanied by their contextual Wikipedia pages and a suite of three tasks for training and evaluating phrase embeddings. Training on PiC improves ranking-models’ accuracy and remarkably pushes span selection (SS) models (i.e., predicting the start and end index of the target phrase) near human accuracy, which is 95% Exact Match (EM) on semantic search given a query phrase and a passage. Interestingly, we find evidence that such impressive performance is because the SS models learn to better capture the common meaning of a phrase regardless of its actual context. SotA models perform poorly in distinguishing two senses of the same phrase in two contexts (∼60% EM) and in estimating the similarity between two different phrases in the same context (∼70% EM).

2022

pdf bib
Double Trouble: How to not Explain a Text Classifier’s Decisions Using Counterfactuals Synthesized by Masked Language Models?
Thang Pham | Trung Bui | Long Mai | Anh Nguyen
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

A principle behind dozens of attribution methods is to take the prediction difference between before-and-after an input feature (here, a token) is removed as its attribution. A popular Input Marginalization (IM) method (Kim et al., 2020) uses BERT to replace a token, yielding more plausible counterfactuals. While Kim et al., 2020 reported that IM is effective, we find this conclusion not convincing as the Deletion-BERT metric used in their paper is biased towards IM. Importantly, this bias exists in Deletion-based metrics, including Insertion, Sufficiency, and Comprehensiveness. Furthermore, our rigorous evaluation using 6 metrics and 3 datasets finds no evidence that IM is better than a Leave-One-Out (LOO) baseline. We find two reasons why IM is not better than LOO: (1) deleting a single word from the input only marginally reduces a classifier’s accuracy; and (2) a highly predictable word is always given near-zero attribution, regardless of its true importance to the classifier. In contrast, making LIME samples more natural via BERT consistently improves LIME accuracy under several ROAR metrics.

2021

pdf bib
Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks?
Thang Pham | Trung Bui | Long Mai | Anh Nguyen
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
A Vietnamese Dataset for Evaluating Machine Reading Comprehension
Kiet Nguyen | Vu Nguyen | Anh Nguyen | Ngan Nguyen
Proceedings of the 28th International Conference on Computational Linguistics

Over 97 million inhabitants speak Vietnamese as the native language in the world. However, there are few research studies on machine reading comprehension (MRC) in Vietnamese, the task of understanding a document or text, and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a new process of dataset creation for Vietnamese MRC. Our in-depth analyses illustrate that our dataset requires abilities beyond simple reasoning like word matching and demands complicate reasoning such as single-sentence and multiple-sentence inferences. Besides, we conduct experiments on state-of-the-art MRC methods in English and Chinese as the first experimental models on UIT-ViQuAD, which will be compared to further models. We also estimate human performances on the dataset and compare it to the experimental results of several powerful machine models. As a result, the substantial differences between humans and the best model performances on the dataset indicate that improvements can be explored on UIT-ViQuAD through future research. Our dataset is freely available to encourage the research community to overcome challenges in Vietnamese MRC.