Long Nguyen

Also published as: L. Nguyen

2025

pdf bib abs
When in Doubt, Ask First: A Unified Retrieval Agent-Based System for Ambiguous and Unanswerable Question Answering
Long Nguyen | Quynh Vo | Hung Luu | Tho Quan
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Large Language Models (LLMs) have shown strong capabilities in Question Answering (QA), but their effectiveness in high-stakes, closed-domain settings is often constrained by hallucinations and limited handling of vague or underspecified queries. These challenges are especially pronounced in Vietnamese, a low-resource language with complex syntax and strong contextual dependence, where user questions are often short, informal, and ambiguous. We introduce the Unified Retrieval Agent-Based System (URASys), a QA framework that combines agent-based reasoning with dual retrieval under the Just Enough principle to address standard, ambiguous, and unanswerable questions in a unified manner. URASys performs lightweight query decomposition and integrates document retrieval with a question–answer layer via a two-phase indexing pipeline, engaging in interactive clarification when intent is uncertain and explicitly signaling unanswerable cases to avoid hallucination. We evaluate URASys on Vietnamese and English QA benchmarks spanning single-hop, multi-hop, and real-world academic advising tasks, and release new dual-language ambiguous subsets for benchmarking interactive clarification. Results show that URASys outperforms strong retrieval-based baselines in factual accuracy, improves unanswerable handling, and achieves statistically significant gains in human evaluations for clarity and trustworthiness.

pdf bib abs
Enhancing Named Entity Translation from Classical Chinese to Vietnamese in Traditional Vietnamese Medicine Domain: A Hybrid Masking and Dictionary-Augmented Approach
Nhu Pham | Uyen Nguyen | Long Nguyen | Dien Dinh
Proceedings of the 18th International Natural Language Generation Conference

Vietnam’s traditional medical texts were historically written in Classical Chinese using Sino-Vietnamese pronunciations. As the Vietnamese language transitioned to a Latin-based national script and interest in integrating traditional medicine with modern healthcare grows, accurate translation of these texts has become increasingly important. However, the diversity of terminology and the complexity of translating medical entities into modern contexts pose significant challenges. To address this, we propose a method that fine-tunes large language models (LLMs) using augmented data and a Hybrid Entity Masking and Replacement (HEMR) strategy to improve named entity translation. We also introduce a parallel named entity translation dataset specifically curated for traditional Vietnamese medicine. Our evaluation across multiple LLMs shows that the proposed approach achieves a translation accuracy of 71.91%, demonstrating its effectiveness. These results underscore the importance of incorporating named entity awareness into translation systems, particularly in low-resource and domain-specific settings like traditional Vietnamese medicine.

pdf bib abs
Serving the Underserved: Leveraging BARTBahnar Language Model for Bahnaric-Vietnamese Translation
Long Nguyen | Tran Le | Huong Nguyen | Quynh Vo | Phong Nguyen | Tho Quan
Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025)

The Bahnar people, one of Vietnam’s ethnic minorities, represent an underserved community with limited access to modern technologies. Developing an effective Bahnaric-Vietnamese translation system is essential for fostering linguistic exchange, preserving cultural heritage, and empowering local communities by bridging communication barriers. With advancements in Artificial Intelligence (AI), Neural Machine Translation (NMT) has achieved remarkable success across various language pairs. However, the low-resource nature of Bahnaric, characterized by data scarcity, vocabulary constraints, and the lack of parallel corpora, poses significant challenges to building an accurate and efficient translation system. To address these challenges, we propose a novel hybrid architecture for Bahnaric-Vietnamese translation, with BARTBahnar as its core language model. BARTBahnar is developed by continually training a pre-trained Vietnamese model, BARTPho, on augmented monolingual Bahnaric data, followed by fine-tuning on bilingual datasets. This transfer learning approach reduces training costs while effectively capturing linguistic similarities between the two languages. Additionally, we implement advanced data augmentation techniques to enrich and diversify training data, further enhancing BARTBahnar’s robustness and translation accuracy. Beyond leveraging the language model, our hybrid system integrates rule-based and statistical methods to improve translation quality. Experimental results show substantial improvements on bilingual Bahnaric-Vietnamese datasets, validating the effectiveness of our approach for low-resource translation. To support further research, we open-source our code and related materials at https://github.com/ura-hcmut/BARTBahnar.

2024

pdf bib abs
ViMedAQA: A Vietnamese Medical Abstractive Question-Answering Dataset and Findings of Large Language Model
Minh-Nam Tran | Phu-Vinh Nguyen | Long Nguyen | Dien Dinh
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Question answering involves creating answers to questions. With the growth of large language models, the ability of question-answering systems has dramatically improved. However, there is a lack of Vietnamese abstractive question-answering datasets, especially in the medical domain. Therefore, this research aims to mitigate this gap by introducing ViMedAQA. This **Vi**etnamese **Med**ical **A**bstractive **Q**uestion-**A**nswering dataset covers four topics in the Vietnamese medical domain, including body parts, disease, drugs and medicine. Additionally, the empirical results on the proposed dataset examine the capability of the large language models in the Vietnamese medical domain, including reasoning, memorizing and awareness of essential information.

pdf bib abs
ViGLUE: A Vietnamese General Language Understanding Benchmark and Analysis of Vietnamese Language Models
Minh-Nam Tran | Phu-Vinh Nguyen | Long Nguyen | Dien Dinh
Findings of the Association for Computational Linguistics: NAACL 2024

As the number of language models has increased, various benchmarks have been suggested to assess the proficiency of the models in natural language understanding. However, there is a lack of such a benchmark in Vietnamese due to the difficulty in accessing natural language processing datasets or the scarcity of task-specific datasets. **ViGLUE**, the proposed dataset collection, is a **Vi**etnamese **G**eneral **L**anguage **U**nderstanding **E**valuation benchmark developed using three methods: translating an existing benchmark, generating new corpora, and collecting available datasets. ViGLUE contains twelve tasks and encompasses over ten areas and subjects, enabling it to evaluate models comprehensively over a broad spectrum of aspects. Baseline models utilizing multilingual language models are also provided for all tasks in the proposed benchmarks. In addition, the study of the available Vietnamese large language models is conducted to explore the language models’ ability in the few-shot learning framework, leading to the exploration of the relationship between specific tasks and the number of shots.

pdf bib
Advancing Vietnamese Information Retrieval with Learning Objective and Benchmark
Vinh Nguyen | Nam Tran | Long Nguyen | Dien Dinh
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib
ViHerbQA: A Robust QA Model for Vietnamese Traditional Herbal Medicine
Quyen Truong | Long Nguyen | Dien Dinh
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib
EATT: Knowledge Graph Integration in Transformer Architecture
Phong Vo | Long Nguyen
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib
Multi-mask Prefix Tuning: Applying Multiple Adaptive Masks on Deep Prompt Tuning
Qui Tu | Trung Nguyen | Long Nguyen | Dien Dinh
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib
VHE: A New Dataset for Event Extraction from Vietnamese Historical Texts
Truc Hoang | Long Nguyen | Dien Dinh
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib
A Comparative Study of Chart Summarization
An Chu | Thong Huynh | Long Nguyen | Dien Dinh
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

2022

pdf bib abs
Multi-level Community-awareness Graph Neural Networks for Neural Machine Translation
Binh Nguyen | Long Nguyen | Dien Dinh
Proceedings of the 29th International Conference on Computational Linguistics

Neural Machine Translation (NMT) aims to translate the source- to the target-language while preserving the original meaning. Linguistic information such as morphology, syntactic, and semantics shall be grasped in token embeddings to produce a high-quality translation. Recent works have leveraged the powerful Graph Neural Networks (GNNs) to encode such language knowledge into token embeddings. Specifically, they use a trained parser to construct semantic graphs given sentences and then apply GNNs. However, most semantic graphs are tree-shaped and too sparse for GNNs which cause the over-smoothing problem. To alleviate this problem, we propose a novel Multi-level Community-awareness Graph Neural Network (MC-GNN) layer to jointly model local and global relationships between words and their linguistic roles in multiple communities. Intuitively, the MC-GNN layer substitutes a self-attention layer at the encoder side of a transformer-based machine translation model. Extensive experiments on four language-pair datasets with common evaluation metrics show the remarkable improvements of our method while reducing the time complexity in very long sentences.