Long Nguyen

Also published as: L. Nguyen

Other people with similar names: Long S. T. Nguyen


2025

Vietnam’s traditional medical texts were historically written in Classical Chinese using Sino-Vietnamese pronunciations. As the Vietnamese language transitioned to a Latin-based national script and interest in integrating traditional medicine with modern healthcare grows, accurate translation of these texts has become increasingly important. However, the diversity of terminology and the complexity of translating medical entities into modern contexts pose significant challenges. To address this, we propose a method that fine-tunes large language models (LLMs) using augmented data and a Hybrid Entity Masking and Replacement (HEMR) strategy to improve named entity translation. We also introduce a parallel named entity translation dataset specifically curated for traditional Vietnamese medicine. Our evaluation across multiple LLMs shows that the proposed approach achieves a translation accuracy of 71.91%, demonstrating its effectiveness. These results underscore the importance of incorporating named entity awareness into translation systems, particularly in low-resource and domain-specific settings like traditional Vietnamese medicine.

2024

Question answering involves creating answers to questions. With the growth of large language models, the ability of question-answering systems has dramatically improved. However, there is a lack of Vietnamese abstractive question-answering datasets, especially in the medical domain. Therefore, this research aims to mitigate this gap by introducing ViMedAQA. This **Vi**etnamese **Med**ical **A**bstractive **Q**uestion-**A**nswering dataset covers four topics in the Vietnamese medical domain, including body parts, disease, drugs and medicine. Additionally, the empirical results on the proposed dataset examine the capability of the large language models in the Vietnamese medical domain, including reasoning, memorizing and awareness of essential information.
As the number of language models has increased, various benchmarks have been suggested to assess the proficiency of the models in natural language understanding. However, there is a lack of such a benchmark in Vietnamese due to the difficulty in accessing natural language processing datasets or the scarcity of task-specific datasets. **ViGLUE**, the proposed dataset collection, is a **Vi**etnamese **G**eneral **L**anguage **U**nderstanding **E**valuation benchmark developed using three methods: translating an existing benchmark, generating new corpora, and collecting available datasets. ViGLUE contains twelve tasks and encompasses over ten areas and subjects, enabling it to evaluate models comprehensively over a broad spectrum of aspects. Baseline models utilizing multilingual language models are also provided for all tasks in the proposed benchmarks. In addition, the study of the available Vietnamese large language models is conducted to explore the language models’ ability in the few-shot learning framework, leading to the exploration of the relationship between specific tasks and the number of shots.

2022

Neural Machine Translation (NMT) aims to translate the source- to the target-language while preserving the original meaning. Linguistic information such as morphology, syntactic, and semantics shall be grasped in token embeddings to produce a high-quality translation. Recent works have leveraged the powerful Graph Neural Networks (GNNs) to encode such language knowledge into token embeddings. Specifically, they use a trained parser to construct semantic graphs given sentences and then apply GNNs. However, most semantic graphs are tree-shaped and too sparse for GNNs which cause the over-smoothing problem. To alleviate this problem, we propose a novel Multi-level Community-awareness Graph Neural Network (MC-GNN) layer to jointly model local and global relationships between words and their linguistic roles in multiple communities. Intuitively, the MC-GNN layer substitutes a self-attention layer at the encoder side of a transformer-based machine translation model. Extensive experiments on four language-pair datasets with common evaluation metrics show the remarkable improvements of our method while reducing the time complexity in very long sentences.

1994

1993

1992