Luan Thanh Nguyen

Also published as: Luan Thanh Nguyen

2025

Don’t Take it Literally! Idiom-aware Vietnamese Translation via In-context Learning
Luan Thanh Nguyen | Parisa Kordjamshidi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

The translation of idiomatic expressions often results in misunderstandings and inaccuracies, affecting everyday communication as well as machine translation systems. This paper introduces Idiom-aware Vietnamese Translation (IDiAT), a new framework for the evaluation of idiomatic translation for Vietnamese, along with state-of-the-art results for this task. We collect and curate a high-quality Vietnamese-English idiom set that serves as a resource for in-context learning (ICL). IDiAT’s evaluation benchmark includes both idiomatic and non-idiomatic text pairs to assess general translation quality and idiomatic translation performance. We leverage ICL in large language models to augment few-shot demonstrations with idiom and topic descriptions and consequently improve the translation accuracy. Empirical results demonstrate that our IDiAT-based ICL outperforms traditional supervised methods using only a few data samples. Multiple evaluations confirm the effectiveness of our proposed approach. Though focusing on the Vietnamese language, our approach advances idiomatic translation and contributes to the development of culturally aware translation systems, paving the way for future research in low-resource languages. The experimental materials are publicly available.

2024

pdf bib abs

Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges
Nguyen Van Dinh | Thanh Chi Dang | Luan Thanh Nguyen | Kiet Van Nguyen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained models for two downstream tasks: (1) Dialect identification and (2) Speech recognition. The empirical results suggest two implications including the influence of geographical factors on dialects, and the constraints of current approaches in speech recognition tasks involving multi-dialect speech data. Our dataset is available for research purposes.

pdf bib abs

ViHateT5: Enhancing Hate Speech Detection in Vietnamese With a Unified Text-to-Text Transformer Model
Luan Thanh Nguyen
Findings of the Association for Computational Linguistics: ACL 2024

Recent advancements in hate speech detection (HSD) in Vietnamese have made significant progress, primarily attributed to the emergence of transformer-based pre-trained language models, particularly those built on the BERT architecture. However, the necessity for specialized fine-tuned models has resulted in the complexity and fragmentation of developing a multitasking HSD system. Moreover, most current methodologies focus on fine-tuning general pre-trained models, primarily trained on formal textual datasets like Wikipedia, which may not accurately capture human behavior on online platforms. In this research, we introduce ViHateT5, a T5-based model pre-trained on our proposed large-scale domain-specific dataset named VOZ-HSD. By harnessing the power of a text-to-text architecture, ViHateT5 can tackle multiple tasks using a unified model and achieve state-of-the-art performance across all standard HSD benchmarks in Vietnamese. Our experiments also underscore the significance of label distribution in pre-training data on model efficacy. We provide our experimental materials for research purposes, including the VOZ-HSD dataset, pre-trained checkpoint, the unified HSD-multitask ViHateT5 model, and related source code on GitHub publicly.

pdf bib

How Good Is Synthetic Data for Social Media Texts? A Study on Fine-Tuning Low-Resource Language Models for Vietnamese
Luan Thanh Nguyen
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

2022

pdf bib

SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese
Luan Thanh Nguyen | Kiet Van Nguyen | Ngan Luu-Thuy Nguyen
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

2021

pdf bib abs

UIT-E10dot3 at SemEval-2021 Task 5: Toxic Spans Detection with Named Entity Recognition and Question-Answering Approaches
Phu Gia Hoang | Luan Thanh Nguyen | Kiet Nguyen
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

The increment of toxic comments on online space is causing tremendous effects on other vulnerable users. For this reason, considerable efforts are made to deal with this, and SemEval-2021 Task 5: Toxic Spans Detection is one of those. This task asks competitors to extract spans that have toxicity from the given texts, and we have done several analyses to understand its structure before doing experiments. We solve this task by two approaches, Named Entity Recognition with spaCy’s library and Question-Answering with RoBERTa combining with ToxicBERT, and the former gains the highest F1-score of 66.99%.

Co-authors

Ngan Nguyen 1

Venues

SemEval1

Fix author