Minh Nguyen

2026

VN-MTEB: Vietnamese Massive Text Embedding Benchmark
Loc Pham | Tung Luu | Thu Vo | Minh Nguyen | Viet Hoang
Findings of the Association for Computational Linguistics: EACL 2026

Vietnam ranks among the top countries in terms of both internet traffic and online toxicity. As a result, implementing embedding models for recommendation and content control duties in applications is crucial. However, a lack of large-scale test datasets, both in volume and task diversity, makes it tricky for scientists to effectively evaluate AI models before deploying them in real-world, large-scale projects. To solve this important problem, we introduce a Vietnamese benchmark, VN-MTEB for embedding models, which we created by translating a large number of English samples from the Massive Text Embedding Benchmark using our new automated framework, thereby contributing an extension of the Massive Multilingual Text Embedding Benchmark with our additional Vietnamese tasks and datasets. We leverage the strengths of large language models (LLMs) and cutting-edge embedding models to conduct translation and filtering processes to retain high-quality samples, guaranteeing a natural flow of language and semantic fidelity while preserving named entity recognition (NER) and code snippets. Our comprehensive benchmark consists of 41 datasets from six tasks specifically designed for Vietnamese text embeddings. In our analysis, we find that bigger and more complex models using Rotary Positional Embedding outperform those using Absolute Positional Embedding in embedding tasks.

2025

pdf bib abs

A hybrid Approach to low-resource machine translation for Ojibwe verbs
Minh Nguyen | Christopher Hammerly | Miikka Slifverberg
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

Machine translation is a tool that can help teachers, learners, and users of low-resourced languages. However, there are significant challenges in developing these tools, such as the lack of large-scale parallel corpora and complex morphology. We propose a novel hybrid system that combines LLM and rule-based methods in two distinct stages to translate inflected Ojibwe verbs into English. We use an LLM to automatically annotate dictionary data to build translation templates. Then, our rulebased module performs translation using inflection and slot-filling processes built on top of an FST-based analyzer. We test the system with a set of automated tests. Thanks to the ahead-of-time nature of the template-building process and the light-weight rule-based translation module, the end-to-end translation process has an average translation speed of 70 milliseconds per word. The system achieved an average ChrF score of 0.82 and a semantic similarity score of 0.93 among the successfully translated verbs in a test set. The approach has the potential to be extended to other low-resource Indigenous languages with dictionary data.

2023

pdf bib

FiRo: Finite-context Indexing of Restricted Output Space for NLP Models Facing Noisy Input
Minh Nguyen | Nancy Chen
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs

CovRelex-SE: Adding Semantic Information for Relation Search via Sequence Embedding
Truong Do | Chau Nguyen | Vu Tran | Ken Satoh | Yuji Matsumoto | Minh Nguyen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

In recent years, COVID-19 has impacted all aspects of human life. As a result, numerous publications relating to this disease have been issued. Due to the massive volume of publications, some retrieval systems have been developed to provide researchers with useful information. In these systems, lexical searching methods are widely used, which raises many issues related to acronyms, synonyms, and rare keywords. In this paper, we present a hybrid relation retrieval system, CovRelex-SE, based on embeddings to provide high-quality search results. Our system can be accessed through the following URL: https://www.jaist.ac.jp/is/labs/nguyen-lab/systems/covrelex-se/

2022

pdf bib abs

VIMQA: A Vietnamese Dataset for Advanced Reasoning and Explainable Multi-hop Question Answering
Khang Le | Hien Nguyen | Tung Le Thanh | Minh Nguyen
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Vietnamese is the native language of over 98 million people in the world. However, existing Vietnamese Question Answering (QA) datasets do not explore the model’s ability to perform advanced reasoning and provide evidence to explain the answer. We introduce VIMQA, a new Vietnamese dataset with over 10,000 Wikipedia-based multi-hop question-answer pairs. The dataset is human-generated and has four main features: (1) The questions require advanced reasoning over multiple paragraphs. (2) Sentence-level supporting facts are provided, enabling the QA model to reason and explain the answer. (3) The dataset offers various types of reasoning to test the model’s ability to reason and extract relevant proof. (4) The dataset is in Vietnamese, a low-resource language. We also conduct experiments on our dataset using state-of-the-art Multilingual single-hop and multi-hop QA methods. The results suggest that our dataset is challenging for existing methods, and there is room for improvement in Vietnamese QA systems. In addition, we propose a general process for data creation and publish a framework for creating multilingual multi-hop QA datasets. The dataset and framework are publicly available to encourage further research in Vietnamese QA systems.

pdf bib abs

Event Causality Identification via Generation of Important Context Words
Hieu Man | Minh Nguyen | Thien Nguyen
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

An important problem of Information Extraction involves Event Causality Identification (ECI) that seeks to identify causal relation between pairs of event mentions. Prior models for ECI have mainly solved the problem using the classification framework that does not explore prediction/generation of important context words from input sentences for causal recognition. In this work, we consider the words along the dependency path between the two event mentions in the dependency tree as the important context words for ECI. We introduce dependency path generation as a complementary task for ECI, which can be solved jointly with causal label prediction to improve the performance. To facilitate the multi-task learning, we cast ECI into a generation problem that aims to generate both causal relation and dependency path words from input sentence. In addition, we propose to use the REINFORCE algorithm to train our generative model where novel reward functions are designed to capture both causal prediction accuracy and generation quality. The experiments on two benchmark datasets demonstrate state-of-the-art performance of the proposed model for ECI.

2021

pdf bib abs

Improving Named Entity Recognition in Spoken Dialog Systems by Context and Speech Pattern Modeling
Minh Nguyen | Zhou Yu
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

While named entity recognition (NER) from speech has been around as long as NER from written text has, the accuracy of NER from speech has generally been much lower than that of NER from text. The rise in popularity of spoken dialog systems such as Siri or Alexa highlights the need for more accurate NER from speech because NER is a core component for understanding what users said in dialogs. Deployed spoken dialog systems receive user input in the form of automatic speech recognition (ASR) transcripts, and simply applying NER model trained on written text to ASR transcripts often leads to low accuracy because compared to written text, ASR transcripts lack important cues such as punctuation and capitalization. Besides, errors in ASR transcripts also make NER from speech challenging. We propose two models that exploit dialog context and speech pattern clues to extract named entities more accurately from open-domain dialogs in spoken dialog systems. Our results show the benefit of modeling dialog context and speech patterns in two settings: a standard setting with random partition of data and a more realistic but also more difficult setting where many named entities encountered during deployment are unseen during training.

2019

bib abs

Isolating the Effects of Modeling Recursive Structures: A Case Study in Pronunciation Prediction of Chinese Characters
Minh Nguyen | Gia H Ngo | Nancy Chen
Proceedings of the 2019 Workshop on Widening NLP

Finding that explicitly modeling structures leads to better generalization, we consider the task of predicting Cantonese pronunciations of logographs (Chinese characters) using logographs’ recursive structures. This task is a suitable case study for two reasons. First, logographs’ pronunciations depend on structures (i.e. the hierarchies of sub-units in logographs) Second, the quality of logographic structures is consistent since the structures are constructed automatically using a set of rules. Thus, this task is less affected by confounds such as varying quality between annotators. Empirical results show that modeling structures explicitly using treeLSTM outperforms LSTM baseline, reducing prediction error by 6.0% relative.

2018

pdf bib abs

Multimodal neural pronunciation modeling for spoken languages with logographic origin
Minh Nguyen | Gia H. Ngo | Nancy Chen
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Graphemes of most languages encode pronunciation, though some are more explicit than others. Languages like Spanish have a straightforward mapping between its graphemes and phonemes, while this mapping is more convoluted for languages like English. Spoken languages such as Cantonese present even more challenges in pronunciation modeling: (1) they do not have a standard written form, (2) the closest graphemic origins are logographic Han characters, of which only a subset of these logographic characters implicitly encodes pronunciation. In this work, we propose a multimodal approach to predict the pronunciation of Cantonese logographic characters, using neural networks with a geometric representation of logographs and pronunciation of cognates in historically related languages. The proposed framework improves performance by 18.1% and 25.0% respective to unimodal and multimodal baselines.

pdf bib abs

Who is Killed by Police: Introducing Supervised Attention for Hierarchical LSTMs
Minh Nguyen | Thien Huu Nguyen
Proceedings of the 27th International Conference on Computational Linguistics

Finding names of people killed by police has become increasingly important as police shootings get more and more public attention (police killing detection). Unfortunately, there has been not much work in the literature addressing this problem. The early work in this field (Keith etal., 2017) proposed a distant supervision framework based on Expectation Maximization (EM) to deal with the multiple appearances of the names in documents. However, such EM-based framework cannot take full advantages of deep learning models, necessitating the use of handdesigned features to improve the detection performance. In this work, we present a novel deep learning method to solve the problem of police killing recognition. The proposed method relies on hierarchical LSTMs to model the multiple sentences that contain the person names of interests, and introduce supervised attention mechanisms based on semantical word lists and dependency trees to upweight the important contextual words. Our experiments demonstrate the benefits of the proposed model and yield the state-of-the-art performance for police killing detection.

pdf bib abs

Statistical Machine Transliteration Baselines for NEWS 2018
Snigdha Singhania | Minh Nguyen | Gia H. Ngo | Nancy Chen
Proceedings of the Seventh Named Entities Workshop

This paper reports the results of our trans-literation experiments conducted on NEWS 2018 Shared Task dataset. We focus on creating the baseline systems trained using two open-source, statistical transliteration tools, namely Sequitur and Moses. We discuss the pre-processing steps performed on this dataset for both the systems. We also provide a re-ranking system which uses top hypotheses from Sequitur and Moses to create a consolidated list of transliterations. The results obtained from each of these models can be used to present a good starting point for the participating teams.