Van-Hien Tran

2025

pdf bib abs
PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation
Hour Kaing | Raj Dabre | Haiyue Song | Van-Hien Tran | Hideki Tanaka | Masao Utiyama
Proceedings of the 31st International Conference on Computational Linguistics

This work introduces PrahokBART, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.

pdf bib abs
Can Explicit Gender Information Improve Zero-Shot Machine Translation?
Van-Hien Tran | Huy Hien Vu | Hideki Tanaka | Masao Utiyama
Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Large language models (LLMs) have demonstrated strong zero-shot machine translation (MT) performance but often exhibit gender bias that is present in their training data, especially when translating into grammatically gendered languages. In this paper, we investigate whether explicitly providing gender information can mitigate this issue and improve translation quality. We propose a two-step approach: (1) inferring entity gender from context, and (2) incorporating this information into prompts using either Structured Tagging or Natural Language. Experiments with five LLMs across four language pairs show that explicit gender cues consistently reduce gender errors, with structured tagging yielding the largest gains. Our results highlight prompt-level gender disambiguation as a simple yet effective strategy for more accurate and fair zero-shot MT.

pdf bib abs
Exploiting Word Sense Disambiguation in Large Language Models for Machine Translation
Van-Hien Tran | Raj Dabre | Hour Kaing | Haiyue Song | Hideki Tanaka | Masao Utiyama
Proceedings of the First Workshop on Language Models for Low-Resource Languages

Machine Translation (MT) has made great strides with the use of Large Language Models (LLMs) and advanced prompting techniques. However, translating sentences with ambiguous words remains challenging, especially when LLMs have limited proficiency in the source language. This paper introduces two methods to enhance MT performance by leveraging the word sense disambiguation capabilities of LLMs. The first method integrates all the available senses of an ambiguous word into the prompting template. The second method uses a pre-trained source language model to predict the correct sense of the ambiguous word, which is then incorporated into the prompting template. Additionally, we propose two prompting template styles for providing word sense information to LLMs. Experiments on the HOLLY dataset demonstrate the effectiveness of our approach in improving MT performance.

pdf bib abs
Enhanced Zero-Shot Machine Translation via Fixed Prefix Pair Bootstrapping
Van-Hien Tran | Masao Utiyama
Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)

Zero-shot in-context learning allows large language models (LLMs) to perform tasks using only provided instructions. However, pre-trained LLMs often face calibration issues in zero-shot scenarios, leading to challenges such as hallucinations and off-target translations that compromise output quality, particularly in machine translation (MT). This paper introduces a new method to improve zero-shot MT using fixed prefix pair bootstrapping. By initializing translations with an accurate bilingual prefix pair at the start of both source and target sentences, this approach effectively guides the model to generate precise target-language outputs. Extensive evaluations across four model architectures and multiple translation directions demonstrate significant and consistent improvements, showcasing the potential of this straightforward strategy to enhance zero-shot MT performance.

2023

pdf bib abs
Improving Embedding Transfer for Low-Resource Machine Translation
Van-Hien Tran | Chenchen Ding | Hideki Tanaka | Masao Utiyama
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Low-resource machine translation (LRMT) poses a substantial challenge due to the scarcity of parallel training data. This paper introduces a new method to improve the transfer of the embedding layer from the Parent model to the Child model in LRMT, utilizing trained token embeddings in the Parent model’s high-resource vocabulary. Our approach involves projecting all tokens into a shared semantic space and measuring the semantic similarity between tokens in the low-resource and high-resource languages. These measures are then utilized to initialize token representations in the Child model’s low-resource vocabulary. We evaluated our approach on three benchmark datasets of low-resource language pairs: Myanmar-English, Indonesian-English, and Turkish-English. The experimental results demonstrate that our method outperforms previous methods regarding translation quality. Additionally, our approach is computationally efficient, leading to reduced training time compared to prior works.

2022

pdf bib abs
Improving Discriminative Learning for Zero-Shot Relation Extraction
Van-Hien Tran | Hiroki Ouchi | Taro Watanabe | Yuji Matsumoto
Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge

Zero-shot relation extraction (ZSRE) aims to predict target relations that cannot be observed during training. While most previous studies have focused on fully supervised relation extraction and achieved considerably high performance, less effort has been made towards ZSRE. This study proposes a new model incorporating discriminative embedding learning for both sentences and semantic relations. In addition, a self-adaptive comparator network is used to judge whether the relationship between a sentence and a relation is consistent. Experimental results on two benchmark datasets showed that the proposed method significantly outperforms the state-of-the-art methods.

2021

pdf bib abs
CovRelex: A COVID-19 Retrieval System with Relation Extraction
Vu Tran | Van-Hien Tran | Phuong Nguyen | Chau Nguyen | Ken Satoh | Yuji Matsumoto | Minh Nguyen
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

This paper presents CovRelex, a scientific paper retrieval system targeting entities and relations via relation extraction on COVID-19 scientific papers. This work aims at building a system supporting users efficiently in acquiring knowledge across a huge number of COVID-19 scientific papers published rapidly. Our system can be accessed via https://www.jaist.ac.jp/is/labs/nguyen-lab/systems/covrelex/.

2019

pdf bib abs
Relation Classification Using Segment-Level Attention-based CNN and Dependency-based RNN
Van-Hien Tran | Van-Thuy Phi | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Recently, relation classification has gained much success by exploiting deep neural networks. In this paper, we propose a new model effectively combining Segment-level Attention-based Convolutional Neural Networks (SACNNs) and Dependency-based Recurrent Neural Networks (DepRNNs). While SACNNs allow the model to selectively focus on the important information segment from the raw sequence, DepRNNs help to handle the long-distance relations from the shortest dependency path of relation entities. Experiments on the SemEval-2010 Task 8 dataset show that our model is comparable to the state-of-the-art without using any external lexical features.