2024
pdf
bib
abs
A Novel Instruction Tuning Method for Vietnamese Mathematical Reasoning using Trainable Open-Source Large Language Models
Nguyen Quang Vinh
|
Thanh-Do Nguyen
|
Vinh Van Nguyen
|
Nam Khac-Hoai Bui
Proceedings of the 28th Conference on Computational Natural Language Learning
This study introduces Simple Reasoning with Code (SiRC), a novel instruction fine-tuning method for solving mathematical reasoning problems, particularly effective for Vietnamese, which is considered a low-resource language. Specifically, solving mathematical problems requires strategic and logical reasoning, which remains challenging in this research area. This paper presents a simple yet effective instruction fine-tuning method for mathematical reasoning. Unlike previous approaches, our proposed method effectively combines chain-of-thought reasoning with code transfer methods without requiring a sophisticated inference procedure. Furthermore, we focus on exploiting small open-source large language models (LLMs) for the Vietnamese language. In this regard, we first introduce a trainable Vietnamese math reasoning dataset, which is named ViMath-InstructCode. The proposed dataset is then used for fine-tuning open-source LLMs (e.g., less than 10 billion parameters). Experiments conducted on our custom ViMath-Bench dataset, the largest benchmarking dataset focusing on Vietnamese mathematical problems, indicate the promising results of our proposed method. Our source code and dataset are available for further exploitation.
2022
pdf
bib
abs
KC4MT: A High-Quality Corpus for Multilingual Machine Translation
Vinh Van Nguyen
|
Ha Nguyen
|
Huong Thanh Le
|
Thai Phuong Nguyen
|
Tan Van Bui
|
Luan Nghia Pham
|
Anh Tuan Phan
|
Cong Hoang-Minh Nguyen
|
Viet Hong Tran
|
Anh Huu Tran
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The multilingual parallel corpus is an important resource for many applications of natural language processing (NLP). For machine translation, the size and quality of the training corpus mainly affects the quality of the translation models. In this work, we present the method for building high-quality multilingual parallel corpus in the news domain and for some low-resource languages, including Vietnamese, Laos, and Khmer, to improve the quality of multilingual machine translation in these areas. We also publicized this one that includes 500.000 Vietnamese-Chinese bilingual sentence pairs; 150.000 Vietnamese-Laos bilingual sentence pairs, and 150.000 Vietnamese-Khmer bilingual sentence pairs.
2013
pdf
bib
Vietnamese Text Accent Restoration with Statistical Machine Translation
Luan-Nghia Pham
|
Viet-Hong Tran
|
Vinh-Van Nguyen
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)
2012
pdf
bib
Improving Statistical Machine Translation with Processing Shallow Parsing
Hoai-Thu Vuong
|
Vinh Van Nguyen
|
Viet Hong Tran
|
Akira Shimazu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation
2009
pdf
bib
Improving a Lexicalized Hierarchical Reordering Model Using Maximum Entropy
Vinh Van Nguyen
|
Akira Shimazu
|
Minh Le Nguyen
|
Thai Phuong Nguyen
Proceedings of Machine Translation Summit XII: Papers
2008
pdf
bib
A Tree-to-String Phrase-based Model for Statistical Machine Translation
Thai Phuong Nguyen
|
Akira Shimazu
|
Tu-Bao Ho
|
Minh Le Nguyen
|
Vinh Van Nguyen
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning