Anh Tuan Nguyen

Also published as: Anh Tuan Nguyen, Anh-Tuan Nguyen


2023

pdf bib
ViDeBERTa: A powerful pre-trained language model for Vietnamese
Cong Dao Tran | Nhut Huy Pham | Anh Tuan Nguyen | Truong Son Hy | Tu Vu
Findings of the Association for Computational Linguistics: EACL 2023

This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at: https://github.com/HySonLab/ViDeBERTa.

2020

pdf bib
BERTweet: A pre-trained language model for English Tweets
Dat Quoc Nguyen | Thanh Vu | Anh Tuan Nguyen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet under the MIT License to facilitate future research and applications on Tweet data. Our BERTweet is available at https://github.com/VinAIResearch/BERTweet

pdf bib
ReINTEL: A Multimodal Data Challenge for Responsible Information Identification on Social Network Sites
Duc-Trong Le | Xuan-Son Vu | Nhu-Dung To | Huu-Quang Nguyen | Thuy-Trinh Nguyen | Thi Khanh-Linh Le | Anh-Tuan Nguyen | Minh-Duc Hoang | Nghia Le | Huyen Nguyen | Hoang D. Nguyen
Proceedings of the 7th International Workshop on Vietnamese Language and Speech Processing

pdf bib
PhoBERT: Pre-trained language models for Vietnamese
Dat Quoc Nguyen | Anh Tuan Nguyen
Findings of the Association for Computational Linguistics: EMNLP 2020

We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT models are available at https://github.com/VinAIResearch/PhoBERT

pdf bib
A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese
Anh Tuan Nguyen | Mai Hoang Dao | Dat Quoc Nguyen
Findings of the Association for Computational Linguistics: EMNLP 2020

Semantic parsing is an important NLP task. However, Vietnamese is a low-resource language in this research area. In this paper, we present the first public large-scale Text-to-SQL semantic parsing dataset for Vietnamese. We extend and evaluate two strong semantic parsing baselines EditSQL (Zhang et al., 2019) and IRNet (Guo et al., 2019) on our dataset. We compare the two baselines with key configurations and find that: automatic Vietnamese word segmentation improves the parsing results of both baselines; the normalized pointwise mutual information (NPMI) score (Bouma, 2009) is useful for schema linking; latent syntactic features extracted from a neural dependency parser for Vietnamese also improve the results; and the monolingual language model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) helps produce higher performances than the recent best multilingual language model XLM-R (Conneau et al., 2020).

pdf bib
Fast Word Predictor for On-Device Application
Huy Tien Nguyen | Khoi Tuan Nguyen | Anh Tuan Nguyen | Thanh Lac Thi Tran
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

Learning on large text corpora, deep neural networks achieve promising results in the next word prediction task. However, deploying these huge models on devices has to deal with constraints of low latency and a small binary size. To address these challenges, we propose a fast word predictor performing efficiently on mobile devices. Compared with a standard neural network which has a similar word prediction rate, the proposed model obtains 60% reduction in memory size and 100X faster inference time on a middle-end mobile device. The method is developed as a feature for a chat application which serves more than 100 million users.

pdf bib
TATL at WNUT-2020 Task 2: A Transformer-based Baseline System for Identification of Informative COVID-19 English Tweets
Anh Tuan Nguyen
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

As the COVID-19 outbreak continues to spread throughout the world, more and more information about the pandemic has been shared publicly on social media. For example, there are a huge number of COVID-19 English Tweets daily on Twitter. However, the majority of those Tweets are uninformative, and hence it is important to be able to automatically select only the informative ones for downstream applications. In this short paper, we present our participation in the W-NUT 2020 Shared Task 2: Identification of Informative COVID-19 English Tweets. Inspired by the recent advances in pretrained Transformer language models, we propose a simple yet effective baseline for the task. Despite its simplicity, our proposed approach shows very competitive results in the leaderboard as we ranked 8 over 56 teams participated in total.