Kiet Nguyen


pdf bib
SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese
Luan Nguyen | Kiet Nguyen | Ngan Nguyen
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation


pdf bib
UIT-E10dot3 at SemEval-2021 Task 5: Toxic Spans Detection with Named Entity Recognition and Question-Answering Approaches
Phu Gia Hoang | Luan Thanh Nguyen | Kiet Nguyen
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

The increment of toxic comments on online space is causing tremendous effects on other vulnerable users. For this reason, considerable efforts are made to deal with this, and SemEval-2021 Task 5: Toxic Spans Detection is one of those. This task asks competitors to extract spans that have toxicity from the given texts, and we have done several analyses to understand its structure before doing experiments. We solve this task by two approaches, Named Entity Recognition with spaCy’s library and Question-Answering with RoBERTa combining with ToxicBERT, and the former gains the highest F1-score of 66.99%.

pdf bib
Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage Span Labeling
Duc-Vu Nguyen | Linh-Bao Vo | Ngoc-Linh Tran | Kiet Nguyen | Ngan Nguyen
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation


pdf bib
A Vietnamese Dataset for Evaluating Machine Reading Comprehension
Kiet Nguyen | Vu Nguyen | Anh Nguyen | Ngan Nguyen
Proceedings of the 28th International Conference on Computational Linguistics

Over 97 million inhabitants speak Vietnamese as the native language in the world. However, there are few research studies on machine reading comprehension (MRC) in Vietnamese, the task of understanding a document or text, and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a new process of dataset creation for Vietnamese MRC. Our in-depth analyses illustrate that our dataset requires abilities beyond simple reasoning like word matching and demands complicate reasoning such as single-sentence and multiple-sentence inferences. Besides, we conduct experiments on state-of-the-art MRC methods in English and Chinese as the first experimental models on UIT-ViQuAD, which will be compared to further models. We also estimate human performances on the dataset and compare it to the experimental results of several powerful machine models. As a result, the substantial differences between humans and the best model performances on the dataset indicate that improvements can be explored on UIT-ViQuAD through future research. Our dataset is freely available to encourage the research community to overcome challenges in Vietnamese MRC.

pdf bib
Empirical Study of Text Augmentation on Social Media Text in Vietnamese
Son Luu | Kiet Nguyen | Ngan Nguyen
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf bib
UIT-HSE at WNUT-2020 Task 2: Exploiting CT-BERT for Identifying COVID-19 Information on the Twitter Social Network
Khiem Tran | Hao Phan | Kiet Nguyen | Ngan Luu Thuy Nguyen
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Recently, COVID-19 has affected a variety of real-life aspects of the world and led to dreadful consequences. More and more tweets about COVID-19 has been shared publicly on Twitter. However, the plurality of those Tweets are uninformative, which is challenging to build automatic systems to detect the informative ones for useful AI applications. In this paper, we present our results at the W-NUT 2020 Shared Task 2: Identification of Informative COVID-19 English Tweets. In particular, we propose our simple but effective approach using the transformer-based models based on COVID-Twitter-BERT (CT-BERT) with different fine-tuning techniques. As a result, we achieve the F1-Score of 90.94% with the third place on the leaderboard of this task which attracted 56 submitted teams in total.