Peerat Limkonchotiwat


2022

pdf bib
Thai Nested Named Entity Recognition Corpus
Weerayut Buaphet | Can Udomcharoenchaikit | Peerat Limkonchotiwat | Attapol Rutherford | Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2022

This paper presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from 4,894 documents in the domains of news articles and restaurant reviews. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. To understand the new challenges our proposed dataset brings to the field, we conduct an experimental study on (i) cutting edge N-NER models with the state-of-the-art accuracy in English and (ii) baseline methods based on well-known language model architectures. From the experimental results, we obtained two key findings. First, all models produced poor F1 scores in the tail region of the class distribution. There is little or no performance improvement provided by these models with respect to the baseline methods with our Thai dataset. These findings suggest that further investigation is required to make a multilingual N-NER solution that works well across different languages.

2021

pdf bib
Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation
Peerat Limkonchotiwat | Wannaphong Phatthiyaphaibun | Raheem Sarwar | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Robust Fragment-Based Framework for Cross-lingual Sentence Retrieval
Nattapol Trijakwanich | Peerat Limkonchotiwat | Raheem Sarwar | Wannaphong Phatthiyaphaibun | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2021

Cross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of-Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.

2020

pdf bib
Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble
Peerat Limkonchotiwat | Wannaphong Phatthiyaphaibun | Raheem Sarwar | Ekapol Chuangsuwanich | Sarana Nutanong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Like many Natural Language Processing tasks, Thai word segmentation is domain-dependent. Researchers have been relying on transfer learning to adapt an existing model to a new domain. However, this approach is inapplicable to cases where we can interact with only input and output layers of the models, also known as “black boxes”. We propose a filter-and-refine solution based on the stacked-ensemble learning paradigm to address this black-box limitation. We conducted extensive experimental studies comparing our method against state-of-the-art models and transfer learning. Experimental results show that our proposed solution is an effective domain adaptation method and has a similar performance as the transfer learning method.