Can Udomcharoenchaikit


2023

pdf bib
PyThaiNLP: Thai Natural Language Processing in Python
Wannaphong Phatthiyaphaibun | Korakot Chaovavanich | Charin Polpanumas | Arthit Suriyawongkul | Lalita Lowphansirikul | Pattarawat Chormai | Peerat Limkonchotiwat | Thanathip Suntorntip | Can Udomcharoenchaikit
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.

pdf bib
Cross-Lingual Data Augmentation For Thai Question-Answering
Parinthapat Pengpun | Can Udomcharoenchaikit | Weerayut Buaphet | Peerat Limkonchotiwat
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP

This paper presents an innovative data augmentation framework with data quality control designed to enhance the robustness of Question Answering (QA) models in low-resource languages, particularly Thai. Recognizing the challenges posed by the scarcity and quality of training data, we leverage data augmentation techniques in both monolingual and cross-lingual settings. Our approach augments and enriches the original dataset, thereby increasing its linguistic diversity and robustness. We evaluate the robustness of our framework on Machine Reading Comprehension, and the experimental results illustrate the potential of data augmentation to effectively increase training data and improve model generalization in low-resource language settings, offering a promising direction for the data augmentation manner.

pdf bib
Typo-Robust Representation Learning for Dense Retrieval
Panuthep Tasawong | Wuttikorn Ponwitayarat | Peerat Limkonchotiwat | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries. To assess the effectiveness of our proposed method, we compare it against the existing competitors using two benchmark datasets and two base encoders. Our method outperforms the competitors in all cases with misspelled queries. Our code and models are available at https://github.com/panuthept/DST-DenseRetrieval.

2022

pdf bib
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems
Daniel Deutsch | Can Udomcharoenchaikit | Juri Opitz | Yang Gao | Marina Fomicheva | Steffen Eger
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems

pdf bib
Thai Nested Named Entity Recognition Corpus
Weerayut Buaphet | Can Udomcharoenchaikit | Peerat Limkonchotiwat | Attapol Rutherford | Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2022

This paper presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from 4,894 documents in the domains of news articles and restaurant reviews. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. To understand the new challenges our proposed dataset brings to the field, we conduct an experimental study on (i) cutting edge N-NER models with the state-of-the-art accuracy in English and (ii) baseline methods based on well-known language model architectures. From the experimental results, we obtained two key findings. First, all models produced poor F1 scores in the tail region of the class distribution. There is little or no performance improvement provided by these models with respect to the baseline methods with our Thai dataset. These findings suggest that further investigation is required to make a multilingual N-NER solution that works well across different languages.

pdf bib
CL-ReLKT: Cross-lingual Language Knowledge Transfer for Multilingual Retrieval Question Answering
Peerat Limkonchotiwat | Wuttikorn Ponwitayarat | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: NAACL 2022

Cross-Lingual Retrieval Question Answering (CL-ReQA) is concerned with retrieving answer documents or passages to a question written in a different language. A common approach to CL-ReQA is to create a multilingual sentence embedding space such that question-answer pairs across different languages are close to each other. In this paper, we propose a novel CL-ReQA method utilizing the concept of language knowledge transfer and a new cross-lingual consistency training technique to create a multilingual embedding space for ReQA. To assess the effectiveness of our work, we conducted comprehensive experiments on CL-ReQA and a downstream task, machine reading QA. We compared our proposed method with the current state-of-the-art solutions across three public CL-ReQA corpora. Our method outperforms competitors in 19 out of 21 settings of CL-ReQA. When used with a downstream machine reading QA task, our method outperforms the best existing language-model-based method by 10% in F1 while being 10 times faster in sentence embedding computation. The code and models are available at https://github.com/mrpeerat/CL-ReLKT.

pdf bib
ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation
Peerat Limkonchotiwat | Wuttikorn Ponwitayarat | Lalita Lowphansirikul | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2022

Sentence representations are essential in many NLP tasks operating at the sentence level. Recently, research attention has shifted towards learning how to represent sentences without any annotations, i.e., unsupervised representation learning. Despite the benefit of training without supervised data, there is still a performance penalty compared to supervised methods. Furthermore, the supervised-unsupervised performance gap widens as we reduce the model size. In this paper, we propose an unsupervised sentence representation method to reduce the supervised-unsupervised performance gap, especially for smaller models. Utilizing the concept for knowledge distillation, we derive a distillation framework comprising two training objectives, control and generalize, called ConGen. Experiments on semantic textual similarity (STS), text classification (transfer), and natural language inference (NLI) tasks show that ConGen is on par with supervised training even on smaller models. Furthermore, our method consistently outperformed competitors on multilingual STS.The code and models are available at https://github.com/KornWtp/ConGen.

pdf bib
Topic-Regularized Authorship Representation Learning
Jitkapat Sawatphol | Nonthakit Chaiwong | Can Udomcharoenchaikit | Sarana Nutanong
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Authorship attribution is a task that aims to identify the author of a given piece of writing. We aim to develop a generalized solution that can handle a large number of texts from authors and topics unavailable in training data. Previous studies have proposed strategies to address only either unseen authors or unseen topics. Authorship representation learning has been shown to work in open-set environments with a large number of unseen authors but has not been explicitly designed for cross-topic environments at the same time. To handle a large number of unseen authors and topics, we propose Authorship Representation Regularization (ARR), a distillation framework that creates authorship representation with reduced reliance on topic-specific information. To assess the performance of our framework, we also propose a cross-topic-open-set evaluation method. Our proposed method has improved performances in the cross-topic-open set setup over baselines in 4 out of 6 cases.

pdf bib
Mitigating Spurious Correlation in Natural Language Understanding with Counterfactual Inference
Can Udomcharoenchaikit | Wuttikorn Ponwitayarat | Patomporn Payoungkhamdee | Kanruethai Masuk | Weerayut Buaphet | Ekapol Chuangsuwanich | Sarana Nutanong
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Despite their promising results on standard benchmarks, NLU models are still prone to make predictions based on shortcuts caused by unintended bias in the dataset. For example, an NLI model may use lexical overlap as a shortcut to make entailment predictions due to repetitive data generation patterns from annotators, also called annotation artifacts. In this paper, we propose a causal analysis framework to help debias NLU models. We show that (1) by defining causal relationships, we can introspect how much annotation artifacts affect the outcomes. (2) We can utilize counterfactual inference to mitigate bias with this knowledge. We found that viewing a model as a treatment can mitigate bias more effectively than viewing annotation artifacts as treatment. (3) In addition to bias mitigation, we can interpret how much each debiasing strategy is affected by annotation artifacts. Our experimental results show that using counterfactual inference can improve out-of-distribution performance in all settings while maintaining high in-distribution performance.