PyThaiNLP: Thai Natural Language Processing in Python

We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.


Introduction
In recent years, the field of natural language processing has witnessed remarkable advancements, catalyzing breakthroughs for various applications.However, Thai has remained comparatively underserved due to the challenges posed by limited language resources (Arreerard et al., 2022).
Thai is the de facto national language of Thailand.It belongs to Tai linguistic group within the Kra-Dai language family.According to Ethnologue (Eberhard et al., 2023), there are 60.2 million users of Central Thai, of which 20.8 million are native (2000).If including the Northern (6 million, 2004), Northeastern (15 million, 1983), andSouthern (4.5 million, 2006) variants, there are estimated 85.7 million users of Thais speakers around the world.
Thai is a scriptio continua or has neither spaces nor other marks between the words or sentences in its most common writing style (Sornlertlamvanich et al., 2000).The lack of clear word and sentence boundaries leads to ambiguity that cannot be disambiguated using merely just grammatical knowledge (Supnithi et al., 2004).
Although many closed-source open APIs for NLP have an ability to process Thai language 1 , we believe that an open-source toolbox is essential for both researchers and practitioners to not only access the NLP capabilities but also gain full transparency and trust on both training data and algorithms.
2 This allows the community to adapt and further develop the functionalities as needed, making a crucial step towards democratizing NLP.
This paper introduces PyThaiNLP, an opensource Thai natural language processing library written in Python programming language.Its features span from a simple dictionary-based word tokenizer, to a statistical named-entity recognition, and an instruction-following large language model.The library was released in 2016 under an Open Source Initiative-approved Apache License 2.0 that allows free use and modification of software, including commercial use.

Open-source Thai NLP before PyThaiNLP
Before PyThaiNLP started in 2016, some free and open-source software do exist for different Thai NLP tasks, but there were no unified open-source toolkits that unified multiple tools or tasks in a single library, and the number of available Thai NLP datasets was low compared to high-resource languages like Chinese, English, or German.
Natural Language Toolkit (NLTK) (Bird and Loper, 2004), one of the most comprehensive and most popular NLP libraries in Python at the time, did not support Thai.OpenNLP, another popular free and open-source NLP toolkit written in Java, 1 Such as those provided by commercial cloud service providers and "AI for Thai", the government-funded Thai AI service platform at https://aiforthai.in.th/.
2 For a discussion about concentrated power and the political economy of the 'open' AI, see Widder et al. (2023).Open Thai language resources, like annotated corpora, were also limited in size and number."Publicly available" datasets tend to have restricted access, either through restrictive licenses 5 or the registration requirement, or both.
Because there is a few toolkits available, limited in documentation and performance, short of rigorous benchmarking, and/or lack of maintenance, Thai NLP reseachers had to spend their limited time and resources building basic components and/or collecting a dataset before they could proceed further for more advanced problems.The limited availability of source codes and datasets also affects reproducibility.
5 Even today, this practice continues: take, for instance, the LST20 corpus from NECTEC, which has multiple layers of linguistic annotation.However, the free version can only be used for non-commercial purposes.See https://opend-portal. nectec.or.th/en/dataset/lst20-corpus.
2002) which provide not only POS but also word boundaries.
Apart from the ones listed above, more opensource Thai word tokenizers were released after 2009 as a result of BEST (Benchmark for Enhancing the Standard of Thai language processing) evaluation for Thai word segmentation organized by the National Electronics and Computer Technology Center (NECTEC) in 2009 (Kosawat, 2009), and 2010 6 .Unfortunately, these tokenizers are no longer maintained and are not accessible at the time of writing.The most impactful contribution from BEST, however, is the BEST-2010 word segmentation dataset that was publicly released.This dataset provides a basis for a lot of modern Thai open-source word segmentation software.We should also mention the Thai Language Toolkit (TLTK) (Aroonmanakun and Thamrongrattanarit, 2018).Its first release on Python Package Index (version 0.3.4,February 2018) includes statistical syllable and word segmentation (Aroonmanakun, 2002), POS tagging, and spelling suggestion.Its latest version, as of writing, features discourse unit segmentation, NER, grapheme-tophoneme conversion, IPA transcription, romaniza-tion, and more.To date, TLTK and PyThaiNLP are the only two comprehensive Thai NLP libraries for Python.However, TLTK's documentation is still quite limited.For more reviews on Thai NLP tools and datasets, including more recent ones (post-2016), see Arreerard et al. (2022).

PyThaiNLP and Its Ecosystem
Our primary objective is to ensure the userfriendliness and simplicity of the library.Drawing inspiration from NLTK, we follow numerous established interfaces.For example, word_tokenize and pos_tag.In addition, we also create datasets and pre-trained models for the Thai language.Figure 1 illustrates the overview of PyThaiNLP's functionalities and its ecosystem.Table 1 displays the development milestones of PyThaiNLP.
We will discuss here only popular features and major datasets/models.

Word and Sentence Tokenization
PyThaiNLP supports many word tokenization algorithms.
7 The default algorithm is NewMM which is dictionary-based maximum matching (Sornlertlamvanich, 1993) and utilizes Thai character cluster (Theeramunkong et al., 2000).The pure-Python tokenizer performs reasonably well on public benchmarks.Chormai et al. (2020) demonstrated that it is the fastest word tokenizer on the BEST 2010 benchmark, with 71.18 % accuracy (compared to state-of-the-art at 95.60 %).Thanathip Suntorntip ported NewMM to Rust programming language8 , resulting in an even faster word tokenizer in our toolbox.
For sentence tokenization, we trained a conditional random field (CRF) model, using pythoncrfsuite (Peng and Korobov, 2014), on translated TED transcripts and Thai sentence boundaries are assumed to be denoted by English sentence boundaries (Lowphansirikul et al., 2021b).
PyThaiNLP supports the following transliteration implementations: Thai romanization using the Royal Thai General System of Transcription (RTGS), transliteration of romanized Japanese/Korean/Mandarin/Vietnamese texts to Thai using Wunsen library (cakimpei, 2022)10 , and Thai word pronunciation.

Coreference Resolution and Entity Linking
For coreference resolution, we create Han-Coref, a Thai coreference resolution corpus and model (Phatthiyaphaibun and Limkonchotiwat, 2023).

Machine Translation
We collaborated with VISTEC-depa Thailand Artificial Intelligence Research Institute (AIResearch.in.th) 11 to create the English-Thai translation dataset and model.The model outperformed Google Translate on an out-of-sample test set at the time of release (Lowphansirikul et al., 2021b).

Automatic Speech Recognition
In order to develop a dataset for ASR, PyThaiNLP members contribute to the development of Common Voice corpus (Ardila et al., 2020), including Thai sentence cleanup and validation rules for its Sentence Collector12 , an online campaign inviting people to contribute Thai sentences, and offline events for volunteers to contribute their voices and voice validation.Utilizing Common Voice Corpus 7.0, we created a Thai ASR model in collaboration with AIResearch.in.th and achieved the lowest character error rate in a benchmark (VISTEC-depa AI Research Institute of Thailand, 2023).

VISTEC-TPTH-2020: Word
Tokenization, Spell Checking and Correction VISTEC-TPTH-2020 is a Thai word tokenization and spell checking dataset in the social media domain, the largest one to date (Limkonchotiwat et al., 2021).We collected 50,000 sentences from top trending posts on Twitter in 2020 and selected only posts with substantial character counts.This dataset is a multi-task dataset, including mention detection, spell checking, and spell correction.

Thai NER: Named Entity Recognition
Thai NER is a Thai named-entity recognition dataset.We curated text from various domains including news, Wikipedia articles, government documents, as well as text from other Thai NER datasets.The data is manually re-labeled for consistency (Phatthiyaphaibun, 2022).

Pre-trained Language Models
WangchanBERTa is an encoder-only pre-trained Thai language model.Based on public benchmarks, it is the current state-of-the-art (Lowphansirikul et al., 2021a).It is also a collaborative work with AIResearch.in.th.WangChanGLM (Polpanumas et al., 2023) is a multilingual instruction-following model finetuned from XGLM (Lin et al., 2022).Wannaphong Phatthiyaphaibun, a high school student at the time, created PyThaiNLP in 2016 as a hobby project.He wanted to create a simple Thai chatbot in Python.He used PyICU as a word tokenizer and soon found out that Thai language did not have a comprehensive NLP toolkit in Python like NLTK (Bird and Loper, 2004).He decided to create PyThaiNLP and hosted the project on GitHub13 .

Community and Project Milestones
After the first few official releases, following Korakot Chaovavanich's suggestion, a "Thai Natural Language Processing" group has been created as a public Facebook group 14 .This serves as a main venue to showcase PyThaiNLP's capabilities and a hub for Thai NLP researchers and practitioners to discuss the field.Today, the group has over 16,000 members and is Thailand's largest NLP interest group.This communication channel also performs a recruiting function for us.The first offline meetup of the group occurred in 24 May 2018 as a bird-of-a-feather session after a Data Science BKK meetup15 .
Many of our main contributors, such as Charin Polpanumas and Arthit Suriyawongkul organically joined the project from the community.At this stage, we created foundational capabilities such as word tokenization, part-of-speech tagging, subword tokenization, named-entity recognition, and word vectors.A lot of code cleaning, reorganization, and documentation also happened around 2018-2019.This included the adoption of PEP 484 type hints 16 and other Python best practices to make the code even more readable and facilitate off-line type checkers.The adoption of PyThaiNLP can be reflected by the number of stars on GitHub the project received over the years (Figure 2).

Gaining Resources for Large Language Models (2019-present)
The growing activity of PyThaiNLP development can be seen from the number of code commits to the Git repository, which reached its peak in Q4 2019 17 .In 2020, the project began a collaboration with AIResearch.in.th.Their main focus was to create and distribute open-source models and datasets.This collaboration has provided PyThaiNLP with computational resources we need to scale up our operations as well as additional developers for maintaining the project, such as Lalita Lowphansirikul.
Under the collaboration, we have built an English-Thai sentence pair dataset and the state-ofthe-art English-Thai translation model (Lowphansirikul et al., 2021b), the RoBERTa-based monolingual language model WangchanBERTa (Lowphansirikul et al., 2021a), and most recently the multilingual instruction-following model WangChanGLM (Polpanumas et al., 2023).
Due to limited computational and human resources, we prioritize features with the highest impact-to-effort ratio.For example, during 2019-2020, there were two types of dominant transformer-based language models: encoder-only BERT family and decoder-only GPT family.We opted to pursue the encoder-only models and trained WangchanBERTa because, at the time, it required relatively fewer resources to train and had better performance across impactful tasks such as text classification, sequence tagging, and extractive question answering.It was not until decoderonly models proved to create more value-added in 2022 that we started to train such models as WangChanGLM.

Community and Infrastructure for
Software Quality It is important to be noted that the community not only made contributions in the form of feature improvements but also in the areas of documentation, including computational documentation (e.g., Jupyter notebooks), improving code quality and test suite, and streamlining software testing and delivery.Some of which may not be visible to the users but are crucial for the development of the project., 1976).pip installation package will be built and tested against the test suite in Linux, macOS, and Windows 24 .The package then can be automatically publish to the Python Package Index directly from the CI, once it passed all the tests in every platform.PyThaiNLP code coverage reached 80 % towards the end of 2018, compare to under 60 % in 2017.Code coverage is a metric that can help assess the quality of the test suite, and it therefore reflects how well the functionalities are thoroughly tested.The coverage went over 90 % in August 2019 and kept stable at this level until 2022 25 .

Years Notable
From early 2022, we experienced a gradual drop of the code coverage to 80 %.The main reason is 24 Easy installation and consistent behavior across platforms are what we aim for.This is one of the reasons why we developed a pure-Python NewMM.The previous implementation of our default word tokenizer requires marisa-trie, a trie data structure library in C++.Unfortunately, marisa-trie does not officially support mingw32 compiler on Windows. 25Our code coverage is measured by coverage.pywhich is included in our continuous integration workflow.The coverage stats are made available online by Coveralls at: https:// coveralls.io/github/PyThaiNLP/pythainlpa growing number of features that require a large language model that cannot fit inside our standard GitHub-hosted runners.We have to remove some of the tests for those features.Before 2022, we also tested our library against versions of CPython and PyPy, but now it has been reduced to only CPython 3.8 due to the lack of support for other Python versions in some of our machine learning dependencies.
Some of the common code improvements we made after analyzing code coverage and other tests were the removal of unused code, fixing inconsistent behavior in different operating systems, better handling of a very long string, empty string, empty list, null, and/or negative values, and better handling of exceptions in control flow, resulting a code that is smaller and more robust.

PyThaiNLP and Its Research Impact
Researchers worldwide use PyThaiNLP to work with Thai language.For instance, for word tokenization in cross-lingual language model pretraining (Lample and Conneau, 2019), universal dependency parsing (Smith et al., 2018), and crosslingual representation learning (Conneau et al., 2020).In addition, research and industry-grade tools namely SEACoreNLP 26 , an open-source initiative by NLPHub of AI Singapore, and spaCy (Honnibal et al., 2020) include PyThaiNLP as part of their toolkit.

PyThaiNLP and Its Industry Impact
PyThaiNLP is used in many real-world business use cases in firms of all sizes both domestic and international.User feedback generally highlights how the library has sped up their product devel-opment cycles involving Thai NLP as well as its effectiveness in terms of business outcomes.The most frequently used functionalities are tokenization and text normalization.We introduce here selected use cases from national and multinational firms in banking, telecommunication, insurance, retail, and software development.
Siam Commercial Bank (BKK:SCB; USD 10B market cap) is one of Thailand's largest banks.The bank operates a chatbot to automatically answer customer queries.Their data analytics team finetuned WangchanBERTa for intent classification to enhance its question-answering capabilities as well as to detect personal information in customers' inputs in order to exclude them from their internal training sets.Moreover, the team relies on basic text processing functions such as tokenization and normalization to speed up their development process.They have also found the published performance benchmarks to be useful when selecting models for their tasks.
True Corporation (BKK:TRUE; 6B) is one of the two providers in Thailand's duopoly telecommunication market.Its subsidiary, True Digital Group, uses PyThaiNLP both for digital media analysis and for recommendation engine on production.They featurized their Thai-text contents using thai2fit word vectors and saw a noticeable uplift in user engagement and subsequent business outcomes.They also combined our word vectors with Top2Vec (Angelov, 2020) to perform topic modeling and improve customer experience.
Central Retail Digital (BKK:CRC; 6B) is a digital transformation unit serving Central Retail, Thailand's largest department store.Their data science team used PyThaiNLP mainly to enhance search and recommendation offerings across five business units and other six million customers.Word tokenization and text normalization were used to preprocess product information and search queries as input for the product search system.Since most search systems are built for languages with white spaces as word delimiters, this preprocessing step has allowed their product search to outperform out-of-the-box solutions which are not compatible with Thai.For content-based recommendations, the team featurized production information to create a model that recommends similar products to customers.
AIA Thailand (HKG:1299; 109B global) is the Thai headquarter of the global insurance firm American Insurance Association.Their data science team employs PyThaiNLP in analyzing their inbound and outbound call logs using word tokenization, text normalization, stop word handling, and local-time-format string handling functionalities.For the inbound calls, they normalize and tokenize the logs to perform topic modeling and identify critical topics of conversation to emphasize both automated voice bot and human staff training and allocation.This resulted in improved percentage of calls that the voice bot fulfilled successfully and reduced call waiting time.For the outbound calls, they perform keyword identification from the logs processed by PyThaiNLP to gain insights to improve customer retention.
VISAI is a VISTEC university spin-off that provides machine learning tools and consulting services.It has finetuned WangchanBERTa to perform text classification, named entity recognition, and relation extraction on unstructured data of their clients to create a queryable knowledge graph.They also use tokenization and text normalization functionalities to facilitate text processing for all their NLP-based products.

Conclusion and Future Works
This paper introduces the PyThaiNLP library, explains its features and datasets (as illustrated in Figure 1), and discusses the community and the engineering project supporting the library.
By 2023, we will have implemented the opensource version of most general NLP capabilities available in English for Thai27 .We see the following items as the next major milestones: • Domain-specific datasets/models Some capabilities are not performing well on specific use cases; for instances, named-entity recognition in financial reports, medical terms translation, and legal documents question answering.We believe more domain-specific datasets and models will help close this gap.
• Robust benchmark for Thai NLP tasks As NLP has garnered more attention, more models and datasets, both open-and closed-source, will be available.It will, therefore, be imperative to have a robust benchmark in comparing the models' performance and the datasets' quality.
• Correctness and consistency Search key generation (such as Soundex), sorting, and tokenization28 have to be deterministic and strictly follow a specification, or an application may behave in an unexpected fashion.More test cases and verification might be needed for these features.
• Efficient mechanism to load and manage datasets/models To reduce the size of the library and to carter the use in a system with a restricted network connection29 .
• Seamless integration with languageagnostic tools The ultimate goal is for developers to no longer need PyThaiNLP as Thai language is supported by standard NLP libraries such as spaCy and Hugging Face (Wolf et al., 2020).We have begun this work with integrating our text processing functions and models to spaCy.and Pongtachchai Panachaiboonpipop; 3) Ekapol Chuangsuwanich for academic guidance and contribution to models and datasets; 4) MacStadium for infrastructure support; and 5) NLP-OSS 2023 anonymous reviewers.We are much obliged to free and open-source software community for software building blocks and best practices, including but not limited to NumFOCUS, fast.ai,Hugging Face, and Thai Linux Working Group.Moreover, we thank organizations who care enough to develop multilingual resources to accommodate lowresource languages, most notably Meta AI.Lastly, we cannot thank enough volunteers of various opencontent communities, including Wikipedia, Common Voice, TED Translators, and similar local initiatives; modern NLP will not be possible without their accumulated effort.

Limitations
In our current CI workflow, every code commit to the repository triggers an automated test suit for all supported platforms.The process can be challenging if our package depends on large language models (LLMs) because a single LLM can exhaust the memory of our free-tier CI infrastructure.Some of the components can be cached to reduce build time, but they have to be loaded to the memory in any case.This forced us to drop some LLM-related tests and scarified the code coverage of the library as discussed in Section 4.3.
Even we have a resource to do such tests with the current design, it is neither economical nor sustainable.An improved test utilizing a stub, mock, or spy (proxy) test pattern that provides an off-line "fake inference" can help this.These techniques have been proven useful in other software testing involving expensive database/API queries or network connections.Lyra (2019) and Microsoft (2020) provide such examples, using the Python Standard Library's unittest.mock.This can reduce a number of times an LLM is actually being loaded/called.The required inference could be handled either by a non-free tier CI plan from the same or different provider (which should be more affordable now due to reduced number of calls) or by a computer outside the cloud.

Figure 2 :
Figure 2: Number of stars PyThaiNLP has received from GitHub users over the years.

Table 1 :
Notable features introduced to PyThaiNLP over the years.