ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation

Suteera Seeha, Ivan Bilan, Liliana Mamani Sanchez, Johannes Huber, Michael Matuschek, Hinrich Schütze


Abstract
We propose ThaiLMCut, a semi-supervised approach for Thai word segmentation which utilizes a bi-directional character language model (LM) as a way to leverage useful linguistic knowledge from unlabeled data. After the language model is trained on substantial unlabeled corpora, the weights of its embedding and recurrent layers are transferred to a supervised word segmentation model which continues fine-tuning them on a word segmentation task. Our experimental results demonstrate that applying the LM always leads to a performance gain, especially when the amount of labeled data is small. In such cases, the F1 Score increased by up to 2.02%. Even on abig labeled dataset, a small improvement gain can still be obtained. The approach has also shown to be very beneficial for out-of-domain settings with a gain in F1 Score of up to 3.13%. Finally, we show that ThaiLMCut can outperform other open source state-of-the-art models achieving an F1 Score of 98.78% on the standard benchmark, InterBEST2009.
Anthology ID:
2020.lrec-1.858
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6947–6957
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.858
DOI:
Bibkey:
Cite (ACL):
Suteera Seeha, Ivan Bilan, Liliana Mamani Sanchez, Johannes Huber, Michael Matuschek, and Hinrich Schütze. 2020. ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6947–6957, Marseille, France. European Language Resources Association.
Cite (Informal):
ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation (Seeha et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.858.pdf
Code
 meanna/ThaiLMCUT