ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation

Suteera Seeha; Ivan Bilan; Liliana Mamani Sanchez; Johannes Huber; Michael Matuschek; Hinrich Schütze

ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation

Suteera Seeha, Ivan Bilan, Liliana Mamani Sanchez, Johannes Huber, Michael Matuschek, Hinrich Schütze

Abstract

We propose ThaiLMCut, a semi-supervised approach for Thai word segmentation which utilizes a bi-directional character language model (LM) as a way to leverage useful linguistic knowledge from unlabeled data. After the language model is trained on substantial unlabeled corpora, the weights of its embedding and recurrent layers are transferred to a supervised word segmentation model which continues fine-tuning them on a word segmentation task. Our experimental results demonstrate that applying the LM always leads to a performance gain, especially when the amount of labeled data is small. In such cases, the F1 Score increased by up to 2.02%. Even on abig labeled dataset, a small improvement gain can still be obtained. The approach has also shown to be very beneficial for out-of-domain settings with a gain in F1 Score of up to 3.13%. Finally, we show that ThaiLMCut can outperform other open source state-of-the-art models achieving an F1 Score of 98.78% on the standard benchmark, InterBEST2009.

Anthology ID:: 2020.lrec-1.858
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 6947–6957
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.858/
DOI:
Bibkey:
Cite (ACL):: Suteera Seeha, Ivan Bilan, Liliana Mamani Sanchez, Johannes Huber, Michael Matuschek, and Hinrich Schütze. 2020. ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6947–6957, Marseille, France. European Language Resources Association.
Cite (Informal):: ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation (Seeha et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.858.pdf

PDF Cite Search Fix data