Nonparametric Bayesian Semi-supervised Word Segmentation

Ryo Fujii; Ryo Domoto; Daichi Mochihashi

doi:10.1162/tacl_a_00054

Nonparametric Bayesian Semi-supervised Word Segmentation

Ryo Fujii, Ryo Domoto, Daichi Mochihashi

Abstract

This paper presents a novel hybrid generative/discriminative model of word segmentation based on nonparametric Bayesian methods. Unlike ordinary discriminative word segmentation which relies only on labeled data, our semi-supervised model also leverages a huge amounts of unlabeled text to automatically learn new “words”, and further constrains them by using a labeled data to segment non-standard texts such as those found in social networking services. Specifically, our hybrid model combines a discriminative classifier (CRF; Lafferty et al. (2001) and unsupervised word segmentation (NPYLM; Mochihashi et al. (2009)), with a transparent exchange of information between these two model structures within the semi-supervised framework (JESS-CM; Suzuki and Isozaki (2008)). We confirmed that it can appropriately segment non-standard texts like those in Twitter and Weibo and has nearly state-of-the-art accuracy on standard datasets in Japanese, Chinese, and Thai.

Anthology ID:: Q17-1013
Volume:: Transactions of the Association for Computational Linguistics, Volume 5
Month:
Year:: 2017
Address:: Cambridge, MA
Editors:: Lillian Lee, Mark Johnson, Kristina Toutanova
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 179–189
Language:
URL:: https://aclanthology.org/Q17-1013/
DOI:: 10.1162/tacl_a_00054
Bibkey:
Cite (ACL):: Ryo Fujii, Ryo Domoto, and Daichi Mochihashi. 2017. Nonparametric Bayesian Semi-supervised Word Segmentation. Transactions of the Association for Computational Linguistics, 5:179–189.
Cite (Informal):: Nonparametric Bayesian Semi-supervised Word Segmentation (Fujii et al., TACL 2017)
Copy Citation:
PDF:: https://aclanthology.org/Q17-1013.pdf
Video:: https://aclanthology.org/Q17-1013.mp4

PDF Cite Search Video Fix data