Multi-Grained Chinese Word Segmentation

Chen Gong, Zhenghua Li, Min Zhang, Xinzhou Jiang


Abstract
Traditionally, word segmentation (WS) adopts the single-grained formalism, where a sentence corresponds to a single word sequence. However, Sproat et al. (1997) show that the inter-native-speaker consistency ratio over Chinese word boundaries is only 76%, indicating single-grained WS (SWS) imposes unnecessary challenges on both manual annotation and statistical modeling. Moreover, WS results of different granularities can be complementary and beneficial for high-level applications. This work proposes and addresses multi-grained WS (MWS). We build a large-scale pseudo MWS dataset for model training and tuning by leveraging the annotation heterogeneity of three SWS datasets. Then we manually annotate 1,500 test sentences with true MWS annotations. Finally, we propose three benchmark approaches by casting MWS as constituent parsing and sequence labeling. Experiments and analysis lead to many interesting findings.
Anthology ID:
D17-1072
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
692–703
Language:
URL:
https://aclanthology.org/D17-1072
DOI:
10.18653/v1/D17-1072
Bibkey:
Cite (ACL):
Chen Gong, Zhenghua Li, Min Zhang, and Xinzhou Jiang. 2017. Multi-Grained Chinese Word Segmentation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 692–703, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Multi-Grained Chinese Word Segmentation (Gong et al., EMNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/D17-1072.pdf