Splitting compounds with ngrams

Naomi Tachikawa Shapiro


Abstract
Compound words with unmarked word boundaries are problematic for many tasks in NLP and computational linguistics, including information extraction, machine translation, and syllabification. This paper introduces a simple, proof-of-concept language modeling approach to automatic compound segmentation, as applied to Finnish. This approach utilizes an off-the-shelf morphological analyzer to split training words into their constituent morphemes. A language model is subsequently trained on ngrams composed of morphemes, morpheme boundaries, and word boundaries. Linguistic constraints are then used to weed out phonotactically ill-formed segmentations, thereby allowing the language model to select the best grammatical segmentation. This approach achieves an accuracy of ~97%.
Anthology ID:
C16-1061
Volume:
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Yuji Matsumoto, Rashmi Prasad
Venue:
COLING
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
630–640
Language:
URL:
https://aclanthology.org/C16-1061/
DOI:
Bibkey:
Cite (ACL):
Naomi Tachikawa Shapiro. 2016. Splitting compounds with ngrams. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 630–640, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Splitting compounds with ngrams (Shapiro, COLING 2016)
Copy Citation:
PDF:
https://aclanthology.org/C16-1061.pdf