From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction

Toms Bergmanis; Sharon Goldwater

From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction

Abstract

A major motivation for unsupervised morphological analysis is to reduce the sparse data problem in under-resourced languages. Most previous work focus on segmenting surface forms into their constituent morphs (taking: tak +ing), but surface form segmentation does not solve the sparse data problem as the analyses of take and taking are not connected to each other. We present a system that adapts the MorphoChains system (Narasimhan et al., 2015) to provide morphological analyses that aim to abstract over spelling differences in functionally similar morphs. This results in analyses that are not compelled to use all the orthographic material of a word (stopping: stop +ing) or limited to only that material (acidified: acid +ify +ed). On average across six typologically varied languages our system has a similar or better F-score on EMMA (a measure of underlying morpheme accuracy) than three strong baselines; moreover, the total number of distinct morphemes identified by our system is on average 12.8% lower than for Morfessor (Virpioja et al., 2013), a state-of-the-art surface segmentation system.

Anthology ID:: E17-1032
Volume:: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
Month:: April
Year:: 2017
Address:: Valencia, Spain
Editors:: Mirella Lapata, Phil Blunsom, Alexander Koller
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 337–346
Language:
URL:: https://aclanthology.org/E17-1032/
DOI:
Bibkey:
Cite (ACL):: Toms Bergmanis and Sharon Goldwater. 2017. From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 337–346, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):: From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction (Bergmanis & Goldwater, EACL 2017)
Copy Citation:
PDF:: https://aclanthology.org/E17-1032.pdf

PDF Cite Search Fix data