A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance

Alexander Erdmann, Salam Khalifa, Mai Oudah, Nizar Habash, Houda Bouamor


Abstract
We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.
Anthology ID:
W19-4214
Volume:
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
August
Year:
2019
Address:
Florence, Italy
Venue:
ACL
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
113–124
Language:
URL:
https://aclanthology.org/W19-4214
DOI:
10.18653/v1/W19-4214
Bibkey:
Cite (ACL):
Alexander Erdmann, Salam Khalifa, Mai Oudah, Nizar Habash, and Houda Bouamor. 2019. A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance. In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 113–124, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance (Erdmann et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-4214.pdf