A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

C.m. Downey, Fei Xia, Gina-Anne Levow, Shane Steinert-Threlkeld


Abstract
We introduce a Masked Segmental Language Model (MSLM) for joint language modeling and unsupervised segmentation. While near-perfect supervised methods have been developed for segmenting human-like linguistic units in resource-rich languages such as Chinese, many of the world’s languages are both morphologically complex, and have no large dataset of “gold” segmentations for supervised training. Segmental Language Models offer a unique approach by conducting unsupervised segmentation as the byproduct of a neural language modeling objective. However, current SLMs are limited in their scalability due to their recurrent architecture. We propose a new type of SLM for use in both unsupervised and lightly supervised segmentation tasks. The MSLM is built on a span-masking transformer architecture, harnessing a masked bidirectional modeling context and attention, as well as adding the potential for model scalability. In a series of experiments, our model outperforms the segmentation quality of recurrent SLMs on Chinese, and performs similarly to the recurrent model on English.
Anthology ID:
2022.sigmorphon-1.5
Volume:
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
July
Year:
2022
Address:
Seattle, Washington
Editors:
Garrett Nicolai, Eleanor Chodroff
Venue:
SIGMORPHON
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
39–50
Language:
URL:
https://aclanthology.org/2022.sigmorphon-1.5
DOI:
10.18653/v1/2022.sigmorphon-1.5
Bibkey:
Cite (ACL):
C.m. Downey, Fei Xia, Gina-Anne Levow, and Shane Steinert-Threlkeld. 2022. A Masked Segmental Language Model for Unsupervised Natural Language Segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 39–50, Seattle, Washington. Association for Computational Linguistics.
Cite (Informal):
A Masked Segmental Language Model for Unsupervised Natural Language Segmentation (Downey et al., SIGMORPHON 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sigmorphon-1.5.pdf
Code
 cmdowney88/SegmentalLMs