Subword Segmental Language Modelling for Nguni Languages

Francois Meyer, Jan Buys


Abstract
Subwords have become the standard units of text in NLP, enabling efficient open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is viewed as a preprocessing step applied to the corpus before training. This can lead to sub-optimal segmentations for low-resource languages with complex morphologies. We propose a subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling. By unifying subword segmentation and language modelling, our model learns subwords that optimise LM performance. We train our model on the 4 Nguni languages of South Africa. These are low-resource agglutinative languages, so subword information is critical. As an LM, SSLM outperforms existing approaches such as BPE-based models on average across the 4 languages. Furthermore, it outperforms standard subword segmenters on unsupervised morphological segmentation. We also train our model as a word-level sequence model, resulting in an unsupervised morphological segmenter that outperforms existing methods by a large margin for all 4 languages. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters, enabling the model to discover morpheme-like subwords that improve its LM capabilities.
Anthology ID:
2022.findings-emnlp.494
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6636–6649
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.494
DOI:
10.18653/v1/2022.findings-emnlp.494
Bibkey:
Cite (ACL):
Francois Meyer and Jan Buys. 2022. Subword Segmental Language Modelling for Nguni Languages. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6636–6649, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Subword Segmental Language Modelling for Nguni Languages (Meyer & Buys, Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-emnlp.494.pdf
Video:
 https://aclanthology.org/2022.findings-emnlp.494.mp4