Beyond Characters: Subword-level Morpheme Segmentation

Ben Peters, Andre F. T. Martins


Abstract
This paper presents DeepSPIN’s submissions to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation. We make three submissions, all to the word-level subtask. First, we show that entmax-based sparse sequence-tosequence models deliver large improvements over conventional softmax-based models, echoing results from other tasks. Then, we challenge the assumption that models for morphological tasks should be trained at the character level by building a transformer that generates morphemes as sequences of unigram language model-induced subwords. This subword transformer outperforms all of our character-level models and wins the word-level subtask. Although we do not submit an official submission to the sentence-level subtask, we show that this subword-based approach is highly effective there as well.
Anthology ID:
2022.sigmorphon-1.14
Volume:
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
July
Year:
2022
Address:
Seattle, Washington
Editors:
Garrett Nicolai, Eleanor Chodroff
Venue:
SIGMORPHON
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
131–138
Language:
URL:
https://aclanthology.org/2022.sigmorphon-1.14
DOI:
10.18653/v1/2022.sigmorphon-1.14
Bibkey:
Cite (ACL):
Ben Peters and Andre F. T. Martins. 2022. Beyond Characters: Subword-level Morpheme Segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 131–138, Seattle, Washington. Association for Computational Linguistics.
Cite (Informal):
Beyond Characters: Subword-level Morpheme Segmentation (Peters & Martins, SIGMORPHON 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sigmorphon-1.14.pdf
Data
UniMorph 4.0