Beyond Characters: Subword-level Morpheme Segmentation

Ben Peters; André F. T. Martins

doi:10.18653/v1/2022.sigmorphon-1.14

Beyond Characters: Subword-level Morpheme Segmentation

Abstract

This paper presents DeepSPIN’s submissions to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation. We make three submissions, all to the word-level subtask. First, we show that entmax-based sparse sequence-tosequence models deliver large improvements over conventional softmax-based models, echoing results from other tasks. Then, we challenge the assumption that models for morphological tasks should be trained at the character level by building a transformer that generates morphemes as sequences of unigram language model-induced subwords. This subword transformer outperforms all of our character-level models and wins the word-level subtask. Although we do not submit an official submission to the sentence-level subtask, we show that this subword-based approach is highly effective there as well.

Anthology ID:: 2022.sigmorphon-1.14
Volume:: Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:: July
Year:: 2022
Address:: Seattle, Washington
Editors:: Garrett Nicolai, Eleanor Chodroff
Venue:: SIGMORPHON
SIG:: SIGMORPHON
Publisher:: Association for Computational Linguistics
Note:
Pages:: 131–138
Language:
URL:: https://aclanthology.org/2022.sigmorphon-1.14/
DOI:: 10.18653/v1/2022.sigmorphon-1.14
Bibkey:
Cite (ACL):: Ben Peters and Andre F. T. Martins. 2022. Beyond Characters: Subword-level Morpheme Segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 131–138, Seattle, Washington. Association for Computational Linguistics.
Cite (Informal):: Beyond Characters: Subword-level Morpheme Segmentation (Peters & Martins, SIGMORPHON 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.sigmorphon-1.14.pdf

PDF Cite Search Fix data