Morphological Segmentation for Seneca

Zoey Liu, Robert Jimerson, Emily Prud’hommeaux


Abstract
This study takes up the task of low-resource morphological segmentation for Seneca, a critically endangered and morphologically complex Native American language primarily spoken in what is now New York State and Ontario. The labeled data in our experiments comes from two sources: one digitized from a publicly available grammar book and the other collected from informal sources. We treat these two sources as distinct domains and investigate different evaluation designs for model selection. The first design abides by standard practices and evaluate models with the in-domain development set, while the second one carries out evaluation using a development domain, or the out-of-domain development set. Across a series of monolingual and crosslinguistic training settings, our results demonstrate the utility of neural encoder-decoder architecture when coupled with multi-task learning.
Anthology ID:
2021.americasnlp-1.10
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Month:
June
Year:
2021
Address:
Online
Editors:
Manuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, Katharina Kann
Venue:
AmericasNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
90–101
Language:
URL:
https://aclanthology.org/2021.americasnlp-1.10
DOI:
10.18653/v1/2021.americasnlp-1.10
Bibkey:
Cite (ACL):
Zoey Liu, Robert Jimerson, and Emily Prud’hommeaux. 2021. Morphological Segmentation for Seneca. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 90–101, Online. Association for Computational Linguistics.
Cite (Informal):
Morphological Segmentation for Seneca (Liu et al., AmericasNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.americasnlp-1.10.pdf
Code
 zoeyliu18/seneca