Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

C. Downey; Shannon Drizin; Levon Haroutunian; Shivin Thukral

doi:10.18653/v1/2022.acl-long.366

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

C. Downey, Shannon Drizin, Levon Haroutunian, Shivin Thukral

Abstract

We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K’iche’, a Mayan language. We compare our multilingual model to a monolingual (from-scratch) baseline, as well as a model pre-trained on Quechua only. We show that the multilingual pre-trained approach yields consistent segmentation quality across target dataset sizes, exceeding the monolingual baseline in 6/10 experimental settings. Our model yields especially strong results at small target sizes, including a zero-shot performance of 20.6 F1. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).

Anthology ID:: 2022.acl-long.366
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5331–5346
Language:
URL:: https://aclanthology.org/2022.acl-long.366/
DOI:: 10.18653/v1/2022.acl-long.366
Bibkey:
Cite (ACL):: C. Downey, Shannon Drizin, Levon Haroutunian, and Shivin Thukral. 2022. Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5331–5346, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages (Downey et al., ACL 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.acl-long.366.pdf

PDF Cite Search Fix data