Toward a Coarse-Labeled Spoken Language Identification Dataset for Central Alaskan Yup’ik and Samoan from US Broadcast Archives

Yangyang Chen; Kyeongmin Rim; James Pustejovsky

Toward a Coarse-Labeled Spoken Language Identification Dataset for Central Alaskan Yup’ik and Samoan from US Broadcast Archives

Yangyang Chen, Kyeongmin Rim, James Pustejovsky

Abstract

Publicly available spoken language identification (LID) systems provide sparse and inconsistent coverage of indigenous languages of the Americas and languages of the Pacific Islands. No system on HuggingFace covers Central Alaskan Yup’ik except the largest variant of Meta’s MMS-LID family, and only three MMS-LID variants cover Samoan, while Whisper and VoxLingua107-based models lack both despite including other Polynesian languages. We describe an ongoing effort to build a coarse-labeled LID dataset for Yup’ik and Samoan from US public broadcast archives, benchmark publicly available LID systems on it, and train a simple MLP classifier on frozen wav2vec~2.0 representations as a prototype. We report preliminary corpus statistics, off-the-shelf model performance, and prototype results. Guided by the distinctive phonological typology of the target languages, we outline a phonologically-informed fine-tuning direction as future work.

Anthology ID:: 2026.americasnlp-6.18
Volume:: Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Manuel Mager, Abteen Ebrahimi, Minh Duc Bui, Robert Pugh, Arturo Oncevay, Luis Chiruzzo, Rolando Coto Solano, Shruti Rijhwani, Katharina Von Der Wense
Venues:: AmericasNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 203–211
Language:
URL:: https://aclanthology.org/2026.americasnlp-6.18/
DOI:
Bibkey:
Cite (ACL):: Yangyang Chen, Kyeongmin Rim, and James Pustejovsky. 2026. Toward a Coarse-Labeled Spoken Language Identification Dataset for Central Alaskan Yup’ik and Samoan from US Broadcast Archives. In Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 203–211, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Toward a Coarse-Labeled Spoken Language Identification Dataset for Central Alaskan Yup’ik and Samoan from US Broadcast Archives (Chen et al., AmericasNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.americasnlp-6.18.pdf

PDF Cite Search Fix data