Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation

Jiatong Shi, Jonathan D. Amith, Xuankai Chang, Siddharth Dalmia, Brian Yan, Shinji Watanabe


Abstract
Documentation of endangered languages (ELs) has become increasingly urgent as thousands of languages are on the verge of disappearing by the end of the 21st century. One challenging aspect of documentation is to develop machine learning tools to automate the processing of EL audio via automatic speech recognition (ASR), machine translation (MT), or speech translation (ST). This paper presents an open-access speech translation corpus of Highland Puebla Nahuatl (glottocode high1278), an EL spoken in central Mexico. It then addresses machine learning contributions to endangered language documentation and argues for the importance of speech translation as a key element in the documentation process. In our experiments, we observed that state-of-the-art end-to-end ST models could outperform a cascaded ST (ASR > MT) pipeline when translating endangered language documentation materials.
Anthology ID:
2021.americasnlp-1.7
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Month:
June
Year:
2021
Address:
Online
Venues:
AmericasNLP | NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
53–63
Language:
URL:
https://aclanthology.org/2021.americasnlp-1.7
DOI:
10.18653/v1/2021.americasnlp-1.7
Bibkey:
Cite (ACL):
Jiatong Shi, Jonathan D. Amith, Xuankai Chang, Siddharth Dalmia, Brian Yan, and Shinji Watanabe. 2021. Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 53–63, Online. Association for Computational Linguistics.
Cite (Informal):
Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation (Shi et al., AmericasNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.americasnlp-1.7.pdf
Data
MuST-C