Toward Creation of Ancash Lexical Resources from OCR

Johanna Cordova, Damien Nouvel


Abstract
The Quechua linguistic family has a limited number of NLP resources, most of them being dedicated to Southern Quechua, whereas the varieties of Central Quechua have, to the best of our knowledge, no specific resources (software, lexicon or corpus). Our work addresses this issue by producing two resources for the Ancash Quechua: a full digital version of a dictionary, and an OCR model adapted to the considered variety. In this paper, we describe the steps towards this goal: we first measure performances of existing models for the task of digitising a Quechua dictionary, then adapt a model for the Ancash variety, and finally create a reliable resource for NLP in XML-TEI format. We hope that this work will be a basis for initiating NLP projects for Central Quechua, and that it will encourage digitisation initiatives for under-resourced languages.
Anthology ID:
2021.americasnlp-1.18
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Month:
June
Year:
2021
Address:
Online
Venues:
AmericasNLP | NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
163–167
Language:
URL:
https://aclanthology.org/2021.americasnlp-1.18
DOI:
10.18653/v1/2021.americasnlp-1.18
Bibkey:
Cite (ACL):
Johanna Cordova and Damien Nouvel. 2021. Toward Creation of Ancash Lexical Resources from OCR. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 163–167, Online. Association for Computational Linguistics.
Cite (Informal):
Toward Creation of Ancash Lexical Resources from OCR (Cordova & Nouvel, AmericasNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.americasnlp-1.18.pdf