NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties

Christian Faggionato, Nathan Hill, Marieke Meelen


Abstract
In this paper we present our work-in-progress on a fully-implemented pipeline to create deeply-annotated corpora of a number of historical and contemporary Tibetan and Newar varieties. Our off-the-shelf tools allow researchers to create corpora with five different layers of annotation, ranging from morphosyntactic to information-structural annotation. We build on and optimise existing tools (in line with FAIR principles), as well as develop new ones, and show how they can be adapted to other Tibetan and Newar languages, most notably modern endangered languages that are both extremely low-resourced and under-researched.
Anthology ID:
2022.eurali-1.1
Volume:
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Atul Kr. Ojha, Sina Ahmadi, Chao-Hong Liu, John P. McCrae
Venue:
EURALI
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1–6
Language:
URL:
https://aclanthology.org/2022.eurali-1.1
DOI:
Bibkey:
Cite (ACL):
Christian Faggionato, Nathan Hill, and Marieke Meelen. 2022. NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties. In Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference, pages 1–6, Marseille, France. European Language Resources Association.
Cite (Informal):
NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties (Faggionato et al., EURALI 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.eurali-1.1.pdf
Code
 lothelanor/actib