Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan

Siyu Liang, Talant Mawkanuli, Gina-Anne Levow


Abstract
Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
Anthology ID:
2026.fieldmatters-1.3
Volume:
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
FieldMatters | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16–30
Language:
URL:
https://aclanthology.org/2026.fieldmatters-1.3/
DOI:
Bibkey:
Cite (ACL):
Siyu Liang, Talant Mawkanuli, and Gina-Anne Levow. 2026. Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan. In Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics, pages 16–30, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan (Liang et al., FieldMatters 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.fieldmatters-1.3.pdf