Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

Kellen Parker van Dam, Abishek Stephen


Abstract
Lexical data collection in language documentation often contains transcription errors and borrowings that can mislead linguistic analysis. We present unsupervised methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using phoneme-level and syllable-level n-gram language models, our approach identifies potential transcription errors and borrowings. We evaluate our methods using hand annotated gold standard and rank the phonotactic outliers using precision and recall at K metric. The ranking approach provides field linguists with a method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
Anthology ID:
2026.fieldmatters-1.1
Volume:
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
FieldMatters | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–7
Language:
URL:
https://aclanthology.org/2026.fieldmatters-1.1/
DOI:
Bibkey:
Cite (ACL):
Kellen Parker van Dam and Abishek Stephen. 2026. Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist. In Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics, pages 1–7, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist (van Dam & Stephen, FieldMatters 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.fieldmatters-1.1.pdf