Jambu: A historical linguistic database for South Asian languages

Aryaman Arora, Adam Farris, Samopriya Basu, Suresh Kolichala


Abstract
We introduce JAMBU, a cognate database of South Asian languages which unifies dozens of previous sources in a structured and accessible format. The database includes nearly 287k lemmata from 602 lects, grouped together in 23k sets of cognates. We outline the data wrangling necessary to compile the dataset and train neural models for reflex prediction on the Indo- Aryan subset of the data. We hope that JAMBU is an invaluable resource for all historical linguists and Indologists, and look towards further improvement and expansion of the database.
Anthology ID:
2023.sigmorphon-1.8
Volume:
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Garrett Nicolai, Eleanor Chodroff, Frederic Mailhot, Çağrı Çöltekin
Venue:
SIGMORPHON
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
68–77
Language:
URL:
https://aclanthology.org/2023.sigmorphon-1.8
DOI:
10.18653/v1/2023.sigmorphon-1.8
Bibkey:
Cite (ACL):
Aryaman Arora, Adam Farris, Samopriya Basu, and Suresh Kolichala. 2023. Jambu: A historical linguistic database for South Asian languages. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 68–77, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Jambu: A historical linguistic database for South Asian languages (Arora et al., SIGMORPHON 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.sigmorphon-1.8.pdf