Samopriya Basu


2023

pdf bib
Jambu: A historical linguistic database for South Asian languages
Aryaman Arora | Adam Farris | Samopriya Basu | Suresh Kolichala
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology

We introduce JAMBU, a cognate database of South Asian languages which unifies dozens of previous sources in a structured and accessible format. The database includes nearly 287k lemmata from 602 lects, grouped together in 23k sets of cognates. We outline the data wrangling necessary to compile the dataset and train neural models for reflex prediction on the Indo- Aryan subset of the data. We hope that JAMBU is an invaluable resource for all historical linguists and Indologists, and look towards further improvement and expansion of the database.

2022

pdf bib
Computational Historical Linguistics and Language Diversity in South Asia
Aryaman Arora | Adam Farris | Samopriya Basu | Suresh Kolichala
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

South Asia is home to a plethora of languages, many of which severely lack access to new language technologies. This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics–fields which necessitate the gathering of extensive data from many languages. We claim that data scatteredness (rather than scarcity) is the primary obstacle in the development of South Asian language technology, and suggest that the study of language history is uniquely aligned with surmounting this obstacle. We review recent developments in and at the intersection of South Asian NLP and historical-comparative linguistics, describing our and others’ current efforts in this area. We also offer new strategies towards breaking the data barrier.

2021

pdf bib
Bhāṣācitra: Visualising the dialect geography of South Asia
Aryaman Arora | Adam Farris | Gopalakrishnan R | Samopriya Basu
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

We present Bhāṣācitra, a dialect mapping system for South Asia built on a database of linguistic studies of languages of the region annotated for topic and location data. We analyse language coverage and look towards applications to typology by visualising example datasets. The application is not only meant to be useful for feature mapping, but also serves as a new kind of interactive bibliography for linguists of South Asian languages.