Bhāṣācitra: Visualising the dialect geography of South Asia

We present Bhāṣācitra, a dialect mapping system for South Asia built on a database of linguistic studies of languages of the region annotated for topic and location data. We analyse language coverage and look towards applications to typology by visualising example datasets. The application is not only meant to be useful for feature mapping, but also serves as a new kind of interactive bibliography for linguists of South Asian languages.


Introduction
South Asia is extremely linguistically diverse. There is a common saying illustrating this diversity, present in several languages of the region; it is given in Hindi below.
kos kos par pānī badle, cār kos par bānī. 'The taste of water changes every mile, and the language every four.' One issue with this vast scale of diversity is the difficulty it poses for linguists in collecting and cataloguing linguistic data, which further impedes comprehensive typological analysis. India alone contains known living speakers of 461 languages (Eberhard et al., 2021). 2 It is also difficult to assess the availability of linguistic literature for all of these languages, leading to gaps in the typological databases we end up compiling; print linguistic bibliographies for the region become outdated as new work is published and do not encode useful metadata, such as the specific dialect studied in each work or the linguistic features studied.
In this paper we present Bhās .ā citra, a database of linguistic sources for South Asian languages that we have compiled and annotated, as well as a dialect mapping and visualising system built from the location data extracted from those sources. Currently it includes 1104 labelled sources covering 311 lects. The site is online at http:// aryamanarora.github.io/bhasacitra.

Background and related work
Dialects 3 are defined by isoglosses, geographical boundaries separating linguistic features. The mapping of dialect geography is a well-established problem in linguistics, and has been done for many languages; two illustrative examples are English (Orton et al., 1998;Kretzschmar, 2001) and Japanese (Kumagai, 2016). Dialect mapping is instrumentally important for the study of historicalcomparative linguistics, since the present-day geography of isoglosses is a result of past language change and language contact. The distribution of synchronic features is data for theories of diachronic language change.
Computational approaches to dialect geography have worked on many parts of the issue, including the compilation of broad databases of linguistic features (Dryer and Haspelmath, 2013;Carling et al., 2018), dialect identification and clustering on modern social media corpora (Abdul-Mageed et al., 2018;Jones, 2015), and statistical modelling of dialect groups (e.g. Murawaki, 2020).
South Asia is a linguistic area (Masica, 1993;Bashir, 2016), a region of typological convergence due to historical contact between speakers of languages of different families. Families represented in South Asia are Indo-European, Dravidian, Austroasiatic, Sino-Tibetan, and some unclassified isolates (Nihali, Kusunda, and Burushaski).
Visualisation of data for linguistic typology has a long history, beginning with the first lexical isogloss maps created by aggregating data from dialect surveys and with more recent work specifically for visualising historical change, such as Kalouli et al. (2019). As linguists adopt computational methods that deal with vast amounts of data, it becomes a challenge for humans to interpret datasets. Modern approaches to visualisation like Visual Analytics (VA) try to address this issue (Keim et al., 2008;MacEachren, 2017).
The use of point-based mapping in linguistic data visualisation is well-known, in e.g. WALS (Dryer and Haspelmath, 2013). This format has been used to map data in South Asian languages (Arsenault, 2017;Liljegren et al., 2021) as well as the languages of Iran (Anonby et al., 2019(Anonby et al., , 2018. We develop this paradigm further to map areal language extents based on the location data in published linguistic fieldwork.

Data model
We built Bhās .ā citra to be an easy-to-use system for researchers with no computational background. We implemented the application in JavaScript on a statically-hosted webpage. There are three data files in JSON format, for reference metadata (in Marwari Figure 2: Hovering on the circle for Marwari (a language of Rajasthan, India) highlights the regions from which linguistic sources for it draw data.
BibT E X-compatible format with additional fields for location and topic information; see appendix A), language metadata (traditional genetic classification and coordinates for reference locations), and the typological database (containing per-language per-location data).
The primary interface is an interactive map displaying geographical points corresponding to locations from which language data has been collected. The map is generated and manipulated using the D3.js library which has a complete pipeline for web cartography (Bostock et al., 2011). Dialect zones are partitioned using the Voronoi algorithm; for a point P k in the set of points P, its Voronoi region R k is defined as all points closer to P k than to any other point.
In the primary interface (see figure 3), zones are colour-coded by consensus genetic classification of the languages covering the zone, with circles (with size proportional to the number of sources) centered at the weighted average of the coordinates of descriptions of the languages. In the case where multiple languages share a zone, the RGB components of the colouring are averaged.

Interface
The primary interface map is fully interactive (draggable and zoomable). Hovering over a language circle shows all the geographical points and Voronoi polygons associated with the sources compiled for that language (see figure 2). Like the language circles, each geographical point's size is weighted by the number of sources corresponding to it. Clicking on a language circle brings up the scrollable bibliography for that language, with each entry in human-readable format with the corresponding location and topic annotations appended.

Limitations
In South Asia (as elsewhere), geography is hardly the only variable encoding language use. As noted by Deo (2018) and shown in sociolinguistic studies (Gumperz, 1958) factors such as caste, social status, political affiliation, and religion play a large role in language use and adoption. Migrant speaker communities have also developed distinct dialects even in regions where they are a minority language group (e.g. Marathi speakers in Thanjavur and Burushaski speakers in Srinagar).
To deal with geographical overlap (different language sources for the same location), we allowed the areal zones of multiple languages to encompass the same location. A complete solution to the limitations of the geographical model would require collection of demographic data indexed to language use, which has not yet been collected on a large scale in South Asia.

Compiling the database
There are some existing bibliographies of language references for South Asia. In compiling data for Bhās .ā citra, we prioritised the incorporation of sources that provided the greatest coverage of language information, such as grammars and grammatical sketches, analysed corpora, and sociolinguistic surveys.
We began with data from Glottolog for broad coverage (Hammarström et al., 2020); South Asiaspecific sources we drew from are Peterson (2018); Baart and Baart-Bremer (2001); Perera (2021). We then searched for literature not included in existing bibliographies. Many new sources were obtained from Shodhganga, 4 a platform for open-access digitised theses completed at Indian universities. These theses were difficult to access before the past decade, so from this resource we were able to incorporate many new references.
We annotated information on topic coverage for every source (see table 1) and location data (see §4.1) when possible. We also preferred to link to open-access versions of sources. In total, we compiled 1104 sources describing 311 lects with data collected from 763 locations. This number is continually increasing as we actively improve our coverage of the linguistic literature and new work is published.

Locations
The primary new contribution of the Bhās .ā citra database is location data manually collected from the included references, shown in figure 3. The geocoding of the locations was done through the Google Maps API and manually verified. While databases such as Glottolog and WALS do include location data for languages, their representation reduces the language's geographical distribution to a single point. We instead represent multiple points per language based on data from the sources we catalogued.
For example, in Glottolog, Hindi is placed at a single point in central India, whereas in Bhās .ā citra there are 21 locations associated with Hindi-Urdu, with most sources describing the standard dialect in Delhi, but also work dealing with varieties in Varanasi, Lahore, and the rural regions surrounding Delhi. Areal mapping of linguistic references allows for better assessment of the coverage of dialects in our sources, and for explicit coverage of dialect variation when mapping features.

Mapping datasets
To illustrate the value of areal visualisation of language features, we mapped two datasets: the phoneme inventories of a large number of Indian languages from Ramaswami (1999), and the outcomes of selected sound changes from Sanskrit to the modern Indo-Aryan languages based on the Jambu database (Arora and Farris, 2021) parsed from Turner (1962)(1963)(1964)(1965)(1966). Note that we only visually analyse the map in these examples; these observations would need to be corroborated with statistic analysis and modelling to result in any verifiable claims.

Phoneme inventories
From the data in Ramaswami (1999) collected in the PHOIBLE database (Moran and McCloy, 2019) we were able to map the phoneme inventories of 62 major South Asian languages. Several works have studied the phonetic typology of the South Asian linguistic area, e.g. Ramanujan and Masica (2016); Arsenault (2017), but have not used areal mapping visualisations.
Some interesting phonological features for mapping are retroflexion (which is prevalent throughout the region, but weakly distinguished or not distinguished at all in the eastern periphery) and breathy-voiced stops (which are less common in much of the Dravidian and Munda families and in the northwestern languages). Figure 4a shows the distribution of the breathy-voiced retroflex stop /ã H / (in IAST: d . h) using the Bhās .ā citra system. While Arsenault (2017) did use mapping, the feature-separating lines were calculated based on point coordinates for each language, not areal zones. Bhās .ā citra produces more accurate visualisations; it is immediately clear that the northwest Indo-Aryan and Nuristani, Dravidian, and Munda languages lack the phoneme, and this information can be used to inform locations for future fieldwork at the isogloss boundaries to refine our data.

Indo-Aryan sound changes
As another demonstration, we use an underdevelopment etymological database of Indo-Aryan languages (Arora and Farris, 2021) that builds on Turner (1962Turner ( -1966 to map the outcomes of some key Indo-Aryan sound changes. 5 The Indo-Aryan (IA) languages show complex overlapping phonological isoglosses as a symptom of intense cross-dialectal contact over a long period of time, whose complexity makes it difficult to make sense of the family's linguistic history. For example, the Sanskrit cluster /kù/ generally develops to /k h / in the core region of modern Indo-Aryan and / > tS h / in the periphery, but some doublets are evidence of dialect contact, e.g. Sanskrit /kùa:r@/ > Hindi / > tS h a:r/ 'ashes' as well as /k h a:r/ 'alkali' (Masica, 1993). The variability of these sound changes has recently been used to statistically model dialect components in IA languages (Cathcart, 2019a(Cathcart, ,b, 2020Cathcart and Rama, 2020).
Thus, a visualisation of the probability of certain IA sound changes based on a lexical database would be useful for finding isoglosses and the geographical extent of historical dialect contact. We aligned the cognate forms given in Arora and Farris (2021) using the LingPy library's multiple alignment function (List et al., 2019). Based on the alignments, the likelihood of /kù/ > /k(:) h / is mapped in figure 4b. A rough core-periphery distinction indeed emerges, with languages in the northwest, south, and east having fewer outcomes of /k(:) h /. It is also apparent that the language coverage in Turner (1962)(1963)(1964)(1965)(1966) is limited, with a great deal of core IA languages lacking data.

Future work
We intend to maximise coverage of South Asian languages in Bhās .ā citra. In the interest of achieving this goal we welcome contributions to our opensource database on GitHub: https://github. com/aryamanarora/bhasacitra. Ultimately, this sort of database would be useful for all languages of the world, but we lack the domain knowledge for non-South Asian languages, so we welcome any collaborators who feel this system would be beneficial.
As for directions for technical work, Bhās .ā citra would benefit from a SQL database for faster querying and precomputation of some data (e.g. language circle sizes and coordinates) to improve performance in the browser. In the interface, we will explore continuous alternatives to discretised Voronoi polygons, which force rigid transitions between lects 6 and do not show where location coverage is sparse. This will also help us with the issue of large polygons at the edges of our research area. Also, a basemap with administrative boundaries and other contextual geographical information would be useful. All of these will require substantial changes to the code beyond the capabilities of visualisation with pure D3.js.
Bhās .ā citra is one step of our larger goal of improving the study of South Asian languages with computational methods. Our future work on historical/comparative linguistics (Arora and Farris, 2021) and corpus linguistics for under-studied languages of the region will benefit from Bhās .ā citra's 6 We thank both reviewers for pointing out this limitation. visualisation capabilities.

Conclusion
We developed and presented Bhās .ā citra, a database of linguistic resources for South Asia and a language visualisation system based on location data from those resources. We analysed the coverage of our database and used the areal mapping system to visualise phoneme inventories and Indo-Aryan sound change outcomes. We hope that researchers find the tool useful especially as we move forward with studying the typology of South Asian languages.