Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights

Automatic speech recognition (ASR) in Sanskrit is interesting, owing to the various linguistic peculiarities present in the language. The Sanskrit language is lexically productive, undergoes euphonic assimilation of phones at the word boundaries and exhibits variations in spelling conventions and in pronunciations. In this work, we propose the first large scale study of automatic speech recognition (ASR) in Sanskrit, with an emphasis on the impact of unit selection in Sanskrit ASR. In this work, we release a 78 hour ASR dataset for Sanskrit, which faithfully captures several of the linguistic characteristics expressed by the language. We investigate the role of different acoustic model and language model units in ASR systems for Sanskrit. We also propose a new modelling unit, inspired by the syllable level unit selection, that captures character sequences from one vowel in the word to the next vowel. We also highlight the importance of choosing graphemic representations for Sanskrit and show the impact of this choice on word error rates (WER). Finally, we extend these insights from Sanskrit ASR for building ASR systems in two other Indic languages, Gujarati and Telugu. For both these languages, our experimental results show that the use of phonetic based graphemic representations in ASR results in performance improvements as compared to ASR systems that use native scripts.


Introduction
Sanskrit is a language with fairly advanced disciplines of phonetics (Śiks ̣ā), prosody (Chandas), and grammar (Vyākaran ̣a). The language has a rich oral tradition and it tends to follow a phonemic-orthography resulting in systematic grapheme-phoneme correspondences. Connected speech leads to phonemic transformations in utteracnes, and in Sanskrit this is faithfully preserved in writing as well . This is called as Sandhi and is defined as the euphonic assimilation of sounds, i.e., modification and fusion of sounds, at or across the boundaries of grammatical units (Matthews, 2007, p. 353). Phonemic orthography is beneficial for a language, when it comes to designing automatic speech recognition Systems (ASR), specifically for unit selection at both the Acoustic Model (AM) and Language Model (LM) levels.
Regardless of the aforementioned commonalities preserved in both the speech and text in Sanskrit, designing a large scale ASR system raises several challenges. The Unicode encoding for the native scripts in Sanskrit, both in Roman and Devanāgari, does not preserve the correspondence with the phonemic encoding. Further, mapping the graphemes in Unicode to the corresponding phonemes either leads to ambiguity and redundancy or often requires multi-grapheme combinations.
The language is lexically productive, which results in long compound words with multiple components in usage. This results in the speakers segmenting the compounds at arbitrary lexeme boundaries of the compound, as it need not always be possible to utter the compound in one breath and also to convey the meaning clearly. Similarly, such arbitrary segmentations at the word boundaries are possible in utterance of long text sequences where multiple lexical items are fused together via Sandhi. These segmentations are accompanied with the corresponding Sandhi based transformations, resulting in a new phonetic sequence different from the original sequence. Finally, Sanskrit might be one of those rare natural languages where the number of non-native proficient speakers are manifold in comparison to the native speakers (Krishna et al., 2020). This makes the ASR task further challenging, as the speakers are prone to carry their influence from their corresponding mother tongues into the Sanskrit utterances as well.
While there exist several computational models for processing Sanskrit texts (Kulkarni, 2013;Kumar et al., 2010;Shukla et al., 2010;Kulkarni et al., 2010a;Goyal et al., 2012;Kulkarni et al., 2010c;Mishra et al., 2013;Saluja et al., 2017;Anoop and Ramakrishnan, 2019;Krishna et al., 2021), large scale systems for processing of speech in Sanskrit, are almost non-existent. First, we present a new dataset, with 78 hours of speech covering about 46,000 sentences, for ASR in Sanskrit. Keeping the rich and long cultural heritage the language carries, we prepare our dataset to be diverse both chronologically and in terms of the domain coverage. Further, the dataset contains utterances from 27 different speakers, representing 6 different native languages. The dataset splits have disjoint speakers, with 12 in the training and 5 each in the validation, test and out-of-domain test data sets. Further, we explicitly mark the segmentation decisions made by a speaker to segment long compound words and fused phrases and include the corresponding transformations due to sandhi.
Using this dataset, we propose a new, largevocabulary Sanskrit ASR system, which, to the best of our knowledge, is the first such system for Sanskrit. The phonemic orthography followed in Sanskrit has influenced our design choices in terms of unit selection at the level of the acoustic and language models. We investigate three different encoding schemes used to model LM tokens, namely, word-based encoding, byte pair encoding (BPE) and a new vowel split encoding inspired by existing linguistic theories of syllabic structure popularly used within text-to-speech systems (Kishore and Black, 2003;Mishra et al., 2013). Further, to address the redundancy issues in Unicode representations, we make use of the Sanskrit Library Phonetic (SLP1) encoding scheme proposed by Scharf and Hyman (2011). SLP1 is designed such that it preserves the phonemic orthography. Building on the study by Scharf and Hyman (2011), we focus on two graphemic representations only, viz., native script (Devanagari) and SLP1.
Finally, we extend our insights to model ASR systems for two more Indian languages, viz., Telugu and Gujarati. We extend the SLP1 to include graphemes relevant for these languages which are missing from Sanskrit. We report the performance of these ASR systems on two publicly available ASR datasets.
Our main contributions in this work are: 1) We present (in Section 2) a new, large vocabulary Sanskrit ASR system and the first ever ASRbased study for Sanskrit using a new, large and diverse, labeled speech corpus वाक् सञ्चयः (/Vāksañ cayah ̣/). 2) We investigate (in Sections 3 and 4) different modeling choices for both acoustic models and language models in Sanskrit ASR systems, along with different graphemic representations. We propose a new word segmentation technique based on splitting at vowels that can be used with both the acoustic model and the language model. 3) We also contextualize our findings for Sanskrit by providing comparisons on ASR systems built for two other Indian languages, viz., Gujarati and Telugu. from one topical domain to another, specifically one Śāstra (branch of learning) to another (Adiga et al., 2018). Our corpus contain samples from diverse domains, including philosophy, literature, commentary on poetry and grammar. It also includes contemporary recordings such as stories, live lectures, spiritual discourse and radio program/podcast, so that collecting a wide range of Sanskrit vocabulary. The recordings were primarily collected with the help of volunteers, recording their speech by using the Recorder app on Android phones and the Audacity platform, and from various sources available online. 3 oTranscribe 3 was used to transcribe the audio files. We had a total of 9 volunteers for recording and 18 unique speakers for the content collected online. Each of these speakers are proficient Sanskrit speakers, with at least an undergraduate or equivalent degree in Sanskrit. These speakers are native speakers of one of the 6 following Indic languages, Hindi, Kannada, Malayalam, Marathi, Tamil and Telugu. In Table 1, we provide the details of the training/validation/test splits for our corpus, वाक् सञ्चयः (/Vāksañcayah ̣/). The speakers in all these 4 splits, train, validation, test, and out-of-domain test sets are disjoint. The out-ofdomain test dataset is a stratified sampled dataset, consisting of speech samples from 5 unique speakers. Two of these were added to include utterances in Sanskrit from speakers with more pronounced influence of their native languages (in Hindi and Tamil). The domain of the training dataset primarily is a speech collection of readings from various well known texts. Further, the speech in the training data is in accordance with the traditional phonetic prescriptions of Sanskrit (Śiks ̣ā). Hence, the remaining three in the out-of-domain test set were added to include utterances from different speech domains, extempore discourse, lecture and radio program, differing from the speech domain in the training set. The radio program is studio produced, while the other two are live recorded.

Challenges in Sanskrit ASR
In this section, we describe various linguistic phenomena that are important to consider when preparing datasets and building ASR systems for Sanskrit with the help of illustrative examples. 3 The URLs of the tools and the list of the texts we use are available in the supplementary material.
Word Length: The tokens in Sanskrit texts can be very long owing to "Sandhi" and the lexically productive process of compounding (``Samāsa"). For instance, consider a compound word, वागथर् प्र तपत्तये (/vāgarthapratipattaye/). It forms a 19 letter word in SLP1 (vAgarTapratipattaye), and is formed by combining the three Sanskrit stems वाक् , अथर् , प्र तप त्त (/vāk, artha, prati patti/), as per the rules of Sandhi and Samāsa. In Figure 1, we present the distribution of the number of characters (in SLP1 format) per word across the three languages that we experimentally analyse in this work, viz., Sanskrit, Telugu and Gujarati. The plots are normalized with respect to the size of the vocabulary. The average word length is much higher in Sanskrit (10.75) compared to Gujarati (7.79) and Telugu (9.35). Table 2 compares the distribution of number of characters (wrt SLP1) per word in the training vocabulary of three ASR datasets. More than 26% of the words in Sanskrit have length exceeding 12 characters.   Sandhi: Sandhi can occur between successive words in a sentence or between the lexemes in a compound, or between the morphemes of the a) nityaśśabdārthasambandhah ̣ b) nityah ̣ śabdārthasambandhah ̣ c) nityah ̣ śabdārtha sambandhah ̣ d) nityaśśabdārtha sambandhah ̣ Figure 2: a) Utterance of a sequence without any pause b) Sandhi split at word boundary c) Sandhi based segmentation d) utterance splits without sandhi same word. While recognising longer sequences due to sandhi and compounding is a challenge in itself, the external sandhi gives rise to the issue of arbitrary points of segmentation performed by speakers at the time of utterance. Figure 2a shows text-sequence where the sequence contains a word nityah ̣ and a compound śabdārthasambandhah ̣ fused together via Sandhi. Further, śabdārthasam bandhah ̣ is a compound with śabda, artha and sambandha as its components. While it is expected to be uttered without any pause after considering Samhitā (As ̣t ̣ādhyāyī-1-4-109), a speaker may choose to segment at the lexical boundaries as shown in Figure 2c and Figure 2d. However in doing so, a proficient speaker would prefer a sequence similar to Figure 2a or 2b, rather than Figure 2c or 2d. This is because, the former two, though result in phonetic transformations, preserve the syntactic and semantic validity of the sequence. However, the latter do not preserve the syntactic and semantic validity of the sequence. Similarly, there are cases where there can be phonetic transformations between the bound morphemes and the free morpheme of a word. These transformations do not result in any modification to the word, other than phonetic variations. However, this makes it challenging for an ASR system. The case of Diphthongs is the quite prevalent under these cases. In Diphthongs, which can occur both at internal or external sandhi, the independent vowel can only occur at the start of a word. Any vowel appearing in the middle of a word either gets converted to a dependent vowel or a diphthong. When a word ending with "ए(/ē/) or ओ(/ō/)" and followed by any vowel, then ending will be changed to either "अय् (/ay/) or अव् (/av/) respectively" or "अ(/a/)". For example िवष्णो+इह(/vis ̣n ̣o+iha/) will get converted to िवष्णइह(/vis ̣n ̣aiha/) or िवष्णिवह(/vis ̣n ̣aviha/).
Unicode encoding: The Unicode encoding for the native scripts in Sanskrit, similar to several indian languages, does not preserve the correspondence with the phonemic encoding. Further, mapping the graphemes in Unicode to the corresponding phonemes either suffers from ambiguity and redundancy or often requires multigrapheme combinations.
For instance, consider the word वागथार् िवव(/vāgarthāviva/) in Sanskrit.Here the graphemes in Devanagari 'व ◌ा ग र ◌् थ ◌ा व ि◌ व'and Roman (v ā g a r t h ā v i v a) do not exhibit a one-to-one mapping with the phonemes. For instance, a single grapheme (e.g., ग) may correspond to 2 phonemes while two graphemes (e.g., र ◌् in devanagari, 't h' in roman) may correspond to a single phoneme.

Unit Selection Alternatives for AM and LM
In this section, we discuss different alternatives for identifying units of the language model and the acoustic model that we subsequently employ in our experimental evaluation and analysis.

Alternatives for Graphemic Representation
The Unicode standard for native scripts of Sanskrit : Devanagari, Gujarati and Telugu faces challenge for computational language processing due to redundancy in mappings between phonemes and graphemes as previously discussed. So for Sanskrit, we use Sanskrit Library Phonetic encodings (SLP1) designed by Scharf and Hyman (2011). This encoding gives unique one-to-one mapping to the phoneme. However Gujarati possesses extra native characters such as ઍ(/e/), ઑ(/o/) . Telugu also possesses extra characters such as ఎ(/e/), ఒ(/o/), and ఱ(/r/). So we extend SLP1 to fit to the character set of Gujarati and Telugu in our experiments.

Alternatives for Sub-word units
One possibility for deriving subword units for the language modeling is to segment words in Sanskrit based on Sandhi rules. However, Sandhi splitting can change some phonemes corresponding to the words in almost all cases. Consider the word रामाये दम् = रामाय+इदम् (/rāmāyedam/ = /rāmāya/+/idam/), wherein the vowel ए (/e/) is changed into अ+इ (/a/+/i/) after performing Sandhi-based splitting. This leads to a mismatch between the speech transcript and the speech audio, potentially creating further complications for ASR.

BPE-based Word Segmentation
Byte pair encoding (BPE) is a simple data compression algorithm that iteratively replaces the frequently occurring subword units with a single unused byte (Gage, 1994). This technique was first adopted to model rare words using subword units in neural machine translation (Sennrich et al., 2016). Interestingly, BPE has been explored for learning new vocabulary for poetry to prose conversion in Sanskrit (Krishna et al., 2019). We consider the benefits of using BPE as a subword unit for Sanskrit ASR. While BPE is a purely data-driven segmentation strategy, we next present a linguistically motivated segmentation approach that might be aligned with finding syllable units for ASR that are more phonetically compliant. We refer to this technique as vowel segmentation.

Vowel Segmentation
Splitting the tokens based on vowels and adjacent consonants is inspired by the identification of metres in Sanskrit prosody, where the metre of a verse is identified by using syllable segmentation, followed by identification of syllable weights and it's combinations (Melnad et al., 2013). The syllable weight of a syllable can either be laghu (light)(represented by the symbol ।) or guru (heavy)(represented by the symbol ऽ). Syllables with short vowels generally form Laghu and those with loing vowels form a Guru. Also, when a short vowel is followed by a conjunct consonant or Anusvāra (nasal sound /ṃ /) or Visarga (voiceless glottal fricative /h ̣/), the short vowel now becomes Guru. E.g., the Laghu-Guru mapping of "अन्यािन सं या त(/anyāni saṃ yāti/)" is "ऽऽ। ऽऽ।". In prior work involving Indian languages for TTS, Kishore et al. (2002) proposed various syllabification rules for words. Herein (with a few exceptions), if a vowel is followed by 3 or more consonants, only the first following vowel is grouped with the preceding vowel to form the subword unit.
Our proposed algorithm for vowel segmentation (VS) is outlined in Algorithm 1. We propose segmenting words at vowel boundaries to extract the units for which alignment with speech is learnt within the ASR system. For acoustic models, an effective unit of a word for ASR would arguably be the syllable (Lee et al., 2013). Representing a word in terms of syllables demands the mapping ALGORITHM 1: Vowel segmentation algorithm for Indian languages Input: word in Indian language Output: Vowel segments in output output = ""; for each graphemic unit of a word from graphemes to phonemes. To create syllable units, phonemes are then combined together based on the sonority sequencing principle (Clements, 1990). Absence of accurate syllabifiers for Indian languages restricts the use of syllables as units for learning alignment. Our approach produces units which can be viewed as a rough approximation to a syllable. A syllable is composed of three parts viz., onset, nucleus and coda, where nucleus has the highest sonority and is always a vowel. In our approach, the onset is always one or zero consonants and the coda is zero or n-1 consonants if the nucleus is followed by n consonants. It is also observed in the pronunciation of conjunct consonant by professional speakers that the beginning part of conjunct consonant gets associated more with the preceding vowel than the following. We consider nasal Anusvāra (◌ं ), Chandrabindu (◌ॅ or ◌ँ ), and Visarga (◌ः) to be part of the consonant set. For example, in Sanskrit, the units for a word उद्यान: (udyāna , park) will be 'उद् या न:(/ud yā na /)' and subword units of the Telugu word తలిల్ తండు ర్ లు(/tallitaṃ d ̣rulu/) will be 'తల్ లి తండ్ రు లు(/tal li taṃ d ̣ ru lu/)'.

Vocabulary size as a function of graphemic unit
For each choice of graphemic unit (viz. native script and SLP1) described in Section 3.1, we study three different units for the acoustic modeling (AM) in ASR, viz., graphemic unit and vowel segmentation for Sanskrit and also phonemic unit across the two other representative Indian languages viz., Gujarati and Telugu. Whereas, for language modeling (LM), we study word, BPE and VS based units. In Figure 3, we report the vocabulary size based on each of these three different unit selections and contrast the sizes with that of two extreme hypothetical systems -one that considers the entire word as a single unit for AM and the other that treats the phoneme as a single unit for AM. Note that while phonetic dictionaries are available for Telugu and Gujarati, our dataset for Sanskrit does not have an accompanying phonetic dictionary. We present the variation in vocabulary size as a function of the graphemic unit (native script vs. SLP1). In both Gujarati and Telugu, we point out that the number of SLP1 graphemic units almost coincide with the number of phonemes, while the native script-based graphemes are much larger in number compared to phonemes. We can also roughly estimate the extent of data sparsity in terms of vocabulary size in each setting -larger the vocabulary size, higher is the chance of data sparsity. We note that data sparsity is minimal for graphemes and highest for a hypothetical system where whole words are the unit of selection.

Experiments and Results
Description of Datasets: In addition to reporting ASR results on the carefully created वाक् सञ्चयः (/Vāksañcayah ̣/) dataset (described in Section 2), we also contrast through experimental analysis on two other Indian languages, viz., Telugu and Gujarati. For Telugu and Gujarati, we used the publicly available speech corpora released by Microsoft (Srivastava et al., 2018) that contains 36.2/8.7 hours and 33.2/5.8 hours of training/test speech in Telugu and Gujarati, respectively. We use a new train-test split for the Gujarati and Telugu datasets because the original split had overlapping spekaers in their train and test. Our new split ensures that the train-test split have disjoint speakers. Transcript of this corpora was cleaned for orthographic errors. Corpora in these two languages were accompanied by pronunciation lexicons, which we used to build phoneme-based ASR systems to compare against our grapheme-based systems.
Experimental Setup: We use the Kaldi toolkit (Povey et al., 2011) for all our ASR experiments. Our acoustic model is implemented using Time Delay Neural Networks (TDNNs) (Peddinti et al., 2015) containing 14 layers. We use 40dimensional MFCCs as our input features along with 100-dimensional i-vector based speaker embeddings (Saon et al., 2013). We used ngram language models with Kneser-Ney smoothing implemented using the SRILM toolkit (Stolcke, 2002). The language models were trained using both training transcripts from the speech data, as well as additional textual data derived from the Leipzig Corpora Collection for Gujarati and Telugu (Goldhahn et al., 2012) and the Digital Corpus of Sanskrit (Hellwig, 2010) for Sanskrit. The word vocabulary sizes in the lexicons for Sanskrit, Telugu and Gujarati are 76K, 43K and 48K, respectively. Tables 3, 4 and 5, present the WERs from ASR systems built using different choices of AM and LM units using both the graphemic representations (Native and SLP1) for Sanskrit, Gujarati and Telugu, respectively. From Table 3    clearly. For example సావ్తనత్ య్మ్(/svātantryam/) is pronounced as సావ్తనత్ మ్(/svātantram/). However, in Sanskrit, conjunct consonants having 5 consonants together such as कात्स्न्यर् म् (/kārtsnyam/) are enunciated very clearly and all consonants appear articulated in its pronunciation. Due to the morphological richness (Kulkarni et al., 2015), inflections and compounds, Sanskrit always has the highest number of rare words. In the training dataset used in the Sanskrit ASR experiments with the vocab size of 70.5K, more than 87.25% words have a frequency less than 3, where as in Telugu and Gujarati training dataset, this is 76.76% and 77.26%, respectively. Clear articulation of conjunct consonants and higher rare word rates makes the BPE and VS based models performs better in Sanskrit than other two languages along with the impact of OOVs.

Results:
We observe that use of SLP1 as a graphemic representation schemes performs best for all the three languages. SLP1 is designed to capture the phonemic-graphemic correspondences present in Indic languages. We also find that ASR performance using phonemes is comparable to graphemes for Gujarati and Telugu. In Sanskrit, we observe that purely grapheme-based acous-N Sanskrit Telugu Gujarati 1 77.30% 75.77% 89.37% 2 21.41% 23.27% 10.06% 3 1.26% 0.96% 0.56% 4 0.03% 0% 0% Table 6: Number of continuous consonants (N) distribution in three ASR datasets tic models outperform grapheme+vowel segmentbased acoustic models. With the consistent mapping between graphemes and phonemes and the absence of schwa deletion, it is intuitive that grapheme-based models would be most appropriate for Sanskrit. Even though for Sanskrit in some cases Devanagari as a graphemic representation outperforms the SLP1 (Sr. 1,3,9,11 in Table 3), the model that uses SLP1 script always outperforms the other in terms of character error rate. In Sanskrit the pause given between the subwords of a compound word and in between two words varies depending on the fluency of the speaker and the complexity of the text, which can deteriorate the WER. The utterance for 'महान् प्राकारः' /mahān prākārah ̣/ may get recognised as 'महान्प्राकारः' /mahānprākārah ̣/, where two correctly recognised words will be evaluated as one deletion and one substituion by the evaluation model. Similarly if the audio of 'शोभमानमासीत् ' /śobhamā namāsīt/ gets recognised as 'शोभमानम् आसीत् ' /śob hamānam āsīt/, then it will be considered as one insertion followed by one substituion. After negating these two particular errors, we will get 17.79% as the modulo substitution deletion WER for our best model of Sanskrit (Sr. 6 of Table 3). The character error rate 3.10% for the best model in Sanskrit also ensures the performance of the model and the quality of the dataset, where as the CER for the best model of Gujarati and Telugu are 5.49% and 5.60% respectively, much higher than Sanskrit. Table 3 Percentage of OOVs recognised  Table  3 OOV Analysis: The OOVs at word level are 44.16%, 15.26% and 18.63% for Sanskrit, Telugu and Gujarati, respectively. Figure 4 depicts the percentage of OOVs recovered by different ASR experimental setup described in Table 3, also giving modulo substitution deletion OOVs. As we move from word based LM Unit to BPE and VS, system is able to recognize more than 57% of OOVs. For Telugu and Gujarati OOVs recognition rate is 53.67% and 48.61% respectively for their best performing models.

Experiment number in
# WER Speech Description 1* 37.79% Tamil influenced accents 2* 37.60% Hindi influenced accents 3* 46.27% Radio Program 4 46.62% Extempore Discourse 5 51.52% Live Lecture Out-of-domain test set. Table 7 presents results on the out-of-domain test set described in Section 2. It shows the WERs we can expect from our models when the speakers and content largely vary in domain from our dataset. This test set was sampled for specific speakers and content that qualify as being out-of-domain. These test utterances were evaluated using our best performing Sanskrit ASR models. Speakers #1 and #2 were included, as their utterances show more pronounced influence of their native languages, Tamil and Hindi respectively. It is observed that speaker #1, does not often attempt to distinguish between the pronunciation of the phoneme pairs such as /ta/ and /da/, /ka/ and /ga/, etc. This is in congruence with the orthography followed in Tamil, the speaker's native language. Speaker #2's reading was influenced by schwa deletion, i.e., the phenomena of deleting vowel markers accompanying consonants at certain contexts (elaborated in the supplementary material) which is dominant in Hindi. For example, गतवान् /gatavān/ is pronounced as /gatvān/ by this speaker and the ASR system correctly predicts it as /gatvān/. Here, the acoustic model clearly dominates the language model. Speaker #3's utterances are the Sanskrit translation of the Indian government's public outreach program known as 'मन क बात' /man kī bāt/. This widely differs from the training speech dataset and vocabulary of the LM that is used. Similar observation can be made for speakers #4 and #5, albeit for different reasons. Speaker #4 in the discourse tends to use rare words, especially domain-specific proper nouns and derivational verbs, both of which are scarce in our LM vocabulary. Speaker #5 tends to deviate from conventional speech patterns, by providing emphasis on specific words, for the purpose of pedagogy.

Conclusion
We presented a new Sanskrit speech corpus वाक् सञ्चयः (/Vāksañcayah ̣/) along with a new largevocabulary ASR system. We explored different unit selection alternatives for both AM and LM, along with a new segmentation approach. We observe that SLP1, when used as the script instead of the native scripts generally results in better performances for Gujarati and Telugu. For Sanskrit, SLP1 based model results in better character error rate and BPE based model with SLP1 giving the best result. Similarly, we observe that vowel split based segmentation consistently yields better performance than word based model and close to results of the best model. Our current dataset and the model are specifically designed to handle data in prose. The inclusion of poetry data would require substantial changes to the system, which we plan to address in the near future. For instance, the poetry data would greatly benefit from insights from Sanskrit prosody. More importantly, the degree of free word orderness in prose and poetry greatly varies in Sanskrit, so much so that an n-gram LM will not be effective .

A Differences between Sanskrit and other Indic languages for ASR
Many Indian languages are known to be derived from Sanskrit (Kulkarni et al., 2010b) and their scripts derived from the Brahmi script (Salomon, 1996;Sproat, 2003), which leads to graphemebased similarites amongst them. In Figure 5, we illustrate through an example, the spectrum of mapping the native character/grapheme (units) in words across languages; at one end of the spectrum is राम(/rām/) in Hindi mapped to రామ(/rāma/) in Telugu as an example where direct correspondence with the native character exists. Going further in the spectrum are examples for which direct character correspondence does not exist. सीता(/sītā/) in Hindi going to ಸೀತೆ(/sīte/) in Kannada is an instance where there is a change in the ending vowel.

Schwa Deletion
The schwa deletion phenomenon plays a crucial role in the north Indian languages. Every consonant by itself includes a short /a/ vowel sound (referred to as "schwa") unless otherwise specified. For example, the letter 'त' in Hindi is pronounced as /ta/. This sound can be associated with any other vowel sound by the use of "Mātras". Mātras are dependent forms of vowels. Schwa is the default vowel for a consonant and hence does not require any explicit Mātra to represent it. Schwa deletion is a phenomenon where implicit schwas of a word are deleted during pronunciation. For example, in Hindi, the proper noun, 'अजु र् न (/arjun/, the name of a person) has schwa deletion after the consonant 'न' and is pronounced as Arjun. This phenomenon is not observed in the South Indian languages. For instance, in Kannada it is pronounced as 'Arjuna'. There is no implicit schwa deletion in Sanskrit as well as in the traditional use of South Indian languages such as Kannada. North Indian languages observe schwa deletion not only at the end of the word, but also in the middle of a word in some cases. For example, the word 'गलती' (/galtī/ meaning mistake) in Hindi observes implicit schwa deletion after the consonant 'ल' (/la/). ASR becomes challenging because of this phenomenon since the occurrence of schwa deletion is not always explicitly specified in the orthography. For example, the name रामबाबु (/rāmbābu/) has two basic words concatenated to form a name. In Hindi, this name has an implicit schwa deleted at म (consonant sounding 'ma') of राम (/rām/). While constructing phonetic representations for ASR, such deletions introduce ambiguities in pronunciation which could be alleviated by enforcing more consistency between graphemes and phonemes. This same word रामबाबु written in Telugu would be phonetically represented as రామాబ్బు (/rāmbābu/) instead of రామబాబు (/rāmabābu/) which is intuitive. Note that in the former case, there is an addition of '◌్ '(halant: an explicit schwa deletion marker) at మ(/ma/). This forces the consonants మ(/ma/) and బ(/ba/) to combine and form a conjunct. In the latter case there is a grapheme consistency across both Hindi and Telugu languages but there is a variation in their pronunciation due to the schwa deletion phenomenon. In contrast, in the case of Sanskrit, since pronunciation is strictly governed by the शक्षा(/śiks ̣ā/) (Manomohan and Pān ̣ini, 1938), a treatise on phonetics, schwa deletion is not observed.

B BPE Experiment Details
In Sanskrit a noun can have 24 to 92 inflections (depending on base word's gender and alternate forms) and a verb can have 90 to 180 inflections. Derivative nouns (Taddhitas) and verbs (passive (Karman ̣i), san, n ̣ic, yaṅ, etc) are also used often in Sanskrit literature. Due to this morphological richness and frequently occurring compound words, vocab size can be reduced by properly selecting repetitive stems and suffices using BPE by specifying the number of merge operations. Therefore we experimented varying number of subword unit with vocabulary sizes of 2K, 4K, 8K, 16K, 32K and 64K (K=1000). Table 8 shows the varying BPE configuration on our best configuration, i.e graphemes as AM unit and BPE as LM unit. However, the performance of these configurations are comparable irrespective of their BPE vocabulary size.
BPE with vocabulary size of 32,000 stands closes to that of VS, with a vocabulary size of 29,147. Even in this configuration, BPE outperforms VS, as BPE reports a WER of 21.94 as