Itihasa: A large-scale corpus for Sanskrit to English translation

This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.


Introduction
Sanskrit is one of the oldest languages in the world and most Indo-European languages are influenced by it (Beekes, 1995). There are about 30 million pieces of Sanskrit literature available to us today (Goyal et al., 2012), most of which have not been digitized. Among those that have been, few have been translated. The main reason for this is the lack of expertise and funding. An automatic translation system would not only aid and accelerate this process, but it would also help in democratizing the knowledge, history, and culture present in this literature. In this work, we present Itihāsa, a large-scale Sanskrit-English translation corpus consisting of more than 93,000 shlokas and their translations.
Itihāsa, literally meaning 'it happened this way' is a collection of historical records of important events in Indian history. These bodies of work are mostly composed in the form of verses or shlokas, a poetic form which usually consists of four parts containing eight syllables each (Fig. 1). The most important among these works are The Rāmāyana 1 The processed and split dataset can be found at https://github.com/rahular/itihasa and a human-readable version can be found at http://rahular. com/itihasa. and The Mahābhārata. The Rāmāyana, which describes the events in the life of Lord Rāma, consists of 24,000 verses. The Mahābhārata details the war between cousins of the Kuru dynasty, in 100,000 verses. The Mahābhārata is the longest poem ever written with about 1.8 million words in total and is roughly ten times the length of the Iliad and the Odyssey combined.
Only two authors have attempted to translate the unabridged versions of both The Rāmāyana and The Mahābhārata to English: Manmatha Nāth Dutt in the 1890s and Bibek Debroy in the 2010s. M. N. Dutt was a prolific translator whose works are now in the public domain. These works are published in a shloka-wise format as shown in Fig. 1 which makes it easy to automatically align shlokas with their translations. Though many of M. N. Dutt's works are freely available, we choose to extract data from The Rāmāyana (Vālmiki and Dutt, 1891), and The Mahābhārata (Dwaipāyana and Dutt, 1895), mainly due to its size and popularity. As per our knowledge, this is the biggest Sanskrit-English translation dataset to be released in the public domain.
We also train and evaluate standard translation systems on this dataset. In both translation directions, we use Moses as an SMT baseline, and Transformer-based seq2seq models as NMT baselines (see §4). We find that models which are generally on-par with human performance on other translation tasks, perform poorly on Itihāsa, with the best models scoring between 7-8 BLEU points. This indicates the complex nature of the dataset (see §3 for a detailed analysis of the dataset and its vocabulary).

Motivation
The main motivation behind this work is to provide an impetus for the Indic NLP community to build better translation systems for Sanskrit. Additionally, since The Rāmāyana and The Mahābhārata are so pervasive in Indian culture, and have been translated to all major Indian languages, there is a possibility of creating an n-way parallel corpus with Sanskrit as the pivot language, similar to Europarl (Koehn, 2005) and PMIndia (Haddow and Kirefu, 2020) datasets. The existence of Sanskrit-English parallel data has other advantages as well. Due to Sanskrit being a morphologically rich, agglutinative, and highly inflexive, complex concepts can be expressed in compact forms by combining individual words through Sandhi and Samasa. 2 This also enables a speaker to potentially create an infinite number of unique words in Sanskrit. Having a parallel corpus can help us induce word translations through bilingual dictionary induction (Søgaard et al., 2018). It also allows us to use English as a surrogate language for tasks like knowledge base population. Constituency or dependency parsing, NER, and word sense disambiguation can be improved using indirect supervision (Täckström, 2013). Essentially, a parallel corpus allows us to apply a plethora of transfer learning techniques to improve 2 Sandhi refers to the concatenation of words, where the edge characters combine to form a new one. Samasa can be thought of as being similar to elliptic constructions in English where certain phrases are elided since their meaning is obvious from the context. NLP tools for Sanskrit.

Data Preparation
The translated works of The Rāmāyana and The Mahābhārata were published in four and nine volumes respectively. 3 All volumes have a standard two-column format as shown in Fig. 2. Each page has a header with the chapter name and page number separated from the main text by a horizontal line. The two columns of text are separated by a vertical line. The process of data preparation can be divided into (i) automatic OCR extraction, and (ii) manual inspection for alignment errors.

Automatic Extraction
The OCR systems we experimented with performed poorly on digitized documents due to their two-column format. They often fail to recognize line breaks which result in the concatenation of text present in different columns. To mitigate this issue, we use an edge detector 4 to find the largest horizontal and vertical lines, and using the indices of the detected lines, split the original page horizontally and vertically to remove the header and separate the columns (see Fig. 2). We then input the single-column documents to Google Cloud's OCR API 5 to extract text from them. To verify the accuracy of the extracted text, one chapter from each volume (13 chapters in total) is manually checked for mistakes. We find that the extracted text is more than 99% and 97% accurate in Sanskrit and English respectively. The surprising accuracy of Devanagari OCR can be attributed to  the distinctness of its alphabet. For English, this number decreases as the OCR system often misclassifies similar-looking characters (viz., e and c, i and l, etc.).
Manual Inspection An important limitation of the OCR system is its misclassification of alignment spaces and line breaks. It sometimes wrongly treats large gaps between words as line breaks and the rest of the text on the line is moved to the end of the paragraph which results in translations being misaligned with its shlokas. Therefore, the output of all 13 volumes was manually inspected and such misalignments were corrected. 6 Upon manual inspection, other kinds of errors were discovered and corrected where possible. 7 These errors can be categorized as follows: (i) print errors: this type of error is caused by occluded or faded text, smudged ink, etc. An example can be seen in Fig. 3a, (ii) input errors: these are human errors during typesetting the volumes which include typos (Fig, 3b), exclusion of words, inclusion of spurious words, etc., (iii) subjective errors: these are contextual errors in the translation itself. For example, in Fig. 3c, the word dharma is incorrectly translated as 'religion' instead of 'righteousness', and (iv) OCR errors: these errors arise from the underlying OCR system. An example of such errors is the improper handling of split words across lines in the Devanagari script. If the OCR system encounters a hyphen as the last character of a line, the entire line is ignored. In general, print errors are corrected as much as possible, subjective errors are retained for originality, and other types of errors are corrected when encountered.

Analysis
In total, we extract 19,371 translation pairs from 642 chapters of The Rāmāyana and 73,659 translation pairs from 2,110 chapters of The Mahābhārata. It should be noted that these numbers do not correspond to the number of shlokas because, in the original volumes, shlokas are sometimes split and often combined to make the English translations flow better. We reserve 80% of the data from each text for training MT systems and use the rest for evaluation. From the evaluation set, 33% is used for development and 67% for testing. The absolute sizes of the split data are shown in Tab. 1.
Due to Sanskrit's agglutinative nature, the dataset is asymmetric in the sense that, the number of words required to convey the same information, is less in Sanskrit when compared with English. The Rāmāyana's English translations, on average, have 2.54 words for every word in its shloka. This value is even larger in The Mahābhārata with 2.82 translated words per shloka word.
This effect is clearly seen when we consider the vocabulary sizes and the percentage of common tokens between the texts. For this, we tokenize the data with two different tokenization schemes: wordlevel and byte-pair encoding (Sennrich et al., 2016, BPE). For word-level tokenization, the translations of The Rāmāyana (The Mahābhārata) have 16,820 (31,055) unique word tokens, and the shlokas have 66,072 (184,407) tokens. The English vocabularies have 11,579 common tokens which is 68.8% of The Rāmāyana's and 37.3% of The Mahābhārata's. But the overlap percentages drop significantly for the Sanskrit vocabularies. In this case, we find 21,635 common tokens which amount to an overlap of 32.7% and 11.7% respectively. As shown in Fig. 4, this trend holds for BPE tokenization as well.

Experiments
We train one SMT and five NMT systems in both directions and report the (i) character n-gram F-score, (ii) token accuracy, (iii) BLEU (Papineni et al., 2002), and (iv) Translation Edit Ratio (Snover et al., 2006, TER) scores in Tab. 2. For SMT, we use Moses (Koehn et al., 2007) and for NMT, we use sequence-to-sequence (seq2seq) Transformers (Vaswani et al., 2017). We train the seq2seq models from scratch by initializing the encoders and decoders with standard BERT (B2B) architectures. These Tiny, Mini, Small, Medium, and Base models have 2/128, 4/256, 4/512, 8/512, and 12/768 layers/dimensions respectively. See Turc et al. (2019) for more details. In our early experiments, we also tried initializing the encoders and decoders with weights from pre-trained Indic language models like MuRIL (Khanuja et al., 2021), but they showed poor performance and thus are not reported here.
Implementation Details All models are trained using HuggingFace Transformers (Wolf et al., 2020). Both source and target sequences are truncated at 128 tokens. We train WordPiece tokenizers on our dataset and use them for all models. Adam optimizer (Kingma and Ba, 2014) with weight-  decay of 0.01, and learning rate of 5 × 10 −5 is used. All models are trained for 100 epochs. The learning rate is warmed up over 8,000 steps and decayed later with a linear scheduler. We use a batch size of 128, and use standard cross-entropy loss with no label smoothing. We run into memory errors on bigger models (medium and base), but maintain the effective batch-size and optimization steps by introducing gradient accumulation and increasing the number of epochs, respectively. Also, to reduce the total training time of bigger models, we stop training if the BLEU score does not improve over 10 epochs. During generation, we use a beam size of 5 and compute all metrics against truncated references.
Discussion We see that all models perform poorly, with low token accuracy and high TER. While the English to Sanskrit (E2S) models get better with size, this pattern is not clearly seen in Sanskrit to English (S2E) models. Surprisingly for S2E models, the token accuracy progressively decreases as their size increases. Also, Moses has the best TER among S2E models which suggests that the seq2seq models have not been able to learn even simple co-occurrences between source and target tokens. This leads us to hypothesize that the Sanskrit encoders produce sub-optimal representations. One way to improve them would be to add a Sandhi-splitting step to the tokenization pipeline, thereby decreasing the Sanskrit vocabulary size. Another natural extension to improve the quality of representations would be to initialize the encoders with a pre-trained language model. वश्वा मत्रवचः श्रु त्वा लक्ष्मणः सहलक्ष्मणः। वश्वा मत्रवचः श्रु त्वा वश्वा मत्रोऽब्रवी ददम ्॥ Figure 5: A gold sentence and shloka from the test set, and its corresponding small model prediction.
Though it is clear that there is a large scope for improvement, the models are able to learn some interesting features of the dataset. Fig. 5 shows a random gold translation pair and the small model's prediction. Though we see repetitions of phrases and semantic errors, the prediction follows the meter in which the original shlokas are written, i.e. it also consists of 4 parts containing 8 syllables each.

Related Work
Early translation efforts from Sanskrit to English were limited to the construction of dictionaries by Western Indologists (Müller, 1866;Monier-Williams, 1899). Over the years, though notable translation works like Ganguli (1883) have been published, the lack of digitization has been a bottleneck hindering any meaningful progress towards automatic translation systems. This has changed recently, at least for monolingual data, with the curation of digital libraries like GRETIL 8 and DCS 9 . Currently, the largest freely available repository of translations are for The Bhagavadgita (Prabhakar et al., 2000) and The Rāmāyana (Geervani et al., 1989). However, labeled datasets for other tasks, like the ones proposed in (Kulkarni, 2013;Bhardwaj et al., 2018; have resulted in parsers (Krishna et al., 2020(Krishna et al., , 2021 and sandhi splitters (Aralikatte et al., 2018; which are pre-cursors to modular translation systems. Though there have been attempts at building Sanskrit translation tools (Bharati and Kulkarni, 2009), they are mostly rule-based and rely on manual intervention. We hope that the availability of the Itihāsa corpus pushes the domain towards endto-end systems.

Conclusion
In this work, we introduce Itihāsa, a large-scale dataset containing more than 93,000 pairs of Sanskrit shlokas and their English translations from The Rāmāyana and The Mahābhārata. First, we detail the extraction process which includes an automated OCR phase and a manual alignment phase. Next, we analyze the dataset to give an intuition of its asymmetric nature and to showcase its complexities. Lastly, we train state-of-the-art translation models which perform poorly, proving the necessity for more work in this area.