Amrith Krishna


2022

pdf bib
Does Meta-learning Help mBERT for Few-shot Question Generation in a Cross-lingual Transfer Setting for Indic Languages?
Aniruddha Roy | Rupak Kumar Thakur | Isha Sharma | Ashim Gupta | Amrith Krishna | Sudeshna Sarkar | Pawan Goyal
Proceedings of the 29th International Conference on Computational Linguistics

Few-shot Question Generation (QG) is an important and challenging problem in the Natural Language Generation (NLG) domain. Multilingual BERT (mBERT) has been successfully used in various Natural Language Understanding (NLU) applications. However, the question of how to utilize mBERT for few-shot QG, possibly with cross-lingual transfer, remains. In this paper, we try to explore how mBERT performs in few-shot QG (cross-lingual transfer) and also whether applying meta-learning on mBERT further improves the results. In our setting, we consider mBERT as the base model and fine-tune it using a seq-to-seq language modeling framework in a cross-lingual setting. Further, we apply the model agnostic meta-learning approach to our base model. We evaluate our model for two low-resource Indian languages, Bengali and Telugu, using the TyDi QA dataset. The proposed approach consistently improves the performance of the base model in few-shot settings and even works better than some heavily parameterized models. Human evaluation also confirms the effectiveness of our approach.

2021

pdf bib
Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights
Devaraja Adiga | Rishabh Kumar | Amrith Krishna | Preethi Jyothi | Ganesh Ramakrishnan | Pawan Goyal
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
A Little Pretraining Goes a Long Way: A Case Study on Dependency Parsing Task for Low-resource Morphologically Rich Languages
Jivnesh Sandhan | Amrith Krishna | Ashim Gupta | Laxmidhar Behera | Pawan Goyal
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Neural dependency parsing has achieved remarkable performance for many domains and languages. The bottleneck of massive labelled data limits the effectiveness of these approaches for low resource languages. In this work, we focus on dependency parsing for morphological rich languages (MRLs) in a low-resource setting. Although morphological information is essential for the dependency parsing task, the morphological disambiguation and lack of powerful analyzers pose challenges to get this information for MRLs. To address these challenges, we propose simple auxiliary tasks for pretraining. We perform experiments on 10 MRLs in low-resource settings to measure the efficacy of our proposed pretraining method and observe an average absolute gain of 2 points (UAS) and 3.6 points (LAS).

2020

pdf bib
A Graph-Based Framework for Structured Prediction Tasks in Sanskrit
Amrith Krishna | Bishal Santra | Ashim Gupta | Pavankumar Satuluri | Pawan Goyal
Computational Linguistics, Volume 46, Issue 4 - December 2020

We propose a framework using energy-based models for multiple structured prediction tasks in Sanskrit. Ours is an arc-factored model, similar to the graph-based parsing approaches, and we consider the tasks of word segmentation, morphological parsing, dependency parsing, syntactic linearization, and prosodification, a “prosody-level” task we introduce in this work. Ours is a search-based structured prediction framework, which expects a graph as input, where relevant linguistic information is encoded in the nodes, and the edges are then used to indicate the association between these nodes. Typically, the state-of-the-art models for morphosyntactic tasks in morphologically rich languages still rely on hand-crafted features for their performance. But here, we automate the learning of the feature function. The feature function so learned, along with the search space we construct, encode relevant linguistic information for the tasks we consider. This enables us to substantially reduce the training data requirements to as low as 10%, as compared to the data requirements for the neural state-of-the-art models. Our experiments in Czech and Sanskrit show the language-agnostic nature of the framework, where we train highly competitive models for both the languages. Moreover, our framework enables us to incorporate language-specific constraints to prune the search space and to filter the candidates during inference. We obtain significant improvements in morphosyntactic tasks for Sanskrit by incorporating language-specific constraints into the model. In all the tasks we discuss for Sanskrit, we either achieve state-of-the-art results or ours is the only data-driven solution for those tasks.

pdf bib
SHR++: An Interface for Morpho-syntactic Annotation of Sanskrit Corpora
Amrith Krishna | Shiv Vidhyut | Dilpreet Chawla | Sruti Sambhavi | Pawan Goyal
Proceedings of the Twelfth Language Resources and Evaluation Conference

We propose a web-based annotation framework, SHR++, for morpho-syntactic annotation of corpora in Sanskrit. SHR++ is designed to generate annotations for the word-segmentation, morphological parsing and dependency analysis tasks in Sanskrit. It incorporates analyses and predictions from various tools designed for processing texts in Sanskrit, and utilise them to ease the cognitive load of the human annotators. Specifically, SHR++ uses Sanskrit Heritage Reader, a lexicon driven shallow parser for enumerating all the phonetically and lexically valid word splits along with their morphological analyses for a given string. This would help the annotators in choosing the solutions, rather than performing the segmentations by themselves. Further, predictions from a word segmentation tool are added as suggestions that can aid the human annotators in their decision making. Our evaluation shows that enabling this segmentation suggestion component reduces the annotation time by 20.15 %. SHR++ can be accessed online at http://vidhyut97.pythonanywhere.com/ and the codebase, for the independent deployment of the system elsewhere, is hosted at https://github.com/iamdsc/smart-sanskrit-annotator.

pdf bib
Keep it Surprisingly Simple: A Simple First Order Graph Based Parsing Model for Joint Morphosyntactic Parsing in Sanskrit
Amrith Krishna | Ashim Gupta | Deepak Garasangi | Pavankumar Satuluri | Pawan Goyal
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Morphologically rich languages seem to benefit from joint processing of morphology and syntax, as compared to pipeline architectures. We propose a graph-based model for joint morphological parsing and dependency parsing in Sanskrit. Here, we extend the Energy based model framework (Krishna et al., 2020), proposed for several structured prediction tasks in Sanskrit, in 2 simple yet significant ways. First, the framework’s default input graph generation method is modified to generate a multigraph, which enables the use of an exact search inference. Second, we prune the input search space using a linguistically motivated approach, rooted in the traditional grammatical analysis of Sanskrit. Our experiments show that the morphological parsing from our joint model outperforms standalone morphological parsers. We report state of the art results in morphological parsing, and in dependency parsing, both in standalone (with gold morphological tags) and joint morphosyntactic parsing setting.

pdf bib
Evaluating Neural Morphological Taggers for Sanskrit
Ashim Gupta | Amrith Krishna | Pawan Goyal | Oliver Hellwig
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Neural sequence labelling approaches have achieved state of the art results in morphological tagging. We evaluate the efficacy of four standard sequence labelling models on Sanskrit, a morphologically rich, fusional Indian language. As its label space can theoretically contain more than 40,000 labels, systems that explicitly model the internal structure of a label are more suited for the task, because of their ability to generalise to labels not seen during training. We find that although some neural models perform better than others, one of the common causes for error for all of these models is mispredictions due to syncretism.

2019

pdf bib
Revisiting the Role of Feature Engineering for Compound Type Identification in Sanskrit
Jivnesh Sandhan | Amrith Krishna | Pawan Goyal | Laxmidhar Behera
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

pdf bib
Poetry to Prose Conversion in Sanskrit as a Linearisation Task: A Case for Low-Resource Languages
Amrith Krishna | Vishnu Sharma | Bishal Santra | Aishik Chakraborty | Pavankumar Satuluri | Pawan Goyal
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The word ordering in a Sanskrit verse is often not aligned with its corresponding prose order. Conversion of the verse to its corresponding prose helps in better comprehension of the construction. Owing to the resource constraints, we formulate this task as a word ordering (linearisation) task. In doing so, we completely ignore the word arrangement at the verse side. kāvya guru, the approach we propose, essentially consists of a pipeline of two pretraining steps followed by a seq2seq model. The first pretraining step learns task-specific token embeddings from pretrained embeddings. In the next step, we generate multiple possible hypotheses for possible word arrangements of the input %using another pretraining step. We then use them as inputs to a neural seq2seq model for the final prediction. We empirically show that the hypotheses generated by our pretraining step result in predictions that consistently outperform predictions based on the original order in the verse. Overall, kāvya guru outperforms current state of the art models in linearisation for the poetry to prose conversion task in Sanskrit.

2018

pdf bib
Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit
Amrith Krishna | Bishal Santra | Sasi Prasanth Bandaru | Gaurav Sahu | Vishnu Dutt Sharma | Pavankumar Satuluri | Pawan Goyal
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The configurational information in sentences of a free word order language such as Sanskrit is of limited use. Thus, the context of the entire sentence will be desirable even for basic processing tasks such as word segmentation. We propose a structured prediction framework that jointly solves the word segmentation and morphological tagging tasks in Sanskrit. We build an energy based model where we adopt approaches generally employed in graph based parsing techniques (McDonald et al., 2005a; Carreras, 2007). Our model outperforms the state of the art with an F-Score of 96.92 (percentage improvement of 7.06%) while using less than one tenth of the task-specific training data. We find that the use of a graph based approach instead of a traditional lattice-based sequential labelling approach leads to a percentage gain of 12.6% in F-Score for the segmentation task.

pdf bib
Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit
Amrith Krishna | Bodhisattwa P. Majumder | Rajesh Bhat | Pawan Goyal
Proceedings of the 22nd Conference on Computational Natural Language Learning

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman. Currently, there exists no dataset available for Romanised Sanskrit OCR. So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth. For training, we synthetically generate training images for both the settings. We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-to-sequence tasks (Schnober et al., 2016). We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01% from an OCR output with CRR of 35.76% for one of the dataset settings. A human judgement survey performed on the models shows that our proposed model results in predictions which are faster to comprehend and faster to improve for a human than the other systems.

pdf bib
Building a Word Segmenter for Sanskrit Overnight
Vikas Reddy | Amrith Krishna | Vishnu Sharma | Prateek Gupta | Vineeth M R | Pawan Goyal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
A Dataset for Sanskrit Word Segmentation
Amrith Krishna | Pavan Kumar Satuluri | Pawan Goyal
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

The last decade saw a surge in digitisation efforts for ancient manuscripts in Sanskrit. Due to various linguistic peculiarities inherent to the language, even the preliminary tasks such as word segmentation are non-trivial in Sanskrit. Elegant models for Word Segmentation in Sanskrit are indispensable for further syntactic and semantic processing of the manuscripts. Current works in word segmentation for Sanskrit, though commendable in their novelty, often have variations in their objective and evaluation criteria. In this work, we set the record straight. We formally define the objectives and the requirements for the word segmentation task. In order to encourage research in the field and to alleviate the time and effort required in pre-processing, we release a dataset of 115,000 sentences for word segmentation. For each sentence in the dataset we include the input character sequence, ground truth segmentation, and additionally lexical and morphological information about all the phonetically possible segments for the given sentence. In this work, we also discuss the linguistic considerations made while generating the candidate space of the possible segments.

pdf bib
A Graph Based Semi-Supervised Approach for Analysis of Derivational Nouns in Sanskrit
Amrith Krishna | Pavankumar Satuluri | Harshavardhan Ponnada | Muneeb Ahmed | Gulab Arora | Kaustubh Hiware | Pawan Goyal
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing

Derivational nouns are widely used in Sanskrit corpora and represent an important cornerstone of productivity in the language. Currently there exists no analyser that identifies the derivational nouns. We propose a semi supervised approach for identification of derivational nouns in Sanskrit. We not only identify the derivational words, but also link them to their corresponding source words. Our novelty comes in the design of the network structure for the task. The edge weights are featurised based on the phonetic, morphological, syntactic and the semantic similarity shared between the words to be identified. We find that our model is effective for the task, even when we employ a labelled dataset which is only 5 % to that of the entire dataset.

2016

pdf bib
Word Segmentation in Sanskrit Using Path Constrained Random Walks
Amrith Krishna | Bishal Santra | Pavankumar Satuluri | Sasi Prasanth Bandaru | Bhumi Faldu | Yajuvendra Singh | Pawan Goyal
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In Sanskrit, the phonemes at the word boundaries undergo changes to form new phonemes through a process called as sandhi. A fused sentence can be segmented into multiple possible segmentations. We propose a word segmentation approach that predicts the most semantically valid segmentation for a given sentence. We treat the problem as a query expansion problem and use the path-constrained random walks framework to predict the correct segments.

pdf bib
Compound Type Identification in Sanskrit: What Roles do the Corpus and Grammar Play?
Amrith Krishna | Pavankumar Satuluri | Shubham Sharma | Apurv Kumar | Pawan Goyal
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

We propose a classification framework for semantic type identification of compounds in Sanskrit. We broadly classify the compounds into four different classes namely, Avyayībhāva, Tatpuruṣa, Bahuvrīhi and Dvandva. Our classification is based on the traditional classification system followed by the ancient grammar treatise Adṣṭādhyāyī, proposed by Pāṇini 25 centuries back. We construct an elaborate features space for our system by combining conditional rules from the grammar Adṣṭādhyāyī, semantic relations between the compound components from a lexical database Amarakoṣa and linguistic structures from the data using Adaptor Grammars. Our in-depth analysis of the feature space highlight inadequacy of Adṣṭādhyāyī, a generative grammar, in classifying the data samples. Our experimental results validate the effectiveness of using lexical databases as suggested by Amba Kulkarni and Anil Kumar, and put forward a new research direction by introducing linguistic patterns obtained from Adaptor grammars for effective identification of compound type. We utilise an ensemble based approach, specifically designed for handling skewed datasets and we %and Experimenting with various classification methods, we achieve an overall accuracy of 0.77 using random forest classifiers.