Ganesh Ramakrishnan


2024

pdf bib
FAIR: Filtering of Automatically Induced Rules
Divya Jyoti Bajpai | Ayush Maheshwari | Manjesh Hanawal | Ganesh Ramakrishnan
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The availability of large annotated data can be a critical bottleneck in training machine learning algorithms successfully, especially when applied to diverse domains. Weak supervision offers a promising alternative by accelerating the creation of labeled training data using domainspecific rules. However, it requires users to write a diverse set of high-quality rules to assign labels to the unlabeled data. Automatic Rule Induction (ARI) approaches circumvent this problem by automatically creating rules from features on a small labeled set and filtering a final set of rules from them. In the ARI approach, the crucial step is to filter out a set of a high-quality useful subset of rules from the large set of automatically created rules. In this paper, we propose an algorithm FAIR (Filtering of Automatically Induced Rules) to filter rules from a large number of automatically induced rules using submodular objective functions that account for the collective precision, coverage, and conflicts of the rule set. We experiment with three ARI approaches and five text classification datasets to validate the superior performance of our algorithm with respect to several semi-supervised label aggregation approaches. Further, we show that FAIR achieves statistically significant results in comparison to existing rule-filtering approaches. The source code is available at https://github.com/ ayushbits/FAIR-LF-Induction.

pdf bib
SMART: Submodular Data Mixture Strategy for Instruction Tuning
H S V N S Kowndinya Renduchintala | Sumit Bhatia | Ganesh Ramakrishnan
Findings of the Association for Computational Linguistics: ACL 2024

Instruction Tuning involves finetuning a language model on a collection of instruction-formatted datasets in order to enhance the generalizability of the model to unseen tasks. Studies have shown the importance of balancing different task proportions during finetuning, but finding the right balance remains challenging. Unfortunately, there’s currently no systematic method beyond manual tuning or relying on practitioners’ intuition. In this paper, we introduce SMART (Submodular data Mixture strAtegy for instRuction Tuning) — a novel data mixture strategy which makes use of a submodular function to assign importance scores to tasks which are then used to determine the mixture weights. Given a fine-tuning budget, SMART redistributes the budget among tasks and selects non-redundant samples from each task. Experimental results demonstrate that SMART significantly outperforms traditional methods such as examples proportional mixing and equal mixing. Furthermore, SMART facilitates the creation of data mixtures based on a few representative subsets of tasks alone and through task pruning analysis, we reveal that in a limited budget setting, allocating budget among a subset of representative tasks yields superior performance compared to distributing the budget among all tasks. The code for reproducing our results is open-sourced at https://github.com/kowndinya-renduchintala/SMART.

pdf bib
Few shot chain-of-thought driven reasoning to prompt LLMs for open-ended medical question answering
Saeel Sandeep Nachane | Ojas Gramopadhye | Prateek Chanda | Ganesh Ramakrishnan | Kshitij Sharad Jadhav | Yatin Nandwani | Dinesh Raghu | Sachindra Joshi
Findings of the Association for Computational Linguistics: EMNLP 2024

In this paper, we propose a modified version of the MedQA-USMLE dataset, named MEDQA-OPEN, which contains open-ended medical questions without options to mimic clinical scenarios, along with clinician-approved reasoned answers. Additionally, we implement a prompt driven by Chain of Thought (CoT) reasoning, CLINICR, to mirror the prospective process of incremental reasoning, reaching a correct response to medical questions. We empirically demonstrate how CLINICR outperforms the state-of-the-art 5-shot CoT-based prompt (Liévin et al., 2022). We also present an approach that mirrors real-life clinical practice by first exploring multiple differential diagnoses through MCQ-CLINICR and subsequently narrowing down to a final diagnosis using MCQ-ELIMINATIVE. Finally, emphasizing the importance of response verification in medical settings, we utilize a reward model mechanism, replacing the elimination process performed by MCQ-ELIMINATIVE.

pdf bib
Beyond Common Words: Enhancing ASR Cross-Lingual Proper Noun Recognition Using Large Language Models
Rishabh Kumar | Sabyasachi Ghosh | Ganesh Ramakrishnan
Findings of the Association for Computational Linguistics: EMNLP 2024

In this work, we address the challenge of cross-lingual proper noun recognition in automatic speech recognition (ASR), where proper nouns in an utterance may originate from a language different from the language in which the ASR system is trained. We enhance the performance of end-to-end ASR systems by instructing a large language model (LLM) to correct the ASR model’s predictions. The LLM’s context is augmented with a dictionary of cross-lingual words that are phonetically and graphemically similar to the potentially incorrect proper nouns in the ASR predictions. Our dictionary-based method DiP-ASR (Dictionary-based Prompting for Automatic Speech Recognition) significantly reduces word error rates compared to both the end-to-end ASR baseline and instruction-based prompting of the LLM without the dictionary across cross-lingual proper noun recognition tasks involving three secondary languages.

pdf bib
DictDis: Dictionary Constrained Disambiguation for Improved NMT
Ayush Maheshwari | Preethi Jyothi | Ganesh Ramakrishnan
Findings of the Association for Computational Linguistics: EMNLP 2024

Domain-specific neural machine translation (NMT) systems (, in educational applications) are socially significant with the potential to help make information accessible to a diverse set of users in multilingual societies. Such NMT systems should be lexically constrained and draw from domain-specific dictionaries. Dictionaries could present multiple candidate translations for a source word/phrase due to the polysemous nature of words. The onus is then on the NMT model to choose the contextually most appropriate candidate. Prior work has largely ignored this problem and focused on the single candidate constraint setting wherein the target word or phrase is replaced by a single constraint. In this work, we present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries. We achieve this by augmenting training data with multiple dictionary candidates to actively encourage disambiguation during training by implicitly aligning multiple candidate constraints. We demonstrate the utility of DictDis via extensive experiments on English-Hindi, English-German, and English-French datasets across a variety of domains including regulatory, finance, engineering, health and standard benchmark test datasets. In comparison with existing approaches for lexically constrained and unconstrained NMT, we demonstrate superior performance for the copy constraint and disambiguation-related measures on all domains, while also obtaining improved fluency of up to 2-3 BLEU points on some domains. We also release our test set consisting of 4K English-Hindi sentences in multiple domains.

pdf bib
Samayik: A Benchmark and Dataset for English-Sanskrit Translation
Ayush Maheshwari | Ashim Gupta | Amrith Krishna | Atul Kumar Singh | Ganesh Ramakrishnan | Anil Kumar Gourishetty | Jitin Singla
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We release Saamayik, a dataset of around 53,000 parallel English-Sanskrit sentences, written in contemporary prose. Sanskrit is a classical language still in sustenance and has a rich documented heritage. However, due to the limited availability of digitized content, it still remains a low-resource language. Existing Sanskrit corpora, whether monolingual or bilingual, have predominantly focused on poetry and offer limited coverage of contemporary written materials. Saamayik is curated from a diverse range of domains, including language instruction material, textual teaching pedagogy, and online tutorials, among others. It stands out as a unique resource that specifically caters to the contemporary usage of Sanskrit, with a primary emphasis on prose writing. Translation models trained on our dataset demonstrate statistically significant improvements when translating out-of-domain contemporary corpora, outperforming models trained on older classical-era poetry datasets. Finally, we also release benchmark models by adapting four multilingual pre-trained models, three of them have not been previously exposed to Sanskrit for translating between English and Sanskrit while one of them is multi-lingual pre-trained translation model including English and Sanskrit. The dataset and source code can be found at https://github.com/ayushbits/saamayik.

2023

pdf bib
DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation
Suraj Kothawade | Anmol Mekala | D.Chandra Sekhara Hetha Havya | Mayank Kothyari | Rishabh Iyer | Ganesh Ramakrishnan | Preethi Jyothi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that compared to other speech selection methods, DITTO is 3-5 times as label-efficient for its improvements on the Indic-TTS and L2 datasets.

pdf bib
INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models
H S V N S Kowndinya Renduchintala | Krishnateja Killamsetty | Sumit Bhatia | Milan Aggarwal | Ganesh Ramakrishnan | Rishabh Iyer | Balaji Krishnamurthy
Findings of the Association for Computational Linguistics: EMNLP 2023

A salient characteristic of pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora and demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data. Further, we perform a rigorous empirical evaluation to show that the resulting models achieve up to ~99% of the performance of the fully-trained models. We made our framework publicly available at https://github.com/Efficient-AI/ingenious.

2022

pdf bib
WARM: A Weakly (+Semi) Supervised Math Word Problem Solver
Oishik Chatterjee | Isha Pandey | Aashish Waikar | Vishwajeet Kumar | Ganesh Ramakrishnan
Proceedings of the 29th International Conference on Computational Linguistics

Solving math word problems (MWPs) is an important and challenging problem in natural language processing. Existing approaches to solving MWPs require full supervision in the form of intermediate equations. However, labeling every MWP with its corresponding equations is a time-consuming and expensive task. In order to address this challenge of equation annotation, we propose a weakly supervised model for solving MWPs by requiring only the final answer as supervision. We approach this problem by first learning to generate the equation using the problem description and the final answer, which we subsequently use to train a supervised MWP solver. We propose and compare various weakly supervised techniques to learn to generate equations directly from the problem description and answer. Through extensive experiments, we demonstrate that without using equations for supervision, our approach achieves accuracy gains of 4.5% and 32% over the current state-of-the-art weakly-supervised approach, on the standard Math23K and AllArith datasets respectively. Additionally, we curate and release new datasets of roughly 10k MWPs each in English and in Hindi (a low-resource language). These datasets are suitable for training weakly supervised models. We also present an extension of our model to semi-supervised learning and present further improvements on results, along with insights.

pdf bib
SPEAR : Semi-supervised Data Programming in Python
Guttu Abhishek | Harshad Ingole | Parth Laturia | Vineeth Dorna | Ayush Maheshwari | Ganesh Ramakrishnan | Rishabh Iyer
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present SPEAR, an open-source python library for data programming with semi supervision. The package implements several recent data programming approaches including facility to programmatically label and build training data. SPEAR facilitates weak supervision in the form of heuristics (or rules) and association of noisy labels to the training dataset. These noisy labels are aggregated to assign labels to the unlabeled data for downstream tasks. We have implemented several label aggregation approaches that aggregate the noisy labels and then train using the noisily labeled set in a cascaded manner. Our implementation also includes other approaches that jointly aggregate and train the model for text classification tasks. Thus, in our python package, we integrate several cascade and joint data-programming approaches while also providing the facility of data programming by letting the user define labeling functions or rules. The code and tutorial notebooks are available at https://github.com/decile-team/spear. Further, extensive documentation can be found at https://spear-decile.readthedocs.io/. Video tutorials demonstrating the usage of our package are available https://youtube.com/playlist?list=PLW8agt_HvkVnOJoJAqBpaerFb-z-ZlqlP. We also present some real-world use cases of SPEAR.

pdf bib
Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming
Ayush Maheshwari | Krishnateja Killamsetty | Ganesh Ramakrishnan | Rishabh Iyer | Marina Danilevsky | Lucian Popa
Findings of the Association for Computational Linguistics: ACL 2022

A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time-consuming to obtain. Although a small amount of labeled data cannot be used to train a model, it can be used effectively for the generation of humaninterpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of additional noisy labeled data in a paradigm that is now commonly referred to as data programming. Previous methods of generating LFs do not attempt to use the given labeled data further to train a model, thus missing opportunities for improving performance. Additionally, since the LFs are generated automatically, they are likely to be noisy, and naively aggregating these LFs can lead to suboptimal results. In this work, we propose an LF-based bi-level optimization framework WISDOM to solve these two critical limitations. WISDOM learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner, and more critically, reweighs each LF according to its goodness, influencing its contribution to the semi-supervised loss using a robust bi-level optimization algorithm. We show that WISDOM significantly outperforms prior approaches on several text classification datasets.

pdf bib
Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training
Ashish Mittal | Durga Sivasubramanian | Rishabh Iyer | Preethi Jyothi | Ganesh Ramakrishnan
Findings of the Association for Computational Linguistics: EMNLP 2022

Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.

pdf bib
A Benchmark and Dataset for Post-OCR text correction in Sanskrit
Ayush Maheshwari | Nikhil Singh | Amrith Krishna | Ganesh Ramakrishnan
Findings of the Association for Computational Linguistics: EMNLP 2022

Sanskrit is a classical language with about 30 million extant manuscripts fit for digitisation, available in written, printed or scanned-image forms. However, it is still considered to be a low-resource language when it comes to available digital resources. In this work, we release a post-OCR text correction dataset containing around 218,000 sentences, with 1.5 million words, from 30 different books. Texts in Sanskrit are known to be diverse in terms of their linguistic and stylistic usage since Sanskrit was the ‘lingua francua’ for discourse in the Indian subcontinent for about 3 millennia. Keeping this in mind, we release a multi-domain dataset, from areas as diverse as astronomy, medicine and mathematics, with some of them as old as 18 centuries. Further, we release multiple strong baselines as benchmarks for the task, based on pre-trained Seq2Seq language models. We find that our best-performing model, consisting of byte level tokenization in conjunction with phonetic encoding (Byt5+SLP1), yields a 23% point increase over the OCR output in terms of word and character error rates. Moreover, we perform extensive experiments in evaluating these models on their performance and analyse common causes of mispredictions both at the graphemic and lexical levels. Our code and dataset is publicly available at https://github.com/ayushbits/pe-ocr-sanskrit.

2021

pdf bib
Joint Learning of Hyperbolic Label Embeddings for Hierarchical Multi-label Classification
Soumya Chatterjee | Ayush Maheshwari | Ganesh Ramakrishnan | Saketha Nath Jagarlapudi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We consider the problem of multi-label classification, where the labels lie on a hierarchy. However, unlike most existing works in hierarchical multi-label classification, we do not assume that the label-hierarchy is known. Encouraged by the recent success of hyperbolic embeddings in capturing hierarchical relations, we propose to jointly learn the classifier parameters as well as the label embeddings. Such a joint learning is expected to provide a twofold advantage: i) the classifier generalises better as it leverages the prior knowledge of existence of a hierarchy over the labels, and ii) in addition to the label co-occurrence information, the label-embedding may benefit from the manifold structure of the input datapoints, leading to embeddings that are more faithful to the label hierarchy. We propose a novel formulation for the joint learning and empirically evaluate its efficacy. The results show that the joint learning improves over the baseline that employs label co-occurrence based pre-trained hyperbolic embeddings. Moreover, the proposed classifiers achieve state-of-the-art generalization on standard benchmarks. We also present evaluation of the hyperbolic embeddings obtained by joint learning and show that they represent the hierarchy more accurately than the other alternatives.

pdf bib
Meta-Learning for Effective Multi-task and Multilingual Modelling
Ishan Tarunesh | Sushil Khyalia | Vishwajeet Kumar | Ganesh Ramakrishnan | Preethi Jyothi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Natural language processing (NLP) tasks (e.g. question-answering in English) benefit from knowledge of other tasks (e.g., named entity recognition in English) and knowledge of other languages (e.g., question-answering in Spanish). Such shared representations are typically learned in isolation, either across tasks or across languages. In this work, we propose a meta-learning approach to learn the interactions between both tasks and languages. We also investigate the role of different sampling strategies used during meta-learning. We present experiments on five different tasks and six different languages from the XTREME multilingual benchmark dataset. Our meta-learned model clearly improves in performance compared to competitive baseline models that also include multi-task baselines. We also present zero-shot evaluations on unseen target languages to demonstrate the utility of our proposed model.

pdf bib
Semi-Supervised Data Programming with Subset Selection
Ayush Maheshwari | Oishik Chatterjee | Krishnateja Killamsetty | Ganesh Ramakrishnan | Rishabh Iyer
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Rule Augmented Unsupervised Constituency Parsing
Atul Sahay | Anshul Nasery | Ayush Maheshwari | Ganesh Ramakrishnan | Rishabh Iyer
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights
Devaraja Adiga | Rishabh Kumar | Amrith Krishna | Preethi Jyothi | Ganesh Ramakrishnan | Pawan Goyal
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Vocabulary Matters: A Simple yet Effective Approach to Paragraph-level Question Generation
Vishwajeet Kumar | Manish Joshi | Ganesh Ramakrishnan | Yuan-Fang Li
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Question generation (QG) has recently attracted considerable attention. Most of the current neural models take as input only one or two sentences, and perform poorly when multiple sentences or complete paragraphs are given as input. However, in real-world scenarios it is very important to be able to generate high-quality questions from complete paragraphs. In this paper, we present a simple yet effective technique for answer-aware question generation from paragraphs. We augment a basic sequence-to-sequence QG model with dynamic, paragraph-specific dictionary and copy attention that is persistent across the corpus, without requiring features generated by sophisticated NLP pipelines or handcrafted rules. Our evaluation on SQuAD shows that our model significantly outperforms current state-of-the-art systems in question generation from paragraphs in both automatic and human evaluation. We achieve a 6-point improvement over the best system on BLEU-4, from 16.38 to 22.62.

2019

pdf bib
ParaQG: A System for Generating Questions and Answers from Paragraphs
Vishwajeet Kumar | Sivaanandh Muneeswaran | Ganesh Ramakrishnan | Yuan-Fang Li
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

Generating syntactically and semantically valid and relevant questions from paragraphs is useful with many applications. Manual generation is a labour-intensive task, as it requires the reading, parsing and understanding of long passages of text. A number of question generation models based on sequence-to-sequence techniques have recently been proposed. Most of them generate questions from sentences only, and none of them is publicly available as an easy-to-use service. In this paper, we demonstrate ParaQG, a Web-based system for generating questions from sentences and paragraphs. ParaQG incorporates a number of novel functionalities to make the question generation process user-friendly. It provides an interactive interface for a user to select answers with visual insights on generation of questions. It also employs various faceted views to group similar questions as well as filtering techniques to eliminate unanswerable questions.

pdf bib
Putting the Horse before the Cart: A Generator-Evaluator Framework for Question Generation from Text
Vishwajeet Kumar | Ganesh Ramakrishnan | Yuan-Fang Li
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Automatic question generation (QG) is a useful yet challenging task in NLP. Recent neural network-based approaches represent the state-of-the-art in this task. In this work, we attempt to strengthen them significantly by adopting a holistic and novel generator-evaluator framework that directly optimizes objectives that reward semantics and structure. The generator is a sequence-to-sequence model that incorporates the structure and semantics of the question being generated. The generator predicts an answer in the passage that the question can pivot on. Employing the copy and coverage mechanisms, it also acknowledges other contextually important (and possibly rare) keywords in the passage that the question needs to conform to, while not redundantly repeating words. The evaluator model evaluates and assigns a reward to each predicted question based on its conformity to the structure of ground-truth questions. We propose two novel QG-specific reward functions for text conformity and answer conformity of the generated question. The evaluator also employs structure-sensitive rewards based on evaluation measures such as BLEU, GLEU, and ROUGE-L, which are suitable for QG. In contrast, most of the previous works only optimize the cross-entropy loss, which can induce inconsistencies between training (objective) and testing (evaluation) measures. Our evaluation shows that our approach significantly outperforms state-of-the-art systems on the widely-used SQuAD benchmark as per both automatic and human evaluation.

pdf bib
Cross-Lingual Training for Automatic Question Generation
Vishwajeet Kumar | Nitish Joshi | Arijit Mukherjee | Ganesh Ramakrishnan | Preethi Jyothi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Automatic question generation (QG) is a challenging problem in natural language understanding. QG systems are typically built assuming access to a large number of training instances where each instance is a question and its corresponding answer. For a new language, such training instances are hard to obtain making the QG problem even more challenging. Using this as our motivation, we study the reuse of an available large QG dataset in a secondary language (e.g. English) to learn a QG model for a primary language (e.g. Hindi) of interest. For the primary language, we assume access to a large amount of monolingual text but only a small QG dataset. We propose a cross-lingual QG model which uses the following training regime: (i) Unsupervised pretraining of language models in both primary and secondary languages and (ii) joint supervised training for QG in both languages. We demonstrate the efficacy of our proposed approach using two different primary languages, Hindi and Chinese. Our proposed framework clearly outperforms a number of baseline models, including a fully-supervised transformer-based model trained on the QG datasets in the primary language. We also create and release a new question answering dataset for Hindi consisting of 6555 sentences.

2018

pdf bib
Entity Resolution and Location Disambiguation in the Ancient Hindu Temples Domain using Web Data
Ayush Maheshwari | Vishwajeet Kumar | Ganesh Ramakrishnan | J. Saketha Nath
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present a system for resolving entities and disambiguating locations based on publicly available web data in the domain of ancient Hindu Temples. Scarce, unstructured information poses a challenge to Entity Resolution(ER) and snippet ranking. Additionally, because the same set of entities may be associated with multiple locations, Location Disambiguation(LD) is a problem. The mentions and descriptions of temples exist in the order of hundreds of thousands, with such data generated by various users in various forms such as text (Wikipedia pages), videos (YouTube videos), blogs, etc. We demonstrate an integrated approach using a combination of grammar rules for parsing and unsupervised (clustering) algorithms to resolve entity and locations with high confidence. A demo of our system is accessible at tinyurl.com/templedemos. Our system is open source and available on GitHub.

2015

pdf bib
A machine-assisted human translation system for technical documents
Vishwajeet Kumar | Ashish Kulkarni | Pankaj Singh | Ganesh Ramakrishnan | Ganesh Arnaal
Proceedings of Machine Translation Summit XV: User Track

pdf bib
Optimizing Multivariate Performance Measures for Learning Relation Extraction Models
Gholamreza Haffari | Ajay Nagesh | Ganesh Ramakrishnan
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Summarization of Multi-Document Topic Hierarchies using Submodular Mixtures
Ramakrishna Bairi | Rishabh Iyer | Ganesh Ramakrishnan | Jeff Bilmes
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
An Approach to Collective Entity Linking
Ashish Kulkarni | Kanika Agarwal | Pararth Shah | Sunny Raj Rathod | Ganesh Ramakrishnan
Proceedings of the 12th International Conference on Natural Language Processing

2014

pdf bib
Noisy Or-based model for Relation Extraction using Distant Supervision
Ajay Nagesh | Gholamreza Haffari | Ganesh Ramakrishnan
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Efficient Reuse of Structured and Unstructured Resources for Ontology Population
Chetana Gavankar | Ashish Kulkarni | Ganesh Ramakrishnan
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We study the problem of ontology population for a domain ontology and present solutions based on semi-automatic techniques. A domain ontology for an organization, often consists of classes whose instances are either specific to, or independent of the organization. E.g. in an academic domain ontology, classes like Professor, Department could be organization (university) specific, while Conference, Programming languages are organization independent. This distinction allows us to leverage data sources both―within the organization and those in the Internet ― to extract entities and populate an ontology. We propose techniques that build on those for open domain IE. Together with user input, we show through comprehensive evaluation, how these semi-automatic techniques achieve high precision. We experimented with the academic domain and built an ontology comprising of over 220 classes. Intranet documents from five universities formed our organization specific corpora and we used open domain knowledge bases like Wikipedia, Linked Open Data, and web pages from the Internet as the organization independent data sources. The populated ontology that we built for one of the universities comprised of over 75,000 instances. We adhere to the semantic web standards and tools and make the resources available in the OWL format. These could be useful for applications such as information extraction, text annotation, and information retrieval.

2013

pdf bib
Learning to Generate Diversified Query Interpretations using Biconvex Optimization
Ramakrishna Bairi | Ambha A | Ganesh Ramakrishnan
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Structure Cognizant Pseudo Relevance Feedback
Arjun Atreya V | Yogesh Kakde | Pushpak Bhattacharyya | Ganesh Ramakrishnan
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
SATTY : Word Sense Induction Application in Web Search Clustering
Satyabrata Behera | Upasana Gaikwad | Ramakrishna Bairi | Ganesh Ramakrishnan
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib
Towards Efficient Named-Entity Rule Induction for Customizability
Ajay Nagesh | Ganesh Ramakrishnan | Laura Chiticariu | Rajasekar Krishnamurthy | Ankush Dharkar | Pushpak Bhattacharyya
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Building Multilingual Search Index using open source framework
Arjun Atreya | Swapnil Chaudhari | Pushpak Bhattacharyya | Ganesh Ramakrishnan
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Error tracking in search engine development
Swapnil Chaudhari | Arjun Atreya V | Pushpak Bhattacharyya | Ganesh Ramakrishnan
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Proceedings of the Workshop on Information Extraction and Entity Analytics on Social Media Data
Sriram Raghavan | Ganesh Ramakrishnan
Proceedings of the Workshop on Information Extraction and Entity Analytics on Social Media Data

pdf bib
Effective Mentor Suggestion System for Online Collaboration Platform
Advait Raut | Upasana Gaikwad | Ramakrishna Bairi | Ganesh Ramakrishnan
Proceedings of the Workshop on Speech and Language Processing Tools in Education

pdf bib
Enriching An Academic knowledge base using Linked Open Data
Chetana Gavankar | Ashish Kulkarni | Yuan Fang Li | Ganesh Ramakrishnan
Proceedings of the Workshop on Speech and Language Processing Tools in Education

pdf bib
Content Bookmarking and Recommendation
Ananth Vyasarayamut | Satyabrata Behera | Ganesh Ramakrishnan
Proceedings of the Workshop on Speech and Language Processing Tools in Education

pdf bib
Proceedings of the Workshop on Question Answering for Complex Domains
Nanda Kambhatla | Sachindra Joshi | Ganesh Ramakrishnan | Kiran Kate | Priyanka Agrawal
Proceedings of the Workshop on Question Answering for Complex Domains

2008

pdf bib
Learning Decision Lists with Known Rules for Text Mining
Venkatesan Chakravarthy | Sachindra Joshi | Ganesh Ramakrishnan | Shantanu Godbole | Sreeram Balakrishnan
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

2007

pdf bib
USP-IBM-1 and USP-IBM-2: The ILP-based Systems for Lexical Sample WSD in SemEval-2007
Lucia Specia | Maria das Graças | Volpe Nunes | Ashwin Srinivasan | Ganesh Ramakrishnan
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
Entity Annotation based on Inverse Index Operations
Ganesh Ramakrishnan | Sreeram Balakrishnan | Sachindra Joshi
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
Generic Text Summarization Using WordNet
Kedar Bellare | Anish Das Sarma | Atish Das Sarma | Navneet Loiwal | Vaibhav Mehta | Ganesh Ramakrishnan | Pushpak Bhattacharyya
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
A gloss-centered algorithm for disambiguation
Ganesh Ramakrishnan | B. Prithviraj | Pushpak Bhattacharyya
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

2003

pdf bib
Question Answering via Bayesian Inference on Lexical Relations
Ganesh Ramakrishnan | Apurva Jadhav | Ashutosh Joshi | Soumen Chakrabarti | Pushpak Bhattacharyya
Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering

Search
Co-authors
Fix data