Diptesh Kanojia


2021

pdf bib
Cognition-aware Cognate Detection
Diptesh Kanojia | Prashant Sharma | Sayali Ghodekar | Pushpak Bhattacharyya | Gholamreza Haffari | Malhar Kulkarni
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Automatic detection of cognates helps downstream NLP tasks of Machine Translation, Cross-lingual Information Retrieval, Computational Phylogenetics and Cross-lingual Named Entity Recognition. Previous approaches for the task of cognate detection use orthographic, phonetic and semantic similarity based features sets. In this paper, we propose a novel method for enriching the feature sets, with cognitive features extracted from human readers’ gaze behaviour. We collect gaze behaviour data for a small sample of cognates and show that extracted cognitive features help the task of cognate detection. However, gaze data collection and annotation is a costly task. We use the collected gaze behaviour data to predict cognitive features for a larger sample and show that predicted cognitive features, also, significantly improve the task performance. We report improvements of 10% with the collected gaze features, and 12% using the predicted gaze features, over the previously proposed approaches. Furthermore, we release the collected gaze behaviour data along with our code and cross-lingual models.

pdf bib
FrameNet-assisted Noun Compound Interpretation
Girishkumar Ponkiya | Diptesh Kanojia | Pushpak Bhattacharyya | Girish Palshikar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages
Diptesh Kanojia | Raj Dabre | Shubham Dewangan | Pushpak Bhattacharyya | Gholamreza Haffari | Malhar Kulkarni
Proceedings of the 28th International Conference on Computational Linguistics

Cognates are variants of the same lexical form across different languages; for example “fonema” in Spanish and “phoneme” in English are cognates, both of which mean “a unit of sound”. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.

pdf bib
Challenge Dataset of Cognates and False Friend Pairs from Indian Languages
Diptesh Kanojia | Malhar Kulkarni | Pushpak Bhattacharyya | Gholamreza Haffari
Proceedings of the 12th Language Resources and Evaluation Conference

Cognates are present in multiple variants of the same text across different languages (e.g., “hund” in German and “hound” in the English language mean “dog”). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends’ dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.

pdf bib
Recommendation Chart of Domains for Cross-Domain Sentiment Analysis: Findings of A 20 Domain Study
Akash Sheoran | Diptesh Kanojia | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of the 12th Language Resources and Evaluation Conference

Cross-domain sentiment analysis (CDSA) helps to address the problem of data scarcity in scenarios where labelled data for a domain (known as the target domain) is unavailable or insufficient. However, the decision to choose a domain (known as the source domain) to leverage from is, at best, intuitive. In this paper, we investigate text similarity metrics to facilitate source domain selection for CDSA. We report results on 20 domains (all possible pairs) using 11 similarity metrics. Specifically, we compare CDSA performance with these metrics for different domain-pairs to enable the selection of a suitable source domain, given a target domain. These metrics include two novel metrics for evaluating domain adaptability to help source domain selection of labelled data and utilize word and sentence-based embeddings as metrics for unlabelled data. The goal of our experiments is a recommendation chart that gives the K best source domains for CDSA for a given target domain. We show that the best K source domains returned by our similarity metrics have a precision of over 50%, for varying values of K.

pdf bib
Cognitively Aided Zero-Shot Automatic Essay Grading
Sandeep Mathias | Rudra Murthy | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Automatic essay grading (AEG) is a process in which machines assign a grade to an essay written in response to a topic, called the prompt. Zero-shot AEG is when we train a system to grade essays written to a new prompt which was not present in our training data. In this paper, we describe a solution to the problem of zero-shot automatic essay grading, using cognitive information, in the form of gaze behaviour. Our experiments show that using gaze behaviour helps in improving the performance of AEG systems, especially when we provide a new essay written in response to a new prompt for scoring, by an average of almost 5 percentage points of QWK.

pdf bib
Happy Are Those Who Grade without Seeing: A Multi-Task Learning Approach to Grade Essays Using Gaze Behaviour
Sandeep Mathias | Rudra Murthy | Diptesh Kanojia | Abhijit Mishra | Pushpak Bhattacharyya
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

The gaze behaviour of a reader is helpful in solving several NLP tasks such as automatic essay grading. However, collecting gaze behaviour from readers is costly in terms of time and money. In this paper, we propose a way to improve automatic essay grading using gaze behaviour, which is learnt at run time using a multi-task learning framework. To demonstrate the efficacy of this multi-task learning based approach to automatic essay grading, we collect gaze behaviour for 48 essays across 4 essay sets, and learn gaze behaviour for the rest of the essays, numbering over 7000 essays. Using the learnt gaze behaviour, we can achieve a statistically significant improvement in performance over the state-of-the-art system for the essay sets where we have gaze data. We also achieve a statistically significant improvement for 4 other essay sets, numbering about 6000 essays, where we have no gaze behaviour data available. Our approach establishes that learning gaze behaviour improves automatic essay grading.

pdf bib
“A Passage to India”: Pre-trained Word Embeddings for Indian Languages
Saurav Kumar | Saunack Kumar | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Dense word vectors or ‘word embeddings’ which encode semantic properties of words, have now become integral to NLP tasks like Machine Translation (MT), Question Answering (QA), Word Sense Disambiguation (WSD), and Information Retrieval (IR). In this paper, we use various existing approaches to create multiple word embeddings for 14 Indian languages. We place these embeddings for all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and Telugu in a single repository. Relatively newer approaches that emphasize catering to context (BERT, ELMo, etc.) have shown significant improvements, but require a large amount of resources to generate usable models. We release pre-trained embeddings generated using both contextual and non-contextual approaches. We also use MUSE and XLM to train cross-lingual embeddings for all pairs of the aforementioned languages. To show the efficacy of our embeddings, we evaluate our embedding models on XPOS, UPOS and NER tasks for all these languages. We release a total of 436 models using 8 different approaches. We hope they are useful for the resource-constrained Indian language NLP. The title of this paper refers to the famous novel “A Passage to India” by E.M. Forster, published initially in 1924.

2019

pdf bib
Utilizing Wordnets for Cognate Detection among Indian Languages
Diptesh Kanojia | Kevin Patel | Malhar Kulkarni | Pushpak Bhattacharyya | Gholemreza Haffari
Proceedings of the 10th Global Wordnet Conference

Automatic Cognate Detection (ACD) is a challenging task which has been utilized to help NLP applications like Machine Translation, Information Retrieval and Computational Phylogenetics. Unidentified cognate pairs can pose a challenge to these applications and result in a degradation of performance. In this paper, we detect cognate word pairs among ten Indian languages with Hindi and use deep learning methodologies to predict whether a word pair is cognate or not. We identify IndoWordnet as a potential resource to detect cognate word pairs based on orthographic similarity-based methods and train neural network models using the data obtained from it. We identify parallel corpora as another potential resource and perform the same experiments for them. We also validate the contribution of Wordnets through further experimentation and report improved performance of up to 26%. We discuss the nuances of cognate detection among closely related Indian languages and release the lists of detected cognates as a dataset. We also observe the behaviour of, to an extent, unrelated Indian language pairs and release the lists of detected cognates among them as well.

pdf bib
Utilizing Word Embeddings based Features for Phylogenetic Tree Generation of Sanskrit Texts
Diptesh Kanojia | Abhijeet Dubey | Malhar Kulkarni | Pushpak Bhattacharyya | Gholemreza Haffari
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

pdf bib
An Introduction to the Textual History Tool
Diptesh Kanojia | Malhar Kulkarni | Pushpak Bhattacharyya | Eivind Kahrs
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

2018

pdf bib
Semi-automatic WordNet Linking using Word Embeddings
Kevin Patel | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 9th Global Wordnet Conference

Wordnets are rich lexico-semantic resources. Linked wordnets are extensions of wordnets, which link similar concepts in wordnets of different languages. Such resources are extremely useful in many Natural Language Processing (NLP) applications, primarily those based on knowledge-based approaches. In such approaches, these resources are considered as gold standard/oracle. Thus, it is crucial that these resources hold correct information. Thereby, they are created by human experts. However, manual maintenance of such resources is a tedious and costly affair. Thus techniques that can aid the experts are desirable. In this paper, we propose an approach to link wordnets. Given a synset of the source language, the approach returns a ranked list of potential candidate synsets in the target language from which the human expert can choose the correct one(s). Our technique is able to retrieve a winner synset in the top 10 ranked list for 60% of all synsets and 70% of noun synsets.

pdf bib
Hindi Wordnet for Language Teaching: Experiences and Lessons Learnt
Hanumant Redkar | Rajita Shukla | Sandhya Singh | Jaya Saraswati | Laxmi Kashyap | Diptesh Kanojia | Preethi Jyothi | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 9th Global Wordnet Conference

This paper reports the work related to making Hindi Wordnet1 available as a digital resource for language learning and teaching, and the experiences and lessons that were learnt during the process. The language data of the Hindi Wordnet has been suitably modified and enhanced to make it into a language learning aid. This aid is based on modern pedagogical axioms and is aligned to the learning objectives of the syllabi of the school education in India. To make it into a comprehensive language tool, grammatical information has also been encoded, as far as these can be marked on the lexical items. The delivery of information is multi-layered, multi-sensory and is available across multiple digital platforms. The front end has been designed to offer an eye-catching user-friendly interface which is suitable for learners starting from age six onward. Preliminary testing of the tool has been done and it has been modified as per the feedbacks that were received. Above all, the entire exercise has offered gainful insights into learning based on associative networks and how knowledge based on such networks can be made available to modern learners.

pdf bib
pyiwn: A Python based API to access Indian Language WordNets
Ritesh Panjwani | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 9th Global Wordnet Conference

Indian language WordNets have their individual web-based browsing interfaces along with a common interface for IndoWordNet. These interfaces prove to be useful for language learners and in an educational domain, however, they do not provide the functionality of connecting to them and browsing their data through a lucid application programming interface or an API. In this paper, we present our work on creating such an easy-to-use framework which is bundled with the data for Indian language WordNets and provides NLTK WordNet interface like core functionalities in Python. Additionally, we use a pre-built speech synthesis system for Hindi language and augment Hindi data with audios for words, glosses, and example sentences. We provide a detailed usage of our API and explain the functions for ease of the user. Also, we package the IndoWordNet data along with the source code and provide it openly for the purpose of research. We aim to provide all our work as an open source framework for further development.

pdf bib
Synthesizing Audio for Hindi WordNet
Diptesh Kanojia | Preethi Jyothi | Pushpak Bhattacharyya
Proceedings of the 9th Global Wordnet Conference

In this paper, we describe our work on the creation of a voice model using a speech synthesis system for the Hindi Language. We use pre-existing “voices”, use publicly available speech corpora to create a “voice” using the Festival Speech Synthesis System (Black, 1997). Our contribution is two-fold: (1) We scrutinize multiple speech synthesis systems and provide an extensive report on the currently available state-of-the-art systems. We also develop voices using the existing implementations of the aforementioned systems, and (2) We use these voices to generate sample audios for randomly chosen words; manually evaluate the audio generated, and produce audio for all WordNet words using the winner voice model. We also produce audios for the Hindi WordNet Glosses and Example sentences. We describe our efforts to use pre-existing implementations for WaveNet - a model to generate raw audio using neural nets (Oord et al., 2016) and generate speech for Hindi. Our lexicographers perform a manual evaluation of the audio generated using multiple voices. A qualitative and quantitative analysis reveals that the voice model generated by us performs the best with an accuracy of 0.44.

pdf bib
Indian Language Wordnets and their Linkages with Princeton WordNet
Diptesh Kanojia | Kevin Patel | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Eyes are the Windows to the Soul: Predicting the Rating of Text Quality Using Gaze Behaviour
Sandeep Mathias | Diptesh Kanojia | Kevin Patel | Samarth Agrawal | Abhijit Mishra | Pushpak Bhattacharyya
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Predicting a reader’s rating of text quality is a challenging task that involves estimating different subjective aspects of the text, like structure, clarity, etc. Such subjective aspects are better handled using cognitive information. One such source of cognitive information is gaze behaviour. In this paper, we show that gaze behaviour does indeed help in effectively predicting the rating of text quality. To do this, we first we model text quality as a function of three properties - organization, coherence and cohesion. Then, we demonstrate how capturing gaze behaviour helps in predicting each of these properties, and hence the overall quality, by reporting improvements obtained by adding gaze features to traditional textual features for score prediction. We also hypothesize that if a reader has fully understood the text, the corresponding gaze behaviour would give a better indication of the assigned rating, as opposed to partial understanding. Our experiments validate this hypothesis by showing greater agreement between the given rating and the predicted rating when the reader has a full understanding of the text.

2017

pdf bib
Is your Statement Purposeless? Predicting Computer Science Graduation Admission Acceptance based on Statement Of Purpose
Diptesh Kanojia | Nikhil Wani | Pushpak Bhattacharyya
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf bib
Harnessing Cognitive Features for Sarcasm Detection
Abhijit Mishra | Diptesh Kanojia | Seema Nagar | Kuntal Dey | Pushpak Bhattacharyya
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
That’ll Do Fine!: A Coarse Lexical Resource for English-Hindi MT, Using Polylingual Topic Models
Diptesh Kanojia | Aditya Joshi | Pushpak Bhattacharyya | Mark James Carman
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence of such lexical resources, multilingual topic models have been used to create coarse lexical resources in the past, using a Cartesian product approach. Our results show that for morphologically rich languages like Hindi, the Cartesian product approach is detrimental for MT. We then present a novel ‘sentential’ approach to use this coarse lexical resource from a multilingual topic model. Our coarse lexical resource when injected with a parallel corpus outperforms a system trained using parallel corpus and a good quality lexical resource. As demonstrated by the quality of our coarse lexical resource and its benefit to MT, we believe that our sentential approach to create such a resource will help MT for resource-constrained languages.

pdf bib
SlangNet: A WordNet like resource for English Slang
Shehzaad Dhuliawala | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a WordNet like structured resource for slang words and neologisms on the internet. The dynamism of language is often an indication that current language technology tools trained on today’s data, may not be able to process the language in the future. Our resource could be (1) used to augment the WordNet, (2) used in several Natural Language Processing (NLP) applications which make use of noisy data on the internet like Information Retrieval and Web Mining. Such a resource can also be used to distinguish slang word senses from conventional word senses. To stimulate similar innovations widely in the NLP community, we test the efficacy of our resource for detecting slang using standard bag of words Word Sense Disambiguation (WSD) algorithms (Lesk and Extended Lesk) for English data on the internet.

pdf bib
Leveraging Cognitive Features for Sentiment Analysis
Abhijit Mishra | Diptesh Kanojia | Seema Nagar | Kuntal Dey | Pushpak Bhattacharyya
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Sophisticated Lexical Databases - Simplified Usage: Mobile Applications and Browser Plugins For Wordnets
Diptesh Kanojia | Raj Dabre | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

India is a country with 22 officially recognized languages and 17 of these have WordNets, a crucial resource. Web browser based interfaces are available for these WordNets, but are not suited for mobile devices which deters people from effectively using this resource. We present our initial work on developing mobile applications and browser extensions to access WordNets for Indian Languages. Our contribution is two fold: (1) We develop mobile applications for the Android, iOS and Windows Phone OS platforms for Hindi, Marathi and Sanskrit WordNets which allow users to search for words and obtain more information along with their translations in English and other Indian languages. (2) We also develop browser extensions for English, Hindi, Marathi, and Sanskrit WordNets, for both Mozilla Firefox, and Google Chrome. We believe that such applications can be quite helpful in a classroom scenario, where students would be able to access the WordNets as dictionaries as well as lexical knowledge bases. This can help in overcoming the language barrier along with furthering language understanding.

pdf bib
A picture is worth a thousand words: Using OpenClipArt library for enriching IndoWordNet
Diptesh Kanojia | Shehzaad Dhuliawala | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

WordNet has proved to be immensely useful for Word Sense Disambiguation, and thence Machine translation, Information Retrieval and Question Answering. It can also be used as a dictionary for educational purposes. The semantic nature of concepts in a WordNet motivates one to try to express this meaning in a more visual way. In this paper, we describe our work of enriching IndoWordNet with image acquisitions from the OpenClipArt library. We describe an approach used to enrich WordNets for eighteen Indian languages. Our contribution is three fold: (1) We develop a system, which, given a synset in English, finds an appropriate image for the synset. The system uses the OpenclipArt library (OCAL) to retrieve images and ranks them. (2) After retrieving the images, we map the results along with the linkages between Princeton WordNet and Hindi WordNet, to link several synsets to corresponding images. We choose and sort top three images based on our ranking heuristic per synset. (3) We develop a tool that allows a lexicographer to manually evaluate these images. The top images are shown to a lexicographer by the evaluation tool for the task of choosing the best image representation. The lexicographer also selects the number of relevant images. Using our system, we obtain an Average Precision (P @ 3) score of 0.30.

pdf bib
Mapping it differently: A solution to the linking challenges
Meghna Singh | Rajita Shukla | Jaya Saraswati | Laxmi Kashyap | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

This paper reports the work of creating bilingual mappings in English for certain synsets of Hindi wordnet, the need for doing this, the methods adopted and the tools created for the task. Hindi wordnet, which forms the foundation for other Indian language wordnets, has been linked to the English WordNet. To maximize linkages, an important strategy of using direct and hypernymy linkages has been followed. However, the hypernymy linkages were found to be inadequate in certain cases and posed a challenge due to sense granularity of language. Thus, the idea of creating bilingual mappings was adopted as a solution. A bilingual mapping means a linkage between a concept in two different languages, with the help of translation and/or transliteration. Such mappings retain meaningful representations, while capturing semantic similarity at the same time. This has also proven to be a great enhancement of Hindi wordnet and can be a crucial resource for multilingual applications in natural language processing, including machine translation and cross language information retrieval.

2015

pdf bib
TransChat: Cross-Lingual Instant Messaging for Indian Languages
Diptesh Kanojia | Shehzaad Dhuliawala | Abhijit Mishra | Naman Gupta | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Using Multilingual Topic Models for Improved Alignment in English-Hindi MT
Diptesh Kanojia | Aditya Joshi | Pushpak Bhattacharyya | Mark James Carman
Proceedings of the 12th International Conference on Natural Language Processing

2014

pdf bib
Do not do processing, when you can look up: Towards a Discrimination Net for WSD
Diptesh Kanojia | Pushpak Bhattacharyya | Raj Dabre | Siddhartha Gunti | Manish Shrivastava
Proceedings of the Seventh Global Wordnet Conference

pdf bib
PaCMan : Parallel Corpus Management Workbench
Diptesh Kanojia | Manish Shrivastava | Raj Dabre | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

2013

pdf bib
More than meets the eye: Study of Human Cognition in Sense Annotation
Salil Joshi | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
Discrimination-Net for Hindi
Diptesh Kanojia | Arindam Chatterjee | Salil Joshi | Pushpak Bhattacharyya
Proceedings of COLING 2012: Demonstration Papers