Pushpak Bhattacharyya

Also published as: Pushpak Bhattacharya


2021

pdf bib
Invited Presentation
Pushpak Bhattacharyya
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

AI now and in future will have to grapple continuously with the problem of low resource. AI will increasingly be ML intensive. But ML needs data often with annotation. However, annotation is costly. Over the years, through work on multiple problems, we have developed insight into how to do language processing in low resource setting. Following 6 methods—individually and in combination—seem to be the way forward: 1) Artificially augment resource (e.g. subwords) 2) Cooperative NLP (e.g., pivot in MT) 3) Linguistic embellishment (e.g. factor based MT, source reordering) 4) Joint Modeling (e.g., Coref and NER, Sentiment and Emotion: each task helping the other to either boost accuracy or reduce resource requirement) 5) Multimodality (e.g., eye tracking based NLP, also picture+text+speech based Sentiment Analysis) 6)Cross Lingual Embedding (e.g., embedding from multiple languages helping MT, close to 2 above) The present talk will focus on low resource machine translation. We describe the use of techniques from the above list and bring home the seriousness and methodology of doing Machine Translation in low resource settings.

pdf bib
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
Toshiaki Nakazawa | Hideki Nakayama | Isao Goto | Hideya Mino | Chenchen Ding | Raj Dabre | Anoop Kunchukuttan | Shohei Higashiyama | Hiroshi Manabe | Win Pa Pa | Shantipriya Parida | Ondřej Bojar | Chenhui Chu | Akiko Eriguchi | Kaori Abe | Yusuke Oda | Katsuhito Sudoh | Sadao Kurohashi | Pushpak Bhattacharyya
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

pdf bib
Language Relatedness and Lexical Closeness can help Improve Multilingual NMT: IITBombay@MultiIndicNMT WAT2021
Jyotsana Khatri | Nikhil Saini | Pushpak Bhattacharyya
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

Multilingual Neural Machine Translation has achieved remarkable performance by training a single translation model for multiple languages. This paper describes our submission (Team ID: CFILT-IITB) for the MultiIndicMT: An Indic Language Multilingual Task at WAT 2021. We train multilingual NMT systems by sharing encoder and decoder parameters with language embedding associated with each token in both encoder and decoder. Furthermore, we demonstrate the use of transliteration (script conversion) for Indic languages in reducing the lexical gap for training a multilingual NMT system. Further, we show improvement in performance by training a multilingual NMT system using languages of the same family, i.e., related languages.

pdf bib
Multilingual Machine Translation Systems at WAT 2021: One-to-Many and Many-to-One Transformer based NMT
Shivam Mhaskar | Aditya Jain | Aakash Banerjee | Pushpak Bhattacharyya
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

In this paper, we present the details of the systems that we have submitted for the WAT 2021 MultiIndicMT: An Indic Language Multilingual Task. We have submitted two separate multilingual NMT models: one for English to 10 Indic languages and another for 10 Indic languages to English. We discuss the implementation details of two separate multilingual NMT approaches, namely one-to-many and many-to-one, that makes use of a shared decoder and a shared encoder, respectively. From our experiments, we observe that the multilingual NMT systems outperforms the bilingual baseline MT systems for each of the language pairs under consideration.

pdf bib
IITP-MT at WAT2021: Indic-English Multilingual Neural Machine Translation using Romanized Vocabulary
Ramakrishna Appicharla | Kamal Kumar Gupta | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This paper describes the systems submitted to WAT 2021 MultiIndicMT shared task by IITP-MT team. We submit two multilingual Neural Machine Translation (NMT) systems (Indic-to-English and English-to-Indic). We romanize all Indic data and create subword vocabulary which is shared between all Indic languages. We use back-translation approach to generate synthetic data which is appended to parallel corpus and used to train our models. The models are evaluated using BLEU, RIBES and AMFM scores with Indic-to-English model achieving 40.08 BLEU for Hindi-English pair and English-to-Indic model achieving 34.48 BLEU for English-Hindi pair. However, we observe that the shared romanized subword vocabulary is not helping English-to-Indic model at the time of generation, leading it to produce poor quality translations for Tamil, Telugu and Malayalam to English pairs with BLEU score of 8.51, 6.25 and 3.79 respectively.

pdf bib
Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages
Tejas Dhamecha | Rudra Murthy | Samarth Bharadwaj | Karthik Sankaranarayanan | Pushpak Bhattacharyya
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning. We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages. A first of its kind detailed study is presented to track performance change as languages are added to a base language in a graded and greedy (in the sense of best boost of performance) manner; which reveals that careful selection of subset of related languages can significantly improve performance than utilizing all related languages. The Indo-Aryan (IA) language family is chosen for the study, the exact languages being Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi and Urdu. The script barrier is crossed by simple rule-based transliteration of the text of all languages to Devanagari. Experiments are performed on mBERT, IndicBERT, MuRIL and two RoBERTa-based LMs, the last two being pre-trained by us. Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning. Textual Entailment, Entity Classification, Section Title Prediction, tasks of IndicGLUE and POS tagging form our test bed. Compared to monolingual fine tuning we get relative performance improvement of up to 150% in the downstream tasks. The surprise take-away is that for any language there is a particular combination of other languages which yields the best performance, and any additional language is in fact detrimental.

pdf bib
“So You Think You’re Funny?”: Rating the Humour Quotient in Standup Comedy
Anirudh Mittal | Pranav Jeevan P | Prerak Gandhi | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Computational Humour (CH) has attracted the interest of Natural Language Processing and Computational Linguistics communities. Creating datasets for automatic measurement of humour quotient is difficult due to multiple possible interpretations of the content. In this work, we create a multi-modal humour-annotated dataset (~40 hours) using stand-up comedy clips. We devise a novel scoring mechanism to annotate the training data with a humour quotient score using the audience’s laughter. The normalized duration (laughter duration divided by the clip duration) of laughter in each clip is used to compute this humour coefficient score on a five-point scale (0-4). This method of scoring is validated by comparing with manually annotated scores, wherein a quadratic weighted kappa of 0.6 is obtained. We use this dataset to train a model that provides a ‘funniness’ score, on a five-point scale, given the audio and its corresponding text. We compare various neural language models for the task of humour-rating and achieve an accuracy of 0.813 in terms of Quadratic Weighted Kappa (QWK). Our ‘Open Mic’ dataset is released for further research along with the code.

pdf bib
IITP-MT at CALCS2021: English to Hinglish Neural Machine Translation using Unsupervised Synthetic Code-Mixed Parallel Corpus
Ramakrishna Appicharla | Kamal Kumar Gupta | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

This paper describes the system submitted by IITP-MT team to Computational Approaches to Linguistic Code-Switching (CALCS 2021) shared task on MT for English→Hinglish. We submit a neural machine translation (NMT) system which is trained on the synthetic code-mixed (cm) English-Hinglish parallel corpus. We propose an approach to create code-mixed parallel corpus from a clean parallel corpus in an unsupervised manner. It is an alignment based approach and we do not use any linguistic resources for explicitly marking any token for code-switching. We also train NMT model on the gold corpus provided by the workshop organizers augmented with the generated synthetic code-mixed parallel corpus. The model trained over the generated synthetic cm data achieves 10.09 BLEU points over the given test set.

pdf bib
FrameNet-assisted Noun Compound Interpretation
Girishkumar Ponkiya | Diptesh Kanojia | Pushpak Bhattacharyya | Girish Palshikar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Modelling Context Emotions using Multi-task Learning for Emotion Controlled Dialog Generation
Deeksha Varshney | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

A recent topic of research in natural language generation has been the development of automatic response generation modules that can automatically respond to a user’s utterance in an empathetic manner. Previous research has tackled this task using neural generative methods by augmenting emotion classes with the input sequences. However, the outputs by these models may be inconsistent. We employ multi-task learning to predict the emotion label and to generate a viable response for a given utterance using a common encoder with multiple decoders. Our proposed encoder-decoder model consists of a self-attention based encoder and a decoder with dot product attention mechanism to generate response with a specified emotion. We use the focal loss to handle imbalanced data distribution, and utilize the consistency loss to allow coherent decoding by the decoders. Human evaluation reveals that our model produces more emotionally pertinent responses. In addition, our model outperforms multiple strong baselines on automatic evaluation measures such as F1 and BLEU scores, thus resulting in more fluent and adequate responses.

pdf bib
Cognition-aware Cognate Detection
Diptesh Kanojia | Prashant Sharma | Sayali Ghodekar | Pushpak Bhattacharyya | Gholamreza Haffari | Malhar Kulkarni
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Automatic detection of cognates helps downstream NLP tasks of Machine Translation, Cross-lingual Information Retrieval, Computational Phylogenetics and Cross-lingual Named Entity Recognition. Previous approaches for the task of cognate detection use orthographic, phonetic and semantic similarity based features sets. In this paper, we propose a novel method for enriching the feature sets, with cognitive features extracted from human readers’ gaze behaviour. We collect gaze behaviour data for a small sample of cognates and show that extracted cognitive features help the task of cognate detection. However, gaze data collection and annotation is a costly task. We use the collected gaze behaviour data to predict cognitive features for a larger sample and show that predicted cognitive features, also, significantly improve the task performance. We report improvements of 10% with the collected gaze features, and 12% using the predicted gaze features, over the previously proposed approaches. Furthermore, we release the collected gaze behaviour data along with our code and cross-lingual models.

pdf bib
Disfluency Correction using Unsupervised and Semi-supervised Learning
Nikhil Saini | Drumil Trivedi | Shreya Khare | Tejas Dhamecha | Preethi Jyothi | Samarth Bharadwaj | Pushpak Bhattacharyya
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Spoken language is different from the written language in its style and structure. Disfluencies that appear in transcriptions from speech recognition systems generally hamper the performance of downstream NLP tasks. Thus, a disfluency correction system that converts disfluent to fluent text is of great value. This paper introduces a disfluency correction model that translates disfluent to fluent text by drawing inspiration from recent encoder-decoder unsupervised style-transfer models for text. We also show considerable benefits in performance when utilizing a small sample of 500 parallel disfluent-fluent sentences in a semi-supervised way. Our unsupervised approach achieves a BLEU score of 79.39 on the Switchboard corpus test set, with further improvement to a BLEU score of 85.28 with semi-supervision. Both are comparable to two competitive fully-supervised models.

pdf bib
SEPRG: Sentiment aware Emotion controlled Personalized Response Generation
Mauajama Firdaus | Umang Jain | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 14th International Conference on Natural Language Generation

Social chatbots have gained immense popularity, and their appeal lies not just in their capacity to respond to the diverse requests from users, but also in the ability to develop an emotional connection with users. To further develop and promote social chatbots, we need to concentrate on increasing user interaction and take into account both the intellectual and emotional quotient in the conversational agents. Therefore, in this work, we propose the task of sentiment aware emotion controlled personalized dialogue generation giving the machine the capability to respond emotionally and in accordance with the persona of the user. As sentiment and emotions are highly co-related, we use the sentiment knowledge of the previous utterance to generate the correct emotional response in accordance with the user persona. We design a Transformer based Dialogue Generation framework, that generates responses that are sensitive to the emotion of the user and corresponds to the persona and sentiment as well. Moreover, the persona information is encoded by a different Transformer encoder, along with the dialogue history, is fed to the decoder for generating responses. We annotate the PersonaChat dataset with sentiment information to improve the response quality. Experimental results on the PersonaChat dataset show that the proposed framework significantly outperforms the existing baselines, thereby generating personalized emotional responses in accordance with the sentiment that provides better emotional connection and user satisfaction as desired in a social chatbot.

pdf bib
BERT based Adverse Drug Effect Tweet Classification
Tanay Kayastha | Pranjal Gupta | Pushpak Bhattacharyya
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

This paper describes models developed for the Social Media Mining for Health (SMM4H) 2021 shared tasks. Our team participated in the first subtask that classifies tweets with Adverse Drug Effect (ADE) mentions. Our best performing model utilizes BERTweet followed by a single layer of BiLSTM. The system achieves an F-score of 0.45 on the test set without the use of any auxiliary resources such as Part-of-Speech tags, dependency tags, or knowledge from medical dictionaries.

pdf bib
CFILT IIT Bombay@LT-EDI-EACL2021: Hope Speech Detection for Equality, Diversity, and Inclusion using Multilingual Representation fromTransformers
Pankaj Singh | Prince Kumar | Pushpak Bhattacharyya
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

With the internet becoming part and parcel of our lives, engagement in social media has increased a lot. Identifying and eliminating offensive content from social media has become of utmost priority to prevent any kind of violence. However, detecting encouraging, supportive and positive content is equally important to prevent misuse of censorship targeted to attack freedom of speech. This paper presents our system for the shared task Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI, EACL 2021. The data for this shared task is provided in English, Tamil, and Malayalam which was collected from YouTube comments. It is a multiclass classification problem where each data instance is categorized into one of the three classes: ‘Hope speech’, ‘Not hope speech’, and ‘Not in intended language’. We propose a system that employs multilingual transformer models to obtain the representation of text and classifies it into one of the three classes. We explored the use of multilingual models trained specifically for Indian languages along with generic multilingual models. Our system was ranked 2nd for English, 2nd for Malayalam, and 7th for the Tamil language in the final leader board published by organizers and obtained a weighted F1-score of 0.92, 0.84, 0.55 respectively on the hidden test dataset used for the competition. We have made our system publicly available at GitHub.

pdf bib
How low is too low? A monolingual take on lemmatisation in Indian languages
Kumar Saunack | Kumar Saurav | Pushpak Bhattacharyya
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. Most prior work on ML based lemmatization has focused on high resource languages, where data sets (word forms) are readily available. For languages which have no linguistic work available, especially on morphology or in languages where the computational realization of linguistic rules is complex and cumbersome, machine learning based lemmatizers are the way togo. In this paper, we devote our attention to lemmatisation for low resource, morphologically rich scheduled Indian languages using neural methods. Here, low resource means only a small number of word forms are available. We perform tests to analyse the variance in monolingual models’ performance on varying the corpus size and contextual morphological tag data for training. We show that monolingual approaches with data augmentation can give competitive accuracy even in the low resource setting, which augurs well for NLP in low resource setting.

pdf bib
Towards Sentiment and Emotion aided Multi-modal Speech Act Classification in Twitter
Tulika Saha | Apoorva Upadhyaya | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Speech Act Classification determining the communicative intent of an utterance has been investigated widely over the years as a standalone task. This holds true for discussion in any fora including social media platform such as Twitter. But the emotional state of the tweeter which has a considerable effect on the communication has not received the attention it deserves. Closely related to emotion is sentiment, and understanding of one helps understand the other. In this work, we firstly create a new multi-modal, emotion-TA (‘TA’ means tweet act, i.e., speech act in Twitter) dataset called EmoTA collected from open-source Twitter dataset. We propose a Dyadic Attention Mechanism (DAM) based multi-modal, adversarial multi-tasking framework. DAM incorporates intra-modal and inter-modal attention to fuse multiple modalities and learns generalized features across all the tasks. Experimental results indicate that the proposed framework boosts the performance of the primary task, i.e., TA classification (TAC) by benefitting from the two secondary tasks, i.e., Sentiment and Emotion Analysis compared to its uni-modal and single task TAC (tweet act classification) variants.

pdf bib
Investigating Active Learning in Interactive Neural Machine Translation
Kamal Gupta | Dhanvanth Boppana | Rejwanul Haque | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of Machine Translation Summit XVIII: Research Track

Interactive-predictive translation is a collaborative iterative process and where human translators produce translations with the help of machine translation (MT) systems interactively. Various sampling techniques in active learning (AL) exist to update the neural MT (NMT) model in the interactive-predictive scenario. In this paper and we explore term based (named entity count (NEC)) and quality based (quality estimation (QE) and sentence similarity (Sim)) sampling techniques – which are used to find the ideal candidates from the incoming data – for human supervision and MT model’s weight updation. We carried out experiments with three language pairs and viz. German-English and Spanish-English and Hindi-English. Our proposed sampling technique yields 1.82 and 0.77 and 0.81 BLEU points improvements for German-English and Spanish-English and Hindi-English and respectively and over random sampling based baseline. It also improves the present state-of-the-art by 0.35 and 0.12 BLEU points for German-English and Spanish-English and respectively. Human editing effort in terms of number-of-words-changed also improves by 5 and 4 points for German-English and Spanish-English and respectively and compared to the state-of-the-art.

pdf bib
Crosslingual Embeddings are Essential in UNMT for distant languages: An English to IndoAryan Case Study
Tamali Banerjee | Rudra V Murthy | Pushpak Bhattacharya
Proceedings of Machine Translation Summit XVIII: Research Track

Recent advances in Unsupervised Neural Machine Translation (UNMT) has minimized the gap between supervised and unsupervised machine translation performance for closely related language-pairs. However and the situation is very different for distant language pairs. Lack of overlap in lexicon and low syntactic similarity such as between English and IndoAryan languages leads to poor translation quality in existing UNMT systems. In this paper and we show that initialising the embedding layer of UNMT models with cross-lingual embeddings leads to significant BLEU score improvements over existing UNMT models where the embedding layer weights are randomly initialized. Further and freezing the embedding layer weights leads to better gains compared to updating the embedding layer weights during training. We experimented using Masked Sequence to Sequence (MASS) and Denoising Autoencoder (DAE) UNMT approaches for three distant language pairs. The proposed cross-lingual embedding initialization yields BLEU score improvement of as much as ten times over the baseline for English-Hindi and English-Bengali and English-Gujarati. Our analysis shows that initialising embedding layer with static cross-lingual embedding mapping is essential for training of UNMT models for distant language-pairs.

pdf bib
Neural Machine Translation in Low-Resource Setting: a Case Study in English-Marathi Pair
Aakash Banerjee | Aditya Jain | Shivam Mhaskar | Sourabh Dattatray Deoghare | Aman Sehgal | Pushpak Bhattacharya
Proceedings of Machine Translation Summit XVIII: Research Track

In this paper and we explore different techniques of overcoming the challenges of low-resource in Neural Machine Translation (NMT) and specifically focusing on the case of English-Marathi NMT. NMT systems require a large amount of parallel corpora to obtain good quality translations. We try to mitigate the low-resource problem by augmenting parallel corpora or by using transfer learning. Techniques such as Phrase Table Injection (PTI) and back-translation and mixing of language corpora are used for enhancing the parallel data; whereas pivoting and multilingual embeddings are used to leverage transfer learning. For pivoting and Hindi comes in as assisting language for English-Marathi translation. Compared to baseline transformer model and a significant improvement trend in BLEU score is observed across various techniques. We have done extensive manual and automatic and qualitative evaluation of our systems. Since the trend in Machine Translation (MT) today is post-editing and measuring of Human Effort Reduction (HER) and we have given our preliminary observations on Translation Edit Rate (TER) vs. BLEU score study and where TER is regarded as a measure of HER.

pdf bib
Scrambled Translation Problem: A Problem of Denoising UNMT
Tamali Banerjee | Rudra V Murthy | Pushpak Bhattacharya
Proceedings of Machine Translation Summit XVIII: Research Track

In this paper and we identify an interesting kind of error in the output of Unsupervised Neural Machine Translation (UNMT) systems like Undreamt1. We refer to this error type as Scrambled Translation problem. We observe that UNMT models which use word shuffle noise (as in case of Undreamt) can generate correct words and but fail to stitch them together to form phrases. As a result and words of the translated sentence look scrambled and resulting in decreased BLEU. We hypothesise that the reason behind scrambled translation problem is ’shuffling noise’ which is introduced in every input sentence as a denoising strategy. To test our hypothesis and we experiment by retraining UNMT models with a simple retraining strategy. We stop the training of the Denoising UNMT model after a pre-decided number of iterations and resume the training for the remaining iterations- which number is also pre-decided- using original sentence as input without adding any noise. Our proposed solution achieves significant performance improvement UNMT models that train conventionally. We demonstrate these performance gains on four language pairs and viz. and English-French and English-German and English-Spanish and Hindi-Punjabi. Our qualitative and quantitative analysis shows that the retraining strategy helps achieve better alignment as observed by attention heatmap and better phrasal translation and leading to statistically significant improvement in BLEU scores.

pdf bib
Evaluating the Performance of Back-translation for Low Resource English-Marathi Language Pair: CFILT-IITBombay @ LoResMT 2021
Aditya Jain | Shivam Mhaskar | Pushpak Bhattacharyya
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

In this paper, we discuss the details of the various Machine Translation (MT) systems that we have submitted for the English-Marathi LoResMT task. As a part of this task, we have submitted three different Neural Machine Translation (NMT) systems; a Baseline English-Marathi system, a Baseline Marathi-English system, and an English-Marathi system that is based on the back-translation technique. We explore the performance of these NMT systems between English and Marathi languages, which forms a low resource language pair due to unavailability of sufficient parallel data. We also explore the performance of the back-translation technique when the back-translated data is obtained from NMT systems that are trained on a very less amount of data. From our experiments, we observe that the back-translation technique can help improve the MT quality over the baseline for the English-Marathi language pair.

2020

pdf bib
IIITBH-IITP@CL-SciSumm20, CL-LaySumm20, LongSumm20
Saichethan Reddy | Naveen Saini | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the First Workshop on Scholarly Document Processing

In this paper, we present the IIIT Bhagalpur and IIT Patna team’s effort to solve the three shared tasks namely, CL-SciSumm 2020, CL-LaySumm 2020, LongSumm 2020 at SDP 2020. The theme of these tasks is to generate medium-scale, lay and long summaries, respectively, for scientific articles. For the first two tasks, unsupervised systems are developed, while for the third one, we develop a supervised system.The performances of all the systems were evaluated on the associated datasets with the shared tasks in term of well-known ROUGE metric.

pdf bib
IITP-AI-NLP-ML@ CL-SciSumm 2020, CL-LaySumm 2020, LongSumm 2020
Santosh Kumar Mishra | Harshavardhan Kundarapu | Naveen Saini | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the First Workshop on Scholarly Document Processing

The publication rate of scientific literature increases rapidly, which poses a challenge for researchers to keep themselves updated with new state-of-the-art. Scientific document summarization solves this problem by summarizing the essential fact and findings of the document. In the current paper, we present the participation of IITP-AI-NLP-ML team in three shared tasks, namely, CL-SciSumm 2020, LaySumm 2020, LongSumm 2020, which aims to generate medium, lay, and long summaries of the scientific articles, respectively. To solve CL-SciSumm 2020 and LongSumm 2020 tasks, three well-known clustering techniques are used, and then various sentence scoring functions, including textual entailment, are used to extract the sentences from each cluster for a summary generation. For LaySumm 2020, an encoder-decoder based deep learning model has been utilized. Performances of our developed systems are evaluated in terms of ROUGE measures on the associated datasets with the shared task.

pdf bib
EL-BERT at SemEval-2020 Task 10: A Multi-Embedding Ensemble Based Approach for Emphasis Selection in Visual Media
Chandresh Kanani | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In visual media, text emphasis is the strengthening of words in a text to convey the intent of the author. Text emphasis in visual media is generally done by using different colors, backgrounds, or font for the text; it helps in conveying the actual meaning of the message to the readers. Emphasis selection is the task of choosing candidate words for emphasis, it helps in automatically designing posters and other media contents with written text. If we consider only the text and do not know the intent, then there can be multiple valid emphasis selections. We propose the use of ensembles for emphasis selection to improve over single emphasis selection models. We show that the use of multi-embedding helps in enhancing the results for base models. To show the efficacy of proposed approach we have also done a comparison of our results with state-of-the-art models.

pdf bib
IITP-AINLPML at SemEval-2020 Task 12: Offensive Tweet Identification and Target Categorization in a Multitask Environment
Soumitra Ghosh | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we describe the participation of IITP-AINLPML team in the SemEval-2020 SharedTask 12 on Offensive Language Identification and Target Categorization in English Twitter data. Our proposed model learns to extract textual features using a BiGRU-based deep neural network supported by a Hierarchical Attention architecture to focus on the most relevant areas in the text. We leverage the effectiveness of multitask learning while building our models for sub-task A and B. We do necessary undersampling of the over-represented classes in the sub-tasks A and C.During training, we consider a threshold of 0.5 as the separation margin between the instances belonging to classes OFF and NOT in sub-task A and UNT and TIN in sub-task B. For sub-task C, the class corresponding to the maximum score among the given confidence scores of the classes(IND, GRP and OTH) is considered as the final label for an instance. Our proposed model obtains the macro F1-scores of 90.95%, 55.69% and 63.88% in sub-task A, B and C, respectively.

pdf bib
Knowledge Graph and Deep Neural Network for Extractive Text Summarization by Utilizing Triples
Amit Vhatkar | Pushpak Bhattacharyya | Kavi Arya
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

In our research work, we represent the content of the sentence in graphical form after extracting triples from the sentences. In this paper, we will discuss novel methods to generate an extractive summary by scoring the triples. Our work has also touched upon sequence-to-sequence encoding of the content of the sentence, to classify it as a summary or a non-summary sentence. Our findings help to decide the nature of the sentences forming the summary and the length of the system generated summary as compared to the length of the reference summary.

pdf bib
Can Neural Networks Automatically Score Essay Traits?
Sandeep Mathias | Pushpak Bhattacharyya
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

Essay traits are attributes of an essay that can help explain how well written (or badly written) the essay is. Examples of traits include Content, Organization, Language, Sentence Fluency, Word Choice, etc. A lot of research in the last decade has dealt with automatic holistic essay scoring - where a machine rates an essay and gives a score for the essay. However, writers need feedback, especially if they want to improve their writing - which is why trait-scoring is important. In this paper, we show how a deep-learning based system can outperform feature-based machine learning systems, as well as a string kernel system in scoring essay traits.

pdf bib
Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis
Dushyant Singh Chauhan | Dhanush S R | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we hypothesize that sarcasm is closely related to sentiment and emotion, and thereby propose a multi-task deep learning framework to solve all these three problems simultaneously in a multi-modal conversational scenario. We, at first, manually annotate the recently released multi-modal MUStARD sarcasm dataset with sentiment and emotion classes, both implicit and explicit. For multi-tasking, we propose two attention mechanisms, viz. Inter-segment Inter-modal Attention (Ie-Attention) and Intra-segment Inter-modal Attention (Ia-Attention). The main motivation of Ie-Attention is to learn the relationship between the different segments of the sentence across the modalities. In contrast, Ia-Attention focuses within the same segment of the sentence across the modalities. Finally, representations from both the attentions are concatenated and shared across the five classes (i.e., sarcasm, implicit sentiment, explicit sentiment, implicit emotion, explicit emotion) for multi-tasking. Experimental results on the extended version of the MUStARD dataset show the efficacy of our proposed approach for sarcasm detection over the existing state-of-the-art systems. The evaluation also shows that the proposed multi-task framework yields better performance for the primary task, i.e., sarcasm detection, with the help of two secondary tasks, emotion and sentiment analysis.

pdf bib
Towards Emotion-aided Multi-modal Dialogue Act Classification
Tulika Saha | Aditya Patra | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The task of Dialogue Act Classification (DAC) that purports to capture communicative intent has been studied extensively. But these studies limit themselves to text. Non-verbal features (change of tone, facial expressions etc.) can provide cues to identify DAs, thus stressing the benefit of incorporating multi-modal inputs in the task. Also, the emotional state of the speaker has a substantial effect on the choice of the dialogue act, since conversations are often influenced by emotions. Hence, the effect of emotion too on automatic identification of DAs needs to be studied. In this work, we address the role of both multi-modality and emotion recognition (ER) in DAC. DAC and ER help each other by way of multi-task learning. One of the major contributions of this work is a new dataset- multimodal Emotion aware Dialogue Act dataset called EMOTyDA, collected from open-sourced dialogue datasets. To demonstrate the utility of EMOTyDA, we build an attention based (self, inter-modal, inter-task) multi-modal, multi-task Deep Neural Network (DNN) for joint learning of DAs and emotions. We show empirically that multi-modality and multi-tasking achieve better performance of DAC compared to uni-modal and single task DAC variants.

pdf bib
Modelling Source- and Target- Language Syntactic Information as Conditional Context in Interactive Neural Machine Translation
Kamal Kumar Gupta | Rejwanul Haque | Asif Ekbal | Pushpak Bhattacharyya | Andy Way
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

In interactive machine translation (MT), human translators correct errors in automatic translations in collaboration with the MT systems, which is seen as an effective way to improve the productivity gain in translation. In this study, we model source-language syntactic constituency parse and target-language syntactic descriptions in the form of supertags as conditional context for interactive prediction in neural MT (NMT). We found that the supertags significantly improve productivity gain in translation in interactive-predictive NMT (INMT), while syntactic parsing somewhat found to be effective in reducing human effort in translation. Furthermore, when we model this source- and target-language syntactic information together as the conditional context, both types complement each other and our fully syntax-informed INMT model statistically significantly reduces human efforts in a French–to–English translation task, achieving 4.30 points absolute (corresponding to 9.18% relative) improvement in terms of word prediction accuracy (WPA) and 4.84 points absolute (corresponding to 9.01% relative) reduction in terms of word stroke ratio (WSR) over the baseline.

pdf bib
All-in-One: A Deep Attentive Multi-task Learning Framework for Humour, Sarcasm, Offensive, Motivation, and Sentiment on Memes
Dushyant Singh Chauhan | Dhanush S R | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In this paper, we aim at learning the relationships and similarities of a variety of tasks, such as humour detection, sarcasm detection, offensive content detection, motivational content detection and sentiment analysis on a somewhat complicated form of information, i.e., memes. We propose a multi-task, multi-modal deep learning framework to solve multiple tasks simultaneously. For multi-tasking, we propose two attention-like mechanisms viz., Inter-task Relationship Module (iTRM) and Inter-class Relationship Module (iCRM). The main motivation of iTRM is to learn the relationship between the tasks to realize how they help each other. In contrast, iCRM develops relations between the different classes of tasks. Finally, representations from both the attentions are concatenated and shared across the five tasks (i.e., humour, sarcasm, offensive, motivational, and sentiment) for multi-tasking. We use the recently released dataset in the Memotion Analysis task @ SemEval 2020, which consists of memes annotated for the classes as mentioned above. Empirical results on Memotion dataset show the efficacy of our proposed approach over the existing state-of-the-art systems (Baseline and SemEval 2020 winner). The evaluation also indicates that the proposed multi-task framework yields better performance over the single-task learning.

pdf bib
Unsupervised Aspect-Level Sentiment Controllable Style Transfer
Mukuntha Narayanan Sundararaman | Zishan Ahmad | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Unsupervised style transfer in text has previously been explored through the sentiment transfer task. The task entails inverting the overall sentiment polarity in a given input sentence, while preserving its content. From the Aspect-Based Sentiment Analysis (ABSA) task, we know that multiple sentiment polarities can often be present together in a sentence with multiple aspects. In this paper, the task of aspect-level sentiment controllable style transfer is introduced, where each of the aspect-level sentiments can individually be controlled at the output. To achieve this goal, a BERT-based encoder-decoder architecture with saliency weighted polarity injection is proposed, with unsupervised training strategies, such as ABSA masked-language-modelling. Through both automatic and manual evaluation, we show that the system is successful in controlling aspect-level sentiments.

pdf bib
Happy Are Those Who Grade without Seeing: A Multi-Task Learning Approach to Grade Essays Using Gaze Behaviour
Sandeep Mathias | Rudra Murthy | Diptesh Kanojia | Abhijit Mishra | Pushpak Bhattacharyya
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

The gaze behaviour of a reader is helpful in solving several NLP tasks such as automatic essay grading. However, collecting gaze behaviour from readers is costly in terms of time and money. In this paper, we propose a way to improve automatic essay grading using gaze behaviour, which is learnt at run time using a multi-task learning framework. To demonstrate the efficacy of this multi-task learning based approach to automatic essay grading, we collect gaze behaviour for 48 essays across 4 essay sets, and learn gaze behaviour for the rest of the essays, numbering over 7000 essays. Using the learnt gaze behaviour, we can achieve a statistically significant improvement in performance over the state-of-the-art system for the essay sets where we have gaze data. We also achieve a statistically significant improvement for 4 other essay sets, numbering about 6000 essays, where we have no gaze behaviour data available. Our approach establishes that learning gaze behaviour improves automatic essay grading.

pdf bib
A Unified Framework for Multilingual and Code-Mixed Visual Question Answering
Deepak Gupta | Pabitra Lenka | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In this paper, we propose an effective deep learning framework for multilingual and code- mixed visual question answering. The pro- posed model is capable of predicting answers from the questions in Hindi, English or Code- mixed (Hinglish: Hindi-English) languages. The majority of the existing techniques on Vi- sual Question Answering (VQA) focus on En- glish questions only. However, many applica- tions such as medical imaging, tourism, visual assistants require a multilinguality-enabled module for their widespread usages. As there is no available dataset in English-Hindi VQA, we firstly create Hindi and Code-mixed VQA datasets by exploiting the linguistic properties of these languages. We propose a robust tech- nique capable of handling the multilingual and code-mixed question to provide the answer against the visual information (image). To better encode the multilingual and code-mixed questions, we introduce a hierarchy of shared layers. We control the behaviour of these shared layers by an attention-based soft layer sharing mechanism, which learns how shared layers are applied in different ways for the dif- ferent languages of the question. Further, our model uses bi-linear attention with a residual connection to fuse the language and image fea- tures. We perform extensive evaluation and ablation studies for English, Hindi and Code- mixed VQA. The evaluation shows that the proposed multilingual model achieves state-of- the-art performance in all these settings.

pdf bib
Extracting Message Sequence Charts from Hindi Narrative Text
Swapnil Hingmire | Nitin Ramrakhiyani | Avinash Kumar Singh | Sangameshwar Patil | Girish Palshikar | Pushpak Bhattacharyya | Vasudeva Varma
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events

In this paper, we propose the use of Message Sequence Charts (MSC) as a representation for visualizing narrative text in Hindi. An MSC is a formal representation allowing the depiction of actors and interactions among these actors in a scenario, apart from supporting a rich framework for formal inference. We propose an approach to extract MSC actors and interactions from a Hindi narrative. As a part of the approach, we enrich an existing event annotation scheme where we provide guidelines for annotation of the mood of events (realis vs irrealis) and guidelines for annotation of event arguments. We report performance on multiple evaluation criteria by experimenting with Hindi narratives from Indian History. Though Hindi is the fourth most-spoken first language in the world, from the NLP perspective it has comparatively lesser resources than English. Moreover, there is relatively less work in the context of event processing in Hindi. Hence, we believe that this work is among the initial works for Hindi event processing.

pdf bib
A Retrofitting Model for Incorporating Semantic Relations into Word Embeddings
Sapan Shah | Sreedhar Reddy | Pushpak Bhattacharyya
Proceedings of the 28th International Conference on Computational Linguistics

We present a novel retrofitting model that can leverage relational knowledge available in a knowledge resource to improve word embeddings. The knowledge is captured in terms of relation inequality constraints that compare similarity of related and unrelated entities in the context of an anchor entity. These constraints are used as training data to learn a non-linear transformation function that maps original word vectors to a vector space respecting these constraints. The transformation function is learned in a similarity metric learning setting using Triplet network architecture. We applied our model to synonymy, antonymy and hypernymy relations in WordNet and observed large gains in performance over original distributional models as well as other retrofitting approaches on word similarity task and significant overall improvement on lexical entailment detection task.

pdf bib
Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages
Diptesh Kanojia | Raj Dabre | Shubham Dewangan | Pushpak Bhattacharyya | Gholamreza Haffari | Malhar Kulkarni
Proceedings of the 28th International Conference on Computational Linguistics

Cognates are variants of the same lexical form across different languages; for example “fonema” in Spanish and “phoneme” in English are cognates, both of which mean “a unit of sound”. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.

pdf bib
Reinforced Multi-task Approach for Multi-hop Question Generation
Deepak Gupta | Hardik Chauhan | Ravi Tej Akella | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 28th International Conference on Computational Linguistics

Question generation (QG) attempts to solve the inverse of question answering (QA) problem by generating a natural language question given a document and an answer. While sequence to sequence neural models surpass rule-based systems for QG, they are limited in their capacity to focus on more than one supporting fact. For QG, we often require multiple supporting facts to generate high-quality questions. Inspired by recent works on multi-hop reasoning in QA, we take up Multi-hop question generation, which aims at generating relevant questions based on supporting facts in the context. We employ multitask learning with the auxiliary task of answer-aware supporting fact prediction to guide the question generator. In addition, we also proposed a question-aware reward function in a Reinforcement Learning (RL) framework to maximize the utilization of the supporting facts. We demonstrate the effectiveness of our approach through experiments on the multi-hop question answering dataset, HotPotQA. Empirical evaluation shows our model to outperform the single-hop neural question generation models on both automatic evaluation metrics such as BLEU, METEOR, and ROUGE and human evaluation metrics for quality and coverage of the generated questions.

pdf bib
Filtering Back-Translated Data in Unsupervised Neural Machine Translation
Jyotsana Khatri | Pushpak Bhattacharyya
Proceedings of the 28th International Conference on Computational Linguistics

Unsupervised neural machine translation (NMT) utilizes only monolingual data for training. The quality of back-translated data plays an important role in the performance of NMT systems. In back-translation, all generated pseudo parallel sentence pairs are not of the same quality. Taking inspiration from domain adaptation where in-domain sentences are given more weight in training, in this paper we propose an approach to filter back-translated data as part of the training process of unsupervised NMT. Our approach gives more weight to good pseudo parallel sentence pairs in the back-translation phase. We calculate the weight of each pseudo parallel sentence pair using sentence-wise round-trip BLEU score which is normalized batch-wise. We compare our approach with the current state of the art approaches for unsupervised NMT.

pdf bib
MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations
Mauajama Firdaus | Hardik Chauhan | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 28th International Conference on Computational Linguistics

Emotion and sentiment classification in dialogues is a challenging task that has gained popularity in recent times. Humans tend to have multiple emotions with varying intensities while expressing their thoughts and feelings. Emotions in an utterance of dialogue can either be independent or dependent on the previous utterances, thus making the task complex and interesting. Multi-label emotion detection in conversations is a significant task that provides the ability to the system to understand the various emotions of the users interacting. Sentiment analysis in dialogue/conversation, on the other hand, helps in understanding the perspective of the user with respect to the ongoing conversation. Along with text, additional information in the form of audio and video assist in identifying the correct emotions with the appropriate intensity and sentiments in an utterance of a dialogue. Lately, quite a few datasets have been made available for dialogue emotion and sentiment classification, but these datasets are imbalanced in representing different emotions and consist of an only single emotion. Hence, we present at first a large-scale balanced Multimodal Multi-label Emotion, Intensity, and Sentiment Dialogue dataset (MEISD), collected from different TV series that has textual, audio and visual features, and then establish a baseline setup for further research.

pdf bib
Analysing cross-lingual transfer in lemmatisation for Indian languages
Kumar Saurav | Kumar Saunack | Pushpak Bhattacharyya
Proceedings of the 28th International Conference on Computational Linguistics

Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. However, most of the prior work on this topic has focused on high resource languages. In this paper, we evaluate cross-lingual approaches for low resource languages, especially in the context of morphologically rich Indian languages. We test our model on six languages from two different families and develop linguistic insights into each model’s performance.

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
Pushpak Bhattacharyya | Dipti Misra Sharma | Rajeev Sangal
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

pdf bib
Cognitively Aided Zero-Shot Automatic Essay Grading
Sandeep Mathias | Rudra Murthy | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Automatic essay grading (AEG) is a process in which machines assign a grade to an essay written in response to a topic, called the prompt. Zero-shot AEG is when we train a system to grade essays written to a new prompt which was not present in our training data. In this paper, we describe a solution to the problem of zero-shot automatic essay grading, using cognitive information, in the form of gaze behaviour. Our experiments show that using gaze behaviour helps in improving the performance of AEG systems, especially when we provide a new essay written in response to a new prompt for scoring, by an average of almost 5 percentage points of QWK.

pdf bib
Semantic Extractor-Paraphraser based Abstractive Summarization
Anubhav Jangra | Raghav Jain | Vaibhav Mavi | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

The anthology of spoken languages today is inundated with textual information, necessitating the development of automatic summarization models. In this manuscript, we propose an extractor-paraphraser based abstractive summarization system that exploits semantic overlap as opposed to its predecessors that focus more on syntactic information overlap. Our model outperforms the state-of-the-art baselines in terms of ROUGE, METEOR and word mover similarity (WMS), establishing the superiority of the proposed system via extensive ablation experiments. We have also challenged the summarization capabilities of the state of the art Pointer Generator Network (PGN), and through thorough experimentation, shown that PGN is more of a paraphraser, contrary to the prevailing notion of a summarizer; illustrating it’s incapability to accumulate information across multiple sentences.

pdf bib
A Multi-modal Personality Prediction System
Chanchal Suman | Aditya Gupta | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Automatic prediction of personality traits has many real-life applications, e.g., in forensics, recommender systems, personalized services etc.. In this work, we have proposed a solution framework for solving the problem of predicting the personality traits of a user from videos. Ambient, facial and the audio features are extracted from the video of the user. These features are used for the final output prediction. The visual and audio modalities are combined in two different ways: averaging of predictions obtained from the individual modalities, and concatenation of features in multi-modal setting. The dataset released in Chalearn-16 is used for evaluating the performance of the system. Experimental results illustrate that it is possible to obtain better performance with a hand full of images, rather than using all the images present in the video

pdf bib
D-Coref: A Fast and Lightweight Coreference Resolution Model using DistilBERT
Chanchal Suman | Jeetu Kumar | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Smart devices are often deployed in some edge-devices, which require quality solutions in limited amount of memory usage. In most of the user-interaction based smart devices, coreference resolution is often required. Keeping this in view, we have developed a fast and lightweight coreference resolution model which meets the minimum memory requirement and converges faster. In order to generate the embeddings for solving the task of coreference resolution, DistilBERT, a light weight BERT module is utilized. DistilBERT consumes less memory (only 60% of memory in comparison to BERT-based heavy model) and it is suitable for deployment in edge devices. DistilBERT embedding helps in 60% faster convergence with an accuracy compromise of 2.59%, and 6.49% with respect to its base model and current state-of-the-art, respectively.

pdf bib
Leveraging Alignment and Phonology for low-resource Indic to English Neural Machine Transliteration
Parth Patel | Manthan Mehta | Pushpak Bhattacharya | Arjun Atreya
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

In this paper we present a novel transliteration technique based on Orthographic Syllable(OS) segmentation for low-resource Indian languages (ILs). Given that alignment has produced promising results in Statistical Machine Transliteration systems and phonology plays an important role in transliteration, we introduce a new model which uses alignment representation similar to that of IBM model 3 to pre-process the tokenized input sequence and then use pre-trained source and target OS-embeddings for training. We apply our model for transliteration from ILs to English and report our accuracy based on Top-1 Exact Match. We also compare our accuracy with a previously proposed Phrase-Based model and report improvements.

pdf bib
Annotated Corpus of Tweets in English from Various Domains for Emotion Detection
Soumitra Ghosh | Asif Ekbal | Pushpak Bhattacharyya | Sriparna Saha | Vipin Tyagi | Alka Kumar | Shikha Srivastava | Nitish Kumar
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Emotion recognition is a very well-attended problem in Natural Language Processing (NLP). Most of the existing works on emotion recognition focus on the general domain and in some cases to specific domains like fairy tales, blogs, weather, Twitter etc. But emotion analysis systems in the domains of security, social issues, technology, politics, sports, etc. are very rare. In this paper, we create a benchmark setup for emotion recognition in these specialised domains. First, we construct a corpus of 18,921 tweets in English annotated with Paul Ekman’s six basic emotions (Anger, Disgust, Fear, Happiness, Sadness, Surprise) and a non-emotive class Others. Thereafter, we propose a deep neural framework to perform emotion recognition in an end-to-end setting. We build various models based on Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory (Bi-LSTM), Bi-directional Gated Recurrent Unit (Bi-GRU). We propose a Hierarchical Attention-based deep neural network for Emotion Detection (HAtED). We also develop multiple systems by considering different sets of emotion classes for each system and report the detailed comparative analysis of the results. Experiments show the hierarchical attention-based model achieves best results among the considered baselines with accuracy of 69%.

pdf bib
Proceedings of the 7th Workshop on Asian Translation
Toshiaki Nakazawa | Hideki Nakayama | Chenchen Ding | Raj Dabre | Anoop Kunchukuttan | Win Pa Pa | Ondřej Bojar | Shantipriya Parida | Isao Goto | Hidaya Mino | Hiroshi Manabe | Katsuhito Sudoh | Sadao Kurohashi | Pushpak Bhattacharyya
Proceedings of the 7th Workshop on Asian Translation

pdf bib
“A Passage to India”: Pre-trained Word Embeddings for Indian Languages
Saurav Kumar | Saunack Kumar | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Dense word vectors or ‘word embeddings’ which encode semantic properties of words, have now become integral to NLP tasks like Machine Translation (MT), Question Answering (QA), Word Sense Disambiguation (WSD), and Information Retrieval (IR). In this paper, we use various existing approaches to create multiple word embeddings for 14 Indian languages. We place these embeddings for all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and Telugu in a single repository. Relatively newer approaches that emphasize catering to context (BERT, ELMo, etc.) have shown significant improvements, but require a large amount of resources to generate usable models. We release pre-trained embeddings generated using both contextual and non-contextual approaches. We also use MUSE and XLM to train cross-lingual embeddings for all pairs of the aforementioned languages. To show the efficacy of our embeddings, we evaluate our embedding models on XPOS, UPOS and NER tasks for all these languages. We release a total of 436 models using 8 different approaches. We hope they are useful for the resource-constrained Indian language NLP. The title of this paper refers to the famous novel “A Passage to India” by E.M. Forster, published initially in 1924.

pdf bib
Generating Fluent Translations from Disfluent Text Without Access to Fluent References: IIT Bombay@IWSLT2020
Nikhil Saini | Jyotsana Khatri | Preethi Jyothi | Pushpak Bhattacharyya
Proceedings of the 17th International Conference on Spoken Language Translation

Machine translation systems perform reasonably well when the input is well-formed speech or text. Conversational speech is spontaneous and inherently consists of many disfluencies. Producing fluent translations of disfluent source text would typically require parallel disfluent to fluent training data. However, fluent translations of spontaneous speech are an additional resource that is tedious to obtain. This work describes the submission of IIT Bombay to the Conversational Speech Translation challenge at IWSLT 2020. We specifically tackle the problem of disfluency removal in disfluent-to-fluent text-to-text translation assuming no access to fluent references during training. Common patterns of disfluency are extracted from disfluent references and a noise induction model is used to simulate them starting from a clean monolingual corpus. This synthetically constructed dataset is then considered as a proxy for labeled data during training. We also make use of additional fluent text in the target language to help generate fluent translations. This work uses no fluent references during training and beats a baseline model by a margin of 4.21 and 3.11 BLEU points where the baseline uses disfluent and fluent references, respectively. Index Terms- disfluency removal, machine translation, noise induction, leveraging monolingual data, denoising for disfluency removal.

pdf bib
A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning
Deepak Gupta | Asif Ekbal | Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: EMNLP 2020

Code-mixing, the interleaving of two or more languages within a sentence or discourse is ubiquitous in multilingual societies. The lack of code-mixed training data is one of the major concerns for the development of end-to-end neural network-based models to be deployed for a variety of natural language processing (NLP) applications. A potential solution is to either manually create or crowd-source the code-mixed labelled data for the task at hand, but that requires much human efforts and often not feasible because of the language specific diversity in the code-mixed text. To circumvent the data scarcity issue, we propose an effective deep learning approach for automatically generating the code-mixed text from English to multiple languages without any parallel data. In order to train the neural network, we create synthetic code-mixed texts from the available parallel corpus by modelling various linguistic properties of code-mixing. Our codemixed text generator is built upon the encoder-decoder framework, where the encoder is augmented with the linguistic and task-agnostic features obtained from the transformer based language model. We also transfer the knowledge from a neural machine translation (NMT) to warm-start the training of code-mixed generator. Experimental results and in-depth analysis show the effectiveness of our proposed code-mixed text generation on eight diverse language pairs.

pdf bib
Looking inside Noun Compounds: Unsupervised Prepositional and Free Paraphrasing
Girishkumar Ponkiya | Rudra Murthy | Pushpak Bhattacharyya | Girish Palshikar
Findings of the Association for Computational Linguistics: EMNLP 2020

A noun compound is a sequence of contiguous nouns that acts as a single noun, although the predicate denoting the semantic relation between its components is dropped. Noun Compound Interpretation is the task of uncovering the relation, in the form of a preposition or a free paraphrase. Prepositional paraphrasing refers to the use of preposition to explain the semantic relation, whereas free paraphrasing refers to invoking an appropriate predicate denoting the semantic relation. In this paper, we propose an unsupervised methodology for these two types of paraphrasing. We use pre-trained contextualized language models to uncover the ‘missing’ words (preposition or predicate). These language models are usually trained to uncover the missing word/words in a given input sentence. Our approach uses templates to prepare the input sequence for the language model. The template uses a special token to indicate the missing predicate. As the model has already been pre-trained to uncover a missing word (or a sequence of words), we exploit it to predict missing words for the input sequence. Our experiments using four datasets show that our unsupervised approach (a) performs comparably to supervised approaches for prepositional paraphrasing, and (b) outperforms supervised approaches for free paraphrasing. Paraphrasing (prepositional or free) using our unsupervised approach is potentially helpful for NLP tasks like machine translation and information extraction.

pdf bib
Part-of-Speech Annotation Challenges in Marathi
Gajanan Rane | Nilesh Joshi | Geetanjali Rane | Hanumant Redkar | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

Part of Speech (POS) annotation is a significant challenge in natural language processing. The paper discusses issues and challenges faced in the process of POS annotation of the Marathi data from four domains viz., tourism, health, entertainment and agriculture. During POS annotation, a lot of issues were encountered. Some of the major ones are discussed in detail in this paper. Also, the two approaches viz., the lexical (L approach) and the functional (F approach) of POS tagging have been discussed and presented with examples. Further, some ambiguous cases in POS annotation are presented in the paper.

pdf bib
CEASE, a Corpus of Emotion Annotated Suicide notes in English
Soumitra Ghosh | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 12th Language Resources and Evaluation Conference

A suicide note is usually written shortly before the suicide and it provides a chance to comprehend the self-destructive state of mind of the deceased. From a psychological point of view, suicide notes have been utilized for recognizing the motive behind the suicide. To the best of our knowledge, there is no openly accessible suicide note corpus at present, making it challenging for the researchers and developers to deep dive into the area of mental health assessment and suicide prevention. In this paper, we create a fine-grained emotion annotated corpus (CEASE) of suicide notes in English and develop various deep learning models to perform emotion detection on the curated dataset. The corpus consists of 2393 sentences from around 205 suicide notes collected from various sources. Each sentence is annotated with a particular emotion class from a set of 15 fine-grained emotion labels, namely (forgiveness, happiness_peacefulness, love, pride, hopefulness, thankfulness, blame, anger, fear, abuse, sorrow, hopelessness, guilt, information, instructions). For the evaluation, we develop an ensemble architecture, where the base models correspond to three supervised deep learning models, namely Convolutional Neural Network (CNN), Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM). We obtain the highest test accuracy of 60.17% and cross-validation accuracy of 60.32%

pdf bib
A Platform for Event Extraction in Hindi
Sovan Kumar Sahoo | Saumajit Saha | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 12th Language Resources and Evaluation Conference

Event Extraction is an important task in the widespread field of Natural Language Processing (NLP). Though this task is adequately addressed in English with sufficient resources, we are unaware of any benchmark setup in Indian languages. Hindi is one of the most widely spoken languages in the world. In this paper, we present an Event Extraction framework for Hindi language by creating an annotated resource for benchmarking, and then developing deep learning based models to set as the baselines. We crawl more than seventeen hundred disaster related Hindi news articles from the various news sources. We also develop deep learning based models for Event Trigger Detection and Classification, Argument Detection and Classification and Event-Argument Linking.

pdf bib
Challenge Dataset of Cognates and False Friend Pairs from Indian Languages
Diptesh Kanojia | Malhar Kulkarni | Pushpak Bhattacharyya | Gholamreza Haffari
Proceedings of the 12th Language Resources and Evaluation Conference

Cognates are present in multiple variants of the same text across different languages (e.g., “hund” in German and “hound” in the English language mean “dog”). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends’ dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.

pdf bib
Incorporating Politeness across Languages in Customer Care Responses: Towards building a Multi-lingual Empathetic Dialogue Agent
Mauajama Firdaus | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 12th Language Resources and Evaluation Conference

Customer satisfaction is an essential aspect of customer care systems. It is imperative for such systems to be polite while handling customer requests/demands. In this paper, we present a large multi-lingual conversational dataset for English and Hindi. We choose data from Twitter having both generic and courteous responses between customer care agents and aggrieved users. We also propose strong baselines that can induce courteous behaviour in generic customer care response in a multi-lingual scenario. We build a deep learning framework that can simultaneously handle different languages and incorporate polite behaviour in the customer care agent’s responses. Our system is competent in generating responses in different languages (here, English and Hindi) depending on the customer’s preference and also is able to converse with humans in an empathetic manner to ensure customer satisfaction and retention. Experimental results show that our proposed models can converse in both the languages and the information shared between the languages helps in improving the performance of the overall system. Qualitative and quantitative analysis shows that the proposed method can converse in an empathetic manner by incorporating courteousness in the responses and hence increasing customer satisfaction.

pdf bib
Recommendation Chart of Domains for Cross-Domain Sentiment Analysis: Findings of A 20 Domain Study
Akash Sheoran | Diptesh Kanojia | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of the 12th Language Resources and Evaluation Conference

Cross-domain sentiment analysis (CDSA) helps to address the problem of data scarcity in scenarios where labelled data for a domain (known as the target domain) is unavailable or insufficient. However, the decision to choose a domain (known as the source domain) to leverage from is, at best, intuitive. In this paper, we investigate text similarity metrics to facilitate source domain selection for CDSA. We report results on 20 domains (all possible pairs) using 11 similarity metrics. Specifically, we compare CDSA performance with these metrics for different domain-pairs to enable the selection of a suitable source domain, given a target domain. These metrics include two novel metrics for evaluating domain adaptability to help source domain selection of labelled data and utilize word and sentence-based embeddings as metrics for unlabelled data. The goal of our experiments is a recommendation chart that gives the K best source domains for CDSA for a given target domain. We show that the best K source domains returned by our similarity metrics have a precision of over 50%, for varying values of K.

pdf bib
Multi-domain Tweet Corpora for Sentiment Analysis: Resource Creation and Evaluation
Mamta . | Asif Ekbal | Pushpak Bhattacharyya | Shikha Srivastava | Alka Kumar | Tista Saha
Proceedings of the 12th Language Resources and Evaluation Conference

Due to the phenomenal growth of online content in recent time, sentiment analysis has attracted attention of the researchers and developers. A number of benchmark annotated corpora are available for domains like movie reviews, product reviews, hotel reviews, etc.The pervasiveness of social media has also lead to a huge amount of content posted by users who are misusing the power of social media to spread false beliefs and to negatively influence others. This type of content is coming from the domains like terrorism, cybersecurity, technology, social issues, etc. Mining of opinions from these domains is important to create a socially intelligent system to provide security to the public and to maintain the law and order situations. To the best of our knowledge, there is no publicly available tweet corpora for such pervasive domains. Hence, we firstly create a multi-domain tweet sentiment corpora and then establish a deep neural network based baseline framework to address the above mentioned issues. Annotated corpus has Cohen’s Kappa measurement for annotation quality of 0.770, which shows that the data is of acceptable quality. We are able to achieve 84.65% accuracy for sentiment analysis by using an ensemble of Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), and Gated Recurrent Unit(GRU).

pdf bib
ScholarlyRead: A New Dataset for Scientific Article Reading Comprehension
Tanik Saikh | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 12th Language Resources and Evaluation Conference

We present ScholarlyRead, span-of-word-based scholarly articles’ Reading Comprehension (RC) dataset with approximately 10K manually checked passage-question-answer instances. ScholarlyRead was constructed in semi-automatic way. We consider the articles from two popular journals of a reputed publishing house. Firstly, we generate questions from these articles in an automatic way. Generated questions are then manually checked by the human annotators. We propose a baseline model based on Bi-Directional Attention Flow (BiDAF) network that yields the F1 score of 37.31%. The framework would be useful for building Question-Answering (QA) systems on scientific articles.

2019

pdf bib
Utilizing Wordnets for Cognate Detection among Indian Languages
Diptesh Kanojia | Kevin Patel | Malhar Kulkarni | Pushpak Bhattacharyya | Gholemreza Haffari
Proceedings of the 10th Global Wordnet Conference

Automatic Cognate Detection (ACD) is a challenging task which has been utilized to help NLP applications like Machine Translation, Information Retrieval and Computational Phylogenetics. Unidentified cognate pairs can pose a challenge to these applications and result in a degradation of performance. In this paper, we detect cognate word pairs among ten Indian languages with Hindi and use deep learning methodologies to predict whether a word pair is cognate or not. We identify IndoWordnet as a potential resource to detect cognate word pairs based on orthographic similarity-based methods and train neural network models using the data obtained from it. We identify parallel corpora as another potential resource and perform the same experiments for them. We also validate the contribution of Wordnets through further experimentation and report improved performance of up to 26%. We discuss the nuances of cognate detection among closely related Indian languages and release the lists of detected cognates as a dataset. We also observe the behaviour of, to an extent, unrelated Indian language pairs and release the lists of detected cognates among them as well.

pdf bib
Proceedings of the 16th International Conference on Natural Language Processing
Dipti Misra Sharma | Pushpak Bhattacharya
Proceedings of the 16th International Conference on Natural Language Processing

pdf bib
A Deep Ensemble Framework for Fake News Detection and Multi-Class Classification of Short Political Statements
Arjun Roy | Kingshuk Basak | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 16th International Conference on Natural Language Processing

Fake news, rumor, incorrect information, and misinformation detection are nowadays crucial issues as these might have serious consequences for our social fabrics. Such information is increasing rapidly due to the availability of enormous web information sources including social media feeds, news blogs, online newspapers etc. In this paper, we develop various deep learning models for detecting fake news and classifying them into the pre-defined fine-grained categories. At first, we develop individual models based on Convolutional Neural Network (CNN), and Bi-directional Long Short Term Memory (Bi-LSTM) networks. The representations obtained from these two models are fed into a Multi-layer Perceptron Model (MLP) for the final classification. Our experiments on a benchmark dataset show promising results with an overall accuracy of 44.87%, which outperforms the current state of the arts.

pdf bib
Multi-linguality helps: Event-Argument Extraction for Disaster Domain in Cross-lingual and Multi-lingual setting
Zishan Ahmad | Deeksha Varshney | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 16th International Conference on Natural Language Processing

Automatic extraction of disaster-related events and their arguments from natural language text is vital for building a decision support system for crisis management. Event extraction from various news sources is a well-explored area for this objective. However, extracting events alone, without any context, provides only partial help for this purpose. Extracting related arguments like Time, Place, Casualties, etc., provides a complete picture of the disaster event. In this paper, we create a disaster domain dataset in Hindi by annotating disaster-related event and arguments. We also obtain equivalent datasets for Bengali and English from a collaboration. We build a multi-lingual deep learning model for argument extraction in all the three languages. We also compare our multi-lingual system with a similar baseline mono-lingual system trained for each language separately. It is observed that a single multi-lingual system is able to compensate for lack of training data, by using joint training of dataset from different languages in shared space, thus giving a better overall result.

pdf bib
A Multi-task Model for Multilingual Trigger Detection and Classification
Sovan Kumar Sahoo | Saumajit Saha | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 16th International Conference on Natural Language Processing

In this paper we present a deep multi-task learning framework for multilingual event and argument trigger detection and classification. In our current work, we identify detection and classification of both event and argument triggers as related tasks and follow a multi-tasking approach to solve them simultaneously in contrast to the previous works where these tasks were solved separately or learning some of the above mentioned tasks jointly. We evaluate the proposed approach with multiple low-resource Indian languages. As there were no datasets available for the Indian languages, we have annotated disaster related news data crawled from the online news portal for different low-resource Indian languages for our experiments. Our empirical evaluation shows that multi-task model performs better than the single task model, and classification helps in trigger detection and vice-versa.

pdf bib
Converting Sentiment Annotated Data to Emotion Annotated Data
Manasi Kulkarni | Pushpak Bhattacharyya
Proceedings of the 16th International Conference on Natural Language Processing

Existing supervised solutions for emotion classification demand large amount of emotion annotated data. Such resources may not be available for many languages. However, it is common to have sentiment annotated data available in these languages. The sentiment information (+1 or -1) is useful to segregate between positive emotions or negative emotions. In this paper, we propose an unsupervised approach for emotion recognition by taking advantage of the sentiment information. Given a sentence and its sentiment information, recognize the best possible emotion for it. For every sentence, the semantic relatedness between the words from sentence and a set of emotion-specific words is calculated using cosine similarity. An emotion vector representing the emotion score for each emotion category of Ekman’s model, is created. It is further improved with the dependency relations and the best possible emotion is predicted. The results show the significant improvement in f-score values for text with sentiment information as input over our baseline as text without sentiment information. We report the weighted f-score on three different datasets with the Ekman’s emotion model. This supports that by leveraging the sentiment value, better emotion annotated data can be created.

pdf bib
A Deep Learning Approach for Automatic Detection of Fake News
Tanik Saikh | Arkadipta De | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 16th International Conference on Natural Language Processing

Fake news detection is a very prominent and essential task in the field of journalism. This challenging problem is seen so far in the field of politics, but it could be even more challenging when it is to be determined in the multi-domain platform. In this paper, we propose two effective models based on deep learning for solving fake news detection problem in online news contents of multiple domains. We evaluate our techniques on the two recently released datasets, namely Fake News AMT and Celebrity for fake news detection. The proposed systems yield encouraging performance, outperforming the current hand-crafted feature engineering based state-of-the-art system with a significant margin of 3.08% and 9.3% by the two models, respectively. In order to exploit the datasets, available for the related tasks, we perform cross-domain analysis (model trained on FakeNews AMT and tested on Celebrity and vice versa) to explore the applicability of our systems across the domains.

pdf bib
Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis
Dushyant Singh Chauhan | Md Shad Akhtar | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In recent times, multi-modal analysis has been an emerging and highly sought-after field at the intersection of natural language processing, computer vision, and speech processing. The prime objective of such studies is to leverage the diversified information, (e.g., textual, acoustic and visual), for learning a model. The effective interaction among these modalities often leads to a better system in terms of performance. In this paper, we introduce a recurrent neural network based approach for the multi-modal sentiment and emotion analysis. The proposed model learns the inter-modal interaction among the participating modalities through an auto-encoder mechanism. We employ a context-aware attention module to exploit the correspondence among the neighboring utterances. We evaluate our proposed approach for five standard multi-modal affect analysis datasets. Experimental results suggest the efficacy of the proposed model for both sentiment and emotion analysis over various existing state-of-the-art systems.

pdf bib
DeepSentiPeer: Harnessing Sentiment in Review Texts to Recommend Peer Review Decisions
Tirthankar Ghosal | Rajeev Verma | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Automatically validating a research artefact is one of the frontiers in Artificial Intelligence (AI) that directly brings it close to competing with human intellect and intuition. Although criticised sometimes, the existing peer review system still stands as the benchmark of research validation. The present-day peer review process is not straightforward and demands profound domain knowledge, expertise, and intelligence of human reviewer(s), which is somewhat elusive with the current state of AI. However, the peer review texts, which contains rich sentiment information of the reviewer, reflecting his/her overall attitude towards the research in the paper, could be a valuable entity to predict the acceptance or rejection of the manuscript under consideration. Here in this work, we investigate the role of reviewer sentiment embedded within peer review texts to predict the peer review outcome. Our proposed deep neural architecture takes into account three channels of information: the paper, the corresponding reviews, and review’s polarity to predict the overall recommendation score as well as the final decision. We achieve significant performance improvement over the baselines (∼ 29% error reduction) proposed in a recently released dataset of peer reviews. An AI of this kind could assist the editors/program chairs as an additional layer of confidence, especially when non-responding/missing reviewers are frequent in present day peer review.

pdf bib
Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders
Sukanta Sen | Kamal Kumar Gupta | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper, we propose a multilingual unsupervised NMT scheme which jointly trains multiple languages with a shared encoder and multiple decoders. Our approach is based on denoising autoencoding of each language and back-translating between English and multiple non-English languages. This results in a universal encoder which can encode any language participating in training into an inter-lingual representation, and language-specific decoders. Our experiments using only monolingual corpora show that multilingual unsupervised model performs better than the separately trained bilingual models achieving improvement of up to 1.48 BLEU points on WMT test sets. We also observe that even if we do not train the network for all possible translation directions, the network is still able to translate in a many-to-many fashion leveraging encoder’s ability to generate interlingual representation.

pdf bib
A Unified Multi-task Adversarial Learning Framework for Pharmacovigilance Mining
Shweta Yadav | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The mining of adverse drug reaction (ADR) has a crucial role in the pharmacovigilance. The traditional ways of identifying ADR are reliable but time-consuming, non-scalable and offer a very limited amount of ADR relevant information. With the unprecedented growth of information sources in the forms of social media texts (Twitter, Blogs, Reviews etc.), biomedical literature, and Electronic Medical Records (EMR), it has become crucial to extract the most pertinent ADR related information from these free-form texts. In this paper, we propose a neural network inspired multi- task learning framework that can simultaneously extract ADRs from various sources. We adopt a novel adversarial learning-based approach to learn features across multiple ADR information sources. Unlike the other existing techniques, our approach is capable to extracting fine-grained information (such as ‘Indications’, ‘Symptoms’, ‘Finding’, ‘Disease’, ‘Drug’) which provide important cues in pharmacovigilance. We evaluate our proposed approach on three publicly available real- world benchmark pharmacovigilance datasets, a Twitter dataset from PSB 2016 Social Me- dia Shared Task, CADEC corpus and Medline ADR corpus. Experiments show that our unified framework achieves state-of-the-art performance on individual tasks associated with the different benchmark datasets. This establishes the fact that our proposed approach is generic, which enables it to achieve high performance on the diverse datasets.

pdf bib
Ordinal and Attribute Aware Response Generation in a Multimodal Dialogue System
Hardik Chauhan | Mauajama Firdaus | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Multimodal dialogue systems have opened new frontiers in the traditional goal-oriented dialogue systems. The state-of-the-art dialogue systems are primarily based on unimodal sources, predominantly the text, and hence cannot capture the information present in the other sources such as videos, audios, images etc. With the availability of large scale multimodal dialogue dataset (MMD) (Saha et al., 2018) on the fashion domain, the visual appearance of the products is essential for understanding the intention of the user. Without capturing the information from both the text and image, the system will be incapable of generating correct and desirable responses. In this paper, we propose a novel position and attribute aware attention mechanism to learn enhanced image representation conditioned on the user utterance. Our evaluation shows that the proposed model can generate appropriate responses while preserving the position and attribute information. Experimental results also prove that our proposed approach attains superior performance compared to the baseline models, and outperforms the state-of-the-art approaches on text similarity based evaluation metrics.

pdf bib
Language-Agnostic Model for Aspect-Based Sentiment Analysis
Md Shad Akhtar | Abhishek Kumar | Asif Ekbal | Chris Biemann | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

In this paper, we propose a language-agnostic deep neural network architecture for aspect-based sentiment analysis. The proposed approach is based on Bidirectional Long Short-Term Memory (Bi-LSTM) network, which is further assisted with extra hand-crafted features. We define three different architectures for the successful combination of word embeddings and hand-crafted features. We evaluate the proposed approach for six languages (i.e. English, Spanish, French, Dutch, German and Hindi) and two problems (i.e. aspect term extraction and aspect sentiment classification). Experiments show that the proposed model attains state-of-the-art performance in most of the settings.

pdf bib
“When Numbers Matter!”: Detecting Sarcasm in Numerical Portions of Text
Abhijeet Dubey | Lakshya Kumar | Arpan Somani | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Research in sarcasm detection spans almost a decade. However a particular form of sarcasm remains unexplored: sarcasm expressed through numbers, which we estimate, forms about 11% of the sarcastic tweets in our dataset. The sentence ‘Love waking up at 3 am’ is sarcastic because of the number. In this paper, we focus on detecting sarcasm in tweets arising out of numbers. Initially, to get an insight into the problem, we implement a rule-based and a statistical machine learning-based (ML) classifier. The rule-based classifier conveys the crux of the numerical sarcasm problem, namely, incongruity arising out of numbers. The statistical ML classifier uncovers the indicators i.e., features of such sarcasm. The actual system in place, however, are two deep learning (DL) models, CNN and attention network that obtains an F-score of 0.93 and 0.91 on our dataset of tweets containing numbers. To the best of our knowledge, this is the first line of research investigating the phenomenon of sarcasm arising out of numbers, culminating in a detector thereof.

pdf bib
Extraction of Message Sequence Charts from Narrative History Text
Girish Palshikar | Sachin Pawar | Sangameshwar Patil | Swapnil Hingmire | Nitin Ramrakhiyani | Harsimran Bedi | Pushpak Bhattacharyya | Vasudeva Varma
Proceedings of the First Workshop on Narrative Understanding

In this paper, we advocate the use of Message Sequence Chart (MSC) as a knowledge representation to capture and visualize multi-actor interactions and their temporal ordering. We propose algorithms to automatically extract an MSC from a history narrative. For a given narrative, we first identify verbs which indicate interactions and then use dependency parsing and Semantic Role Labelling based approaches to identify senders (initiating actors) and receivers (other actors involved) for these interaction verbs. As a final step in MSC extraction, we employ a state-of-the art algorithm to temporally re-order these interactions. Our evaluation on multiple publicly available narratives shows improvements over four baselines.

pdf bib
IITP-MT System for Gujarati-English News Translation Task at WMT 2019
Sukanta Sen | Kamal Kumar Gupta | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We describe our submission to WMT 2019 News translation shared task for Gujarati-English language pair. We submit constrained systems, i.e, we rely on the data provided for this language pair and do not use any external data. We train Transformer based subword-level neural machine translation (NMT) system using original parallel corpus along with synthetic parallel corpus obtained through back-translation of monolingual data. Our primary systems achieve BLEU scores of 10.4 and 8.1 for Gujarati→English and English→Gujarati, respectively. We observe that incorporating monolingual data through back-translation improves the BLEU score significantly over baseline NMT and SMT systems for this language pair.

pdf bib
Utilizing Monolingual Data in NMT for Similar Languages: Submission to Similar Language Translation Task
Jyotsana Khatri | Pushpak Bhattacharyya
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes our submission to Shared Task on Similar Language Translation in Fourth Conference on Machine Translation (WMT 2019). We submitted three systems for Hindi -> Nepali direction in which we have examined the performance of a RNN based NMT system, a semi-supervised NMT system where monolingual data of both languages is utilized using the architecture by and a system trained with extra synthetic sentences generated using copy of source and target sentences without using any additional monolingual data.

pdf bib
Parallel Corpus Filtering Based on Fuzzy String Matching
Sukanta Sen | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper, we describe the IIT Patna’s submission to WMT 2019 shared task on parallel corpus filtering. This shared task asks the participants to develop methods for scoring each parallel sentence from a given noisy parallel corpus. Quality of the scoring method is judged based on the quality of SMT and NMT systems trained on smaller set of high-quality parallel sentences sub-sampled from the original noisy corpus. This task has two language pairs. We submit for both the Nepali-English and Sinhala-English language pairs. We define fuzzy string matching score between English and the translated (into English) source based on Levenshtein distance. Based on the scores, we sub-sample two sets (having 1 million and 5 millions English tokens) of parallel sentences from each parallel corpus, and train SMT systems for development purpose only. The organizers publish the official evaluation using both SMT and NMT on the final official test set. Total 10 teams participated in the shared task and according the official evaluation, our scoring method obtains 2nd position in the team ranking for 1-million NepaliEnglish NMT and 5-million Sinhala-English NMT categories.

pdf bib
Introduction to Sanskrit Shabdamitra: An Educational Application of Sanskrit Wordnet
Malhar Kulkarni | Nilesh Joshi | Sayali Khare | Hanumant Redkar | Pushpak Bhattacharyya
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

pdf bib
Utilizing Word Embeddings based Features for Phylogenetic Tree Generation of Sanskrit Texts
Diptesh Kanojia | Abhijeet Dubey | Malhar Kulkarni | Pushpak Bhattacharyya | Gholemreza Haffari
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

pdf bib
An Introduction to the Textual History Tool
Diptesh Kanojia | Malhar Kulkarni | Pushpak Bhattacharyya | Eivind Kahrs
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

pdf bib
Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis
Md Shad Akhtar | Dushyant Chauhan | Deepanway Ghosal | Soujanya Poria | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Related tasks often have inter-dependence on each other and perform better when solved in a joint framework. In this paper, we present a deep multi-task learning framework that jointly performs sentiment and emotion analysis both. The multi-modal inputs (i.e. text, acoustic and visual frames) of a video convey diverse and distinctive information, and usually do not have equal contribution in the decision making. We propose a context-level inter-modal attention framework for simultaneously predicting the sentiment and expressed emotions of an utterance. We evaluate our proposed approach on CMU-MOSEI dataset for multi-modal sentiment and emotion analysis. Evaluation results suggest that multi-task learning framework offers improvement over the single-task framework. The proposed approach reports new state-of-the-art performance for both sentiment analysis and emotion analysis.

pdf bib
Courteously Yours: Inducing courteous behavior in Customer Care responses using Reinforced Pointer Generator Network
Hitesh Golchha | Mauajama Firdaus | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In this paper, we propose an effective deep learning framework for inducing courteous behavior in customer care responses. The interaction between a customer and the customer care representative contributes substantially to the overall customer experience. Thus it is imperative for customer care agents and chatbots engaging with humans to be personal, cordial and emphatic to ensure customer satisfaction and retention. Our system aims at automatically transforming neutral customer care responses into courteous replies. Along with stylistic transfer (of courtesy), our system ensures that responses are coherent with the conversation history, and generates courteous expressions consistent with the emotional state of the customer. Our technique is based on a reinforced pointer-generator model for the sequence to sequence task. The model is also conditioned on a hierarchically encoded and emotionally aware conversational context. We use real interactions on Twitter between customer care professionals and aggrieved customers to create a large conversational dataset having both forms of agent responses: ‘generic’ and ‘courteous’. We perform quantitative and qualitative analyses on established and task-specific metrics, both automatic and human evaluation based. Our evaluation shows that the proposed models can generate emotionally-appropriate courteous expressions while preserving the content. Experimental results also prove that our proposed approach performs better than the baseline models.

pdf bib
Addressing word-order Divergence in Multilingual Neural Machine Translation for extremely Low Resource Languages
Rudra Murthy | Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Transfer learning approaches for Neural Machine Translation (NMT) train a NMT model on an assisting language-target language pair (parent model) which is later fine-tuned for the source language-target language pair of interest (child model), with the target language being the same. In many cases, the assisting language has a different word order from the source language. We show that divergent word order adversely limits the benefits from transfer learning when little to no parallel corpus between the source and target language is available. To bridge this divergence, we propose to pre-order the assisting language sentences to match the word order of the source language and train the parent model. Our experiments on many language pairs show that bridging the word order gap leads to significant improvement in the translation quality in extremely low-resource scenarios.

pdf bib
Extraction of Message Sequence Charts from Software Use-Case Descriptions
Girish Palshikar | Nitin Ramrakhiyani | Sangameshwar Patil | Sachin Pawar | Swapnil Hingmire | Vasudeva Varma | Pushpak Bhattacharyya
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)

Software Requirement Specification documents provide natural language descriptions of the core functional requirements as a set of use-cases. Essentially, each use-case contains a set of actors and sequences of steps describing the interactions among them. Goals of use-case reviews and analyses include their correctness, completeness, detection of ambiguities, prototyping, verification, test case generation and traceability. Message Sequence Chart (MSC) have been proposed as a expressive, rigorous yet intuitive visual representation of use-cases. In this paper, we describe a linguistic knowledge-based approach to extract MSCs from use-cases. Compared to existing techniques, we extract richer constructs of the MSC notation such as timers, conditions and alt-boxes. We apply this tool to extract MSCs from several real-life software use-case descriptions and show that it performs better than the existing techniques. We also discuss the benefits and limitations of the extracted MSCs to meet the above goals.

2018

pdf bib
Semi-automatic WordNet Linking using Word Embeddings
Kevin Patel | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 9th Global Wordnet Conference

Wordnets are rich lexico-semantic resources. Linked wordnets are extensions of wordnets, which link similar concepts in wordnets of different languages. Such resources are extremely useful in many Natural Language Processing (NLP) applications, primarily those based on knowledge-based approaches. In such approaches, these resources are considered as gold standard/oracle. Thus, it is crucial that these resources hold correct information. Thereby, they are created by human experts. However, manual maintenance of such resources is a tedious and costly affair. Thus techniques that can aid the experts are desirable. In this paper, we propose an approach to link wordnets. Given a synset of the source language, the approach returns a ranked list of potential candidate synsets in the target language from which the human expert can choose the correct one(s). Our technique is able to retrieve a winner synset in the top 10 ranked list for 60% of all synsets and 70% of noun synsets.

pdf bib
An Iterative Approach for Unsupervised Most Frequent Sense Detection using WordNet and Word Embeddings
Kevin Patel | Pushpak Bhattacharyya
Proceedings of the 9th Global Wordnet Conference

Given a word, what is the most frequent sense in which it occurs in a given corpus? Most Frequent Sense (MFS) is a strong baseline for unsupervised word sense disambiguation. If we have large amounts of sense-annotated corpora, MFS can be trivially created. However, sense-annotated corpora are a rarity. In this paper, we propose a method which can compute MFS from raw corpora. Our approach iteratively exploits the semantic congruity among related words in corpus. Our method performs better compared to another similar work.

pdf bib
Hindi Wordnet for Language Teaching: Experiences and Lessons Learnt
Hanumant Redkar | Rajita Shukla | Sandhya Singh | Jaya Saraswati | Laxmi Kashyap | Diptesh Kanojia | Preethi Jyothi | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 9th Global Wordnet Conference

This paper reports the work related to making Hindi Wordnet1 available as a digital resource for language learning and teaching, and the experiences and lessons that were learnt during the process. The language data of the Hindi Wordnet has been suitably modified and enhanced to make it into a language learning aid. This aid is based on modern pedagogical axioms and is aligned to the learning objectives of the syllabi of the school education in India. To make it into a comprehensive language tool, grammatical information has also been encoded, as far as these can be marked on the lexical items. The delivery of information is multi-layered, multi-sensory and is available across multiple digital platforms. The front end has been designed to offer an eye-catching user-friendly interface which is suitable for learners starting from age six onward. Preliminary testing of the tool has been done and it has been modified as per the feedbacks that were received. Above all, the entire exercise has offered gainful insights into learning based on associative networks and how knowledge based on such networks can be made available to modern learners.

pdf bib
pyiwn: A Python based API to access Indian Language WordNets
Ritesh Panjwani | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 9th Global Wordnet Conference

Indian language WordNets have their individual web-based browsing interfaces along with a common interface for IndoWordNet. These interfaces prove to be useful for language learners and in an educational domain, however, they do not provide the functionality of connecting to them and browsing their data through a lucid application programming interface or an API. In this paper, we present our work on creating such an easy-to-use framework which is bundled with the data for Indian language WordNets and provides NLTK WordNet interface like core functionalities in Python. Additionally, we use a pre-built speech synthesis system for Hindi language and augment Hindi data with audios for words, glosses, and example sentences. We provide a detailed usage of our API and explain the functions for ease of the user. Also, we package the IndoWordNet data along with the source code and provide it openly for the purpose of research. We aim to provide all our work as an open source framework for further development.

pdf bib
Synthesizing Audio for Hindi WordNet
Diptesh Kanojia | Preethi Jyothi | Pushpak Bhattacharyya
Proceedings of the 9th Global Wordnet Conference

In this paper, we describe our work on the creation of a voice model using a speech synthesis system for the Hindi Language. We use pre-existing “voices”, use publicly available speech corpora to create a “voice” using the Festival Speech Synthesis System (Black, 1997). Our contribution is two-fold: (1) We scrutinize multiple speech synthesis systems and provide an extensive report on the currently available state-of-the-art systems. We also develop voices using the existing implementations of the aforementioned systems, and (2) We use these voices to generate sample audios for randomly chosen words; manually evaluate the audio generated, and produce audio for all WordNet words using the winner voice model. We also produce audios for the Hindi WordNet Glosses and Example sentences. We describe our efforts to use pre-existing implementations for WaveNet - a model to generate raw audio using neural nets (Oord et al., 2016) and generate speech for Hindi. Our lexicographers perform a manual evaluation of the audio generated using multiple voices. A qualitative and quantitative analysis reveals that the voice model generated by us performs the best with an accuracy of 0.44.

pdf bib
Contextual Inter-modal Attention for Multi-modal Sentiment Analysis
Deepanway Ghosal | Md Shad Akhtar | Dushyant Chauhan | Soujanya Poria | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Multi-modal sentiment analysis offers various challenges, one being the effective combination of different input modalities, namely text, visual and acoustic. In this paper, we propose a recurrent neural network based multi-modal attention framework that leverages the contextual information for utterance-level sentiment prediction. The proposed approach applies attention on multi-modal multi-utterance representations and tries to learn the contributing features amongst them. We evaluate our proposed approach on two multi-modal sentiment analysis benchmark datasets, viz. CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) corpus and the recently released CMU Multi-modal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) corpus. Evaluation results show the effectiveness of our proposed approach with the accuracies of 82.31% and 79.80% for the MOSI and MOSEI datasets, respectively. These are approximately 2 and 1 points performance improvement over the state-of-the-art models for the datasets.

pdf bib
Identifying Transferable Information Across Domains for Cross-domain Sentiment Classification
Raksha Sharma | Pushpak Bhattacharyya | Sandipan Dandapat | Himanshu Sharad Bhatt
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Getting manually labeled data in each domain is always an expensive and a time consuming task. Cross-domain sentiment analysis has emerged as a demanding concept where a labeled source domain facilitates a sentiment classifier for an unlabeled target domain. However, polarity orientation (positive or negative) and the significance of a word to express an opinion often differ from one domain to another domain. Owing to these differences, cross-domain sentiment classification is still a challenging task. In this paper, we propose that words that do not change their polarity and significance represent the transferable (usable) information across domains for cross-domain sentiment classification. We present a novel approach based on χ2 test and cosine-similarity between context vector of words to identify polarity preserving significant words across domains. Furthermore, we show that a weighted ensemble of the classifiers enhances the cross-domain classification performance.

pdf bib
Eyes are the Windows to the Soul: Predicting the Rating of Text Quality Using Gaze Behaviour
Sandeep Mathias | Diptesh Kanojia | Kevin Patel | Samarth Agrawal | Abhijit Mishra | Pushpak Bhattacharyya
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Predicting a reader’s rating of text quality is a challenging task that involves estimating different subjective aspects of the text, like structure, clarity, etc. Such subjective aspects are better handled using cognitive information. One such source of cognitive information is gaze behaviour. In this paper, we show that gaze behaviour does indeed help in effectively predicting the rating of text quality. To do this, we first we model text quality as a function of three properties - organization, coherence and cohesion. Then, we demonstrate how capturing gaze behaviour helps in predicting each of these properties, and hence the overall quality, by reporting improvements obtained by adding gaze features to traditional textual features for score prediction. We also hypothesize that if a reader has fully understood the text, the corresponding gaze behaviour would give a better indication of the assigned rating, as opposed to partial understanding. Our experiments validate this hypothesis by showing greater agreement between the given rating and the predicted rating when the reader has a full understanding of the text.

pdf bib
Identification of Alias Links among Participants in Narratives
Sangameshwar Patil | Sachin Pawar | Swapnil Hingmire | Girish Palshikar | Vasudeva Varma | Pushpak Bhattacharyya
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Identification of distinct and independent participants (entities of interest) in a narrative is an important task for many NLP applications. This task becomes challenging because these participants are often referred to using multiple aliases. In this paper, we propose an approach based on linguistic knowledge for identification of aliases mentioned using proper nouns, pronouns or noun phrases with common noun headword. We use Markov Logic Network (MLN) to encode the linguistic knowledge for identification of aliases. We evaluate on four diverse history narratives of varying complexity. Our approach performs better than the state-of-the-art approach as well as a combination of standard named entity recognition and coreference resolution techniques.

pdf bib
Judicious Selection of Training Data in Assisting Language for Multilingual Neural NER
Rudra Murthy | Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Multilingual learning for Neural Named Entity Recognition (NNER) involves jointly training a neural network for multiple languages. Typically, the goal is improving the NER performance of one of the languages (the primary language) using the other assisting languages. We show that the divergence in the tag distributions of the common named entities between the primary and assisting languages can reduce the effectiveness of multilingual learning. To alleviate this problem, we propose a metric based on symmetric KL divergence to filter out the highly divergent training instances in the assisting language. We empirically show that our data selection strategy improves NER performance in many languages, including those with very limited training data.

pdf bib
Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural Based Question Answering
Deepak Gupta | Pabitra Lenka | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 22nd Conference on Computational Natural Language Learning

Existing research on question answering (QA) and comprehension reading (RC) are mainly focused on the resource-rich language like English. In recent times, the rapid growth of multi-lingual web content has posed several challenges to the existing QA systems. Code-mixing is one such challenge that makes the task more complex. In this paper, we propose a linguistically motivated technique for code-mixed question generation (CMQG) and a neural network based architecture for code-mixed question answering (CMQA). For evaluation, we manually create the code-mixed questions for Hindi-English language pair. In order to show the effectiveness of our neural network based CMQA technique, we utilize two benchmark datasets, SQuAD and MMQA. Experiments show that our proposed model achieves encouraging performance on CMQG and CMQA.

pdf bib
Sentence Level Temporality Detection using an Implicit Time-sensed Resource
Sabyasachi Kamila | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
ASAP++: Enriching the ASAP Automated Essay Grading Dataset with Essay Attribute Scores
Sandeep Mathias | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A Deep Neural Network based Approach for Entity Extraction in Code-Mixed Indian Social Media Text
Deepak Gupta | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Morphology Injection for English-Malayalam Statistical Machine Translation
Sreelekha S | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Sarcasm Target Identification: Dataset and An Introductory Approach
Aditya Joshi | Pranav Goel | Pushpak Bhattacharyya | Mark Carman
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
MMQA: A Multi-domain Multi-lingual Question-Answering Framework for English and Hindi
Deepak Gupta | Surabhi Kumari | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Medical Sentiment Analysis using Social Media: Towards building a Patient Assisted System
Shweta Yadav | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Towards a Standardized Dataset for Noun Compound Interpretation
Girishkumar Ponkiya | Kevin Patel | Pushpak Bhattacharyya | Girish K Palshikar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
The IIT Bombay English-Hindi Parallel Corpus
Anoop Kunchukuttan | Pratik Mehta | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
TAP-DLND 1.0 : A Corpus for Document Level Novelty Detection
Tirthankar Ghosal | Amitra Salam | Swati Tiwari | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Indian Language Wordnets and their Linkages with Princeton WordNet
Diptesh Kanojia | Kevin Patel | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Leveraging Orthographic Similarity for Multilingual Neural Transliteration
Anoop Kunchukuttan | Mitesh Khapra | Gurneet Singh | Pushpak Bhattacharyya
Transactions of the Association for Computational Linguistics, Volume 6

We address the task of joint training of transliteration models for multiple language pairs (multilingual transliteration). This is an instance of multitask learning, where individual tasks (language pairs) benefit from sharing knowledge with related tasks. We focus on transliteration involving related tasks i.e., languages sharing writing systems and phonetic properties (orthographically similar languages). We propose a modified neural encoder-decoder model that maximizes parameter sharing across language pairs in order to effectively leverage orthographic similarity. We show that multilingual transliteration significantly outperforms bilingual transliteration in different scenarios (average increase of 58% across a variety of languages we experimented with). We also show that multilingual transliteration models can generalize well to languages/language pairs not encountered during training and hence perform well on the zeroshot transliteration task. We show that further improvements can be achieved by using phonetic feature input.

pdf bib
The Whole is Greater than the Sum of its Parts: Towards the Effectiveness of Voting Ensemble Classifiers for Complex Word Identification
Nikhil Wani | Sandeep Mathias | Jayashree Aanand Gajjam | Pushpak Bhattacharyya
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we present an effective system using voting ensemble classifiers to detect contextually complex words for non-native English speakers. To make the final decision, we channel a set of eight calibrated classifiers based on lexical, size and vocabulary features and train our model with annotated datasets collected from a mixture of native and non-native speakers. Thereafter, we test our system on three datasets namely News, WikiNews, and Wikipedia and report competitive results with an F1-Score ranging between 0.777 to 0.855 for each of the datasets. Our system outperforms multiple other models and falls within 0.042 to 0.026 percent of the best-performing model’s score in the shared task.

pdf bib
Meaningless yet meaningful: Morphology grounded subword-level NMT
Tamali Banerjee | Pushpak Bhattacharyya
Proceedings of the Second Workshop on Subword/Character LEvel Models

We explore the use of two independent subsystems Byte Pair Encoding (BPE) and Morfessor as basic units for subword-level neural machine translation (NMT). We show that, for linguistically distant language-pairs Morfessor-based segmentation algorithm produces significantly better quality translation than BPE. However, for close language-pairs BPE-based subword-NMT may translate better than Morfessor-based subword-NMT. We propose a combined approach of these two segmentation algorithms Morfessor-BPE (M-BPE) which outperforms these two baseline systems in terms of BLEU score. Our results are supported by experiments on three language-pairs: English-Hindi, Bengali-Hindi and English-Bengali.

pdf bib
Thank “Goodness”! A Way to Measure Style in Student Essays
Sandeep Mathias | Pushpak Bhattacharyya
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

Essays have two major components for scoring - content and style. In this paper, we describe a property of the essay, called goodness, and use it to predict the score given for the style of student essays. We compare our approach to solve this problem with baseline approaches, like language modeling and also a state-of-the-art deep learning system. We show that, despite being quite intuitive, our approach is very powerful in predicting the style of the essays.

pdf bib
Can Taxonomy Help? Improving Semantic Question Matching using Question Taxonomy
Deepak Gupta | Rajkumar Pujari | Asif Ekbal | Pushpak Bhattacharyya | Anutosh Maitra | Tom Jain | Shubhashis Sengupta
Proceedings of the 27th International Conference on Computational Linguistics

In this paper, we propose a hybrid technique for semantic question matching. It uses a proposed two-layered taxonomy for English questions by augmenting state-of-the-art deep learning models with question classes obtained from a deep learning based question classifier. Experiments performed on three open-domain datasets demonstrate the effectiveness of our proposed approach. We achieve state-of-the-art results on partial ordering question ranking (POQR) benchmark dataset. Our empirical analysis shows that coupling standard distributional features (provided by the question encoder) with knowledge from taxonomy is more effective than either deep learning or taxonomy-based knowledge alone.

pdf bib
Treat us like the sequences we are: Prepositional Paraphrasing of Noun Compounds using LSTM
Girishkumar Ponkiya | Kevin Patel | Pushpak Bhattacharyya | Girish Palshikar
Proceedings of the 27th International Conference on Computational Linguistics

Interpreting noun compounds is a challenging task. It involves uncovering the underlying predicate which is dropped in the formation of the compound. In most cases, this predicate is of the form VERB+PREP. It has been observed that uncovering the preposition is a significant step towards uncovering the predicate. In this paper, we attempt to paraphrase noun compounds using prepositions. We consider noun compounds and their corresponding prepositional paraphrases as parallelly aligned sequences of words. This enables us to adapt different architectures from cross-lingual embedding literature. We choose the architecture where we create representations of both noun compound (source sequence) and its corresponding prepositional paraphrase (target sequence), such that their sim- ilarity is high. We use LSTMs to learn these representations. We use these representations to decide the correct preposition. Our experiments show that this approach performs considerably well on different datasets of noun compounds that are manually annotated with prepositions.

pdf bib
Novelty Goes Deep. A Deep Neural Solution To Document Level Novelty Detection
Tirthankar Ghosal | Vignesh Edithal | Asif Ekbal | Pushpak Bhattacharyya | George Tsatsaronis | Srinivasa Satya Sameer Kumar Chivukula
Proceedings of the 27th International Conference on Computational Linguistics

The rapid growth of documents across the web has necessitated finding means of discarding redundant documents and retaining novel ones. Capturing redundancy is challenging as it may involve investigating at a deep semantic level. Techniques for detecting such semantic redundancy at the document level are scarce. In this work we propose a deep Convolutional Neural Networks (CNN) based model to classify a document as novel or redundant with respect to a set of relevant documents already seen by the system. The system is simple and do not require any manual feature engineering. Our novel scheme encodes relevant and relative information from both source and target texts to generate an intermediate representation which we coin as the Relative Document Vector (RDV). The proposed method outperforms the existing state-of-the-art on a document-level novelty detection dataset by a margin of ∼5% in terms of accuracy. We further demonstrate the effectiveness of our approach on a standard paraphrase detection dataset where paraphrased passages closely resemble to semantically redundant documents.

pdf bib
IITP-MT at WAT2018: Transformer-based Multilingual Indic-English Neural Machine Translation System
Sukanta Sen | Kamal Kumar Gupta | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

pdf bib
Multilingual Indian Language Translation System at WAT 2018: Many-to-one Phrase-based SMT
Tamali Banerjee | Anoop Kunchukuttan | Pushpak Bhattacharya
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

pdf bib
Solving Data Sparsity for Aspect Based Sentiment Analysis Using Cross-Linguality and Multi-Linguality
Md Shad Akhtar | Palaash Sawant | Sukanta Sen | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Efficient word representations play an important role in solving various problems related to Natural Language Processing (NLP), data mining, text mining etc. The issue of data sparsity poses a great challenge in creating efficient word representation model for solving the underlying problem. The problem is more intensified in resource-poor scenario due to the absence of sufficient amount of corpus. In this work we propose to minimize the effect of data sparsity by leveraging bilingual word embeddings learned through a parallel corpus. We train and evaluate Long Short Term Memory (LSTM) based architecture for aspect level sentiment classification. The neural network architecture is further assisted by the hand-crafted features for the prediction. We show the efficacy of the proposed model against state-of-the-art methods in two experimental setups i.e. multi-lingual and cross-lingual.

pdf bib
Fine-Grained Temporal Orientation and its Relationship with Psycho-Demographic Correlates
Sabyasachi Kamila | Mohammed Hasanuzzaman | Asif Ekbal | Pushpak Bhattacharyya | Andy Way
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Temporal orientation refers to an individual’s tendency to connect to the psychological concepts of past, present or future, and it affects personality, motivation, emotion, decision making and stress coping processes. The study of the social media users’ psycho-demographic attributes from the perspective of human temporal orientation can be of utmost interest and importance to the business and administrative decision makers as it can provide an extra precious information for them to make informed decisions. In this paper, we propose a very first study to demonstrate the association between the sentiment view of the temporal orientation of the users and their different psycho-demographic attributes by analyzing their tweets. We first create a temporal orientation classifier in a minimally supervised way which classifies each tweet of the users in one of the three temporal categories, namely past, present, and future. A deep Bi-directional Long Short Term Memory (BLSTM) is used for the tweet classification task. Our tweet classifier achieves an accuracy of 78.27% when tested on a manually created test set. We then determine the users’ overall temporal orientation based on their tweets on the social media. The sentiment is added to the tweets at the fine-grained level where each temporal tweet is given a sentiment with either of the positive, negative or neutral. Our experiment reveals that depending upon the sentiment view of temporal orientation, a user’s attributes vary. We finally measure the correlation between the users’ sentiment view of temporal orientation and their different psycho-demographic factors using regression.

pdf bib
Multi-Task Learning Framework for Mining Crowd Intelligence towards Clinical Treatment
Shweta Yadav | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya | Amit Sheth
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

In recent past, social media has emerged as an active platform in the context of healthcare and medicine. In this paper, we present a study where medical user’s opinions on health-related issues are analyzed to capture the medical sentiment at a blog level. The medical sentiments can be studied in various facets such as medical condition, treatment, and medication that characterize the overall health status of the user. Considering these facets, we treat analysis of this information as a multi-task classification problem. In this paper, we adopt a novel adversarial learning approach for our multi-task learning framework to learn the sentiment’s strengths expressed in a medical blog. Our evaluation shows promising results for our target tasks.

2017

pdf bib
Learning Cognitive Features from Gaze Data for Sentiment and Sarcasm Classification using Convolutional Neural Network
Abhijit Mishra | Kuntal Dey | Pushpak Bhattacharyya
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cognitive NLP systems- i.e., NLP systems that make use of behavioral data - augment traditional text-based features with cognitive features extracted from eye-movement patterns, EEG signals, brain-imaging etc. Such extraction of features is typically manual. We contend that manual extraction of features may not be the best way to tackle text subtleties that characteristically prevail in complex classification tasks like Sentiment Analysis and Sarcasm Detection, and that even the extraction and choice of features should be delegated to the learning system. We introduce a framework to automatically extract cognitive features from the eye-movement/gaze data of human readers reading the text and use them as features along with textual features for the tasks of sentiment polarity and sarcasm detection. Our proposed framework is based on Convolutional Neural Network (CNN). The CNN learns features from both gaze and text and uses them to classify the input text. We test our technique on published sentiment and sarcasm labeled datasets, enriched with gaze information, to show that using a combination of automatically learned text and gaze features often yields better classification performance over (i) CNN based systems that rely on text input alone and (ii) existing systems that rely on handcrafted gaze and textual features.

pdf bib
Towards Lower Bounds on Number of Dimensions for Word Embeddings
Kevin Patel | Pushpak Bhattacharyya
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Word embeddings are a relatively new addition to the modern NLP researcher’s toolkit. However, unlike other tools, word embeddings are used in a black box manner. There are very few studies regarding various hyperparameters. One such hyperparameter is the dimension of word embeddings. They are rather decided based on a rule of thumb: in the range 50 to 300. In this paper, we show that the dimension should instead be chosen based on corpus statistics. More specifically, we show that the number of pairwise equidistant words of the corpus vocabulary (as defined by some distance/similarity metric) gives a lower bound on the the number of dimensions , and going below this bound results in degradation of quality of learned word embeddings. Through our evaluations on standard word embedding evaluation tasks, we show that for dimensions higher than or equal to the bound, we get better results as compared to the ones below it.

pdf bib
Utilizing Lexical Similarity between Related, Low-resource Languages for Pivot-based SMT
Anoop Kunchukuttan | Maulik Shah | Pradyot Prakash | Pushpak Bhattacharyya
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We investigate pivot-based translation between related languages in a low resource, phrase-based SMT setting. We show that a subword-level pivot-based SMT model using a related pivot language is substantially better than word and morpheme-level pivot models. It is also highly competitive with the best direct translation model, which is encouraging as no direct source-target training corpus is used. We also show that combining multiple related language pivot models can rival a direct translation model. Thus, the use of subwords as translation units coupled with multiple related pivot languages can compensate for the lack of a direct parallel corpus.

pdf bib
IITP at IJCNLP-2017 Task 4: Auto Analysis of Customer Feedback using CNN and GRU Network
Deepak Gupta | Pabitra Lenka | Harsimran Bedi | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the IJCNLP 2017, Shared Tasks

Analyzing customer feedback is the best way to channelize the data into new marketing strategies that benefit entrepreneurs as well as customers. Therefore an automated system which can analyze the customer behavior is in great demand. Users may write feedbacks in any language, and hence mining appropriate information often becomes intractable. Especially in a traditional feature-based supervised model, it is difficult to build a generic system as one has to understand the concerned language for finding the relevant features. In order to overcome this, we propose deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based approaches that do not require handcrafting of features. We evaluate these techniques for analyzing customer feedback sentences on four languages, namely English, French, Japanese and Spanish. Our empirical analysis shows that our models perform well in all the four languages on the setups of IJCNLP Shared Task on Customer Feedback Analysis. Our model achieved the second rank in French, with an accuracy of 71.75% and third ranks for all the other languages.

pdf bib
A Multilayer Perceptron based Ensemble Technique for Fine-grained Financial Sentiment Analysis
Md Shad Akhtar | Abhishek Kumar | Deepanway Ghosal | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose a novel method for combining deep learning and classical feature based models using a Multi-Layer Perceptron (MLP) network for financial sentiment analysis. We develop various deep learning models based on Convolutional Neural Network (CNN), Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU). These are trained on top of pre-trained, autoencoder-based, financial word embeddings and lexicon features. An ensemble is constructed by combining these deep learning models and a classical supervised model based on Support Vector Regression (SVR). We evaluate our proposed technique on a benchmark dataset of SemEval-2017 shared task on financial sentiment analysis. The propose model shows impressive results on two datasets, i.e. microblogs and news headlines datasets. Comparisons show that our proposed model performs better than the existing state-of-the-art systems for the above two datasets by 2.0 and 4.1 cosine points, respectively.

pdf bib
Sentiment Intensity Ranking among Adjectives Using Sentiment Bearing Word Embeddings
Raksha Sharma | Arpan Somani | Lakshya Kumar | Pushpak Bhattacharyya
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Identification of intensity ordering among polar (positive or negative) words which have the same semantics can lead to a fine-grained sentiment analysis. For example, ‘master’, ‘seasoned’ and ‘familiar’ point to different intensity levels, though they all convey the same meaning (semantics), i.e., expertise: having a good knowledge of. In this paper, we propose a semi-supervised technique that uses sentiment bearing word embeddings to produce a continuous ranking among adjectives that share common semantics. Our system demonstrates a strong Spearman’s rank correlation of 0.83 with the gold standard ranking. We show that sentiment bearing word embeddings facilitate a more accurate intensity ranking system than other standard word embeddings (word2vec and GloVe). Word2vec is the state-of-the-art for intensity ordering task.

pdf bib
Computational Sarcasm
Pushpak Bhattacharyya | Aditya Joshi
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Sarcasm is a form of verbal irony that is intended to express contempt or ridicule. Motivated by challenges posed by sarcastic text to sentiment analysis, computational approaches to sarcasm have witnessed a growing interest at NLP forums in the past decade. Computational sarcasm refers to automatic approaches pertaining to sarcasm. The tutorial will provide a bird’s-eye view of the research in computational sarcasm for text, while focusing on significant milestones.The tutorial begins with linguistic theories of sarcasm, with a focus on incongruity: a useful notion that underlies sarcasm and other forms of figurative language. Since the most significant work in computational sarcasm is sarcasm detection: predicting whether a given piece of text is sarcastic or not, sarcasm detection forms the focus hereafter. We begin our discussion on sarcasm detection with datasets, touching on strategies, challenges and nature of datasets. Then, we describe algorithms for sarcasm detection: rule-based (where a specific evidence of sarcasm is utilised as a rule), statistical classifier-based (where features are designed for a statistical classifier), a topic model-based technique, and deep learning-based algorithms for sarcasm detection. In case of each of these algorithms, we refer to our work on sarcasm detection and share our learnings. Since information beyond the text to be classified, contextual information is useful for sarcasm detection, we then describe approaches that use such information through conversational context or author-specific context.We then follow it by novel areas in computational sarcasm such as sarcasm generation, sarcasm v/s irony classification, etc. We then summarise the tutorial and describe future directions based on errors reported in past work. The tutorial will end with a demonstration of our work on sarcasm detection.This tutorial will be of interest to researchers investigating computational sarcasm and related areas such as computational humour, figurative language understanding, emotion and sentiment sentiment analysis, etc. The tutorial is motivated by our continually evolving survey paper of sarcasm detection, that is available on arXiv at: Joshi, Aditya, Pushpak Bhattacharyya, and Mark James Carman. “Automatic Sarcasm Detection: A Survey.” arXiv preprint arXiv:1602.03426 (2016).

pdf bib
Adapting Pre-trained Word Embeddings For Use In Medical Coding
Kevin Patel | Divya Patel | Mansi Golakiya | Pushpak Bhattacharyya | Nilesh Birari
BioNLP 2017

Word embeddings are a crucial component in modern NLP. Pre-trained embeddings released by different groups have been a major reason for their popularity. However, they are trained on generic corpora, which limits their direct use for domain specific tasks. In this paper, we propose a method to add task specific information to pre-trained word embeddings. Such information can improve their utility. We add information from medical coding data, as well as the first level from the hierarchy of ICD-10 medical code set to different pre-trained word embeddings. We adapt CBOW algorithm from the word2vec package for our purpose. We evaluated our approach on five different pre-trained word embeddings. Both the original word embeddings, and their modified versions (the ones with added information) were used for automated review of medical coding. The modified word embeddings give an improvement in f-score by 1% on the 5-fold evaluation on a private medical claims dataset. Our results show that adding extra information is possible and beneficial for the task at hand.

pdf bib
Towards Harnessing Memory Networks for Coreference Resolution
Joe Cheri | Pushpak Bhattacharyya
Proceedings of the 2nd Workshop on Representation Learning for NLP

Coreference resolution task demands comprehending a discourse, especially for anaphoric mentions which require semantic information for resolving antecedents. We investigate into how memory networks can be helpful for coreference resolution when posed as question answering problem. The comprehension capability of memory networks assists coreference resolution, particularly for the mentions that require semantic and context information. We experiment memory networks for coreference resolution, with 4 synthetic datasets generated for coreference resolution with varying difficulty levels. Our system’s performance is compared with a traditional coreference resolution system to show why memory network can be promising for coreference resolution.

pdf bib
Learning variable length units for SMT between related languages via Byte Pair Encoding
Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the First Workshop on Subword and Character Level Models in NLP

We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic syllables as units of translation, showing up to 11% increase in BLEU score. While orthographic syllables can be used only for languages whose writing systems use vowel representations, BPE is writing system independent and we show that BPE outperforms other units for non-vowel writing systems too. Our results are supported by extensive experimentation spanning multiple language families and writing systems.

pdf bib
IITP at EmoInt-2017: Measuring Intensity of Emotions using Sentence Embeddings and Optimized Features
Md Shad Akhtar | Palaash Sawant | Asif Ekbal | Jyoti Pawar | Pushpak Bhattacharyya
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This paper describes the system that we submitted as part of our participation in the shared task on Emotion Intensity (EmoInt-2017). We propose a Long short term memory (LSTM) based architecture cascaded with Support Vector Regressor (SVR) for intensity prediction. We also employ Particle Swarm Optimization (PSO) based feature selection algorithm for obtaining an optimized feature set for training and evaluation. System evaluation shows interesting results on the four emotion datasets i.e. anger, fear, joy and sadness. In comparison to the other participating teams our system was ranked 5th in the competition.

pdf bib
Comparing Recurrent and Convolutional Architectures for English-Hindi Neural Machine Translation
Sandhya Singh | Ritesh Panjwani | Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

In this paper, we empirically compare the two encoder-decoder neural machine translation architectures: convolutional sequence to sequence model (ConvS2S) and recurrent sequence to sequence model (RNNS2S) for English-Hindi language pair as part of IIT Bombay’s submission to WAT2017 shared task. We report the results for both English-Hindi and Hindi-English direction of language pair.

pdf bib
Hindi Shabdamitra: A Wordnet based E-Learning Tool for Language Learning and Teaching
Hanumant Redkar | Sandhya Singh | Meenakshi Somasundaram | Dhara Gorasia | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

In today’s technology driven digital era, education domain is undergoing a transformation from traditional approaches to more learner controlled and flexible methods of learning. This transformation has opened the new avenues for interdisciplinary research in the field of educational technology and natural language processing in developing quality digital aids for learning and teaching. The tool presented here - Hindi Shabhadamitra, developed using Hindi Wordnet for Hindi language learning, is one such e-learning tool. It has been developed as a teaching and learning aid suitable for formal school based curriculum and informal setup for self learning users. Besides vocabulary, it also provides word based grammar along with images and pronunciation for better learning and retention. This aid demonstrates that how a rich lexical resource like wordnet can be systematically remodeled for practical usage in the educational domain.

pdf bib
Document Level Novelty Detection: Textual Entailment Lends a Helping Hand
Tanik Saikh | Tirthankar Ghosal | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf bib
Is your Statement Purposeless? Predicting Computer Science Graduation Admission Acceptance based on Statement Of Purpose
Diptesh Kanojia | Nikhil Wani | Pushpak Bhattacharyya
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf bib
Hindi Shabdamitra: A Wordnet based E-Learning Tool for Language Learning and Teaching
Hanumant Redkar | Sandhya Singh | Dhara Gorasia | Meenakshi Somasundaram | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf bib
End-to-end Relation Extraction using Neural Networks and Markov Logic Networks
Sachin Pawar | Pushpak Bhattacharyya | Girish Palshikar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

End-to-end relation extraction refers to identifying boundaries of entity mentions, entity types of these mentions and appropriate semantic relation for each pair of mentions. Traditionally, separate predictive models were trained for each of these tasks and were used in a “pipeline” fashion where output of one model is fed as input to another. But it was observed that addressing some of these tasks jointly results in better performance. We propose a single, joint neural network based model to carry out all the three tasks of boundary identification, entity type classification and relation type classification. This model is referred to as “All Word Pairs” model (AWP-NN) as it assigns an appropriate label to each word pair in a given sentence for performing end-to-end relation extraction. We also propose to refine output of the AWP-NN model by using inference in Markov Logic Networks (MLN) so that additional domain knowledge can be effectively incorporated. We demonstrate effectiveness of our approach by achieving better end-to-end relation extraction performance than all 4 previous joint modelling approaches, on the standard dataset of ACE 2004.

pdf bib
Entity Extraction in Biomedical Corpora: An Approach to Evaluate Word Embedding Features with PSO based Feature Selection
Shweta Yadav | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Text mining has drawn significant attention in recent past due to the rapid growth in biomedical and clinical records. Entity extraction is one of the fundamental components for biomedical text mining. In this paper, we propose a novel approach of feature selection for entity extraction that exploits the concept of deep learning and Particle Swarm Optimization (PSO). The system utilizes word embedding features along with several other features extracted by studying the properties of the datasets. We obtain an interesting observation that compact word embedding features as determined by PSO are more effective compared to the entire word embedding feature set for entity extraction. The proposed system is evaluated on three benchmark biomedical datasets such as GENIA, GENETAG, and AiMed. The effectiveness of the proposed approach is evident with significant performance gains over the baseline models as well as the other existing systems. We observe improvements of 7.86%, 5.27% and 7.25% F-measure points over the baseline models for GENIA, GENETAG, and AiMed dataset respectively.

pdf bib
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Question Answering and Implicit Dialogue Identification
Titas Nandi | Chris Biemann | Seid Muhie Yimam | Deepak Gupta | Sarah Kohail | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we present the system for Answer Selection and Ranking in Community Question Answering, which we build as part of our participation in SemEval-2017 Task 3. We develop a Support Vector Machine (SVM) based system that makes use of textual, domain-specific, word-embedding and topic-modeling features. In addition, we propose a novel method for dialogue chain identification in comment threads. Our primary submission won subtask C, outperforming other systems in all the primary evaluation metrics. We performed well in other English subtasks, ranking third in subtask A and eighth in subtask B. We also developed open source toolkits for all the three English subtasks by the name cQARank [https://github.com/TitasNandi/cQARank].

pdf bib
IITP at SemEval-2017 Task 8 : A Supervised Approach for Rumour Evaluation
Vikram Singh | Sunny Narayan | Md Shad Akhtar | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our system participation in the SemEval-2017 Task 8 ‘RumourEval: Determining rumour veracity and support for rumours’. The objective of this task was to predict the stance and veracity of the underlying rumour. We propose a supervised classification approach employing several lexical, content and twitter specific features for learning. Evaluation shows promising results for both the problems.

pdf bib
IITPB at SemEval-2017 Task 5: Sentiment Prediction in Financial Text
Abhishek Kumar | Abhishek Sethi | Md Shad Akhtar | Asif Ekbal | Chris Biemann | Pushpak Bhattacharyya
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper reports team IITPB’s participation in the SemEval 2017 Task 5 on ‘Fine-grained sentiment analysis on financial microblogs and news’. We developed 2 systems for the two tracks. One system was based on an ensemble of Support Vector Classifier and Logistic Regression. This system relied on Distributional Thesaurus (DT), word embeddings and lexicon features to predict a floating sentiment value between -1 and +1. The other system was based on Support Vector Regression using word embeddings, lexicon features, and PMI scores as features. The system was ranked 5th in track 1 and 8th in track 2.

pdf bib
IITP at SemEval-2017 Task 5: An Ensemble of Deep Learning and Feature Based Models for Financial Sentiment Analysis
Deepanway Ghosal | Shobhit Bhatnagar | Md Shad Akhtar | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we propose an ensemble based model which combines state of the art deep learning sentiment analysis algorithms like Convolution Neural Network (CNN) and Long Short Term Memory (LSTM) along with feature based models to identify optimistic or pessimistic sentiments associated with companies and stocks in financial texts. We build our system to participate in a competition organized by Semantic Evaluation 2017 International Workshop. We combined predictions from various models using an artificial neural network to determine the opinion towards an entity in (a) Microblog Messages and (b) News Headlines data. Our models achieved a cosine similarity score of 0.751 and 0.697 for the above two tracks giving us the rank of 2nd and 7th best team respectively.

2016

pdf bib
Harnessing Sequence Labeling for Sarcasm Detection in Dialogue from TV Series ‘Friends’
Aditya Joshi | Vaibhav Tripathi | Pushpak Bhattacharyya | Mark J. Carman
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Leveraging Cognitive Features for Sentiment Analysis
Abhijit Mishra | Diptesh Kanojia | Seema Nagar | Kuntal Dey | Pushpak Bhattacharyya
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Substring-based unsupervised transliteration with phonetic and contextual knowledge
Anoop Kunchukuttan | Pushpak Bhattacharyya | Mitesh M. Khapra
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Political Issue Extraction Model: A Novel Hierarchical Topic Model That Uses Tweets By Political And Non-Political Authors
Aditya Joshi | Pushpak Bhattacharyya | Mark Carman
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Leveraging Annotators’ Gaze Behaviour for Coreference Resolution
Joe Cheri | Abhijit Mishra | Pushpak Bhattacharyya
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning

pdf bib
How Do Cultural Differences Impact the Quality of Sarcasm Annotation?: A Case Study of Indian Annotators and American Text
Aditya Joshi | Pushpak Bhattacharyya | Mark Carman | Jaya Saraswati | Rajita Shukla
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
Dekai Wu | Pushpak Bhattacharyya
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

pdf bib
Deep Learning Architecture for Patient Data De-identification in Clinical Records
Shweta Yadav | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP)

Rapid growth in Electronic Medical Records (EMR) has emerged to an expansion of data in the clinical domain. The majority of the available health care information is sealed in the form of narrative documents which form the rich source of clinical information. Text mining of such clinical records has gained huge attention in various medical applications like treatment and decision making. However, medical records enclose patient Private Health Information (PHI) which can reveal the identities of the patients. In order to retain the privacy of patients, it is mandatory to remove all the PHI information prior to making it publicly available. The aim is to de-identify or encrypt the PHI from the patient medical records. In this paper, we propose an algorithm based on deep learning architecture to solve this problem. We perform de-identification of seven PHI terms from the clinical records. Experiments on benchmark datasets show that our proposed approach achieves encouraging performance, which is better than the baseline model developed with Conditional Random Field.

pdf bib
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)
Toshiaki Nakazawa | Hideya Mino | Chenchen Ding | Isao Goto | Graham Neubig | Sadao Kurohashi | Ir. Hammam Riza | Pushpak Bhattacharyya
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

pdf bib
IIT Bombay’s English-Indonesian submission at WAT: Integrating Neural Language Models with SMT
Sandhya Singh | Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

This paper describes the IIT Bombay’s submission as a part of the shared task in WAT 2016 for English–Indonesian language pair. The results reported here are for both the direction of the language pair. Among the various approaches experimented, Operation Sequence Model (OSM) and Neural Language Model have been submitted for WAT. The OSM approach integrates translation and reordering process resulting in relatively improved translation. Similarly the neural experiment integrates Neural Language Model with Statistical Machine Translation (SMT) as a feature for translation. The Neural Probabilistic Language Model (NPLM) gave relatively high BLEU points for Indonesian to English translation system while the Neural Network Joint Model (NNJM) performed better for English to Indonesian direction of translation system. The results indicate improvement over the baseline Phrase-based SMT by 0.61 BLEU points for English-Indonesian system and 0.55 BLEU points for Indonesian-English translation system.

pdf bib
IITP English-Hindi Machine Translation System at WAT 2016
Sukanta Sen | Debajyoty Banik | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

In this paper we describe the system that we develop as part of our participation in WAT 2016. We develop a system based on hierarchical phrase-based SMT for English to Hindi language pair. We perform re-ordering and augment bilingual dictionary to improve the performance. As a baseline we use a phrase-based SMT model. The MT models are fine-tuned on the development set, and the best configurations are used to report the evaluation on the test set. Experiments show the BLEU of 13.71 on the benchmark test data. This is better compared to the official baseline BLEU score of 10.79.

pdf bib
Faster Decoding for Subword Level Phrase-based SMT between Related Languages
Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

A common and effective way to train translation systems between related languages is to consider sub-word level basic units. However, this increases the length of the sentences resulting in increased decoding time. The increase in length is also impacted by the specific choice of data format for representing the sentences as subwords. In a phrase-based SMT framework, we investigate different choices of decoder parameters as well as data format and their impact on decoding time and translation accuracy. We suggest best options for these settings that significantly improve decoding time with little impact on the translation accuracy.

pdf bib
‘Who would have thought of that!’: A Hierarchical Topic Model for Extraction of Sarcasm-prevalent Topics and Sarcasm Detection
Aditya Joshi | Prayas Jain | Pushpak Bhattacharyya | Mark Carman
Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM)

Topic Models have been reported to be beneficial for aspect-based sentiment analysis. This paper reports the first topic model for sarcasm detection, to the best of our knowledge. Designed on the basis of the intuition that sarcastic tweets are likely to have a mixture of words of both sentiments as against tweets with literal sentiment (either positive or negative), our hierarchical topic model discovers sarcasm-prevalent topics and topic-level sentiment. Using a dataset of tweets labeled using hashtags, the model estimates topic-level, and sentiment-level distributions. Our evaluation shows that topics such as ‘work’, ‘gun laws’, ‘weather’ are sarcasm-prevalent topics. Our model is also able to discover the mixture of sentiment-bearing words that exist in a text of a given sentiment-related label. Finally, we apply our model to predict sarcasm in tweets. We outperform two prior work based on statistical classifiers with specific features, by around 25%.

pdf bib
Can SMT and RBMT Improve each other’s Performance?- An Experiment with English-Hindi Translation
Debajyoty Banik | Sukanta Sen | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
Improving Document Ranking using Query Expansion and Classification Techniques for Mixed Script Information Retrieval
Subham Kumar | Anwesh Sinha Ray | Sabyasachi Kamila | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
Meaning Matters: Senses of Words are More Informative than Words for Cross-domain Sentiment Analysis
Raksha Sharma | Sudha Bhingardive | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
A Recurrent Neural Network Architecture for De-identifying Clinical Records
Shweta | Ankit Kumar | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
Opinion Mining in a Code-Mixed Environment: A Case Study with Government Portals
Deepak Gupta | Ankit Lamba | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
On Why Coarse Class Classification is Bottleneck in Noun Compound Interpretation
Girishkumar Ponkiya | Pushpak Bhattacharyya | Girish K. Palshikar
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
Verbframator:Semi-Automatic Verb Frame Annotator Tool with Special Reference to Marathi
Hanumant Redkar | Sandhya Singh | Nandini Ghag | Jai Paranjape | Nilesh Joshi | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
How Challenging is Sarcasm versus Irony Classification?: A Study With a Dataset from English Literature
Aditya Joshi | Vaibhav Tripathi | Pushpak Bhattacharyya | Mark Carman | Meghna Singh | Jaya Saraswati | Rajita Shukla
Proceedings of the Australasian Language Technology Association Workshop 2016

pdf bib
Are Word Embedding-based Features Useful for Sarcasm Detection?
Aditya Joshi | Vaibhav Tripathi | Kevin Patel | Pushpak Bhattacharyya | Mark Carman
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Orthographic Syllable as basic unit for SMT between Related Languages
Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Statistical Machine Translation between Related Languages
Pushpak Bhattacharyya | Mitesh M. Khapra | Anoop Kunchukuttan
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
Detecting Most Frequent Sense using Word Embeddings and BabelNet
Harpreet Singh Arora | Sudha Bhingardive | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

Since the inception of the SENSEVAL evaluation exercises there has been a great deal of recent research into Word Sense Disambiguation (WSD). Over the years, various supervised, unsupervised and knowledge based WSD systems have been proposed. Beating the first sense heuristics is a challenging task for these systems. In this paper, we present our work on Most Frequent Sense (MFS) detection using Word Embeddings and BabelNet features. The semantic features from BabelNet viz., synsets, gloss, relations, etc. are used for generating sense embeddings. We compare word embedding of a word with its sense embeddings to obtain the MFS with the highest similarity. The MFS is detected for six languages viz., English, Spanish, Russian, German, French and Italian. However, this approach can be applied to any language provided that word embeddings are available for that language.

pdf bib
IndoWordNet::Similarity- Computing Semantic Similarity and Relatedness using IndoWordNet
Sudha Bhingardive | Hanumant Redkar | Prateek Sappadla | Dhirendra Singh | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

Semantic similarity and relatedness measures play an important role in natural language processing applications. In this paper, we present the IndoWordNet::Similarity tool and interface, designed for computing the semantic similarity and relatedness between two words in IndoWordNet. A java based tool and a web interface have been developed to compute this semantic similarity and relatedness. Also, Java API has been developed for this purpose. This tool, web interface and the API are made available for the research purpose.

pdf bib
Sophisticated Lexical Databases - Simplified Usage: Mobile Applications and Browser Plugins For Wordnets
Diptesh Kanojia | Raj Dabre | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

India is a country with 22 officially recognized languages and 17 of these have WordNets, a crucial resource. Web browser based interfaces are available for these WordNets, but are not suited for mobile devices which deters people from effectively using this resource. We present our initial work on developing mobile applications and browser extensions to access WordNets for Indian Languages. Our contribution is two fold: (1) We develop mobile applications for the Android, iOS and Windows Phone OS platforms for Hindi, Marathi and Sanskrit WordNets which allow users to search for words and obtain more information along with their translations in English and other Indian languages. (2) We also develop browser extensions for English, Hindi, Marathi, and Sanskrit WordNets, for both Mozilla Firefox, and Google Chrome. We believe that such applications can be quite helpful in a classroom scenario, where students would be able to access the WordNets as dictionaries as well as lexical knowledge bases. This can help in overcoming the language barrier along with furthering language understanding.

pdf bib
A picture is worth a thousand words: Using OpenClipArt library for enriching IndoWordNet
Diptesh Kanojia | Shehzaad Dhuliawala | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

WordNet has proved to be immensely useful for Word Sense Disambiguation, and thence Machine translation, Information Retrieval and Question Answering. It can also be used as a dictionary for educational purposes. The semantic nature of concepts in a WordNet motivates one to try to express this meaning in a more visual way. In this paper, we describe our work of enriching IndoWordNet with image acquisitions from the OpenClipArt library. We describe an approach used to enrich WordNets for eighteen Indian languages. Our contribution is three fold: (1) We develop a system, which, given a synset in English, finds an appropriate image for the synset. The system uses the OpenclipArt library (OCAL) to retrieve images and ranks them. (2) After retrieving the images, we map the results along with the linkages between Princeton WordNet and Hindi WordNet, to link several synsets to corresponding images. We choose and sort top three images based on our ranking heuristic per synset. (3) We develop a tool that allows a lexicographer to manually evaluate these images. The top images are shown to a lexicographer by the evaluation tool for the task of choosing the best image representation. The lexicographer also selects the number of relevant images. Using our system, we obtain an Average Precision (P @ 3) score of 0.30.

pdf bib
IndoWordNet Conversion to Web Ontology Language (OWL)
Apurva Nagvenkar | Jyoti Pawar | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

WordNet plays a significant role in Linked Open Data (LOD) cloud. It has numerous application ranging from ontology annotation to ontology mapping. IndoWordNet is a linked WordNet connecting 18 Indian language WordNets with Hindi as a source WordNet. The Hindi WordNet was initially developed by linking it to English WordNet. In this paper, we present a data representation of IndoWordNet in Web Ontology Language (OWL). The schema of Princeton WordNet has been enhanced to support the representation of IndoWordNet. This IndoWordNet representation in OWL format is now available to link other web resources. This representation is implemented for eight Indian languages.

pdf bib
Samāsa-Kartā: An Online Tool for Producing Compound Words using IndoWordNet
Hanumant Redkar | Nilesh Joshi | Sandhya Singh | Irawati Kulkarni | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

Samāsa or compounds are a regular feature of Indian Languages. They are also found in other languages like German, Italian, French, Russian, Spanish, etc. Compound word is constructed from two or more words to form a single word. The meaning of this word is derived from each of the individual words of the compound. To develop a system to generate, identify and interpret compounds, is an important task in Natural Language Processing. This paper introduces a web based tool - Samāsa-Kartā for producing compound words. Here, the focus is on Sanskrit language due to its richness in usage of compounds; however, this approach can be applied to any Indian language as well as other languages. IndoWordNet is used as a resource for words to be compounded. The motivation behind creating compound words is to create, to improve the vocabulary, to reduce sense ambiguity, etc. in order to enrich the WordNet. The Samāsa-Kartā can be used for various applications viz., compound categorization, sandhi creation, morphological analysis, paraphrasing, synset creation, etc.

pdf bib
High, Medium or Low? Detecting Intensity Variation Among polar synonyms in WordNet
Raksha Sharma | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

For fine-grained sentiment analysis, we need to go beyond zero-one polarity and find a way to compare adjectives (synonyms) that share the same sense. Choice of a word from a set of synonyms, provides a way to select the exact polarity-intensity. For example, choosing to describe a person as benevolent rather than kind1 changes the intensity of the expression. In this paper, we present a sense based lexical resource, where synonyms are assigned intensity levels, viz., high, medium and low. We show that the measure P (s|w) (probability of a sense s given the word w) can derive the intensity of a word within the sense. We observe a statistically significant positive correlation between P(s|w) and intensity of synonyms for three languages, viz., English, Marathi and Hindi. The average correlation scores are 0.47 for English, 0.56 for Marathi and 0.58 for Hindi.

pdf bib
Mapping it differently: A solution to the linking challenges
Meghna Singh | Rajita Shukla | Jaya Saraswati | Laxmi Kashyap | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

This paper reports the work of creating bilingual mappings in English for certain synsets of Hindi wordnet, the need for doing this, the methods adopted and the tools created for the task. Hindi wordnet, which forms the foundation for other Indian language wordnets, has been linked to the English WordNet. To maximize linkages, an important strategy of using direct and hypernymy linkages has been followed. However, the hypernymy linkages were found to be inadequate in certain cases and posed a challenge due to sense granularity of language. Thus, the idea of creating bilingual mappings was adopted as a solution. A bilingual mapping means a linkage between a concept in two different languages, with the help of translation and/or transliteration. Such mappings retain meaningful representations, while capturing semantic similarity at the same time. This has also proven to be a great enhancement of Hindi wordnet and can be a crucial resource for multilingual applications in natural language processing, including machine translation and cross language information retrieval.

pdf bib
A Hybrid Deep Learning Architecture for Sentiment Analysis
Md Shad Akhtar | Ayush Kumar | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we propose a novel hybrid deep learning archtecture which is highly efficient for sentiment analysis in resource-poor languages. We learn sentiment embedded vectors from the Convolutional Neural Network (CNN). These are augmented to a set of optimized features selected through a multi-objective optimization (MOO) framework. The sentiment augmented optimized vector obtained at the end is used for the training of SVM for sentiment classification. We evaluate our proposed approach for coarse-grained (i.e. sentence level) as well as fine-grained (i.e. aspect level) sentiment analysis on four Hindi datasets covering varying domains. In order to show that our proposed method is generic in nature we also evaluate it on two benchmark English datasets. Evaluation shows that the results of the proposed method are consistent across all the datasets and often outperforms the state-of-art systems. To the best of our knowledge, this is the very first attempt where such a deep learning model is used for less-resourced languages such as Hindi.

pdf bib
Borrow a Little from your Rich Cousin: Using Embeddings and Polarities of English Words for Multilingual Sentiment Classification
Prerana Singhal | Pushpak Bhattacharyya
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we provide a solution to multilingual sentiment classification using deep learning. Given input text in a language, we use word translation into English and then the embeddings of these English words to train a classifier. This projection into the English space plus word embeddings gives a simple and uniform framework for multilingual sentiment analysis. A novel idea is augmentation of the training data with polar words, appearing in these sentences, along with their polarities. This approach leads to a performance gain of 7-10% over traditional classifiers on many languages, irrespective of text genre, despite the scarcity of resources in most languages.

pdf bib
Lexical Resources to Enrich English Malayalam Machine Translation
Sreelekha S | Pushpak Bhattacharyya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present our work on the usage of lexical resources for the Machine Translation English and Malayalam. We describe a comparative performance between different Statistical Machine Translation (SMT) systems on top of phrase based SMT system as baseline. We explore different ways of utilizing lexical resources to improve the quality of English Malayalam statistical machine translation. In order to enrich the training corpus we have augmented the lexical resources in two ways (a) additional vocabulary and (b) inflected verbal forms. Lexical resources include IndoWordnet semantic relation set, lexical words and verb phrases etc. We have described case studies, evaluations and have given detailed error analysis for both Malayalam to English and English to Malayalam machine translation systems. We observed significant improvement in evaluations of translation quality. Lexical resources do help uplift performance when parallel corpora are scanty.

pdf bib
That’ll Do Fine!: A Coarse Lexical Resource for English-Hindi MT, Using Polylingual Topic Models
Diptesh Kanojia | Aditya Joshi | Pushpak Bhattacharyya | Mark James Carman
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence of such lexical resources, multilingual topic models have been used to create coarse lexical resources in the past, using a Cartesian product approach. Our results show that for morphologically rich languages like Hindi, the Cartesian product approach is detrimental for MT. We then present a novel ‘sentential’ approach to use this coarse lexical resource from a multilingual topic model. Our coarse lexical resource when injected with a parallel corpus outperforms a system trained using parallel corpus and a good quality lexical resource. As demonstrated by the quality of our coarse lexical resource and its benefit to MT, we believe that our sentential approach to create such a resource will help MT for resource-constrained languages.

pdf bib
Multiword Expressions Dataset for Indian Languages
Dhirendra Singh | Sudha Bhingardive | Pushpak Bhattacharyya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Multiword Expressions (MWEs) are used frequently in natural languages, but understanding the diversity in MWEs is one of the open problem in the area of Natural Language Processing. In the context of Indian languages, MWEs play an important role. In this paper, we present MWEs annotation dataset created for Indian languages viz., Hindi and Marathi. We extract possible MWE candidates using two repositories: 1) the POS-tagged corpus and 2) the IndoWordNet synsets. Annotation is done for two types of MWEs: compound nouns and light verb constructions. In the process of annotation, human annotators tag valid MWEs from these candidates based on the standard guidelines provided to them. We obtained 3178 compound nouns and 2556 light verb constructions in Hindi and 1003 compound nouns and 2416 light verb constructions in Marathi using two repositories mentioned before. This created resource is made available publicly and can be used as a gold standard for Hindi and Marathi MWE systems.

pdf bib
Aspect based Sentiment Analysis in Hindi: Resource Creation and Evaluation
Md Shad Akhtar | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Due to the phenomenal growth of online product reviews, sentiment analysis (SA) has gained huge attention, for example, by online service providers. A number of benchmark datasets for a wide range of domains have been made available for sentiment analysis, especially in resource-rich languages. In this paper we assess the challenges of SA in Hindi by providing a benchmark setup, where we create an annotated dataset of high quality, build machine learning models for sentiment analysis in order to show the effective usage of the dataset, and finally make the resource available to the community for further advancement of research. The dataset comprises of Hindi product reviews crawled from various online sources. Each sentence of the review is annotated with aspect term and its associated sentiment. As classification algorithms we use Conditional Random Filed (CRF) and Support Vector Machine (SVM) for aspect term extraction and sentiment analysis, respectively. Evaluation results show the average F-measure of 41.07% for aspect term extraction and accuracy of 54.05% for sentiment classification.

pdf bib
Synset Ranking of Hindi WordNet
Sudha Bhingardive | Rajita Shukla | Jaya Saraswati | Laxmi Kashyap | Dhirendra Singh | Pushpak Bhattacharyya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Word Sense Disambiguation (WSD) is one of the open problems in the area of natural language processing. Various supervised, unsupervised and knowledge based approaches have been proposed for automatically determining the sense of a word in a particular context. It has been observed that such approaches often find it difficult to beat the WordNet First Sense (WFS) baseline which assigns the sense irrespective of context. In this paper, we present our work on creating the WFS baseline for Hindi language by manually ranking the synsets of Hindi WordNet. A ranking tool is developed where human experts can see the frequency of the word senses in the sense-tagged corpora and have been asked to rank the senses of a word by using this information and also his/her intuition. The accuracy of WFS baseline is tested on several standard datasets. F-score is found to be 60%, 65% and 55% on Health, Tourism and News datasets respectively. The created rankings can also be used in other NLP applications viz., Machine Translation, Information Retrieval, Text Summarization, etc.

pdf bib
SlangNet: A WordNet like resource for English Slang
Shehzaad Dhuliawala | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a WordNet like structured resource for slang words and neologisms on the internet. The dynamism of language is often an indication that current language technology tools trained on today’s data, may not be able to process the language in the future. Our resource could be (1) used to augment the WordNet, (2) used in several Natural Language Processing (NLP) applications which make use of noisy data on the internet like Information Retrieval and Web Mining. Such a resource can also be used to distinguish slang word senses from conventional word senses. To stimulate similar innovations widely in the NLP community, we test the efficacy of our resource for detecting slang using standard bag of words Word Sense Disambiguation (WSD) algorithms (Lesk and Extended Lesk) for English data on the internet.

pdf bib
Harnessing Cognitive Features for Sarcasm Detection
Abhijit Mishra | Diptesh Kanojia | Seema Nagar | Kuntal Dey | Pushpak Bhattacharyya
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Monotone Submodularity in Opinion Summaries
Jayanth Jayanth | Jayaprakash Sundararaj | Pushpak Bhattacharyya
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Adjective Intensity and Sentiment Analysis
Raksha Sharma | Mohit Gupta | Astha Agarwal | Pushpak Bhattacharyya
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Leveraging Small Multilingual Corpora for SMT Using Many Pivot Languages
Raj Dabre | Fabien Cromieres | Sadao Kurohashi | Pushpak Bhattacharyya
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Unsupervised Most Frequent Sense Detection using Word Embeddings
Sudha Bhingardive | Dhirendra Singh | Rudramurthy V | Hanumant Redkar | Pushpak Bhattacharyya
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent
Anoop Kunchukuttan | Ratish Puduppully | Pushpak Bhattacharyya
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib
A Computational Approach to Automatic Prediction of Drunk-Texting
Aditya Joshi | Abhijit Mishra | Balamurali AR | Pushpak Bhattacharyya | Mark J. Carman
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Harnessing Context Incongruity for Sarcasm Detection
Aditya Joshi | Vinita Sharma | Pushpak Bhattacharyya
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Your Sentiment Precedes You: Using an author’s historical tweets to predict sarcasm
Anupam Khattri | Aditya Joshi | Pushpak Bhattacharyya | Mark Carman
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Data representation methods and use of mined corpora for Indian language transliteration
Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the Fifth Named Entity Workshop

pdf bib
Addressing Class Imbalance in Grammatical Error Detection with Evaluation Metric Optimization
Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Noun Phrase Chunking for Marathi using Distant Supervision
Sachin Pawar | Nitin Ramrakhiyani | Girish K. Palshikar | Pushpak Bhattacharyya | Swapnil Hingmire
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Using Word Embeddings for Bilingual Unsupervised WSD
Sudha Bhingardive | Dhirendra Singh | Rudramurthy V | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
IndoWordNet Dictionary: An Online Multilingual Dictionary using IndoWordNet
Hanumant Redkar | Sandhya Singh | Nilesh Joshi | Anupam Ghosh | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Let Sense Bags Do Talking: Cross Lingual Word Semantic Similarity for English and Hindi
Apurva Nagvenkar | Jyoti Pawar | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
A temporal expression recognition system for medical documents by
Naman Gupta | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Solving Data Sparsity by Morphology Injection in Factored SMT
Sreelekha S | Piyush Dungarwal | Pushpak Bhattacharyya | Malathi D
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
TransChat: Cross-Lingual Instant Messaging for Indian Languages
Diptesh Kanojia | Shehzaad Dhuliawala | Abhijit Mishra | Naman Gupta | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Domain Sentiment Matters: A Two Stage Sentiment Analyzer
Raksha Sharma | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Judge a Book by its Cover: Conservative Focused Crawling under Resource Constraints
Shehzaad Dhuliawala | Arjun Atreya V | Ravi Kumar Yadav | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Logistic Regression for Automatic Lexical Level Morphological Paradigm Selection for Konkani Nouns
Shilpa Desai | Jyoti Pawar | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Automated Analysis of Bangla Poetry for Classification and Poet Identification
Geetanjali Rakshit | Anupam Ghosh | Pushpak Bhattacharyya | Gholamreza Haffari
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features
Dhirendra Singh | Sudha Bhingardive | Kevin Patel | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Augmenting Pivot based SMT with word segmentation
Rohit More | Anoop Kunchukuttan | Pushpak Bhattacharyya | Raj Dabre
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Using Multilingual Topic Models for Improved Alignment in English-Hindi MT
Diptesh Kanojia | Aditya Joshi | Pushpak Bhattacharyya | Mark James Carman
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Triangulation of Reordering Tables: An Advancement Over Phrase Table Triangulation in Pivot-Based SMT
Deepak Patil | Harshad Chavan | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Post-editing a chapter of a specialized textbook into 7 languages: importance of terminological proximity with English for productivity
Ritesh Shah | Christian Boitet | Pushpak Bhattacharyya | Mithun Padmakumar | Leonardo Zilio | Ruslan Kalitvianski | Mohammad Nasiruddin | Mutsuko Tomokiyo | Sandra Castellanos Páez
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Investigating the potential of post-ordering SMT output to improve translation quality
Pratik Mehta | Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Coreference Resolution to Support IE from Indian Classical Music Forums
Joe Cheri | Pushpak Bhattacharyya
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
Graph Based Algorithm for Automatic Domain Segmentation of WordNet
Brijesh Bhatt | Subhash Kunnath | Pushpak Bhattacharyya
Proceedings of the Seventh Global Wordnet Conference

pdf bib
Do not do processing, when you can look up: Towards a Discrimination Net for WSD
Diptesh Kanojia | Pushpak Bhattacharyya | Raj Dabre | Siddhartha Gunti | Manish Shrivastava
Proceedings of the Seventh Global Wordnet Conference

pdf bib
Facilitating Multi-Lingual Sense Annotation: Human Mediated Lemmatizer
Pushpak Bhattacharyya | Ankit Bahuguna | Lavita Talukdar | Bornali Phukan
Proceedings of the Seventh Global Wordnet Conference

pdf bib
Semi-Automatic Extension of Sanskrit Wordnet using Bilingual Dictionary
Sudha Bhingardive | Tanuja Ajotikar | Irawati Kulkarni | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the Seventh Global Wordnet Conference

pdf bib
IndoWordnet Visualizer: A Graphical User Interface for Browsing and Exploring Wordnets of Indian Languages
Devendra Singh Chaplot | Sudha Bhingardive | Pushpak Bhattacharyya
Proceedings of the Seventh Global Wordnet Conference

pdf bib
Tuning a Grammar Correction System for Increased Precision
Anoop Kunchukuttan | Sriram Chaudhury | Pushpak Bhattacharyya
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
Dive deeper: Deep Semantics for Sentiment Analysis
Nikhilkumar Jadhav | Pushpak Bhattacharyya
Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
A cognitive study of subjectivity extraction in sentiment annotation
Abhijit Mishra | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
The IIT Bombay Hindi-English Translation System at WMT 2014
Piyush Dungarwal | Rajen Chatterjee | Abhijit Mishra | Anoop Kunchukuttan | Ritesh Shah | Pushpak Bhattacharyya
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
LAYERED: Metric for Machine Translation Evaluation
Shubham Gautam | Pushpak Bhattacharyya
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Tackling Close Cousins: Experiences In Developing Statistical Machine Translation Systems For Marathi And Hindi
Raj Dabre | Jyotesh Choudhari | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
Supertag Based Pre-ordering in Machine Translation
Rajen Chatterjee | Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
HinMA: Distributed Morphology based Hindi Morphological Analyzer
Ankit Bahuguna | Lavita Talukdar | Pushpak Bhattacharyya | Smriti Singh
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
Anou Tradir: Experiences In Building Statistical Machine Translation Systems For Mauritian Languages – Creole, English, French
Raj Dabre | Aneerav Sukhoo | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
Introduction to Synskarta: An Online Interface for Synset Creation with Special Reference to Sanskrit
Hanumant Redkar | Jai Paranjape | Nilesh Joshi | Irawati Kulkarni | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
A Sentiment Analyzer for Hindi Using Hindi Senti Lexicon
Raksha Sharma | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
PaCMan : Parallel Corpus Management Workbench
Diptesh Kanojia | Manish Shrivastava | Raj Dabre | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
AutoParSe: An Automatic Paradigm Selector For Nouns in Konkani
Shilpa Desai | Neenad Desai | Jyoti Pawar | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
Merging Verb Senses of Hindi WordNet using Word Embeddings
Sudha Bhingardive | Ratish Puduppully | Dhirendra Singh | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
A Framework for Learning Morphology using Suffix Association Matrix
Shilpa Desai | Jyoti Pawar | Pushpak Bhattacharyya
Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
Jorge Baptista | Pushpak Bhattacharyya | Christiane Fellbaum | Mikel Forcada | Chu-Ren Huang | Svetla Koeva | Cvetana Krstev | Eric Laporte
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf bib
Measuring Sentiment Annotation Complexity of Text
Aditya Joshi | Abhijit Mishra | Nivvedan Senthamilselvan | Pushpak Bhattacharyya
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Shata-Anuvadak: Tackling Multiway Translation of Indian Languages
Anoop Kunchukuttan | Abhijit Mishra | Rajen Chatterjee | Ritesh Shah | Pushpak Bhattacharyya
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a compendium of 110 Statistical Machine Translation systems built from parallel corpora of 11 Indian languages belonging to both Indo-Aryan and Dravidian families. We analyze the relationship between translation accuracy and the language families involved. We feel that insights obtained from this analysis will provide guidelines for creating machine translation systems of specific Indian language pairs. We build phrase based systems and some extensions. Across multiple languages, we show improvements on the baseline phrase based systems using these extensions: (1) source side reordering for English-Indian language translation, and (2) transliteration of untranslated words for Indian language-Indian language translation. These enhancements harness shared characteristics of Indian languages. To stimulate similar innovation widely in the NLP community, we have made the trained models for these language pairs publicly available.

pdf bib
When Transliteration Met Crowdsourcing : An Empirical Study of Transliteration via Crowdsourcing using Efficient, Non-redundant and Fair Quality Control
Mitesh M. Khapra | Ananthakrishnan Ramanathan | Anoop Kunchukuttan | Karthik Visweswariah | Pushpak Bhattacharyya
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Sufficient parallel transliteration pairs are needed for training state of the art transliteration engines. Given the cost involved, it is often infeasible to collect such data using experts. Crowdsourcing could be a cheaper alternative, provided that a good quality control (QC) mechanism can be devised for this task. Most QC mechanisms employed in crowdsourcing are aggressive (unfair to workers) and expensive (unfair to requesters). In contrast, we propose a low-cost QC mechanism which is fair to both workers and requesters. At the heart of our approach, lies a rule based Transliteration Equivalence approach which takes as input a list of vowels in the two languages and a mapping of the consonants in the two languages. We empirically show that our approach outperforms other popular QC mechanisms (\textit{viz.}, consensus and sampling) on two vital parameters : (i) fairness to requesters (lower cost per correct transliteration) and (ii) fairness to workers (lower rate of rejecting correct answers). Further, as an extrinsic evaluation we use the standard NEWS 2010 test set and show that such quality controlled crowdsourced data compares well to expert data when used for training a transliteration engine.

2013

pdf bib
Detecting Domain Dedicated Polar Words
Raksha Sharma | Pushpak Bhattacharyya
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Little by Little: Semi Supervised Stemming through Stem Set Minimization
Vasudevan N | Pushpak Bhattacharyya
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Automated Grammar Correction Using Hierarchical Phrase-Based Statistical Machine Translation
Bibek Behera | Pushpak Bhattacharyya
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Structure Cognizant Pseudo Relevance Feedback
Arjun Atreya V | Yogesh Kakde | Pushpak Bhattacharyya | Ganesh Ramakrishnan
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Making Headlines in Hindi: Automatic English to Hindi News Headline Translation
Aditya Joshi | Kashyap Popat | Shubham Gautam | Pushpak Bhattacharyya
The Companion Volume of the Proceedings of IJCNLP 2013: System Demonstrations

pdf bib
The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
Kashyap Popat | Balamurali A.R | Pushpak Bhattacharyya | Gholamreza Haffari
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages
Brijesh Bhatt | Lahari Poddar | Pushpak Bhattacharyya
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Automatically Predicting Sentence Translation Difficulty
Abhijit Mishra | Pushpak Bhattacharyya | Michael Carl
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Neighbors Help: Bilingual Unsupervised WSD Using Context
Sudha Bhingardive | Samiulla Shaikh | Pushpak Bhattacharyya
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Detecting Turnarounds in Sentiment Analysis: Thwarting
Ankit Ramteke | Akshat Malu | Pushpak Bhattacharyya | J. Saketha Nath
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
Anoop Kunchukuttan | Rajen Chatterjee | Shourya Roy | Abhijit Mishra | Pushpak Bhattacharyya
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
CFILT-CORE: Semantic Textual Similarity using Universal Networking Language
Avishek Dan | Pushpak Bhattacharyya
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

pdf bib
IITB-Sentiment-Analysts: Participation in Sentiment Analysis in Twitter SemEval 2013 Task
Karan Chawla | Ankit Ramteke | Pushpak Bhattacharyya
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
More than meets the eye: Study of Human Cognition in Sense Annotation
Salil Joshi | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
IITB System for CoNLL 2013 Shared Task: A Hybrid Approach to Grammatical Error Correction
Anoop Kunchukuttan | Ritesh Shah | Pushpak Bhattacharyya
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
Proceedings of the 11th Workshop on Asian Language Resources
Pushpak Bhattacharyya | Key-Sun Choi
Proceedings of the 11th Workshop on Asian Language Resources

pdf bib
Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing
Pushpak Bhattacharyya | M. G. Abbas Malik
Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Urdu Hindi Machine Transliteration using SMT
M. G. Abbas Malik | Christian Boitet | Laurent Besacier | Pushpak Bhattacharyya
Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing

2012

pdf bib
janardhan: Semantic Textual Similarity using Universal Networking Language graph matching
Janardhan Singh | Arindam Bhattacharya | Pushpak Bhattacharyya
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Towards Efficient Named-Entity Rule Induction for Customizability
Ajay Nagesh | Ganesh Ramakrishnan | Laura Chiticariu | Rajasekar Krishnamurthy | Ankush Dharkar | Pushpak Bhattacharyya
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Subhabrata Mukherjee | Pushpak Bhattacharyya
Proceedings of COLING 2012

pdf bib
YouCat: Weakly Supervised Youtube Video Categorization System from Meta Data & User Comments using WordNet & Wikipedia
Subhabrata Mukherjee | Pushpak Bhattacharyya
Proceedings of COLING 2012

pdf bib
Cross-Lingual Sentiment Analysis for Indian Languages using Linked WordNets
Balamurali A.R. | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of COLING 2012: Posters

pdf bib
Morphological Analyzer for Affix Stacking Languages: A Case Study of Marathi
Raj Dabre | Archana Amberkar | Pushpak Bhattacharyya
Proceedings of COLING 2012: Posters

pdf bib
Automated Paradigm Selection for FSA based Konkani Verb Morphological Analyzer
Shilpa Desai | Jyoti Pawar | Pushpak Bhattacharyya
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Eating Your Own Cooking: Automatically Linking Wordnet Synsets of Two Languages
Salil Joshi | Arindam Chatterjee | Arun Karthikeyan Karra | Pushpak Bhattacharyya
Proceedings of COLING 2012: Demonstration Papers

pdf bib
I Can Sense It: a Comprehensive Online System for WSD
Salil Joshi | Mitesh M Khapra | Pushpak Bhattacharyya
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Discrimination-Net for Hindi
Diptesh Kanojia | Arindam Chatterjee | Salil Joshi | Pushpak Bhattacharyya
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Experiences in Resource Generation for Machine Translation through Crowdsourcing
Anoop Kunchukuttan | Shourya Roy | Pratik Patel | Kushal Ladha | Somya Gupta | Mitesh M. Khapra | Pushpak Bhattacharyya
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The logistics of collecting resources for Machine Translation (MT) has always been a cause of concern for some of the resource deprived languages of the world. The recent advent of crowdsourcing platforms provides an opportunity to explore the large scale generation of resources for MT. However, before venturing into this mode of resource collection, it is important to understand the various factors such as, task design, crowd motivation, quality control, etc. which can influence the success of such a crowd sourcing venture. In this paper, we present our experiences based on a series of experiments performed. This is an attempt to provide a holistic view of the different facets of translation crowd sourcing and identifying key challenges which need to be addressed for building a practical crowdsourcing solution for MT.

pdf bib
Cost and Benefit of Using WordNet Senses for Sentiment Analysis
Balamurali AR | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Typically, accuracy is used to represent the performance of an NLP system. However, accuracy attainment is a function of investment in annotation. Typically, the more the amount and sophistication of annotation, higher is the accuracy. However, a moot question is """"is the accuracy improvement commensurate with the cost incurred in annotation""""? We present an economic model to assess the marginal benefit accruing from increase in cost of annotation. In particular, as a case in point we have chosen the sentiment analysis (SA) problem. In SA, documents normally are polarity classified by running them through classifiers trained on document vectors constructed from lexeme features, i.e., words. If, however, instead of words, one uses word senses (synset ids in wordnets) as features, the accuracy improves dramatically. But is this improvement significant enough to justify the cost of annotation? This question, to the best of our knowledge, has not been investigated with the seriousness it deserves. We perform a cost benefit study based on a vendor-machine model. By setting up a cost price, selling price and profit scenario, we show that although extra cost is incurred in sense annotation, the profit margin is high, justifying the cost.

pdf bib
Proceedings of the First Workshop on Eye-tracking and Natural Language Processing
Michael Carl | Pushpak Bhattacharyya | Kamal Kumar Choudhary
Proceedings of the First Workshop on Eye-tracking and Natural Language Processing

pdf bib
A heuristic-based approach for systematic error correction of gaze data for reading
Abhijit Mishra | Michael Carl | Pushpak Bhattacharyya
Proceedings of the First Workshop on Eye-tracking and Natural Language Processing

pdf bib
Building Multilingual Search Index using open source framework
Arjun Atreya | Swapnil Chaudhari | Pushpak Bhattacharyya | Ganesh Ramakrishnan
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Error tracking in search engine development
Swapnil Chaudhari | Arjun Atreya V | Pushpak Bhattacharyya | Ganesh Ramakrishnan
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Domain Specific Ontology Extractor For Indian Languages
Brijesh Bhatt | Pushpak Bhattacharyya
Proceedings of the 10th Workshop on Asian Language Resources

pdf bib
Textbook Construction from Lecture Transcripts
Aliabbas Petiwala | Kannan Moudgalya | Pushpak Bhattacharyya
Proceedings of the Workshop on Speech and Language Processing Tools in Education

pdf bib
Partially modelling word reordering as a sequence labelling problem
Anoop Kunchukuttan | Pushpak Bhattacharyya
Proceedings of the Workshop on Reordering for Statistical Machine Translation

pdf bib
Proceedings of the First International Workshop on Optimization Techniques for Human Language Technology
Pushpak Bhattacharyya | Asif Ekbal | Sriparna Saha | Mark Johnson | Diego Molla-Aliod | Mark Dras
Proceedings of the First International Workshop on Optimization Techniques for Human Language Technology

2011

pdf bib
It Takes Two to Tango: A Bilingual Unsupervised Approach for Estimating Sense Distributions using Expectation Maximization
Mitesh M. Khapra | Salil Joshi | Pushpak Bhattacharyya
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Clause-Based Reordering Constraints to Improve Statistical Machine Translation
Ananthakrishnan Ramanathan | Pushpak Bhattacharyya | Karthik Visweswariah | Kushal Ladha | Ankur Gandhe
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Harnessing WordNet Senses for Supervised Sentiment Classification
Balamurali AR | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Together We Can: Bilingual Bootstrapping for WSD
Mitesh M. Khapra | Salil Joshi | Arindam Chatterjee | Pushpak Bhattacharyya
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
C-Feel-It: A Sentiment Analyzer for Micro-blogs
Aditya Joshi | Balamurali AR | Pushpak Bhattacharyya | Rajat Mohanty
Proceedings of the ACL-HLT 2011 System Demonstrations

pdf bib
Robust Sense-based Sentiment Classification
Balamurali AR | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)

pdf bib
Hybrid Inflectional Stemmer and Rule-based Derivational Stemmer for Gujarati
Kartik Suba | Dipti Jiandani | Pushpak Bhattacharyya
Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP)

2010

pdf bib
Multilingual Pseudo-Relevance Feedback: Performance Study of Assisting Languages
Manoj Kumar Chinnakotla | Karthik Raman | Pushpak Bhattacharyya
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision
Mitesh Khapra | Anup Kulkarni | Saurabh Sohoney | Pushpak Bhattacharyya
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Weak Translation Problems – a case study of Scriptural Translation
Muhammad Ghulam Abbas Malik | Christian Boitet | Pushpak Bhattacharyya | Laurent Besacier
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

General purpose, high quality and fully automatic MT is believed to be impossible. We are interested in scriptural translation problems, which are weak sub-problems of the general problem of translation. We introduce the characteristics of the weak problems of translation and of the scriptural translation problems, describe different computational approaches (finite-state, statistical and hybrid) to solve these problems, and report our results on several combinations of Indo-Pak languages and writing systems.

pdf bib
Everybody loves a rich cousin: An empirical study of transliteration through bridge languages
Mitesh M. Khapra | A Kumaran | Pushpak Bhattacharyya
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
OWNS: Cross-lingual Word Sense Disambiguation Using Weighted Overlap Counts and Wordnet Based Similarity Measures
Lipta Mahapatra | Meera Mohan | Mitesh Khapra | Pushpak Bhattacharyya
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
CFILT: Resource Conscious Approaches for All-Words Domain Specific WSD
Anup Kulkarni | Mitesh Khapra | Saurabh Sohoney | Pushpak Bhattacharyya
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Value for Money: Balancing Annotation Effort, Lexicon Building and Accuracy for Multilingual WSD
Mitesh Khapra | Saurabh Sohoney | Anup Kulkarni | Pushpak Bhattacharyya
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Verbs are where all the action lies: Experiences of Shallow Parsing of a Morphologically Rich Language
Harshada Gune | Mugdha Bapat | Mitesh M. Khapra | Pushpak Bhattacharyya
Coling 2010: Posters

pdf bib
Finite-state Scriptural Translation
M. G. Abbas Malik | Christian Boitet | Pushpak Bhattacharyya
Coling 2010: Posters

pdf bib
Think Globally, Apply Locally: Using Distributional Characteristics for Hindi Named Entity Identification
Shalini Gupta | Pushpak Bhattacharyya
Proceedings of the 2010 Named Entities Workshop

pdf bib
A Paradigm-Based Finite State Morphological Analyzer for Marathi
Mugdha Bapat | Harshada Gune | Pushpak Bhattacharyya
Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Hybrid Stemmer for Gujarati
Pratikkumar Patel | Kashyap Popat | Pushpak Bhattacharyya
Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Word Sense Disambiguation and IR
Pushpak Bhattacharyya
Proceedings of the 4th Workshop on Cross Lingual Information Access

pdf bib
More Languages, More MAP?: A Study of Multiple Assisting Languages in Multilingual PRF
Vishal Vachhani | Manoj Chinnakotla | Mitesh Khapra | Pushpak Bhattacharyya
Proceedings of the 4th Workshop on Cross Lingual Information Access

pdf bib
IndoWordNet
Pushpak Bhattacharyya
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

India is a multilingual country where machine translation and cross lingual search are highly relevant problems. These problems require large resources- like wordnets and lexicons- of high quality and coverage. Wordnets are lexical structures composed of synsets and semantic relations. Synsets are sets of synonyms. They are linked by semantic relations like hypernymy (is-a), meronymy (part-of), troponymy (manner-of) etc. IndoWordnet is a linked structure of wordnets of major Indian languages from Indo-Aryan, Dravidian and Sino-Tibetan families. These wordnets have been created by following the expansion approach from Hindi wordnet which was made available free for research in 2006. Since then a number of Indian languages have been creating their wordnets. In this paper we discuss the methodology, coverage, important considerations and multifarious benefits of IndoWordnet. Case studies are provided for Marathi, Sanskrit, Bodo and Telugu, to bring out the basic methodology of and challenges involved in the expansion approach. The guidelines the lexicographers follow for wordnet construction are enumerated. The difference between IndoWordnet and EuroWordnet also is discussed.

2009

pdf bib
Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3)
Sivaji Bandyopadhyay | Pushpak Bhattacharyya | Vasudeva Varma | Sudeshna Sarkar | A Kumaran | Raghavendra Udupa
Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3)

pdf bib
Improving Transliteration Accuracy Using Word-Origin Detection and Lexicon Lookup
Mitesh Khapra | Pushpak Bhattacharyya
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

pdf bib
A Hybrid Model for Urdu Hindi Transliteration
Abbas Malik | Laurent Besacier | Christian Boitet | Pushpak Bhattacharyya
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

pdf bib
Projecting Parameters for Multilingual Word Sense Disambiguation
Mitesh M. Khapra | Sapan Shah | Piyush Kedia | Pushpak Bhattacharyya
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT
Ananthakrishnan Ramanathan | Hansraj Choudhary | Avishek Ghosh | Pushpak Bhattacharyya
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
A Common Parts-of-Speech Tagset Framework for Indian Languages
Baskaran Sankaran | Kalika Bali | Monojit Choudhury | Tanmoy Bhattacharya | Pushpak Bhattacharyya | Girish Nath Jha | S. Rajendran | K. Saravanan | L. Sobha | K.V. Subbarao
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses these deficiencies in an efficient and principled manner. We follow a hierarchical schema similar to that of EAGLES and this enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility.

pdf bib
Lexical Resources for Semantics Extraction
Rajat Mohanty | Pushpak Bhattacharyya
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we report our work on the creation of a number of lexical resources that are crucial for an interlingua based MT from English to other languages. These lexical resources are in the form of sub-categorization frames, verb knowledge bases and rule templates for establishing semantic relations and speech act like attributes. We have created these resources over a long period of time from Oxford Advanced Learners’ Dictionary (OALD) [1], VerbNet [2], Princeton WordNet 2.1 [3], LCS database [4], Penn Tree Bank [5], and XTAG lexicon [6]. On the challenging problem of generating interlingua from domain and structure unrestricted English sentences, we are able to demonstrate that the use of these lexical resources makes a difference in terms of accuracy figures.

pdf bib
Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation
Ananthakrishnan Ramanathan | Jayprasad Hegde | Ritesh M. Shah | Pushpak Bhattacharyya | Sasikumar M.
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Hindi and Marathi to English Cross Language Information Retrieval
Manoj Kumar Chinnakotla | Sagar Ranadive | Om P. Damani | Pushpak Bhattacharyya
Proceedings of the 2nd workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies

pdf bib
Designing a Common POS-Tagset Framework for Indian Languages
Sankaran Baskaran | Kalika Bali | Tanmoy Bhattacharya | Pushpak Bhattacharyya | Girish Nath Jha | Rajendran S | Saravanan K | Sobha L | Subbarao K V.
Proceedings of the 6th Workshop on Asian Language Resources

pdf bib
Hindi Urdu Machine Transliteration using Finite-State Transducers
M G Abbas Malik | Christian Boitet | Pushpak Bhattacharyya
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Hindi Compound Verbs and their Automatic Extraction
Debasri Chakrabarti | Hemang Mandalia | Ritwik Priya | Vaijayanthi Sarma | Pushpak Bhattacharyya
Coling 2008: Companion volume: Posters

2007

pdf bib
Hindi generation from interlingua
Smriti Singh | Mrugank Dalal | Vishal Vachhani | Pushpak Bhattacharyya | Om P. Damani
Proceedings of Machine Translation Summit XI: Papers

2006

pdf bib
Morphological Richness Offsets Resource Demand – Experiences in Constructing a POS Tagger for Hindi
Smriti Singh | Kuhoo Gupta | Manish Shrivastava | Pushpak Bhattacharyya
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Semantically Relatable Sets: Building Blocks for Representing Semantics
Rajat Kumar Mohanty | Anupama Dutta | Pushpak Bhattacharyya
Proceedings of Machine Translation Summit X: Papers

2004

pdf bib
Generic Text Summarization Using WordNet
Kedar Bellare | Anish Das Sarma | Atish Das Sarma | Navneet Loiwal | Vaibhav Mehta | Ganesh Ramakrishnan | Pushpak Bhattacharyya
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
A gloss-centered algorithm for disambiguation
Ganesh Ramakrishnan | B. Prithviraj | Pushpak Bhattacharyya
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

2003

pdf bib
Question Answering via Bayesian Inference on Lexical Relations
Ganesh Ramakrishnan | Apurva Jadhav | Ashutosh Joshi | Soumen Chakrabarti | Pushpak Bhattacharyya
Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering

Search
Co-authors