Proceedings of the 16th International Conference on Natural Language Processing

Dipti Misra Sharma, Pushpak Bhattacharya (Editors)

Anthology ID:
International Institute of Information Technology, Hyderabad, India
NLP Association of India
Bib Export formats:

pdf bib
Proceedings of the 16th International Conference on Natural Language Processing
Dipti Misra Sharma | Pushpak Bhattacharya

pdf bib
Robust Text Classification using Sub-Word Information in Input Word Representations.
Bhanu Prakash Mahanti | Priyank Chhipa | Vivek Sridhar | Vinuthkumar Prasan

Word based deep learning approaches have been used with increasing success recently to solve Natural Language Processing problems like Machine Translation, Language Modelling and Text Classification. However, performance of these word based models is limited by the vocabulary of the training corpus. Alternate approaches using character based models have been proposed to overcome the unseen word problems arising for a variety of reasons. However, character based models fail to capture the sequential relationship of words inherently present in texts. Hence, there is scope for improvement by addressing the unseen word problem while also maintaining the sequential context through word based models. In this work, we propose a method where the input embedding vector incorporates sub-word information but is also suitable for use with models which successfully capture the sequential nature of text. We further attempt to establish that using such a word representation as input makes the model robust to unseen words, particularly arising due to tokenization and spelling errors, which is a common problem in systems where a typing interface is one of the input modalities.

pdf bib
A Deep Ensemble Framework for Fake News Detection and Multi-Class Classification of Short Political Statements
Arjun Roy | Kingshuk Basak | Asif Ekbal | Pushpak Bhattacharyya

Fake news, rumor, incorrect information, and misinformation detection are nowadays crucial issues as these might have serious consequences for our social fabrics. Such information is increasing rapidly due to the availability of enormous web information sources including social media feeds, news blogs, online newspapers etc. In this paper, we develop various deep learning models for detecting fake news and classifying them into the pre-defined fine-grained categories. At first, we develop individual models based on Convolutional Neural Network (CNN), and Bi-directional Long Short Term Memory (Bi-LSTM) networks. The representations obtained from these two models are fed into a Multi-layer Perceptron Model (MLP) for the final classification. Our experiments on a benchmark dataset show promising results with an overall accuracy of 44.87%, which outperforms the current state of the arts.

pdf bib
Building Discourse Parser for Thirukkural
Anita R | Subalalitha C N

Thirukkural is one of the famous Tamil Literatures in the world. It was written by Thiruvalluvar, and focuses on ethics and morality. It provides all possible solutions to lead a successful and a peaceful life fitting any generation. It has been translated into 82 global languages, which necessitate the access of Thirukkural in any language on the World Wide Web (WWW) and processing the Thirukkural computationally. This paper aims at constructing the Thirukkural Discourse Parser which finds the semantic relations in the Thirukkurals which can extract the hidden meaning in it and help in utilizing the same in various Natural Language Processing (NLP) applications, such as, Summary Generation Systems, Information Retrieval (IR) Systems and Question Answering (QA) Systems. Rhetorical Structure Theory (RST) is one of the discourse theories, which is used in NLP to find the coherence between texts. This paper finds the relation within the Thriukkurals and the discourse structure is created using the Thirukkural Discourse Parser. The resultant discourse structure of Thirukkural can be indexed and further be used by Summary Generation Systems, IR Systems and QA Systems. This facilitates the end user to access Thirukkural on WWW and get benefited. This Thirukkural Discourse Parser has been tested with all 1330 Thirukurals using precision and recall.

pdf bib
Introducing Aspects of Creativity in Automatic Poetry Generation
Brendan Bena | Jugal Kalita

Poetry Generation involves teaching systems to automatically generate text that resembles poetic work. A deep learning system can learn to generate poetry on its own by training on a corpus of poems and modeling the particular style of language. In this paper, we propose taking an approach that fine-tunes GPT-2, a pre-trained language model, to our downstream task of poetry generation. We extend prior work on poetry generation by introducing creative elements. Specifically, we generate poems that express emotion and elicit the same in readers, and poems that use the language of dreams—called dream poetry. We are able to produce poems that correctly elicit the emotions of sadness and joy 87.5 and 85 percent, respectively, of the time. We produce dreamlike poetry by training on a corpus of texts that describe dreams. Poems from this model are shown to capture elements of dream poetry with scores of no less than 3.2 on the Likert scale. We perform crowdsourced human-evaluation for all our poems. We also make use of the Coh-Metrix tool, outlining metrics we use to gauge the quality of text generated.

pdf bib
Incorporating Sub-Word Level Information in Language Invariant Neural Event Detection
Suhan Prabhu | Pranav Goel | Alok Debnath | Manish Shrivastava

Detection of TimeML events in text have traditionally been done on corpora such as TimeBanks. However, deep learning methods have not been applied to these corpora, because these datasets seldom contain more than 10,000 event mentions. Traditional architectures revolve around highly feature engineered, language specific statistical models. In this paper, we present a Language Invariant Neural Event Detection (ALINED) architecture. ALINED uses an aggregation of both sub-word level features as well as lexical and structural information. This is achieved by combining convolution over character embeddings, with recurrent layers over contextual word embeddings. We find that our model extracts relevant features for event span identification without relying on language specific features. We compare the performance of our language invariant model to the current state-of-the-art in English, Spanish, Italian and French. We outperform the F1-score of the state of the art in English by 1.65 points. We achieve F1-scores of 84.96, 80.87 and 74.81 on Spanish, Italian and French respectively which is comparable to the current states of the art for these languages. We also introduce the automatic annotation of events in Hindi, a low resource language, with an F1-Score of 77.13.

pdf bib
Event Centric Entity Linking for Hindi News Articles: A Knowledge Graph Based Approach
Pranav Goel | Suhan Prabhu | Alok Debnath | Manish Shrivastava

We describe the development of a knowledge graph from an event annotated corpus by presenting a pipeline that identifies and extracts the relations between entities and events from Hindi news articles. Due to the semantic implications of argument identification for events in Hindi, we use a combined syntactic argument and semantic role identification methodology. To the best of our knowledge, no other architecture exists for this purpose. The extracted combined role information is incorporated in a knowledge graph that can be queried via subgraph extraction for basic questions. The architectures presented in this paper can be used for participant extraction and event-entity linking in most Indo-Aryan languages, due to similar syntactic and semantic properties of event arguments.

pdf bib
Language Modelling with NMT Query Translation for Amharic-Arabic Cross-Language Information Retrieval
Ibrahim Gashaw | H.l Shashirekha

This paper describes our first experiment on Neural Machine Translation (NMT) based query translation for Amharic-Arabic Cross-Language Information Retrieval (CLIR) task to retrieve relevant documents from Amharic and Arabic text collections in response to a query expressed in the Amharic language. We used a pre-trained NMT model to map a query in the source language into an equivalent query in the target language. The relevant documents are then retrieved using a Language Modeling (LM) based retrieval algorithm. Experiments are conducted on four conventional IR models, namely Uni-gram and Bi-gram LM, Probabilistic model, and Vector Space Model (VSM). The results obtained illustrate that the proposed Uni-gram LM outperforms all other models for both Amharic and Arabic language document collections.

pdf bib
Non-native Accent Partitioning for Speakers of Indian Regional Languages
Radha Krishna Guntur | Krishnan Ramakrishnan | Vinay Kumar Mittal

Acoustic features extracted from the speech signal can help in identifying speaker related multiple information such as geographical origin, regional accent and nativity. In this paper, classification of native speakers of South Indian languages is carried out based upon the accent of their non-native language, i.e., English. Four South Indian languages: Kannada, Malayalam, Tamil, and Telugu are examined. A database of English speech from the native speakers of these languages, along with the native language speech data was collected, from a non-overlapping set of speakers. Segment level acoustic features F0 and Mel-frequency cepstral coefficients (MFCCs) are used. Accent partitioning of non-native English speech data is carried out using multiple classifiers: k-nearest neighbour (KNN), linear discriminant analysis (LDA) and support vector machine (SVM), for validation and comparison of results. Classification accuracies of 86.6% are observed using KNN, and 89.2% or more than 90% using SVM classifier. A study of acoustic feature F0 contour, related to L2 intonation, showed that native speakers of Kannada language are quite distinct as compared to those of Tamil or Telugu languages. It is also observed that identification of Malayalam and Kannada speakers from their English speech accent is relatively easier than Telugu or Tamil speakers.

pdf bib
A little perturbation makes a difference: Treebank augmentation by perturbation improves transfer parsing
Ayan Das | Sudeshna Sarkar

We present an approach for cross-lingual transfer of dependency parser so that the parser trained on a single source language can more effectively cater to diverse target languages. In this work, we show that the cross-lingual performance of the parsers can be enhanced by over-generating the source language treebank. For this, the source language treebank is augmented with its perturbed version in which controlled perturbation is introduced in the parse trees by stochastically reordering the positions of the dependents with respect to their heads while keeping the structure of the parse trees unchanged. This enables the parser to capture diverse syntactic patterns in addition to those that are found in the source language. The resulting parser is found to more effectively parse target languages with different syntactic structures. With English as the source language, our system shows an average improvement of 6.7% and 7.7% in terms of UAS and LAS over 29 target languages compared to the baseline single source parser trained using unperturbed source language treebank. This also results in significant improvement over the transfer parser proposed by (CITATION) that involves an “order-free” parser algorithm.

pdf bib
Autism Speech Analysis using Acoustic Features
Abhijit Mohanta | Vinay Kumar Mittal

Autism speech has distinct acoustic patterns, different from normal speech. Analyzing acoustic features derived from the speech of children affected with autism spectrum disorder (ASD) can help its early detection. In this study, a comparative analysis of the discriminating acoustic characteristics is carried out between ASD affected and normal children speech, from speech production point of view. Datasets of English speech of children affected with ASD and normal children were recorded. Changes in the speech production characteristics are examined using the excitation source features F0 and strength of excitation (SoE), the vocal tract filter features formants (F1 to F5) and dominant frequencies (FD1, FD2), and the combined source-filter features signal energy and zero-crossing rate. Changes in the acoustic features are compared in the five vowels’ regions of the English language. Significant changes in few acoustic features are observed for ASD affected speech as compared to normal speech. The differences between the mean values of the formants and dominant frequencies, for ASD affected and normal children, are highest for vowel /i/. It indicates that ASD affected children have possibly more difficulty in speaking the words with vowel /i/. This study can be helpful towards developing systems for automatic detection of ASD.

pdf bib
A Survey on Ontology Enrichment from Text
Vivek Iyer | Lalit Mohan | Mehar Bhatia | Y. Raghu Reddy

Increased internet bandwidth at low cost is leading to the creation of large volumes of unstructured data. This data explosion opens up opportunities for the creation of a variety of data-driven intelligent systems, such as the Semantic Web. Ontologies form one of the most crucial layers of semantic web, and the extraction and enrichment of ontologies given this data explosion becomes an inevitable research problem. In this paper, we survey the literature on semi-automatic and automatic ontology extraction and enrichment and classify them into four broad categories based on the approach. Then, we proceed to narrow down four algorithms from each of these categories, implement and analytically compare them based on parameters like context relevance, efficiency and precision. Lastly, we propose a Long Short Term Memory Networks (LSTM) based deep learning approach to try and overcome the gaps identified in these approaches.

pdf bib
Sanskrit Segmentation revisited
Sriram Krishnan | Amba Kulkarni

Computationally analyzing Sanskrit texts requires proper segmentation in the initial stages. There have been various tools developed for Sanskrit text segmentation. Of these, Gérard Huet’s Reader in the Sanskrit Heritage Engine analyzes the input text and segments it based on the word parameters - phases like iic, ifc, Pr, Subst, etc., and sandhi (or transition) that takes place at the end of a word with the initial part of the next word. And it enlists all the possible solutions differentiating them with the help of the phases. The phases and their analyses have their use in the domain of sentential parsers. In segmentation, though, they are not used beyond deciding whether the words formed with the phases are morphologically valid. This paper tries to modify the above segmenter by ignoring the phase details (except for a few cases), and also proposes a probability function to prioritize the list of solutions to bring up the most valid solutions at the top.

pdf bib
Integrating Lexical Knowledge in Word Embeddings using Sprinkling and Retrofitting
Aakash Srinivasan | Harshavardhan Kamarthi | Devi Ganesan | Sutanu Chakraborti

Neural network based word embeddings, such as Word2Vec and Glove, are purely data driven in that they capture the distributional information about words from the training corpus. Past works have attempted to improve these embeddings by incorporating semantic knowledge from lexical resources like WordNet. Some techniques like retrofitting modify word embeddings in the post-processing stage while some others use a joint learning approach by modifying the objective function of neural networks. In this paper, we discuss two novel approaches for incorporating semantic knowledge into word embeddings. In the first approach, we take advantage of Levy et al’s work which showed that using SVD based methods on co-occurrence matrix provide similar performance to neural network based embeddings. We propose a ‘sprinkling’ technique to add semantic relations to the co-occurrence matrix directly before factorization. In the second approach, WordNet similarity scores are used to improve the retrofitting method. We evaluate the proposed methods in both intrinsic and extrinsic tasks and observe significant improvements over the baselines in many of the datasets.

pdf bib
Robust Deep Learning Based Sentiment Classification of Code-Mixed Text
Siddhartha Mukherjee | Vinuthkumar Prasan | Anish Nediyanchath | Manan Shah | Nikhil Kumar

India is one of unique countries in the world that has the legacy of diversity of languages. Most of these languages are influenced by English. This causes a large presence of code-mixed text in Social Media. Enormous presence of this code-mixed text provides an important research area for Natural Language Processing (NLP). This paper proposes a novel Attention based deep learning technique for Sentiment Classification on Code-Mixed Text (ACCMT) of Hindi-English. The proposed architecture uses fusion of character and word features. Non availability of suitable Word Embedding to represent these Code-Mixed texts is another important hurdle for this league of NLP tasks. This paper also proposes a novel technique for preparing Word Embedding of Code-Mixed text. This embedding is prepared with two separately trained word-embedding on Romanized Hindi and English respectively. This embedding is further used in the proposed deep learning based architecture for robust classification. The Proposed technique achieves 71.97% accuracy, which exceeds the baseline accuracy.

pdf bib
Dataset for Aspect Detection on Mobile reviews in Hindi
Pruthwik Mishra | Ayush Joshi | Dipti Sharma

In recent years Opinion Mining has become one of the very interesting fields of Language Processing. To extract the gist of a sentence in a shorter and efficient manner is what opinion mining provides. In this paper we focus on detecting aspects for a particular domain. While relevant research work has been done in aspect detection in resource rich languages like English, we are trying to do the same in a relatively resource poor Hindi language. Here we present a corpus of mobile reviews which are labelled with carefully curated aspects. The motivation behind Aspect detection is to get information on a finer level about the data. In this paper we identify all aspects related to the gadget which are present on the reviews given online on various websites. We also propose baseline models to detect aspects in Hindi text after conducting various experiments.

pdf bib
Multi-linguality helps: Event-Argument Extraction for Disaster Domain in Cross-lingual and Multi-lingual setting
Zishan Ahmad | Deeksha Varshney | Asif Ekbal | Pushpak Bhattacharyya

Automatic extraction of disaster-related events and their arguments from natural language text is vital for building a decision support system for crisis management. Event extraction from various news sources is a well-explored area for this objective. However, extracting events alone, without any context, provides only partial help for this purpose. Extracting related arguments like Time, Place, Casualties, etc., provides a complete picture of the disaster event. In this paper, we create a disaster domain dataset in Hindi by annotating disaster-related event and arguments. We also obtain equivalent datasets for Bengali and English from a collaboration. We build a multi-lingual deep learning model for argument extraction in all the three languages. We also compare our multi-lingual system with a similar baseline mono-lingual system trained for each language separately. It is observed that a single multi-lingual system is able to compensate for lack of training data, by using joint training of dataset from different languages in shared space, thus giving a better overall result.

pdf bib
Development of POS tagger for English-Bengali Code-Mixed data
Tathagata Raha | Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay

Code-mixed texts are widespread nowadays due to the advent of social media. Since these texts combine two languages to formulate a sentence, it gives rise to various research problems related to Natural Language Processing. In this paper, we try to excavate one such problem, namely, Parts of Speech tagging of code-mixed texts. We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script. Our approach initially involves the collection and cleaning of English-Bengali code-mixed tweets. These tweets were used as a development dataset for building our system. The proposed system is a modular approach that starts by tagging individual tokens with their respective languages and then passes them to different POS taggers, designed for different languages (English and Bengali, in our case). Tags given by the two systems are later joined together and the final result is then mapped to a universal POS tag set. Our system was checked using 100 manually POS tagged code-mixed sentences and it returned an accuracy of 75.29%.

pdf bib
Towards Handling Verb Phrase Ellipsis in English-Hindi Machine Translation
Niyati Bafna | Dipti Sharma

English-Hindi machine translation systems have difficulty interpreting verb phrase ellipsis (VPE) in English, and commit errors in translating sentences with VPE. We present a solution and theoretical backing for the treatment of English VPE, with the specific scope of enabling English-Hindi MT, based on an understanding of the syntactical phenomenon of verb-stranding verb phrase ellipsis in Hindi (VVPE). We implement a rule-based system to perform the following sub-tasks: 1) Verb ellipsis identification in the English source sentence, 2) Elided verb phrase head identification 3) Identification of verb segment which needs to be induced at the site of ellipsis 4) Modify input sentence; i.e. resolving VPE and inducing the required verb segment. This system obtains 94.83 percent precision and 83.04 percent recall on subtask (1), tested on 3900 sentences from the BNC corpus. This is competitive with state-of-the-art results. We measure accuracy of subtasks (2) and (3) together, and obtain a 91 percent accuracy on 200 sentences taken from the WSJ corpus. Finally, in order to indicate the relevance of ellipsis handling to MT, we carried out a manual analysis of the English-Hindi MT outputs of 100 sentences after passing it through our system. We set up a basic metric (1-5) for this evaluation, where 5 indicates drastic improvement, and obtained an average of 3.55. As far as we know, this is the first attempt to target ellipsis resolution in the context of improving English-Hindi machine translation.

pdf bib
A Multi-task Model for Multilingual Trigger Detection and Classification
Sovan Kumar Sahoo | Saumajit Saha | Asif Ekbal | Pushpak Bhattacharyya

In this paper we present a deep multi-task learning framework for multilingual event and argument trigger detection and classification. In our current work, we identify detection and classification of both event and argument triggers as related tasks and follow a multi-tasking approach to solve them simultaneously in contrast to the previous works where these tasks were solved separately or learning some of the above mentioned tasks jointly. We evaluate the proposed approach with multiple low-resource Indian languages. As there were no datasets available for the Indian languages, we have annotated disaster related news data crawled from the online news portal for different low-resource Indian languages for our experiments. Our empirical evaluation shows that multi-task model performs better than the single task model, and classification helps in trigger detection and vice-versa.

pdf bib
Converting Sentiment Annotated Data to Emotion Annotated Data
Manasi Kulkarni | Pushpak Bhattacharyya

Existing supervised solutions for emotion classification demand large amount of emotion annotated data. Such resources may not be available for many languages. However, it is common to have sentiment annotated data available in these languages. The sentiment information (+1 or -1) is useful to segregate between positive emotions or negative emotions. In this paper, we propose an unsupervised approach for emotion recognition by taking advantage of the sentiment information. Given a sentence and its sentiment information, recognize the best possible emotion for it. For every sentence, the semantic relatedness between the words from sentence and a set of emotion-specific words is calculated using cosine similarity. An emotion vector representing the emotion score for each emotion category of Ekman’s model, is created. It is further improved with the dependency relations and the best possible emotion is predicted. The results show the significant improvement in f-score values for text with sentiment information as input over our baseline as text without sentiment information. We report the weighted f-score on three different datasets with the Ekman’s emotion model. This supports that by leveraging the sentiment value, better emotion annotated data can be created.

pdf bib
Towards measuring lexical complexity in Malayalam
Richard Shallam | Ashwini Vaidya

This paper proposes a metric to quantify lexical complexity in Malayalam. The met- ric utilizes word frequency, orthography and morphology as the three factors affect- ing visual word recognition in Malayalam. Malayalam differs from other Indian lan- guages due to its agglutinative morphology and orthography, which are incorporated into our model. The predictions made by our model are then evaluated against reac- tion times in a lexical decision task. We find that reaction times are predicted by frequency, morphological complexity and script complexity. We also explore the interactions between morphological com- plexity with frequency and script in our results. To the best of our knowledge, this is the first study on lexical complexity in Malayalam.

pdf bib
Kunji : A Resource Management System for Higher Productivity in Computer Aided Translation Tools
Priyank Gupta | Manish Shrivastava | Dipti Misra Sharma | Rashid Ahmad

Complex NLP applications, such as machine translation systems, utilize various kinds of resources namely lexical, multiword, domain dictionaries, maps and rules etc. Similarly, translators working on Computer Aided Translation workbenches, also require help from various kinds of resources - glossaries, terminologies, concordances and translation memory in the workbenches in order to increase their productivity. Additionally, translators have to look away from the workbenches for linguistic resources like Named Entities, Multiwords, lexical and lexeme dictionaries in order to get help, as the available resources like concordances, terminologies and glossaries are often not enough. In this paper we present Kunji, a resource management system for translation workbenches and MT modules. This system can be easily integrated in translation workbenches and can also be used as a management tool for resources for MT systems. The described resource management system has been integrated in a translation workbench Transzaar. We also study the impact of providing this resource management system along with linguistic resources on the productivity of translators for English-Hindi language pair. When the linguistic resources like lexeme, NER and MWE dictionaries were made available to translators in addition to their regular translation memories, concordances and terminologies, their productivity increased by 15.61%.

pdf bib
Identification of Synthetic Sentence in Bengali News using Hybrid Approach
Soma Das | Sanjay Chatterji

Often sentences of correct news are either made biased towards a particular person or a group of persons or parties or maybe distorted to add some sentiment or importance in it. Engaged readers often are not able to extract the inherent meaning of such synthetic sentences. In Bengali, the news contents of the synthetic sentences are presented in such a rich way that it usually becomes difficult to identify the synthetic part of it. We have used machine learning algorithms to classify Bengali news sentences into synthetic and legitimate and then used some rule-based postprocessing on each of these models. Finally, we have developed a voting based combination of these models to build a hybrid model for Bengali synthetic sentence identification. This is a new task and therefore we could not compare it with any existing work in the field. Identification of such types of sentences may be used to improve the performance of identifying fake news and satire news. Thus, identifying molecular level biasness in news articles.

pdf bib
An LSTM-Based Deep Learning Approach for Detecting Self-Deprecating Sarcasm in Textual Data
Ashraf Kamal | Muhammad Abulaish

Self-deprecating sarcasm is a special category of sarcasm, which is nowadays popular and useful for many real-life applications, such as brand endorsement, product campaign, digital marketing, and advertisement. The self-deprecating style of campaign and marketing strategy is mainly adopted to excel brand endorsement and product sales value. In this paper, we propose an LSTM-based deep learning approach for detecting self-deprecating sarcasm in textual data. To the best of our knowledge, there is no prior work related to self-deprecating sarcasm detection using deep learning techniques. Starting with a filtering step to identify self-referential tweets, the proposed approach adopts a deep learning model using LSTM for detecting self-deprecating sarcasm. The proposed approach is evaluated over three Twitter datasets and performs significantly better in terms of precision, recall, and f-score.

pdf bib
Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities
Pratik Joshi | Christain Barnes | Sebastin Santy | Simran Khanuja | Sanket Shah | Anirudh Srinivasan | Satwik Bhattamishra | Sunayana Sitaram | Monojit Choudhury | Kalika Bali

In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities. While doing so we bring to light the successes and failures of past work in this area, challenges being faced in doing so, and what have they achieved. Throughout this paper, we take a problem-facing approach and describe essential factors which the success of such technologies hinges upon. We present the various aspects in a manner which clarify and lay out the different tasks involved, which can aid organizations looking to make an impact in this area. We take the example of Gondi, an extremely-low resource Indian language, to reinforce and complement our discussion.

pdf bib
DRCoVe: An Augmented Word Representation Approach using Distributional and Relational Context
Md. Aslam Parwez | Muhammad Abulaish | Mohd Fazil

Word representation using the distributional information of words from a sizeable corpus is considered efficacious in many natural language processing and text mining applications. However, distributional representation of a word is unable to capture distant relational knowledge, representing the relational semantics. In this paper, we propose a novel word representation approach using distributional and relational contexts, DRCoVe, which augments the distributional representation of a word using the relational semantics extracted as syntactic and semantic association among entities from the underlying corpus. Unlike existing approaches that use external knowledge bases representing the relational semantics for enhanced word representation, DRCoVe uses typed dependencies (aka syntactic dependencies) to extract relational knowledge from the underlying corpus. The proposed approach is applied over a biomedical text corpus to learn word representation and compared with GloVe, which is one of the most popular word embedding approaches. The evaluation results on various benchmark datasets for word similarity and word categorization tasks demonstrate the effectiveness of DRCoVe over the GloVe.

pdf bib
A Deep Learning Approach for Automatic Detection of Fake News
Tanik Saikh | Arkadipta De | Asif Ekbal | Pushpak Bhattacharyya

Fake news detection is a very prominent and essential task in the field of journalism. This challenging problem is seen so far in the field of politics, but it could be even more challenging when it is to be determined in the multi-domain platform. In this paper, we propose two effective models based on deep learning for solving fake news detection problem in online news contents of multiple domains. We evaluate our techniques on the two recently released datasets, namely Fake News AMT and Celebrity for fake news detection. The proposed systems yield encouraging performance, outperforming the current hand-crafted feature engineering based state-of-the-art system with a significant margin of 3.08% and 9.3% by the two models, respectively. In order to exploit the datasets, available for the related tasks, we perform cross-domain analysis (model trained on FakeNews AMT and tested on Celebrity and vice versa) to explore the applicability of our systems across the domains.

pdf bib
Samajh-Boojh: A Reading Comprehension system in Hindi
Shalaka Vaidya | Hiranmai Sri Adibhatla | Radhika Mamidi

This paper presents a novel approach designed to answer questions on a reading comprehension passage. It is an end-to-end system which first focuses on comprehending the given passage wherein it converts unstructured passage into a structured data and later proceeds to answer the questions related to the passage using solely the aforementioned structured data. To the best of our knowledge, the proposed design is first of its kind which accounts for entire process of comprehending the passage and then answering the questions associated with the passage. The comprehension stage converts the passage into a Discourse Collection that comprises of the relation shared amongst logical sentences in given passage along with the key characteristics of each sentence. This design has its applications in academic domain , query comprehension in speech systems among others.