uppdf
bib
Proceedings of the 19th International Conference on Natural Language Processing (ICON)
Md. Shad Akhtar
|
Tanmoy Chakraborty
pdf
bib
abs
EdgeGraph: Revisiting Statistical Measures for Language Independent Keyphrase Extraction Leveraging on Bi-grams
Muskan Garg
|
Amit Gupta
The NLP research community resort conventional Word Co-occurrence Network (WCN) for keyphrase extraction using random walk sampling mechanism such as PageRank algo rithm to identify candidate words/ phrases. We argue that the nature of WCN is a path-based network and does not follow a core-periphery structure as observed in web-page linking network. Thus, the language networks leveraging on bi-grams may represent better semantics for keyphrase extraction using random walk. In this work, we use bi-gram as a node and adjacent bi-grams are linked together to generate an EdgeGraph. We validate our method over four publicly available dataset to demonstrate the effectiveness of our simple yet effective language network and our extensive experiments show that random walk over EdgeGraph representation performs better than conventional WCN. We make our codes and supplementary materials available over Github.
pdf
bib
abs
Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages
Bhavyajeet Singh
|
Siri Venkata Pavan Kumar Kandru
|
Anubhav Sharma
|
Vasudeva Varma
Massive knowledge graphs like Wikidata attempt to capture world knowledge about multiple entities. Recent approaches concentrate on automatically enriching these KGs from text. However a lot of information present in the form of natural text in low resource languages is often missed out. Cross Lingual Information Extraction aims at extracting factual information in the form of English triples from low resource Indian Language text. Despite its massive potential, progress made on this task is lagging when compared to Monolingual Information Extraction. In this paper, we propose the task of Cross Lingual Fact Extraction(CLFE) from text and devise an end-to-end generative approach for the same which achieves an overall F1 score of 77.46
pdf
bib
abs
Analysing Syntactic and Semantic Features in Pre-trained Language Models in a Fully Unsupervised Setting
Necva Bölücü
|
Burcu Can
Transformer-based pre-trained language models (PLMs) have been used in all NLP tasks and resulted in a great success. This has led to the question of whether we can transfer this knowledge to syntactic or semantic parsing in a completely unsupervised setting. In this study, we leverage PLMs as a source of external knowledge to perform a fully unsupervised parser model for semantic, constituency and dependency parsing. We analyse the results for English, German, French, and Turkish to understand the impact of the PLMs on different languages for syntactic and semantic parsing. We visualize the attention layers and heads in PLMs for parsing to understand the information that can be learned throughout the layers and the attention heads in the PLMs both for different levels of parsing tasks. The results obtained from dependency, constituency, and semantic parsing are similar to each other, and the middle layers and the ones closer to the final layers have more syntactic and semantic information.
pdf
bib
abs
Knowledge Enhanced Deep Learning Model for Radiology Text Generation
Kaveri Kale
|
Pushpak Bhattacharya
|
Aditya Shetty
|
Milind Gune
|
Kush Shrivastava
|
Rustom Lawyer
|
Spriha Biswas
Manual radiology report generation is a time-consuming task. First, radiologists prepare brief notes while carefully examining the imaging report. Then, radiologists or their secretaries create a full-text report that describes the findings by referring to the notes. Automatic radiology report generation is the primary objective of this research. The central part of automatic radiology report generation is generating the finding section (main body of the report) from the radiologists’ notes. In this research, we suggest a knowledge graph (KG) enhanced radiology text generator that can provide additional domain-specific information. Our approach uses a KG-BART model to generate a description of clinical findings (referred to as pathological description) from radiologists’ brief notes. We have constructed a parallel dataset of radiologists’ notes and corresponding pathological descriptions to train the KG-BART model. Our findings demonstrate that, compared to the BART-large and T5-large models, the BLEU-2 score of the pathological descriptions generated by our approach is raised by 4% and 9%, and the ROUGE-L score by 2% and 2%, respectively. Our analysis shows that the KG-BART model for radiology text generation outperforms the T5-large model. Furthermore, we apply our proposed radiology text generator for whole radiology report generation.
pdf
bib
abs
Named Entity Recognition for Code-Mixed Kannada-English Social Media Data
Poojitha Nandigam
|
Abhinav Appidi
|
Manish Shrivastava
Named Entity Recognition (NER) is a critical task in the field of Natural Language Processing (NLP) and is also a sub-task of Information Extraction. There has been a significant amount of work done in entity extraction and Named Entity Recognition for resource-rich languages. Entity extraction from code-mixed social media data like tweets from twitter complicates the problem due to its unstructured, informal, and incomplete information available in tweets. Here, we present work on NER in Kannada-English code-mixed social media corpus with corresponding named entity tags referring to Organisation (Org), Person (Pers), and Location (Loc). We experimented with machine learning classification models like Conditional Random Fields (CRF), Bi-LSTM, and Bi-LSTM-CRF models on our corpus.
pdf
bib
abs
PAR: Persona Aware Response in Conversational Systems
Abhijit Nargund
|
Sandeep Pandey
|
Jina Ham
To make the Human Computer Interaction more user friendly and persona aligned, detection of user persona is of utmost significance. Towards achieving this objective, we describe a novel approach to select the persona of a user from pre-determine list of personas and utilize it to generate personalized responses. This is achieved in two steps. Firstly, closest matching persona is detected from a set of pre-determined persona for the user. The second step involves the use of a fine-tuned natural language generation (NLG) model to generate persona compliant responses. Through experiments, we demonstrate that the proposed architecture generates better responses than current approaches by using a detected persona. Experimental evaluation on the PersonaChat dataset has demonstrated notable performance in terms of perplexity and F1-score.
pdf
bib
abs
IAEmp: Intent-aware Empathetic Response Generation
Mrigank Tiwari
|
Vivek Dahiya
|
Om Mohanty
|
Girija Saride
In the domain of virtual assistants or conversational systems, it is important to empathise with the user. Being empathetic involves understanding the emotion of the ongoing dialogue and responding to the situation with empathy. We propose a novel approach for empathetic response generation, which leverages predicted intents for future response and prompts the encoder-decoder model to improve empathy in generated responses. Our model exploits the combination of dialogues and their respective emotions to generate empathetic response. As responding intent plays an important part in our generation, we also employ one or more intents to generate responses with relevant empathy. We achieve improved human and automated metrics, compared to the baselines.
pdf
bib
abs
KILDST: Effective Knowledge-Integrated Learning for Dialogue State Tracking using Gazetteer and Speaker Information
Hyungtak Choi
|
Hyeonmok Ko
|
Gurpreet Kaur
|
Lohith Ravuru
|
Kiranmayi Gandikota
|
Manisha Jhawar
|
Simma Dharani
|
Pranamya Patil
Dialogue State Tracking (DST) is core research in dialogue systems and has received much attention. In addition, it is necessary to define a new problem that can deal with dialogue between users as a step toward the conversational AI that extracts and recommends information from the dialogue between users. So, we introduce a new task - DST from dialogue between users about scheduling an event (DST-USERS). The DST-USERS task is much more challenging since it requires the model to understand and track dialogue states in the dialogue between users, as well as to understand who suggested the schedule and who agreed to the proposed schedule. To facilitate DST-USERS research, we develop dialogue datasets between users that plan a schedule. The annotated slot values which need to be extracted in the dialogue are date, time, and location. Previous approaches, such as Machine Reading Comprehension (MRC) and traditional DST techniques, have not achieved good results in our extensive evaluations. By adopting the knowledge-integrated learning method, we achieve exceptional results. The proposed model architecture combines gazetteer features and speaker information efficiently. Our evaluations of the dialogue datasets between users that plan a schedule show that our model outperforms the baseline model.
pdf
bib
abs
Efficient Dialog State Tracking Using Gated- Intent based Slot Operation Prediction for On-device Dialog Systems
Pranamya Patil
|
Hyungtak Choi
|
Ranjan Samal
|
Gurpreet Kaur
|
Manisha Jhawar
|
Aniruddha Tammewar
|
Siddhartha Mukherjee
Conversational agents on smart devices need to be efficient concerning latency in responding, for enhanced user experience and real-time utility. This demands on-device processing (as on-device processing is quicker), which limits the availability of resources such as memory and processing. Most state-of-the-art Dialog State Tracking (DST) systems make use of large pre-trained language models that require high resource computation, typically available on high-end servers. Whereas, on-device systems are memory efficient, have reduced time/latency, preserve privacy, and don’t rely on network. A recent approach tries to reduce the latency by splitting the task of slot prediction into two subtasks of State Operation Prediction (SOP) to select an action for each slot, and Slot Value Generation (SVG) responsible for producing values for the identified slots. SVG being computationally expensive, is performed only for a small subset of actions predicted in the SOP. Motivated from this optimization technique, we build a similar system and work on multi-task learning to achieve significant improvements in DST performance, while optimizing the resource consumption. We propose a quadruplet (Domain, Intent, Slot, and Slot Value) based DST, which significantly boosts the performance. We experiment with different techniques to fuse different layers of representations from intent and slot prediction tasks. We obtain the best joint accuracy of 53.3% on the publicly available MultiWOZ 2.2 dataset, using BERT-medium along with a gating mechanism. We also compare the cost efficiency of our system with other large models and find that our system is best suited for an on-device based production environment.
pdf
bib
abs
Emotion-guided Cross-domain Fake News Detection using Adversarial Domain Adaptation
Arjun Choudhry
|
Inder Khatri
|
Arkajyoti Chakraborty
|
Dinesh Vishwakarma
|
Mukesh Prasad
Recent works on fake news detection have shown the efficacy of using emotions as a feature or emotions-based features for improved performance. However, the impact of these emotion-guided features for fake news detection in cross-domain settings, where we face the problem of domain shift, is still largely unexplored. In this work, we evaluate the impact of emotion-guided features for cross-domain fake news detection, and further propose an emotion-guided, domain-adaptive approach using adversarial learning. We prove the efficacy of emotion-guided models in cross-domain settings for various combinations of source and target datasets from FakeNewsAMT, Celeb, Politifact and Gossipcop datasets.
pdf
bib
abs
Generalised Spherical Text Embedding
Souvik Banerjee
|
Bamdev Mishra
|
Pratik Jawanpuria
|
Manish Shrivastava Shrivastava
This paper aims to provide an unsupervised modelling approach that allows for a more flexible representation of text embeddings. It jointly encodes the words and the paragraphs as individual matrices of arbitrary column dimension with unit Frobenius norm. The representation is also linguistically motivated with the introduction of a metric for the ambient space in which we train the embeddings that calculates the similarity between matrices of unequal number of columns. Thus, the proposed modelling and the novel similarity metric exploits the matrix structure of embeddings. We then go on to show that the same matrices can be reshaped into vectors of unit norm and transform our problem into an optimization problem in a spherical manifold for optimization simplicity. Given the total number of matrices we are dealing with, which is equal to the vocab size plus the total number of documents in the corpus, this makes the training of an otherwise expensive non-linear model extremely efficient. We also quantitatively verify the quality of our text embeddings by showing that they demonstrate improved results in document classification, document clustering and semantic textual similarity benchmark tests.
pdf
bib
abs
CNN-Transformer based Encoder-Decoder Model for Nepali Image Captioning
Bipesh Subedi
|
Bal Krishna Bal
Many image captioning tasks have been carried out in recent years, the majority of the work being for the English language. A few research works have also been carried out for Hindi and Bengali languages in the domain. Unfortunately, not much research emphasis seems to be given to the Nepali language in this direction. Furthermore, the datasets are also not publicly available in the Nepali language. The aim of this research is to prepare a dataset with Nepali captions and develop a deep learning model based on the Convolutional Neural Network (CNN) and Transformer combined model to automatically generate image captions in the Nepali language. The dataset for this work is prepared by applying different data preprocessing techniques on the Flickr8k dataset. The preprocessed data is then passed to the CNN-Transformer model to generate image captions. ResNet-101 and EfficientNetB0 are the two pre-trained CNN models employed for this work. We have achieved some promising results which can be further improved in the future.
pdf
bib
abs
Verb Phrase Anaphora:Do(ing) so with Heuristics
Sandhya Singh
|
Kushagra Shree
|
Sriparna Saha
|
Pushpak Bhattacharyya
|
Gladvin Chinnadurai
|
Manish Vatsa
Verb Phrase Anaphora (VPA) is a universal language phenomenon. It can occur in the form of do so phrase, verb phrase ellipsis, etc. Resolving VPA can improve the performance of Dialogue processing systems, Natural Language Generation (NLG), Question Answering (QA) and so on. In this paper, we present a novel computational approach to resolve the specific verb phrase anaphora appearing as do so construct and its lexical variations for the English language. The approach follows a heuristic technique using a combination of parsing from classical NLP, state-of-the-art (SOTA) Generative Pre-trained Transformer (GPT) language model and RoBERTa grammar correction model. The result indicates that our approach can resolve these specific verb phrase anaphora cases with 73.40 F1 score. The data set used for testing the specific verb phrase anaphora cases of do so and doing so is released for research purposes. This module has been used as the last module in a coreference resolution pipeline for a downstream QA task for the electronic home appliances sector.
pdf
bib
abs
Event Oriented Abstractive Summarization
Aafiya S Hussain
|
Talha Z Chafekar
|
Grishma Sharma
|
Deepak H Sharma
Abstractive Summarization models are generally conditioned on the source article. This would generate a summary with the central theme of the article. However, it would not be possible to generate a summary focusing on specific key areas of the article. To solve this problem, we introduce a novel method for abstractive summarization. We aim to use a transformer to generate summaries which are more tailored to the events in the text by using event information. We extract events from text, perform generalized pooling to get a representation for these events and add an event attention block in the decoder to aid the transformer model in summarization. We carried out experiments on CNN / Daily Mail dataset and the BBC Extreme Summarization dataset. We achieve comparable results on both these datasets, with less training and better inclusion of event information in the summaries as shown by human evaluation scores.
pdf
bib
abs
Augmenting eBooks with with recommended questions using contrastive fine-tuned T5
Shobhan Kumar
|
Arun Chauhan
|
Pavan Kumar
The recent advances in AI made generation of questions from natural language text possible, the approach is completely excludes human in the loop while generating the appropriate questions which improves the students learning engagement. The ever growing amount of educational content renders it increasingly difficult to manually generate sufficient practice or quiz questions to accompany it. Reading comprehension can be improved by asking the right questions, So, this work offers a Transformer based question generation model for autonomously producing quiz questions from educational information, such as eBooks. This work proposes an contrastive training approach for ‘Text-to-Text Transfer Transformer’ (T5) model where the model (T5-eQG) creates the summarized text for the input document and then automatically generates the questions. Our model shows promising results over earlier neural network-based and rules-based models for question generating task on benchmark datasets and NCERT eBooks.
pdf
bib
abs
Reducing Inference Time of Biomedical NER Tasks using Multi-Task Learning
Mukund Chaudhry Chaudhry
|
Arman Kazmi
|
Shashank Jatav
|
Akhilesh Verma
|
Vishal Samal
|
Kristopher Paul
|
Ashutosh Modi
Recently, fine-tuned transformer-based models (e.g., PubMedBERT, BioBERT) have shown the state-of-the-art performance of a number of BioNLP tasks, such as Named Entity Recognition (NER). However, transformer-based models are complex and have millions of parameters, and, consequently, are relatively slow during inference. In this paper, we address the time complexity limitations of the BioNLP transformer models. In particular, we propose a Multi-Task Learning based framework for jointly learning three different biomedical NER tasks. Our experiments show a reduction in inference time by a factor of three without any reduction in prediction accuracy.
pdf
bib
abs
English To Indian Sign Language:Rule-Based Translation System Along With Multi-Word Expressions and Synonym Substitution
Abhigyan Ghosh
|
Radhika Mamidi
The hearing challenged communities all over the world face difficulties to communicate with others. Machine translation has been one of the prominent technologies to facilitate communication with the deaf and hard of hearing community worldwide. We have explored and formulated the fundamental rules of Indian Sign Language(ISL) and implemented them as a translation mechanism of English Text to Indian Sign Language glosses. According to the formulated rules and sub-rules, the source text structure is identified and transferred to the target ISL gloss. This target language is such that it can be easily converted to videos using the Indian Sign Language dictionary. This research work also mentions the intermediate phases of the transfer process and innovations in the process such as Multi-Word Expression detection and synonym substitution to handle the limited vocabulary size of Indian Sign Language while producing semantically accurate translations.
pdf
bib
abs
Improving Contextualized Topic Models with Negative Sampling
Suman Adhya
|
Avishek Lahiri
|
Debarshi Kumar Sanyal
|
Partha Pratim Das
Topic modeling has emerged as a dominant method for exploring large document collections. Recent approaches to topic modeling use large contextualized language models and variational autoencoders. In this paper, we propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics. In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document and dissimilar to the document reconstructed from the perturbed vector. Experiments for different topic counts on three publicly available benchmark datasets show that in most cases, our approach leads to an increase in topic coherence over that of the baselines. Our model also achieves very high topic diversity.
pdf
bib
abs
IMFinE:An Integrated BERT-CNN-BiGRU Model for Mental Health Detection in Financial Context on Textual Data
Ashraf Kamal
|
Padmapriya Mohankumar
|
Vishal K Singh
Nowadays, mental health is a global issue. It is a pervasive phenomenon over online social network platforms. It is observed in varied categories, such as depression, suicide, and stress on the Web. Hence, mental health detection problem is receiving continuous attention among computational linguistics researchers. On the other hand, public emotions and reactions play a significant role in financial domain and the issue of mental health is directly associated. In this paper, we propose a new study to detect mental health in financial context. It starts with two-step data filtration steps to prepare the mental health dataset in financial context. A new model called IMFinE is introduced. It consists of an input layer, followed by two relevant BERT embedding layers, a convolutional neural network, a bidirectional gated recurrent unit, and finally, dense and output layers. The empirical evaluation of the proposed model is performed on Reddit datasets and it shows impressive results in terms of precision, recall, and f-score. It also outperforms relevant state-of-the-art and baseline methods. To the best of our knowledge, this is the first study on mental health detection in financial context.
pdf
bib
abs
Methods to Optimize Wav2Vec with Language Model for Automatic Speech Recognition in Resource Constrained Environment
Vaibhav Haswani
|
Padmapriya Mohankumar
Automatic Speech Recognition (ASR) on resource constrained environment is a complex task since most of the State-Of-The-Art models are combination of multilayered convolutional neural network (CNN) and Transformer models which itself requires huge resources such as GPU or TPU for training as well as inference. The accuracy as a performance metric of an ASR system depends upon the efficiency of phonemes to word translation of an Acoustic Model and context correction of the Language model. However, inference as a performance metric is also an important aspect, which mostly depends upon the resources. Also, most of the ASR models uses transformer models at its core and one caveat of transformers is that it usually has a finite amount of sequence length it can handle. Either because it uses position encodings or simply because the cost of attention in transformers is actually O(n²) in sequence length, meaning that using very large sequence length explodes in complexity/memory. So you cannot run the system with finite hardware even a very high-end GPU, because if we inference even a one hour long audio with Wav2Vec the system will crash. In this paper, we used some state-of-the-art methods to optimize the Wav2Vec model for better accuracy of predictions in resource constrained systems. In addition, we have performed tests with other SOTA models such as Citrinet and Quartznet for the comparative analysis.
pdf
bib
abs
Knowledge Graph-based Thematic Similarity for Indian Legal Judgement Documents using Rhetorical Roles
Sheetal S
|
Veda N
|
Ramya Prabhu
|
Pruthv P
|
Mamatha H R R
Automation in the legal domain is promising to be vital to help solve the backlog that currently affects the Indian judiciary. For any system that is developed to aid such a task, it is imperative that it is informed by choices that legal professionals often take in the real world in order to achieve the same task while also ensuring that biases are eliminated. The task of legal case similarity is accomplished in this paper by extracting the thematic similarity of the documents based on their rhetorical roles. The similarity scores between the documents are calculated, keeping in mind the different amount of influence each of these rhetorical roles have in real life practices over determining the similarity between two documents. Knowledge graphs are used to capture this information in order to facilitate the use of this method for applications like information retrieval and recommendation systems.
pdf
bib
abs
SConE:Contextual Relevance based Significant CompoNent Extraction from Contracts
Hiranmai Adibhatla
|
Manish Shrivastava
Automatic extraction of “significant” components of a legal contract, has the potential to simplify the end user’s comprehension. In essence, “significant” pieces of information have 1) information pertaining to material/practical details about a specific contract and 2) information that is novel or comes as a “surprise” for a specific type of contract. It indicates that the significance of a component may be defined at an individual contract level and at a contract-type level. A component, sentence, or paragraph, may be considered significant at a contract level if it contains contract-specific information (CSI), like names, dates, or currency terms. At a contract-type level, components that deviate significantly from the norm for the type may be considered significant (type-specific information (TSI)). In this paper, we present approaches to extract “significant” components from a contract at both these levels. We attempt to do this by identifying patterns in a pool of documents of the same kind. Consequently, in our approach, the solution is formulated in two parts: identifying CSI using a BERT-based contract-specific information extractor and identifying TSI by scoring sentences in a contract for their likelihood. In this paper, we even describe the annotated corpus of contract documents that we created as a first step toward the development of such a language-processing system. We also release a dataset of contract samples containing sentences belonging to CSI and TSI.
pdf
bib
abs
AniMOJity:Detecting Hate Comments in Indic languages and Analysing Bias against Content Creators
Rahul Khurana
|
Chaitanya Pandey
|
Priyanshi Gupta
|
Preeti Nagrath
Online platforms have dramatically changed how people communicate with one another, resulting in a 467 million increase in the number of Indians actively exchanging and distributing social data. This caused an unexpected rise in harmful, racially, sexually, and religiously biased Internet content humans cannot control. As a result, there is an urgent need to research automated computational strategies for identifying hostile content in academic forums. This paper presents our learning pipeline and novel model, which classifies a multilingual text with a test f1-Score of 88.6% on the Moj Multilingual Abusive Comment Identification dataset for hate speech detection in thirteen Indian regional languages. Our model, Animojity, incorporates transfer learning and SOTA pre- and post-processing techniques. We manually annotate 300 samples to investigate bias and provide insight into the hate towards creators.
pdf
bib
abs
Revisiting Anwesha:Enhancing Personalised and Natural Search in Bangla
Arup Das
|
Joyojyoti Acharya
|
Bibekananda Kundu
|
Sutanu Chakraborti
Bangla is a low-resource, highly agglutinative language. Thus it is challenging to facilitate an effective search of Bangla documents. We have created a gold standard dataset containing query document relevance pairs for evaluation purposes. We utilise Named Entities to improve the retrieval effectiveness of traditional Bangla search algorithms. We suggest a reasonable starting model for leveraging implicit preference feedback based on the user search behaviour to enhance the results retrieved by the Explicit Semantic Analysis (ESA) approach. We use contextual sentence embeddings obtained via Language-agnostic BERT Sentence Embedding (LaBSE) to rerank the candidate documents retrieved by the traditional search algorithms (tf-idf) based on the top sentences that are most relevant to the query. This paper presents our empirical findings across these directions and critically analyses the results.
pdf
bib
abs
KnowPAML:A Knowledge Enhanced Framework for Adaptable Personalized Dialogue Generation Using Meta-Learning
Aditya Shukla
|
Zishan Ahmad
|
Asif Ekbal
In order to provide personalized interactions in a conversational system, responses must be consistent with the user and agent persona while still being relevant to the context of the conversation. Existing personalized conversational systems increase the consistency of the generated response by leveraging persona descriptions, which sometimes tend to generate irrelevant responses to the context. To solve this problems, we propose to extend the persona-agnostic meta-learning (PAML) framework by adding knowledge from ConceptNet knowledge graph with multi-hop attention mechanism. Knowledge is a concept in a triple form that helps in conversational flow. The multi-hop attention mechanism helps select the most appropriate triples with respect to the conversational context and persona description, as not all triples are beneficial for generating responses. The Meta-Learning (PAML) framework allows quick adaptation to different personas by utilizing only a few dialogue samples from the same user. Our experiments on the Persona-Chat dataset show that our method outperforms in terms of persona-adaptability, resulting in more persona-consistent responses, as evidenced by the entailment (Entl) score in the automatic evaluation and the consistency (Con) score in human evaluation.
pdf
bib
abs
There is No Big Brother or Small Brother:Knowledge Infusion in Language Models for Link Prediction and Question Answering
Ankush Agarwal
|
Sakharam Gawade
|
Sachin Channabasavarajendra
|
Pushpak Bhattacharya
The integration of knowledge graphs with deep learning is thriving in improving the performance of various natural language processing (NLP) tasks. In this paper, we focus on knowledge-infused link prediction and question answering using language models, T5, and BLOOM across three domains:Aviation, Movie, and Web. In this context, we infuse knowledge in large and small language models and study their performance, and find the performance to be similar. For the link prediction task on the Aviation Knowledge Graph, we obtain a 0.2 hits@1 score using T5-small, T5-base, T5-large, and BLOOM. Using template-based scripts, we create a set of 1 million synthetic factoid QA pairs in the aviation domain from National Transportation Safety Board (NTSB) reports. On our curated QA pairs, the three models of T5 achieve a 0.7 hits@1 score. We validate our findings with the paired student t test and Cohen’s kappa scores. For link prediction on Aviation Knowledge Graph using T5-small and T5-large, we obtain a Cohen’s kappa score of 0.76, showing substantial agreement between the models. Thus, we infer that small language models perform similar to large language models with the infusion of knowledge.
pdf
bib
abs
Efficient Joint Learning for Clinical Named Entity Recognition and Relation Extraction Using Fourier Networks:A Use Case in Adverse Drug Events
Anthony Yazdani
|
Dimitrios Proios
|
Hossein Rouhizadeh
|
Douglas Teodoro
Current approaches for clinical information extraction are inefficient in terms of computational costs and memory consumption, hindering their application to process large-scale electronic health records (EHRs). We propose an efficient end-to-end model, the Joint-NER-RE-Fourier (JNRF), to jointly learn the tasks of named entity recognition and relation extraction for documents of variable length. The architecture uses positional encoding and unitary batch sizes to process variable length documents and uses a weight-shared Fourier network layer for low-complexity token mixing. Finally, we reach the theoretical computational complexity lower bound for relation extraction using a selective pooling strategy and distance-aware attention weights with trainable polynomial distance functions. We evaluated the JNRF architecture using the 2018 N2C2 ADE benchmark to jointly extract medication-related entities and relations in variable-length EHR summaries. JNRF outperforms rolling window BERT with selective pooling by 0.42%, while being twice as fast to train. Compared to state-of-the-art BiLSTM-CRF architectures on the N2C2 ADE benchmark, results show that the proposed approach trains 22 times faster and reduces GPU memory consumption by 1.75 folds, with a reasonable performance tradeoff of 90%, without the use of external tools, hand-crafted rules or post-processing. Given the significant carbon footprint of deep learning models and the current energy crises, these methods could support efficient and cleaner information extraction in EHRs and other types of large-scale document databases.
pdf
bib
abs
Genre Transfer in NMT:Creating Synthetic Spoken Parallel Sentences using Written Parallel Data
Nalin Kumar
|
Ondrej Bojar
Text style transfer (TST) aims to control attributes in a given text without changing the content. The matter gets complicated when the boundary separating two styles gets blurred. We can notice similar difficulties in the case of parallel datasets in spoken and written genres. Genuine spoken features like filler words and repetitions in the existing spoken genre parallel datasets are often cleaned during transcription and translation, making the texts closer to written datasets. This poses several problems for spoken genre-specific tasks like simultaneous speech translation. This paper seeks to address the challenge of improving spoken language translations. We start by creating a genre classifier for individual sentences and then try two approaches for data augmentation using written examples:(1) a novel method that involves assembling and disassembling spoken and written neural machine translation (NMT) models, and (2) a rule-based method to inject spoken features. Though the observed results for (1) are not promising, we get some interesting insights into the solution. The model proposed in (1) fine-tuned on the synthesized data from (2) produces naturally looking spoken translations for written-to-spoken genre transfer in En-Hi translation systems. We use this system to produce a second-stage En-Hi synthetic corpus, which however lacks appropriate alignments of explicit spoken features across the languages. For the final evaluation, we fine-tune Hi-En spoken translation systems on the synthesized parallel corpora. We observe that the parallel corpus synthesized using our rule-based method produces the best results.
pdf
bib
abs
PACMAN:PArallel CodeMixed dAta generatioN for POS tagging
Arindam Chatterjee
|
Chhavi Sharma
|
Ayush Raj
|
Asif Ekbal
Code-mixing or Code-switching is the mixing of languages in the same context, predominantly observed in multilingual societies. The existing code-mixed datasets are small and primarily contain social media text that does not adhere to standard spelling and grammar. Computational models built on such data fail to generalise on unseen code-mixed data. To address the unavailability of quality code-mixed annotated datasets, we explore the combined task of generating annotated code mixed data, and building computational models from this generated data, specifically for code-mixed Part-Of-Speech (POS) tagging. We introduce PACMAN(PArallel CodeMixed dAta generatioN) - a synthetically generated code-mixed POS tagged dataset, with above 50K samples, which is the largest annotated code-mixed dataset. We build POS taggers using classical machine learning and deep learning based techniques on the generated data to report an F1-score of 98% (8% above current State-of-the-art (SOTA)). To determine the efficacy of our data, we compare it against the existing benchmark in code-mixed POS tagging. PACMAN outperforms the benchmark, ratifying that our dataset and, subsequently, our POS tagging models are generalised and capable of handling even natural code-mixed and monolingual data.
pdf
bib
abs
Error Corpora for Different Informant Groups:Annotating and Analyzing Texts from L2 Speakers, People with Dyslexia and Children
Þórunn Arnardóttir
|
Isidora Glisic
|
Annika Simonsen
|
Lilja Stefánsdóttir
|
Anton Ingason
Error corpora are useful for many tasks, in particular for developing spell and grammar checking software and teaching material and tools. We present and compare three specialized Icelandic error corpora; the Icelandic L2 Error Corpus, the Icelandic Dyslexia Error Corpus, and the Icelandic Child Language Error Corpus. Each corpus contains texts written by speakers of a particular group; L2 speakers of Icelandic, people with dyslexia, and children aged 10 to 15. The corpora shed light on errors made by these groups and their frequencies, and all errors are manually labeled according to an annotation scheme. The corpora vary in size, consisting of errors ranging from 7,817 to 24,948, and are published under a CC BY 4.0 license. In this paper, we describe the corpora and their annotation scheme, and draw comparisons between their errors and their frequencies.
pdf
bib
abs
Similarity Based Label Smoothing For Dialogue Generation
Sougata Saha
|
Souvik Das
|
Rohini Srihari
Generative neural conversational systems are typically trained by minimizing the entropy loss between the training “hard” targets and the predicted logits. Performance gains and improved generalization are often achieved by employing regularization techniques like label smoothing, which converts the training “hard” targets to soft targets. However, label smoothing enforces a data independent uniform distribution on the incorrect training targets, leading to a false assumption of equiprobability. In this paper, we propose and experiment with incorporating data-dependent word similarity-based weighing methods to transform the uniform distribution of the incorrect target probabilities in label smoothing to a more realistic distribution based on semantics. We introduce hyperparameters to control the incorrect target distribution and report significant performance gains over networks trained using standard label smoothing-based loss on two standard open-domain dialogue corpora.
pdf
bib
abs
A Novel Approach towards Cross Lingual Sentiment Analysis using Transliteration and Character Embedding
Rajarshi Roychoudhury
|
Subhrajit Dey
|
Md Akhtar
|
Amitava Das
|
Sudip Naskar
Sentiment analysis with deep learning in resource-constrained languages is a challenging task. In this paper, we introduce a novel approach for sentiment analysis in resource-constrained scenarios using character embedding and cross-lingual sentiment analysis with transliteration. We use this method to introduce the novel task of inducing sentiment polarity of words and sentences and aspect term sentiment analysis in the no-resource scenario. We formulate this task by taking a metalingual approach whereby we transliterate data from closely related languages and transform it into a meta language. We also demonstrated the efficacy of using character-level embedding for sentence representation. We experimented with 4 Indian languages – Bengali, Hindi, Tamil, and Telugu, and obtained encouraging results. We also presented new state-of-the-art results on the Hindi sentiment analysis dataset leveraging our metalingual character embeddings.
pdf
bib
abs
Normalization of Spelling Variations in Code-Mixed Data
Krishna Yadav
|
Md Akhtar
|
Tanmoy Chakraborty
Code-mixed text infused with low resource language has always been a challenge for natural language understanding models. A significant problem while understanding such texts is the correlation between the syntactic and semantic arrangement of words. The phonemes of each character in a word dictates the spelling representation of a term in low resource language. However, there is no universal protocol or alphabet mapping for code-mixing. In this paper, we highlight the impact of spelling variations in code-mixed data for training natural language understanding models. We emphasize the impact of using phonetics to neutralize this variation in spelling across different usage of a word with the same semantics. The proposed approach is a computationally inexpensive technique and improves the performances of state-of-the-art models for three dialog system tasks viz. intent classification, slot-filling, and response generation. We propose a data pipeline for normalizing spelling variations irrespective of language.
pdf
bib
abs
A Method for Automatically Estimating the Informativeness of Peer Reviews
Prabhat Bharti
|
Tirthankar Ghosal
|
Mayank Agarwal
|
Asif Ekbal
Peer reviews are intended to give authors constructive and informative feedback. It is expected that the reviewers will make constructive suggestions over certain aspects, e.g., novelty, clarity, empirical and theoretical soundness, etc., and sections, e.g., problem definition/idea, datasets, methodology, experiments, results, etc., of the paper in a detailed manner. With this objective, we analyze the reviewer’s attitude towards the work. Aspects of the review are essential to determine how much weight the editor/chair should place on the review in making a decision. In this paper, we used a publically available Peer Review Analyze dataset of peer review texts manually annotated at the sentence level (∼13.22 k sentences) across two layers:Paper Section Correspondence and Paper Aspect Category. We transform these categorical annotations to derive an informativeness score of the review based on the review’s coverage across section correspondence, aspects of the paper, and reviewer-centric uncertainty associated with the review. We hope that our proposed methods, which are motivated towards automatically estimating the quality of peer reviews in the form of informativeness scores, will give editors an additional layer of confidence for the automatic judgment of review quality. We make our codes available at
https://github.com/PrabhatkrBharti/informativeness.git.
pdf
bib
abs
Spellchecker for Sanskrit:The Road Less Taken
Prasanna S
A spellchecker is essential for any language for producing error-free content. While there exist advanced computational tools for Sanskrit, such as word segmenter, morphological analyser, sentential parser, and machine translation, a fully functional spellchecker is not available. This paper presents a Sanskrit spellchecking dictionary for Hunspell, thereby creating a spellchecker that works across the numerous platforms Hunspell supports. The spellchecking rules are created based on the Paninian grammar, and the dictionary design follows the word-and-paradigm model, thus, making it easily extendible for future improvements. The paper also presents an online spellchecking interface for Sanskrit developed mainly for the platforms where Hunspell integration is not available yet.
pdf
bib
abs
TeQuAD:Telugu Question Answering Dataset
Rakesh Vemula
|
Mani Nuthi
|
Manish Srivastava
Recent state of the art models and new datasets have advanced many Natural Language Processing areas, especially, Machine Reading Comprehension tasks have improved with the help of datasets like SQuAD (Stanford Question Answering Dataset). But, large high quality datasets are still not a reality for low resource languages like Telugu to record progress in MRC. In this paper, we present a Telugu Question Answering Dataset - TeQuAD with the size of 82k parallel triples created by translating triples from the SQuAD. We also introduce a few methods to create similar Question Answering datasets for the low resource languages. Then, we present the performance of our models which outperform baseline models on Monolingual and Cross Lingual Machine Reading Comprehension (CLMRC) setups, the best of them resulting in an F1 score of 83 % and Exact Match (EM) score of 61 %.
pdf
bib
abs
A Comprehensive Study of Mahabharat using Semantic and Sentiment Analysis
Srijeyarankesh J S
|
Aishwarya Kumaran
|
Nithyasri Lakshminarasimhan
|
Shanmuga Priya M
Indian epics have not been analyzed computationally to the extent that Greek epics have. In this paper, we show how interesting insights can be derived from the ancient epic Mahabharata by applying a variety of analytical techniques based on a combination of natural language processing methods like semantic analysis, sentiment analysis and Named Entity Recognition (NER). The key findings include the analysis of events and their importance in shaping the story, character’s life and their actions leading to consequences and change of emotions across the eighteen parvas of the story.
pdf
bib
abs
DeepADA:An Attention-Based Deep Learning Framework for Augmenting Imbalanced Textual Datasets
Amit Sah
|
Muhammad Abulaish
In this paper, we present an attention-based deep learning framework, DeepADA, which uses data augmentation to address the class imbalance problem in textual datasets. The proposed framework carries out the following functions:(i) using MPNET-based embeddings to extract keywords out of documents from the minority class, (ii) making use of a CNN-BiLSTM architecture with parallel attention to learn the important contextual words associated with the minority class documents’ keywords and provide them with word-level characteristics derived from their statistical and semantic features, (iii) using MPNET, replacing the key contextual terms derived from the oversampled documents that match to a keyword with the contextual term that best fits the context, and finally (iv) oversampling the minority class dataset to produce a balanced dataset. Using a 2-layer stacked BiLSTM classifier, we assess the efficacy of the proposed framework using the original and oversampled versions of three Amazon’s reviews datasets. We contrast the proposed data augmentation approach with two state-of-the-art text data augmentation methods. The experimental results reveal that our method produces an oversampled dataset that is more useful and helps the classifier perform better than the other two state-of-the-art methods. Nevertheless, we discover that the oversampled datasets outperformed their original ones by a wide margin.
pdf
bib
abs
Compact Residual Learning with Frequency-Based Non-Square Kernels for Small Footprint Keyword Spotting
Muhammad Abulaish
|
Rahul Gulia
Enabling voice assistants on small embedded devices requires a keyword spotter with a smaller model size and adequate accuracy. It becomes difficult to achieve a reasonable trade-off between a small footprint and high accuracy. Recent studies have demonstrated that convolution neural networks are also effective in the audio domain. In this paper, taking into account the nature of spectrograms, we propose a compact ResNet architecture that uses frequency-based non-square kernels to extract the maximum number of timbral features for keyword spotting. The proposed architecture is approximately three-and-a-half times smaller than a comparable architecture with conventional square kernels. On the Google’s speech command dataset v1, it outperforms both Google’s convolution neural networks and the equivalent ResNet architecture with square kernels. By implementing non-square kernels for spectrogram-related data, we can achieve a significant increase in accuracy with relatively few parameters, as compared to the conventional square kernels that are the default choice for every problem.
pdf
bib
abs
Unsupervised Bengali Text Summarization Using Sentence Embedding and Spectral Clustering
Sohini Roychowdhury
|
Kamal Sarkar
|
Arka Maji
Single document extractive text summarization produces a condensed version of a document by extracting salient sentences from the document. Most significant and diverse information can be obtained from a document by breaking it into topical clusters of sentences. The spectral clustering method is useful in text summarization because it does not assume any fixed shape of the clusters, and the number of clusters can automatically be inferred using the Eigen gap method. In our approach, we have used word embedding-based sentence representation and a spectral clustering algorithm to identify various topics covered in a Bengali document and generate an extractive summary by selecting salient sentences from the identified topics. We have compared our developed Bengali summarization system with several baseline extractive summarization systems. The experimental results show that the proposed approach performs better than some baseline Bengali summarization systems it is compared to.
uppdf
bib
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
Bharathi Raja Chakravarthi
|
Abirami Murugappan
|
Dhivya Chinnappa
|
Adeep Hane
|
Prasanna Kumar Kumeresan
|
Rahul Ponnusamy
pdf
bib
abs
Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding
O. E. Ojo
|
A. Gelbukh
|
H. Calvo
|
A. Feldman
|
O. O. Adebanji
|
J. Armenta-Segura
People often switch languages in conversations or written communication in order to communicate thoughts on social media platforms. The languages in texts of this type, also known as code-mixed texts, can be mixed at the sentence, word, or even sub-word level. In this paper, we address the problem of identifying language at the word level in code-mixed texts using a sequence of characters and word embedding. We feed machine learning and deep neural networks with a range of character-based and word-based text features as input. The data for this experiment was created by combining YouTube video comments from code-mixed Kannada and English (Kn-En) texts. The texts were pre-processed, split into words, and categorized as ‘Kannada’, ‘English’, ‘Mixed-Language’, ‘Name’, ‘Location’, and ‘Other’. The proposed techniques were able to learn from these features and were able to effectively identify the language of the words in the dataset. The proposed CK-Keras model with pre-trained Word2Vec embedding was our best-performing system, as it outperformed other methods when evaluated by the F1 scores.
pdf
bib
abs
CoLI-Kanglish: Word-Level Language Identification in Code-Mixed Kannada-English Texts Shared Task using the Distilka model
Vajratiya Vajrobol
Due to the intercultural demographic of online users, code-mixed language is often used by them to express themselves on social media. Language support to such users is based on the ability of a system to identify the constituent languages of the code-mixed language. Therefore, the process of language identification that helps in determining the language of individual textual entities from a code-mixed corpus is a current and relevant classification problem. Code-mixed texts are difficult to interpret and analyze from an algorithmic perspective. However, highly complex transformer- based techniques can be used to analyze and identify distinct languages of words in code-mixed texts. Kannada is one of the Dravidian languages which is spoken and written in Karnataka, India. This study aims to identify the language of individual words of texts from a corpus of code-mixed Kannada-English texts using transformer-based techniques. The proposed Distilka model was developed by fine-tuning the DistilBERT model using the code-mixed corpus. This model performed best on the official test dataset with a macro-averaged F1-score of 0.62 and weighted precision score of 0.86. The proposed solution ranked first in the shared task.
pdf
bib
abs
BERT-based Language Identification in Code-Mix Kannada-English Text at the CoLI-Kanglish Shared Task@ICON 2022
Pritam Deka
|
Nayan Jyoti Kalita
|
Shikhar Kumar Sarma
Language identification has recently gained research interest in code-mixed languages due to the extensive use of social media among people. People who speak multiple languages tend to use code-mixed languages when communicating with each other. It has become necessary to identify the languages in such code-mixed environment to detect hate speeches, fake news, misinformation or disinformation and for tasks such as sentiment analysis. In this work, we have proposed a BERT-based approach for language identification in the CoLI-Kanglish shared task at ICON 2022. Our approach achieved 86% weighted average F-1 score and a macro average F-1 score of 57% in the test set.
pdf
bib
abs
Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts
Atnafu Lambebo Tonja
|
Mesay Gemeda Yigezu
|
Olga Kolesnikova
|
Moein Shahiki Tash
|
Grigori Sidorov
|
Alexander Gelbukh
Language Identification at the Word Level in Kannada-English Texts. This paper describes the system paper of CoLI-Kanglish 2022 shared task. The goal of this task is to identify the different languages used in CoLI-Kanglish 2022. This dataset is distributed into different categories including Kannada, English, Mixed-Language, Location, Name, and Others. This Code-Mix was compiled by CoLI-Kanglish 2022 organizers from posts on social media. We use two classification techniques, KNN and SVM, and achieve an F1-score of 0.58 and place third out of nine competitors.
pdf
bib
abs
Word Level Language Identification in Code-mixed Kannada-English Texts using traditional machine learning algorithms
M. Shahiki Tash
|
Z. Ahani
|
A.l. Tonja
|
M. Gemeda
|
N. Hussain
|
O. Kolesnikova
Language Identification at the Word Level in Kannada-English Texts. This paper de- scribes the system paper of CoLI-Kanglish 2022 shared task. The goal of this task is to identify the different languages used in CoLI- Kanglish 2022. This dataset is distributed into different categories including Kannada, En- glish, Mixed-Language, Location, Name, and Others. This Code-Mix was compiled by CoLI- Kanglish 2022 organizers from posts on social media. We use two classification techniques, KNN and SVM, and achieve an F1-score of 0.58 and place third out of nine competitors.
pdf
bib
abs
Word Level Language Identification in Code-mixed Kannada-English Texts using Deep Learning Approach
Mesay Gemeda Yigezu
|
Atnafu Lambebo Tonja
|
Olga Kolesnikova
|
Moein Shahiki Tash
|
Grigori Sidorov
|
Alexander Gelbukh
The goal of code-mixed language identification (LID) is to determine which language is spoken or written in a given segment of a speech, word, sentence, or document. Our task is to identify English, Kannada, and mixed language from the provided data. To train a model we used the CoLI-Kenglish dataset, which contains English, Kannada, and mixed-language words. In our work, we conducted several experiments in order to obtain the best performing model. Then, we implemented the best model by using Bidirectional Long Short Term Memory (Bi-LSTM), which outperformed the other trained models with an F1-score of 0.61%.
pdf
bib
abs
BoNC: Bag of N-Characters Model for Word Level Language Identification
Shimaa Ismail
|
Mai K. Gallab
|
Hamada Nayel
This paper describes the model submitted by NLP_BFCAI team for Kanglish shared task held at ICON 2022. The proposed model used a very simple approach based on the word representation. Simple machine learning classification algorithms, Random Forests, Support Vector Machines, Stochastic Gradient Descent and Multi-Layer Perceptron have been imple- mented. Our submission, RF, securely ranked fifth among all other submissions.
pdf
bib
abs
Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022
F. Balouchzahi
|
S. Butt
|
A. Hegde
|
N. Ashraf
|
H.l. Shashirekha
|
Grigori Sidorov
|
Alexander Gelbukh
The task of Language Identification (LI) in text processing refers to automatically identifying the languages used in a text document. LI task is usually been studied at the document level and often in high-resource languages while giving less importance to low-resource languages. However, with the recent advance- ment in technologies, in a multilingual country like India, many low-resource language users post their comments using English and one or more language(s) in the form of code-mixed texts. Combination of Kannada and English is one such code-mixed text of mixing Kannada and English languages at various levels. To address the word level LI in code-mixed text, in CoLI-Kanglish shared task, we have focused on open-sourcing a Kannada-English code-mixed dataset for word level LI of Kannada, English and mixed-language words written in Roman script. The task includes classifying each word in the given text into one of six predefined categories, namely: Kannada (kn), English (en), Kannada-English (kn-en), Name (name), Lo-cation (location), and Other (other). Among the models submitted by all the participants, the best performing model obtained averaged-weighted and averaged-macro F1 scores of 0.86 and 0.62 respectively.