Radhika Mamidi - ACL Anthology

Radhika Mamidi

2026

TeluguEval: A Comprehensive Benchmark for Evaluating LLM Capabilities in Telugu
Revanth Kumar Gundam | Radhika Mamidi
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)

Large Language Models (LLMs) excel on English reasoning tasks but falter on morphologically rich, low-resource languages such as Telugu, Tamil, and Kannada. We present TeluguEval, a human-curated reasoning benchmark created by translating GSM8K (math), Winogrande (commonsense), ARC (science), CaseHOLD (law), and Hendrycks Ethics into Telugu. We evaluate eight models spanning global (Llama-3.1-8B, Llama-2-7B, Qwen-8B, Gemma-7B, Gemini-2.0) and regional (Telugu-Llama2-7B, Indic-Gemma-7B, Sarvam-m-24B) systems. While extremely strong models such as Gemini and Sarvam-m largely retain performance in Telugu, most English-centric models suffer severe accuracy drops, often exceeding 30 to 40 points, particularly on mathematical and scientific reasoning. We further observe systematic failure modes including script sensitivity, option-selection bias, repetition loops, and unintended code-switching. Our results demonstrate that surface-level Telugu fluency does not imply robust reasoning capability, underscoring the need for Telugu-specific data, tokenization, and pretraining. TeluguEval provides a standardized testbed to drive progress on reasoning in low-resource Indian languages.

2025

Voices of Dissent: A Multimodal Analysis of Protest Songs through Lyrics and Audio
Utsav Shekhar | Radhika Mamidi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Music has long served as a vehicle for political expression, with protest songs playing a central role in articulating dissent and mobilizing collective action. Yet, despite their cultural significance, the linguistic and acoustic signatures that define protest music remain understudied. We present a multimodal computational analysis of protest and non-protest songs spanning multiple decades. Using NLP and audio analysis, we identify the linguistic and musical features that differentiate protest songs. Instead of focusing on classification performance, we treat classification as a diagnostic tool to investigate these features and reveal broader patterns. Protest songs are not just politically charged they are acoustically and linguistically distinct, and we quantify how.

adithjrajeev at SemEval-2025 Task 10: Sequential Learning for Role Classification Using Entity-Centric News Summaries
Adith Rajeev | Radhika Mamidi
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

There is a high prevalence of disinformation and manipulative narratives in online news sources today, and verification of its informative integrity is a vital need as online audience is highly susceptible to being affected by such propaganda or disinformation. The task of verifying any online information is, however, a significant challenge. The task Multilingual Characterization and Extraction of Narratives from Online News, therefore focuses on developing novel methods of analyzing news ecosystems and detecting manipulation attempts to address this challenge. As a part of this effort, we focus on the subtask of Entity Framing, which involves assigning named entities in news articles one of three main roles ( Protagonist, Antagonist, and Innocent) with a further fine-grained role distinction. We propose a pipeline that involves summarizing the article with the summary being centered around the entity. The entity and its entity-centric summary is then used as input for a BERT-based classifier to carry out the final role classification. Finally, we experiment with different approaches in the steps of the pipeline and compare the results obtained by them.

Emotion-Aware Dysarthric Speech Reconstruction: LLMs and Multimodal Evaluation with MCDS
Kaushal Attaluri | Radhika Mamidi | Sireesha Chittepu | Anirudh Chebolu | Hitendra Sarma Thogarcheti
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Dysarthria, a motor speech disorder affecting over 46 million individuals globally, impairs both intelligibility and emotional expression in communication. This work introduces a novel framework for emotion-aware sentence reconstruction from dysarthric speech using Large Language Models (LLMs) fine-tuned with QLoRA, namely LLaMA 3.1 and Mistral 8x7B. Our pipeline integrates direct emotion recognition from raw audio and conditions textual reconstruction on this emotional context to enhance both semantic and affective fidelity.We propose the Multimodal Communication Dysarthria Score (MCDS), a holistic evaluation metric combining BLEU, semantic similarity, emotion consistency, and human ratings:MCDS=αB+βE+γS+δHwhere 𝛼 + 𝛽 + 𝛾 + 𝛿 = 1.On our extended TORGO+ dataset, our emotion-aware LLM model achieves a MCDS of 0.87 and BLEU of 72.4%, significantly outperforming traditional pipelines like Kaldi GMM-HMM (MCDS: 0.52, BLEU: 38.1%) and Whisper-based models. It also surpasses baseline LLM systems by 0.09 MCDS. This sets a new benchmark in emotionally intelligent dysarthric speech reconstruction, with future directions including multilingual support and real-time deployment.

Aligning Text/Speech Representations from Multimodal Models with MEG Brain Activity During Listening
Padakanti Srijith | Khushbu Pahwa | Radhika Mamidi | Bapi Raju Surampudi | Manish Gupta | Subba Reddy Oota
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Although speech language models are expected to align well with brain language processing during speech comprehension, recent studies have found that they fail to capture brain-relevant semantics beyond low-level features. Surprisingly, text-based language models exhibit stronger alignment with brain language regions, as they better capture brain-relevant semantics. However, no prior work has examined the alignment effectiveness of text/speech representations from multimodal models. This raises several key questions: Can speech embeddings from such multimodal models capture brain-relevant semantics through cross-modal interactions? Which modality can take advantage of this synergistic multimodal understanding to improve alignment with brain language processing? Can text/speech representations from such multimodal models outperform unimodal models? To address these questions, we systematically analyze multiple multimodal models, extracting both text- and speech-based representations to assess their alignment with MEG brain recordings during naturalistic story listening. We find that text embeddings from both multimodal and unimodal models significantly outperform speech embeddings from these models. Specifically, multimodal text embeddings exhibit a peak around 200 ms, suggesting that they benefit from speech embeddings, with heightened activity during this time period. However, speech embeddings from these multimodal models still show a similar alignment compared to their unimodal counterparts, suggesting that they do not gain meaningful semantic benefits over text-based representations. These results highlight an asymmetry in cross-modal knowledge transfer, where the text modality benefits more from speech information, but not vice versa.

Choose Your Words Wisely: Domain-adaptive Masking Makes Language Models Learn Faster
Vanshpreet S. Kohli | Aaron Monis | Radhika Mamidi
Proceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025)

Foundational Language Models perform significantly better on downstream tasks in specialised domains (such as law, computer science, and medical science) upon being further pre-trained on extensive domain-specific corpora, but this continual pre-training incurs heavy computational costs. Indeed, some of the most performant specialised language models such as BioBERT incur even higher computing costs during domain-specific training than the pre-training cost of the foundational models they are initialised from. In this paper, we argue that much of the extended pre-training is redundant, with models seemingly wasting valuable resources re-learning lexical and semantic patterns already well-represented in their foundational models such as BERT, T5 and GPT. Focusing on Masked Language Models, we introduce a novel domain-specific masking strategy that is designed to facilitate continual learning while minimizing the training cost. Using this approach, we train and present a BERT-based model trained on a biomedical corpus that matches or surpasses traditionally trained biomedical language models in performance across several downstream classification tasks while incurring up to 11 times lower training costs.

The Evolution of Gen Alpha Slang: Linguistic Patterns and AI Translation Challenges
Ishita | Radhika Mamidi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Generation Alpha (born 2010-2024) is the first generation fully raised within the digital ecosystem. They exhibit unique linguistic behaviours influenced by rampant online communication and platform-specific cultures. This study examines the rapid evolution of Gen Alpha slang through a comparative analysis of Millennial and Gen Z vernacular. We identify three core linguistic patterns: extreme lexical compression, digital culture-driven semantic shifts and part-of-speech conversion. We construct a comprehensive slang corpus sourced from online platforms and evaluate the performance of four AI translation systems (viz. Google Translate, ChatGPT 4, Gemini 1.0, DeepSeek v3) on over 100 slang terms. Our results reveal significant translation challenges rooted in culturally-bound terms from gaming, meme culture, and mental health discourse. Most errors are the result of inadequate cultural contextualization, with literal translations dominating the error patterns. Our findings highlight the critical limitations in current language models and emphasize the need for adaptive, culturally sensitive and context-aware frameworks that can handle the dynamic lexicon of evolving youth vernacular.

Zero at SemEval-2025 Task 11: Multilingual Emotion Classification with BERT Variants: A Comparative Study
Revanth Gundam | Abhinav Marri | Radhika Mamidi
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Emotion detection in text plays a very crucial role in NLP applications such as sentiment analysis and feedback analysis. This study tackles two tasks: multi-label emotion detection, where the goal is to classify text based on six emotions (joy, sadness, fear, anger, surprise, and disgust) in a multilingual setting, and emotion intensity prediction, which assigns an ordinal intensity score to each of the above-given emotions. Using the BRIGHTER dataset, a multilingual corpus spanning 28 languages, the paper addresses issues like class imbalances by treating each emotion as an independent binary classification problem. The paper first explores strategies such as static embeddings such as GloVe with logistic regression classifiers on top of it. To capture contextual nuances more effectively, we fine-tune transformer based models, such as BERT and RoBERTa. Our approach demonstrates the benefits of fine-tuning for improved emotion prediction, while also highlighting the challenges of multilingual and multi-label classification.

Zero at SemEval-2025 Task 2: Entity-Aware Machine Translation: Fine-Tuning NLLB for Improved Named Entity Translation
Revanth Gundam | Abhinav Marri | Advaith Malladi | Radhika Mamidi
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Machine Translation (MT) is an essential tool for communication amongst people across different cultures, yet Named Entity (NE) translation remains a major challenge due to its rarity in occurrence and ambiguity. Traditional approaches, like using lexicons or parallel corpora, often fail to generalize to unseen entities, and hence do not perform well. To address this, we create a silver dataset using the Google Translate API and fine-tune the facebook/nllb200-distilled-600M model with LoRA (LowRank Adaptation) to enhance translation accuracy while also maintaining efficient memory use. Evaluated with metrics such as BLEU, COMET, and M-ETA, our results show that fine-tuning a specialized MT model improves NE translation without having to rely on largescale general-purpose models.

Bridge the GAP: Multi-lingual Models For Ambiguous Pronominal Coreference Resolution in South Asian Languages
Rahothvarman P | Adith John Rajeev | Kaveri Anuranjana | Radhika Mamidi
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

Coreference resolution, the process of determining what a referring expression (a pronoun or a noun phrase) refers to in discourse, is a critical aspect of natural language understanding. However, the development of computational models for coreference resolution in low-resource languages, such as the Dravidian (and more broadly all South Asian) languages, still remains a significant challenge due to the scarcity of annotated corpora in these languages. To address this data scarcity, we adopt a pipeline that translates the English GAP dataset into various South Asian languages, creating a multi-lingual coreference dataset mGAP. Our research aims to leverage this dataset and develop two novel models, namely the joint embedding model and the cross attention model for coreference resolution with Dravidian languages in mind. We also demonstrate that cross-attention captures pronoun-candidate relations better leading to improved coreference resolution. We also harness the similarity across South Asian languages via transfer learning in order to use high resource languages to learn coreference for low resource languages.

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?
Akhilesh Aravapalli | Mounika Marreddy | Radhika Mamidi | Manish Gupta | Subba Reddy Oota
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately ~47K sentences. Our probing analysis of surface, syntactic, and semantic properties reveals that, while almost all multilingual models demonstrate consistent encoding performance for English, surprisingly, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages.

2024

Maha Bhaashya at SemEval-2024 Task 6: Zero-Shot Multi-task Hallucination Detection
Patanjali Bhamidipati | Advaith Malladi | Manish Shrivastava | Radhika Mamidi
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

In recent studies, the extensive utilization oflarge language models has underscored the importance of robust evaluation methodologiesfor assessing text generation quality and relevance to specific tasks. This has revealeda prevalent issue known as hallucination, anemergent condition in the model where generated text lacks faithfulness to the source anddeviates from the evaluation criteria. In thisstudy, we formally define hallucination and propose a framework for its quantitative detectionin a zero-shot setting, leveraging our definitionand the assumption that model outputs entailtask and sample specific inputs. In detectinghallucinations, our solution achieves an accuracy of 0.78 in a model-aware setting and 0.61in a model-agnostic setting. Notably, our solution maintains computational efficiency, requiring far less computational resources than other SOTA approaches, aligning with the trendtowards lightweight and compressed models.

Towards Efficient Audio-Text Keyword Spotting: Quantization and Multi-Scale Linear Attention with Foundation Models
Rahothvarman P | Radhika Mamidi
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Open Vocabulary Keyword Spotting is essential in numerous applications, from virtual assistants to security systems, as it allows systems to identify specific words or phrases in continuous speech. In this paper, we propose a novel end-to-end method for detecting user-defined open vocabulary keywords by leveraging linguistic patterns for the correlation between audio and text modalities. Our approach utilizes quantized pre-trained foundation models for robust audio embeddings and a unique lightweight Multi-Scale Linear Attention (MSLA) network that aligns speech and text representations for effective cross-modal agreement. We evaluate our method on two distinct datasets, comparing its performance against other baselines. The results highlight the effectiveness of our approach, achieving significant improvements over the Cross-Modality Correspondence Detector (CMCD) method, with a 16.08% increase in AUC and a 17.2% reduction in EER metrics on the Google Speech Commands dataset. These findings demonstrate the potential of our method to advance keyword spotting across various real-world applications.

Survey on Computational Approaches to Implicature
Kaveri Anuranjana | Srihitha Mallepally | Sriharshitha Mareddy | Amit Shukla | Radhika Mamidi
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

This paper explores the concept of solving implicature in Natural Language Processing (NLP), highlighting its significance in understanding indirect communication. Drawing on foundational theories by Austin, Searle, and Grice, we discuss how implicature extends beyond literal language to convey nuanced meanings. We review existing datasets, including the Pragmatic Understanding Benchmark (PUB), that assess models’ capabilities in recognizing and interpreting implicatures. Despite recent advances in large language models (LLMs), challenges remain in effectively processing implicature due to limitations in training data and the complexities of contextual interpretation. We propose future directions for research, including the enhancement of datasets and the integration of pragmatic reasoning tasks, to improve LLMs’ understanding of implicature and facilitate better human-computer interaction.

Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text
Jainit Bafna | Hardik Mittal | Suyash Sethia | Manish Shrivastava | Radhika Mamidi
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Large Language Models (LLMs) have showcased impressive abilities in generating fluent responses to diverse user queries. However, concerns regarding the potential misuse ofsuch texts in journalism, educational, and academic contexts have surfaced. SemEval 2024introduces the task of Multigenerator, Multidomain, and Multilingual Black-Box MachineGenerated Text Detection, aiming to developautomated systems for identifying machinegenerated text and detecting potential misuse. In this paper, we i) propose a RoBERTaBiLSTM based classifier designed to classifytext into two categories: AI-generated or human ii) conduct a comparative study of ourmodel with baseline approaches to evaluate itseffectiveness. This paper contributes to the advancement of automatic text detection systemsin addressing the challenges posed by machinegenerated text misuse. Our architecture ranked46th on the official leaderboard with an accuracy of 80.83 among 125.

Context and WSD: Analysing Google Translate’s Sanskrit to English Output of Bhagavadgītā Verses for Word Meaning
Anagha Pradeep | Radhika Mamidi | Pavankumar Satuluri
Proceedings of the 7th International Sanskrit Computational Linguistics Symposium

Automating Humor: A Novel Approach to Joke Generation Using Template Extraction and Infilling
Mayank Goel | Parameswari Krishnamurthy | Radhika Mamidi
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

This paper presents a novel approach to humor generation in natural language processing by automating the creation of jokes through template extraction and infilling. Traditional methods have relied on predefined templates or neural network models, which either lack complexity or fail to produce genuinely humorous content. Our method introduces a technique to extract templates from existing jokes based on semantic salience and BERT’s attention weights. We then infill these templates using advanced techniques, through BERT and large language models (LLMs) like GPT-4, to generate new jokes. Our results indicate that the generated jokes are novel and human-like, with BERT showing promise in generating funny content and GPT-4 excelling in creating clever jokes. The study contributes to a deeper understanding of humor generation and the potential of AI in creative domains.

Weighted Layer Averaging RoBERTa for Black-Box Machine-Generated Text Detection
Ayan Datta | Aryan Chandramania | Radhika Mamidi
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

We propose a novel approach for machine-generated text detection using a RoBERTa model with weighted layer averaging and AdaLoRA for parameter-efficient fine-tuning. Our method incorporates information from all model layers, capturing diverse linguistic cues beyond those accessible from the final layer alone. To mitigate potential overfitting and improve generalizability, we leverage AdaLoRA, which injects trainable low-rank matrices into each Transformer layer, significantly reducing the number of trainable parameters. Furthermore, we employ data mixing to ensure our model encounters text from various domains and generators during training, enhancing its ability to generalize to unseen data. This work highlights the potential of combining layer-wise information with parameter-efficient fine-tuning and data mixing for effective machine-generated text detection.

Towards Enhancing Knowledge Accessibility for Low-Resource Indian Languages: A Template Based Approach
Srijith Padakanti | Akhilesh Aravapalli | Abhijith Chelpuri | Radhika Mamidi
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

In today’s digital age, access to knowledge and information is crucial for societal growth. Although widespread resources like Wikipedia exist, there is still a linguistic barrier to breakdown for low-resource languages. In India, millions of individuals still lack access to reliable information from Wikipedia because they are only proficient in their regional language. To address this gap, our work focuses on enhancing the content and digital footprint of multiple Indian languages. The primary objective of our work is to improve knowledge accessibility by generating a substantial volume of high-quality Wikipedia articles in Telugu, a widely spoken language in India with around 95.7 million native speakers. Our work aims to create Wikipedia articles and also ensures that each article meets necessary quality standards such as a minimum word count, inclusion of images for reference, and an infobox. Our work also adheres to the five core principles of Wikipedia. We streamline our article generation process, leveraging NLP techniques such as translation, transliteration, and template generation and incorporating human intervention when necessary. Our contribution is a collection of 8,929 articles in the movie domain, now ready to be published on Telugu Wikipedia.

2023

Billy-Batson at SemEval-2023 Task 5: An Information Condensation based System for Clickbait Spoiling
Anubhav Sharma | Sagar Joshi | Tushar Abhishek | Radhika Mamidi | Vasudeva Varma
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known as spoilers to satisfy the curiosity induced by a clickbait post. The large context of the article associated with the clickbait and differences in the spoiler forms, make the task challenging. Hence, to tackle the large context, we propose an Information Condensation-based approach, which prunes down the unnecessary context. Given an article, our filtering module optimised with a contrastive learning objective first selects the parapraphs that are the most relevant to the corresponding clickbait.The resulting condensed article is then fed to the two downstream tasks of spoiler type classification and spoiler generation. We demonstrate and analyze the gains from this approach on both the tasks. Overall, we win the task of spoiler type classification and achieve competitive results on spoiler generation.

Blind Leading the Blind: A Social-Media Analysis of the Tech Industry
Tanishq Chaudhary | Pulak Malhotra | Radhika Mamidi | Ponnurangam Kumaraguru
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Online social networks (OSNs) have changed the way we perceive careers. A standard screening process for employees now involves profile checks on LinkedIn, X, and other platforms, with any negative opinions scrutinized. Blind, an anonymous social networking platform, aims to satisfy this growing need for taboo workplace discourse. In this paper, for the first time, we present a large-scale empirical text-based analysis of the Blind platform. We acquire and release two novel datasets: 63k Blind Company Reviews and 767k Blind Posts, containing over seven years of industry data. Using these, we analyze the Blind network, study drivers of engagement, and obtain insights into the last eventful years, preceding, during, and post-COVID-19, accounting for the modern phenomena of work-from-home, return-to-office, and the layoffs surrounding the crisis. Finally, we leverage the unique richness of the Blind content and propose a novel content classification pipeline to automatically retrieve and annotate relevant career and industry content across other platforms. We achieve an accuracy of 99.25% for filtering out relevant content, 78.41% for fine-grained annotation, and 98.29% for opinion mining, demonstrating the high practicality of our software.

Transformer-based Context Aware Morphological Analyzer for Telugu
Priyanka Dasari | Abhijith Chelpuri | Nagaraju Vuppala | Mounika Marreddy | Parameshwari Krishnamurthy | Radhika Mamidi
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

This paper addresses the challenges faced by Indian languages in leveraging deep learning for natural language processing (NLP) due to limited resources, annotated datasets, and Transformer-based architectures. We specifically focus on Telugu and aim to construct a Telugu morph analyzer dataset comprising 10,000 sentences. Furthermore, we assess the performance of established multi-lingual Transformer models (m-Bert, XLM-R, IndicBERT) and mono-lingual Transformer models trained from scratch on an extensive Telugu corpus comprising 80,15,588 sentences (BERT-Te). Our findings demonstrate the efficacy of Transformer-based representations pretrained on Telugu data in improving the performance of the Telugu morph analyzer, surpassing existing multi-lingual approaches. This highlights the necessity of developing dedicated corpora, annotated datasets, and machine learning models in a mono-lingual setting. We present benchmark results for the Telugu morph analyzer achieved through simple fine-tuning on our dataset.

DAP-LeR-DAug: Techniques for enhanced Online Sexism Detection
Jayant Panwar | Radhika Mamidi
Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023)

Automatically Generating Hindi Wikipedia Pages Using Wikidata as a Knowledge Graph: A Domain-Specific Template Sentences Approach
Aditya Agarwal | Radhika Mamidi
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This paper presents a method for automatically generating Wikipedia articles in the Hindi language, using Wikidata as a knowledge base. Our method extracts structured information from Wikidata, such as the names of entities, their properties, and their relationships, and then uses this information to generate natural language text that conforms to a set of templates designed for the domain of interest. We evaluate our method by generating articles about scientists, and we compare the resulting articles to machine-translated articles. Our results show that more than 70% of the generated articles using our method are better in terms of coherence, structure, and readability. Our approach has the potential to significantly reduce the time and effort required to create Wikipedia articles in Hindi and could be extended to other languages and domains as well.

Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation
Dama Sravani | Radhika Mamidi
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Code-Mixing, the act of mixing two or more languages, is a common communicative phenomenon in multi-lingual societies. The lack of quality in code-mixed data is a bottleneck for NLP systems. On the other hand, Monolingual systems perform well due to ample high-quality data. To bridge the gap, creating coherent translations of monolingual sentences to their code-mixed counterparts can improve accuracy in code-mixed settings for NLP downstream tasks. In this paper, we propose a neural machine translation approach to generate high-quality code-mixed sentences by leveraging human judgements. We train filters based on human judgements to identify natural code-mixed sentences from a larger synthetically generated code-mixed corpus, resulting in a three-way silver parallel corpus between monolingual English, monolingual Indian language and code-mixed English with an Indian language. Using these corpora, we fine-tune multi-lingual encoder-decoder models viz, mT5 and mBART, for the translation task. Our results indicate that our approach of using filtered data for training outperforms the current systems for code-mixed generation in Hindi-English. Apart from Hindi-English, the approach performs well when applied to Telugu, a low-resource language, to generate Telugu-English code-mixed sentences.

Matt Bai at SemEval-2023 Task 5: Clickbait spoiler classification via BERT
Nukit Tailor | Radhika Mamidi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The Clickbait Spoiling shared task aims at tackling two aspects of spoiling: classifying the spoiler type based on its length and generating the spoiler. This paper focuses on the task of classifying the spoiler type. Better classification of the spoiler type would eventually help in generating a better spoiler for the post. We use BERT-base (cased) to classify the clickbait posts. The model achieves a balanced accuracy of 0.63 as we give only the post content as the input to our model instead of the concatenation of the post title and post content to find out the differences that the post title might be bringing in.

CoPara: The First Dravidian Paragraph-level n-way Aligned Corpus
Nikhil E | Mukund Choudhary | Radhika Mamidi
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

We present CoPara, the first publicly available paragraph-level (n-way aligned) multilingual parallel corpora for Dravidian languages. The collection contains 2856 paragraph/passage pairs between English and four Dravidian languages. We source the parallel paragraphs from the New India Samachar magazine and align them with English as a pivot language. We do human and artificial evaluations to validate the high-quality alignment and richness of the parallel paragraphs of a range of lengths. To show one of the many ways this dataset can be wielded, we finetuned IndicBART, a seq2seq NMT model on all XX-En pairs of languages in CoPara which perform better than existing sentence-level models on standard benchmarks (like BLEU) on sentence level translations and longer text too. We show how this dataset can enrich a model trained for a task like this, with more contextual cues and beyond sentence understanding even in low-resource settings like that of Dravidian languages. Finally, the dataset and models are made available publicly at CoPara to help advance research in Dravidian NLP, parallel multilingual, and beyond sentence-level tasks like NMT, etc.

GSAC: A Gujarati Sentiment Analysis Corpus from Twitter
Monil Gokani | Radhika Mamidi
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Sentiment Analysis is an important task for analysing online content across languages for tasks such as content moderation and opinion mining. Though a significant amount of resources are available for Sentiment Analysis in several Indian languages, there do not exist any large-scale, open-access corpora for Gujarati. Our paper presents and describes the Gujarati Sentiment Analysis Corpus (GSAC), which has been sourced from Twitter and manually annotated by native speakers of the language. We describe in detail our collection and annotation processes and conduct extensive experiments on our corpus to provide reliable baselines for future work using our dataset.

PanwarJayant at SemEval-2023 Task 10: Exploring the Effectiveness of Conventional Machine Learning Techniques for Online Sexism Detection
Jayant Panwar | Radhika Mamidi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The rapid growth of online communication using social media platforms has led to an increase in the presence of hate speech, especially in terms of sexist language online. The proliferation of such hate speech has a significant impact on the mental health and well-being of the users and hence the need for automated systems to detect and filter such texts. In this study, we explore the effectiveness of conventional machine learning techniques for detecting sexist text. We explore five conventional classifiers, namely, Logistic Regression, Decision Tree, XGBoost, Support Vector Machines, and Random Forest. The results show that different classifiers perform differently on each task due to their different inherent architectures which may be suited to a certain problem more. These models are trained on the shared task dataset, which includes both sexist and non-sexist texts. All in all, this study explores the potential of conventional machine learning techniques in detecting online sexist content. The results of this study highlight the strengths and weaknesses of all classifiers with respect to all subtasks. The results of this study will be useful for researchers and practitioners interested in developing systems for detecting or filtering online hate speech.

Text-2-Wiki: Summarization and Template-driven Article Generation
Jayant Panwar | Radhika Mamidi
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Users on Wikipedia collaborate in a structured and organized manner to publish and update articles on numerous topics, which makes Wikipedia a very rich source of knowledge. English Wikipedia has the most amount of information available (more than 6.7 million articles); however, there are few good informative articles on Wikipedia in Indian languages. Hindi Wikipedia has approximately only 160k articles. The same article in Hindi can be vastly different from its English version and generally contains less information. This poses a problem for native Indian language speakers who are not proficient in English. Therefore, having the same amount of information in Indian Languages will help promote knowledge among those who are not well-versed in English. Publishing the articles manually, like the usual process in Global English Wikipedia, is a timeconsuming process. To get the amount of information in native Indian languages up-to-speed with the amount of information in English, automating the whole article generation process is the best option. In this study, we present a stage-wise approach ranging from Data Collection to Summarization and Translation, and finally ending with Template Creation. This approach ensures the efficient generation of a large amount of content in Hindi Wikipedia in less time. With the help of this study, we were able to successfully generate more than a thousand articles in Hindi Wikipedia with ease.

Witcherses at SemEval-2023 Task 12: Ensemble Learning for African Sentiment Analysis
Monil Gokani | K V Aditya Srivatsa | Radhika Mamidi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper describes our system submission for SemEval-2023 Task 12 AfriSenti-SemEval: Sentiment Analysis for African Languages. We propose an XGBoost-based ensemble model trained on emoticon frequency-based features and the predictions of several statistical models such as SVMs, Logistic Regression, Random Forests, and BERT-based pre-trained language models such as AfriBERTa and AfroXLMR. We also report results from additional experiments not in the system. Our system achieves a mixed bag of results, achieving a best rank of 7th in three of the languages - Igbo, Twi, and Yoruba.

2022

English To Indian Sign Language:Rule-Based Translation System Along With Multi-Word Expressions and Synonym Substitution
Abhigyan Ghosh | Radhika Mamidi
Proceedings of the 19th International Conference on Natural Language Processing (ICON)

The hearing challenged communities all over the world face difficulties to communicate with others. Machine translation has been one of the prominent technologies to facilitate communication with the deaf and hard of hearing community worldwide. We have explored and formulated the fundamental rules of Indian Sign Language(ISL) and implemented them as a translation mechanism of English Text to Indian Sign Language glosses. According to the formulated rules and sub-rules, the source text structure is identified and transferred to the target ISL gloss. This target language is such that it can be easily converted to videos using the Indian Sign Language dictionary. This research work also mentions the intermediate phases of the transfer process and innovations in the process such as Multi-Word Expression detection and synonym substitution to handle the limited vocabulary size of Indian Sign Language while producing semantically accurate translations.

DepressionOne@LT-EDI-ACL2022: Using Machine Learning with SMOTE and Random UnderSampling to Detect Signs of Depression on Social Media Text.
Suman Dowlagar | Radhika Mamidi
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Depression is a common and serious medical illness that negatively affects how you feel, the way you think, and how you act. Detecting depression is essential as it must be treated early to avoid painful consequences. Nowadays, people are broadcasting how they feel via posts and comments. Using social media, we can extract many comments related to depression and use NLP techniques to train and detect depression. This work presents the submission of the DepressionOne team at LT-EDI-2022 for the shared task, detecting signs of depression from social media text. The depression data is small and unbalanced. Thus, we have used oversampling and undersampling methods such as SMOTE and RandomUnderSampler to represent the data. Later, we used machine learning methods to train and detect the signs of depression.

Towards Detecting Political Bias in Hindi News Articles
Samyak Agrawal | Kshitij Gupta | Devansh Gautam | Radhika Mamidi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Political propaganda in recent times has been amplified by media news portals through biased reporting, creating untruthful narratives on serious issues causing misinformed public opinions with interests of siding and helping a particular political party. This issue proposes a challenging NLP task of detecting political bias in news articles. We propose a transformer-based transfer learning method to fine-tune the pre-trained network on our data for this bias detection. As the required dataset for this particular task was not available, we created our dataset comprising 1388 Hindi news articles and their headlines from various Hindi news media outlets. We marked them on whether they are biased towards, against, or neutral to BJP, a political party, and the current ruling party at the centre in India.

CMNEROne at SemEval-2022 Task 11: Code-Mixed Named Entity Recognition by leveraging multilingual data
Suman Dowlagar | Radhika Mamidi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Identifying named entities is, in general, a practical and challenging task in the field of Natural Language Processing. Named Entity Recognition on the code-mixed text is further challenging due to the linguistic complexity resulting from the nature of the mixing. This paper addresses the submission of team CMNEROne to the SEMEVAL 2022 shared task 11 MultiCoNER. The Code-mixed NER task aimed to identify named entities on the code-mixed dataset. Our work consists of Named Entity Recognition (NER) on the code-mixed dataset by leveraging the multilingual data. We achieved a weighted average F1 score of 0.7044, i.e., 6% greater than the NER baseline.

LastResort at SemEval-2022 Task 4: Towards Patronizing and Condescending Language Detection using Pre-trained Transformer Based Models Ensembles
Samyak Agrawal | Radhika Mamidi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper presents our solutions systems for Task4 at SemEval2022: Patronizing and Condescending Language Detection. This shared task contains two sub-tasks. The first sub-task is a binary classification task whose goal is to predict whether a given paragraph contains any form of patronising or condescending language(PCL). For the second sub-task, given a paragraph, we have to find which PCL categories express the condescension. Here we have a total of 7 overlapping sub-categories for PCL. Our proposed solution uses BERT based ensembled models with hard voting and techniques applied to take care of class imbalances. Our paper describes the system architecture of the submitted solution and other experiments that we conducted.

LastResort at SemEval-2022 Task 5: Towards Misogyny Identification using Visual Linguistic Model Ensembles And Task-Specific Pretraining
Samyak Agrawal | Radhika Mamidi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In current times, memes have become one of the most popular mediums to share jokes and information with the masses over the internet. Memes can also be used as tools to spread hatred and target women through degrading content disguised as humour. The task, Multimedia Automatic Misogyny Identification (MAMI), is to detect misogyny in these memes. This task is further divided into two sub-tasks: (A) Misogynous meme identification, where a meme should be categorized either as misogynous or not misogynous and (B) Categorizing these misogynous memes into potential overlapping subcategories. In this paper, we propose models leveraging task-specific pretraining with transfer learning on Visual Linguistic models. Our best performing models scored 0.686 and 0.691 on sub-tasks A and B respectively.

TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers
Suma Reddy Duggenpudi | Subba Reddy Oota | Mounika Marreddy | Radhika Mamidi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Named Entity Recognition (NER) is a successful and well-researched problem in English due to the availability of resources. The transformer models, specifically the masked-language models (MLM), have shown remarkable performance in NER during recent times. With growing data in different online platforms, there is a need for NER in other languages too. NER remains to be underexplored in Indian languages due to the lack of resources and tools. Our contributions in this paper include (i) Two annotated NER datasets for the Telugu language in multiple domains: Newswire Dataset (ND) and Medical Dataset (MD), and we combined ND and MD to form Combined Dataset (CD) (ii) Comparison of the finetuned Telugu pretrained transformer models (BERT-Te, RoBERTa-Te, and ELECTRA-Te) with other baseline models (CRF, LSTM-CRF, and BiLSTM-CRF) (iii) Further investigation of the performance of Telugu pretrained transformer models against the multilingual models mBERT, XLM-R, and IndicBERT. We find that pretrained Telugu language models (BERT-Te and RoBERTa) outperform the existing pretrained multilingual and baseline models in NER. On a large dataset (CD) of 38,363 sentences, the BERT-Te achieves a high F1-score of 0.80 (entity-level) and 0.75 (token-level). Further, these pretrained Telugu models have shown state-of-the-art performance on various existing Telugu NER datasets. We open-source our dataset, pretrained models, and code.

Sammaan@LT-EDI-ACL2022: Ensembled Transformers Against Homophobia and Transphobia
Ishan Sanjeev Upadhyay | Kv Aditya Srivatsa | Radhika Mamidi
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Hateful and offensive content on social media platforms can have negative effects on users and can make online communities more hostile towards certain people and hamper equality, diversity and inclusion. In this paper, we describe our approach to classify homophobia and transphobia in social media comments. We used an ensemble of transformer-based models to build our classifier. Our model ranked 2nd for English, 8th for Tamil and 10th for Tamil-English.

Towards Toxic Positivity Detection
Ishan Sanjeev Upadhyay | KV Aditya Srivatsa | Radhika Mamidi
Proceedings of the Tenth International Workshop on Natural Language Processing for Social Media

Over the past few years, there has been a growing concern around toxic positivity on social media which is a phenomenon where positivity is used to minimize one’s emotional experience. In this paper, we create a dataset for toxic positivity classification from Twitter and an inspirational quote website. We then perform benchmarking experiments using various text classification models and show the suitability of these models for the task. We achieved a macro F1 score of 0.71 and a weighted F1 score of 0.85 by using an ensemble model. To the best of our knowledge, our dataset is the first such dataset created.

2021

Corpus Creation and Language Identification in Low-Resource Code-Mixed Telugu-English Text
Siva Subrahamanyam Varma Kusampudi | Anudeep Chaluvadi | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Code-Mixing (CM) is a common phenomenon in multilingual societies. CM plays a significant role in technology and medical fields where terminologies in the native language are not available or known. Language Identification (LID) of the CM data will help solve NLP tasks such as Spell Checking, Named Entity Recognition, Part-Of-Speech tagging, and Semantic Parsing. In the current era of machine learning, a common problem to the above-mentioned tasks is the availability of Learning data to train models. In this paper, we introduce two Telugu-English CM manually annotated datasets (Twitter dataset and Blog dataset). The Twitter dataset contains more romanization variability and misspelled words than the blog dataset. We compare across various classification models and perform extensive bench-marking using both Classical and Deep Learning Models for LID compared to existing models. We propose two architectures for language classification (Telugu and English) in CM data: (1) Word Level Classification (2) Sentence Level word-by-word Classification and compare these approaches presenting two strong baselines for LID on these datasets.

Towards Sentiment Analysis of Tobacco Products’ Usage in Social Media
Venkata Himakar Yanamandra | Kartikey Pant | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Contemporary tobacco-related studies are mostly concerned with a single social media platform while missing out on a broader audience. Moreover, they are heavily reliant on labeled datasets, which are expensive to make. In this work, we explore sentiment and product identification on tobacco-related text from two social media platforms. We release SentiSmoke-Twitter and SentiSmoke-Reddit datasets, along with a comprehensive annotation schema for identifying tobacco products’ sentiment. We then perform benchmarking text classification experiments using state-of-the-art models, including BERT, RoBERTa, and DistilBERT. Our experiments show F1 scores as high as 0.72 for sentiment identification in the Twitter dataset, 0.46 for sentiment identification, and 0.57 for product identification using semi-supervised learning for Reddit.

TEASER: Towards Efficient Aspect-based SEntiment Analysis and Recognition
Vaibhav Bajaj | Kartikey Pant | Ishan Upadhyay | Srinath Nair | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Sentiment analysis aims to detect the overall sentiment, i.e., the polarity of a sentence, paragraph, or text span, without considering the entities mentioned and their aspects. Aspect-based sentiment analysis aims to extract the aspects of the given target entities and their respective sentiments. Prior works formulate this as a sequence tagging problem or solve this task using a span-based extract-then-classify framework where first all the opinion targets are extracted from the sentence, and then with the help of span representations, the targets are classified as positive, negative, or neutral. The sequence tagging problem suffers from issues like sentiment inconsistency and colossal search space. Whereas, Span-based extract-then-classify framework suffers from issues such as half-word coverage and overlapping spans. To overcome this, we propose a similar span-based extract-then-classify framework with a novel and improved heuristic. Experiments on the three benchmark datasets (Restaurant14, Laptop14, Restaurant15) show our model consistently outperforms the current state-of-the-art. Moreover, we also present a novel supervised movie reviews dataset (Movie20) and a pseudo-labeled movie reviews dataset (moviesLarge) made explicitly for this task and report the results on the novel Movie20 dataset as well.

EDIOne@LT-EDI-EACL2021: Pre-trained Transformers with Convolutional Neural Networks for Hope Speech Detection.
Suman Dowlagar | Radhika Mamidi
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

Hope is an essential aspect of mental health stability and recovery in every individual in this fast-changing world. Any tools and methods developed for detection, analysis, and generation of hope speech will be beneficial. In this paper, we propose a model on hope-speech detection to automatically detect web content that may play a positive role in diffusing hostility on social media. We perform the experiments by taking advantage of pre-processing and transfer-learning models. We observed that the pre-trained multilingual-BERT model with convolution neural networks gave the best results. Our model ranked first, third, and fourth ranks on English, Malayalam-English, and Tamil-English code-mixed datasets.

Graph Convolutional Networks with Multi-headed Attention for Code-Mixed Sentiment Analysis
Suman Dowlagar | Radhika Mamidi
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Code-mixing is a frequently observed phenomenon in multilingual communities where a speaker uses multiple languages in an utterance or sentence. Code-mixed texts are abundant, especially in social media, and pose a problem for NLP tools as they are typically trained on monolingual corpora. Recently, finding the sentiment from code-mixed text has been attempted by some researchers in SentiMix SemEval 2020 and Dravidian-CodeMix FIRE 2020 shared tasks. Mostly, the attempts include traditional methods, long short term memory, convolutional neural networks, and transformer models for code-mixed sentiment analysis (CMSA). However, no study has explored graph convolutional neural networks on CMSA. In this paper, we propose the graph convolutional networks (GCN) for sentiment analysis on code-mixed text. We have used the datasets from the Dravidian-CodeMix FIRE 2020. Our experimental results on multiple CMSA datasets demonstrate that the GCN with multi-headed attention model has shown an improvement in classification metrics.

Autobots@LT-EDI-EACL2021: One World, One Family: Hope Speech Detection with BERT Transformer Model
Sunil Gundapu | Radhika Mamidi
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

The rapid rise of online social networks like YouTube, Facebook, Twitter allows people to express their views more widely online. However, at the same time, it can lead to an increase in conflict and hatred among consumers in the form of freedom of speech. Therefore, it is essential to take a positive strengthening method to research on encouraging, positive, helping, and supportive social media content. In this paper, we describe a Transformer-based BERT model for Hope speech detection for equality, diversity, and inclusion, submitted for LT-EDI-2021 Task 2. Our model achieves a weighted averaged f1-score of 0.93 on the test set.

How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages
Sourav Kumar | Salil Aggarwal | Dipti Misra Sharma | Radhika Mamidi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

India is one of the most linguistically diverse nations of the world and is culturally very rich. Most of these languages are somewhat similar to each other on account of sharing a common ancestry or being in contact for a long period of time. Nowadays, researchers are constantly putting efforts in utilizing the language relatedness to improve the performance of various NLP systems such as cross lingual semantic search, machine translation, sentiment analysis systems, etc. So in this paper, we performed an extensive case study on similarity involving languages of the Indian subcontinent. Language similarity prediction is defined as the task of measuring how similar the two languages are on the basis of their lexical, morphological and syntactic features. In this study, we concentrate only on the approach to calculate lexical similarity between Indian languages by looking at various factors such as size and type of corpus, similarity algorithms, subword segmentation, etc. The main takeaways from our work are: (i) Relative order of the language similarities largely remain the same, regardless of the factors mentioned above, (ii) Similarity within the same language family is higher, (iii) Languages share more lexical features at the subword level.

Sentiment Analysis in Code-Mixed Telugu-English Text with Unsupervised Data Normalization
Siva Subrahamanyam Varma Kusampudi | Preetham Sathineni | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In a multilingual society, people communicate in more than one language, leading to Code-Mixed data. Sentimental analysis on Code-Mixed Telugu-English Text (CMTET) poses unique challenges. The unstructured nature of the Code-Mixed Data is due to the informal language, informal transliterations, and spelling errors. In this paper, we introduce an annotated dataset for Sentiment Analysis in CMTET. Also, we report an accuracy of 80.22% on this dataset using novel unsupervised data normalization with a Multilayer Perceptron (MLP) model. This proposed data normalization technique can be extended to any NLP task involving CMTET. Further, we report an increase of 2.53% accuracy due to this data normalization approach in our best model.

ViTA: Visual-Linguistic Translation by Aligning Object Tags
Kshitij Gupta | Devansh Gautam | Radhika Mamidi
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

Multimodal Machine Translation (MMT) enriches the source text with visual information for translation. It has gained popularity in recent years, and several pipelines have been proposed in the same direction. Yet, the task lacks quality datasets to illustrate the contribution of visual modality in the translation systems. In this paper, we propose our system under the team name Volta for the Multimodal Translation Task of WAT 2021 from English to Hindi. We also participate in the textual-only subtask of the same language pair for which we use mBART, a pretrained multilingual sequence-to-sequence model. For multimodal translation, we propose to enhance the textual input by bringing the visual information to a textual domain by extracting object tags from the image. We also explore the robustness of our system by systematically degrading the source text. Finally, we achieve a BLEU score of 44.6 and 51.6 on the test set and challenge set of the multimodal task.

Developing Conversational Data and Detection of Conversational Humor in Telugu
Vaishnavi Pamulapati | Radhika Mamidi
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

In the field of humor research, there has been a recent surge of interest in the sub-domain of Conversational Humor (CH). This study has two main objectives. (a) develop a conversational (humorous and non-humorous) dataset in Telugu. (b) detect CH in the compiled dataset. In this paper, the challenges faced while collecting the data and experiments carried out are elucidated. Transfer learning and non-transfer learning techniques are implemented by utilizing pre-trained models such as FastText word embeddings, BERT language models and Text GCN, which learns the word and document embeddings simultaneously of the corpus given. State-of-the-art results are observed with a 99.3% accuracy and a 98.5% f1 score achieved by BERT.

Jibes & Delights: A Dataset of Targeted Insults and Compliments to Tackle Online Abuse
Ravsimar Sodhi | Kartikey Pant | Radhika Mamidi
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

Online abuse and offensive language on social media have become widespread problems in today’s digital age. In this paper, we contribute a Reddit-based dataset, consisting of 68,159 insults and 51,102 compliments targeted at individuals instead of targeting a particular community or race. Secondly, we benchmark multiple existing state-of-the-art models for both classification and unsupervised style transfer on the dataset. Finally, we analyse the experimental results and conclude that the transfer task is challenging, requiring the models to understand the high degree of creativity exhibited in the data.

A Pre-trained Transformer and CNN Model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed Social-Media Text
Suman Dowlagar | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence. There are no strict grammatical constraints observed in code-mixing, and it consists of non-standard variations of spelling. The linguistic complexity resulting from the above factors made the computational analysis of the code-mixed language a challenging task. Language identification (LI) and part of speech (POS) tagging are the fundamental steps that help analyze the structure of the code-mixed text. Often, the LI and POS tagging tasks are interdependent in the code-mixing scenario. We project the problem of dealing with multilingualism and grammatical structure while analyzing the code-mixed sentence as a joint learning task. In this paper, we jointly train and optimize language detection and part of speech tagging models in the code-mixed scenario. We used a Transformer with convolutional neural network architecture. We train a joint learning method by combining POS tagging and LI models on code-mixed social media text obtained from the ICON shared task.

OFFLangOne@DravidianLangTech-EACL2021: Transformers with the Class Balanced Loss for Offensive Language Identification in Dravidian Code-Mixed text.
Suman Dowlagar | Radhika Mamidi
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

The intensity of online abuse has increased in recent years. Automated tools are being developed to prevent the use of hate speech and offensive content. Most of the technologies use natural language and machine learning tools to identify offensive text. In a multilingual society, where code-mixing is a norm, the hate content would be delivered in a code-mixed form in social media, which makes the offensive content identification, further challenging. In this work, we participated in the EACL task to detect offensive content in the code-mixed social media scenario. The methodology uses a transformer model with transliteration and class balancing loss for offensive content identification. In this task, our model has been ranked 2nd in Malayalam-English and 4th in Tamil-English code-mixed languages.

Volta at SemEval-2021 Task 6: Towards Detecting Persuasive Texts and Images using Textual and Multimodal Ensemble
Kshitij Gupta | Devansh Gautam | Radhika Mamidi
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

Memes are one of the most popular types of content used to spread information online. They can influence a large number of people through rhetorical and psychological techniques. The task, Detection of Persuasion Techniques in Texts and Images, is to detect these persuasive techniques in memes. It consists of three subtasks: (A) Multi-label classification using textual content, (B) Multi-label classification and span identification using textual content, and (C) Multi-label classification using visual and textual content. In this paper, we propose a transfer learning approach to fine-tune BERT-based models in different modalities. We also explore the effectiveness of ensembles of models trained in different modalities. We achieve an F1-score of 57.0, 48.2, and 52.1 in the corresponding subtasks.

Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes
Anvesh Rao Vijjini | Kaveri Anuranjana | Radhika Mamidi
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

While Curriculum Learning (CL) has recently gained traction in Natural language Processing Tasks, it is still not adequately analyzed. Previous works only show their effectiveness but fail short to explain and interpret the internal workings fully. In this paper, we analyze curriculum learning in sentiment analysis along multiple axes. Some of these axes have been proposed by earlier works that need more in-depth study. Such analysis requires understanding where curriculum learning works and where it does not. Our axes of analysis include Task difficulty on CL, comparing CL pacing techniques, and qualitative analysis by visualizing the movement of attention scores in the model as curriculum phases progress. We find that curriculum learning works best for difficult tasks and may even lead to a decrement in performance for tasks with higher performance without curriculum learning. We see that One-Pass curriculum strategies suffer from catastrophic forgetting and attention movement visualization within curriculum pacing. This shows that curriculum learning breaks down the challenging main task into easier sub-tasks solved sequentially.

Gated Convolutional Sequence to Sequence Based Learning for English-Hingilsh Code-Switched Machine Translation.
Suman Dowlagar | Radhika Mamidi
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Code-Switching is the embedding of linguistic units or phrases from two or more languages in a single sentence. This phenomenon is practiced in all multilingual communities and is prominent in social media. Consequently, there is a growing need to understand code-switched translations by translating the code-switched text into one of the standard languages or vice versa. Neural Machine translation is a well-studied research problem in the monolingual text. In this paper, we have used the gated convolutional sequences to sequence networks for English-Hinglish translation. The convolutions in the model help to identify the compositional structure in the sequences more easily. The model relies on gating and performs multiple attention steps at encoder and decoder layers.

Political Discourse Analysis: A Case Study of Code Mixing and Code Switching in Political Speeches
Dama Sravani | Lalitha Kameswari | Radhika Mamidi
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Political discourse is one of the most interesting data to study power relations in the framework of Critical Discourse Analysis. With the increase in the modes of textual and spoken forms of communication, politicians use language and linguistic mechanisms that contribute significantly in building their relationship with people, especially in a multilingual country like India with many political parties with different ideologies. This paper analyses code-mixing and code-switching in Telugu political speeches to determine the factors responsible for their usage levels in various social settings and communicative contexts. We also compile a detailed set of rules capturing dialectal variations between Standard and Telangana dialects of Telugu.

Automatic Learning Assistant in Telugu
Meghana Bommadi | Shreya Terupally | Radhika Mamidi
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)

This paper presents a learning assistant that tests one’s knowledge and gives feedback that helps a person learn at a faster pace. A learning assistant (based on automated question generation) has extensive uses in education, information websites, self-assessment, FAQs, testing ML agents, research, etc. Multiple researchers, and companies have worked on Virtual Assistance, but majorly in English. We built our learning assistant for Telugu language to help with teaching in the mother tongue, which is the most efficient way of learning. Our system is built primarily based on Question Generation in Telugu. Many experiments were conducted on Question Generation in English in multiple ways. We have built the first hybrid machine learning and rule-based solution in Telugu, which proves efficient for short stories or short passages in children’s books. Our work covers the fundamental question forms with question types: adjective, yes/no, adverb, verb, when, where, whose, quotative, and quantitative (how many/how much). We constructed rules for question generation using Part of Speech (POS) tags and Universal Dependency (UD) tags along with linguistic information of the surrounding relevant context of the word. We used keyword matching, multilingual sentence embedding to evaluate the answer. Our system is primarily built on question generation in Telugu, and is also capable of evaluating the user’s answers to the generated questions.

Towards Quantifying Magnitude of Political Bias in News Articles Using a Novel Annotation Schema
Lalitha Kameswari | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Media bias is a predominant phenomenon present in most forms of print and electronic media such as news articles, blogs, tweets, etc. Since media plays a pivotal role in shaping public opinion towards political happenings, both political parties and media houses often use such sources as outlets to propagate their own prejudices to the public. There has been some research on detecting political bias in news articles. However, none of it attempts to analyse the nature of bias or quantify the magnitude ofthe bias in a given text. This paper presents a political bias annotated corpus viz. PoBiCo-21, which is annotated using a schema specifically designed with 10 labels to capture various techniques used to create political bias in news. We create a ranking of these techniques based on their contribution to bias. After validating the ranking, we propose methods to use it to quantify the magnitude of bias in political news articles.

Hopeful Men@LT-EDI-EACL2021: Hope Speech Detection Using Indic Transliteration and Transformers
Ishan Sanjeev Upadhyay | Nikhil E | Anshul Wadhawan | Radhika Mamidi
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

This paper aims to describe the approach we used to detect hope speech in the HopeEDI dataset. We experimented with two approaches. In the first approach, we used contextual embeddings to train classifiers using logistic regression, random forest, SVM, and LSTM based models. The second approach involved using a majority voting ensemble of 11 models which were obtained by fine-tuning pre-trained transformer models (BERT, ALBERT, RoBERTa, IndicBERT) after adding an output layer. We found that the second approach was superior for English, Tamil and Malayalam. Our solution got a weighted F1 score of 0.93, 0.75 and 0.49 for English, Malayalam and Tamil respectively. Our solution ranked 1st in English, 8th in Malayalam and 11th in Tamil.

Efficient Multilingual Text Classification for Indian Languages
Salil Aggarwal | Sourav Kumar | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

India is one of the richest language hubs on the earth and is very diverse and multilingual. But apart from a few Indian languages, most of them are still considered to be resource poor. Since most of the NLP techniques either require linguistic knowledge that can only be developed by experts and native speakers of that language or they require a lot of labelled data which is again expensive to generate, the task of text classification becomes challenging for most of the Indian languages. The main objective of this paper is to see how one can benefit from the lexical similarity found in Indian languages in a multilingual scenario. Can a classification model trained on one Indian language be reused for other Indian languages? So, we performed zero-shot text classification via exploiting lexical similarity and we observed that our model performs best in those cases where the vocabulary overlap between the language datasets is maximum. Our experiments also confirm that a single multilingual model trained via exploiting language relatedness outperforms the baselines by significant margins.

IIITH at SemEval-2021 Task 7: Leveraging transformer-based humourous and offensive text detection architectures using lexical and hurtlex features and task adaptive pretraining
Tathagata Raha | Ishan Sanjeev Upadhyay | Radhika Mamidi | Vasudeva Varma
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes our approach (IIITH) for SemEval-2021 Task 5: HaHackathon: Detecting and Rating Humor and Offense. Our results focus on two major objectives: (i) Effect of task adaptive pretraining on the performance of transformer based models (ii) How does lexical and hurtlex features help in quantifying humour and offense. In this paper, we provide a detailed description of our approach along with comparisions mentioned above.

2020

Enhancing Bias Detection in Political News Using Pragmatic Presupposition
Lalitha Kameswari | Dama Sravani | Radhika Mamidi
Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

Usage of presuppositions in social media and news discourse can be a powerful way to influence the readers as they usually tend to not examine the truth value of the hidden or indirectly expressed information. Fairclough and Wodak (1997) discuss presupposition at a discourse level where some implicit claims are taken for granted in the explicit meaning of a text or utterance. From the Gricean perspective, the presuppositions of a sentence determine the class of contexts in which the sentence could be felicitously uttered. This paper aims to correlate the type of knowledge presupposed in a news article to the bias present in it. We propose a set of guidelines to identify various kinds of presuppositions in news articles and present a dataset consisting of 1050 articles which are annotated for bias (positive, negative or neutral) and the magnitude of presupposition. We introduce a supervised classification approach for detecting bias in political news which significantly outperforms the existing systems.

Annotated Corpus for Sentiment Analysis in Odia Language
Gaurav Mohanty | Pruthwik Mishra | Radhika Mamidi
Proceedings of the Twelfth Language Resources and Evaluation Conference

Given the lack of an annotated corpus of non-traditional Odia literature which serves as the standard when it comes sentiment analysis, we have created an annotated corpus of Odia sentences and made it publicly available to promote research in the field. Secondly, in order to test the usability of currently available Odia sentiment lexicon, we experimented with various classifiers by training and testing on the sentiment annotated corpus while using identified affective words from the same as features. Annotation and classification are done at sentence level as the usage of sentiment lexicon is best suited to sentiment analysis at this level. The created corpus contains 2045 Odia sentences from news domain annotated with sentiment labels using a well-defined annotation scheme. An inter-annotator agreement score of 0.79 is reported for the corpus.

Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task
Dipti Misra Sharma | Asif Ekbal | Karunesh Arora | Sudip Kumar Naskar | Dipankar Ganguly | Sobha L | Radhika Mamidi | Sunita Arora | Pruthwik Mishra | Vandan Mujadia
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

Detecting Sarcasm in Conversation Context Using Transformer-Based Models
Adithya Avvaru | Sanath Vobilisetty | Radhika Mamidi
Proceedings of the Second Workshop on Figurative Language Processing

Sarcasm detection, regarded as one of the sub-problems of sentiment analysis, is a very typical task because the introduction of sarcastic words can flip the sentiment of the sentence itself. To date, many research works revolve around detecting sarcasm in one single sentence and there is very limited research to detect sarcasm resulting from multiple sentences. Current models used Long Short Term Memory (LSTM) variants with or without attention to detect sarcasm in conversations. We showed that the models using state-of-the-art Bidirectional Encoder Representations from Transformers (BERT), to capture syntactic and semantic information across conversation sentences, performed better than the current models. Based on the data analysis, we estimated that the number of sentences in the conversation that can contribute to the sarcasm and the results agrees to this estimation. We also perform a comparative study of our different versions of BERT-based model with other variants of LSTM model and XLNet (both using the estimated number of conversation sentences) and find out that BERT-based models outperformed them.

Question and Answer pair generation for Telugu short stories
Meghana Bommadi | Shreya Terupally | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Question Answer pair generation is a task that has been worked upon by multiple researchers in many languages. It has been a topic of interest due to its extensive uses in different fields like self assessment, academics, business website FAQs etc. Many experiments were conducted on Question Answering pair generation in English, concentrating on basic Wh-questions with a rule-based approach. We have built the first hybrid machine learning and rule-based solution in Telugu which is efficient for short stories or short passages in children’s books. Our work covers the fundamental question forms with the question types: adjective, yes/no, adverb, verb, when, where, whose, quotative, and quantitative(how many/ how much). We constructed rules for question generation using POS tags and UD tags along with linguistic information of the surrounding context of the word.

Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task
Dipti Misra Sharma | Asif Ekbal | Karunesh Arora | Sudip Kumar Naskar | Dipankar Ganguly | Sobha L | Radhika Mamidi | Sunita Arora | Pruthwik Mishra | Vandan Mujadia
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task
Dipti Misra Sharma | Asif Ekbal | Karunesh Arora | Sudip Kumar Naskar | Dipankar Ganguly | Sobha L | Radhika Mamidi | Sunita Arora | Pruthwik Mishra | Vandan Mujadia
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task

Does a Hybrid Neural Network based Feature Selection Model Improve Text Classification?
Suman Dowlagar | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Text classification is a fundamental problem in the field of natural language processing. Text classification mainly focuses on giving more importance to all the relevant features that help classify the textual data. Apart from these, the text can have redundant or highly correlated features. These features increase the complexity of the classification algorithm. Thus, many dimensionality reduction methods were proposed with the traditional machine learning classifiers. The use of dimensionality reduction methods with machine learning classifiers has achieved good results. In this paper, we propose a hybrid feature selection method for obtaining relevant features by combining various filter-based feature selection methods and fastText classifier. We then present three ways of implementing a feature selection and neural network pipeline. We observed a reduction in training time when feature selection methods are used along with neural networks. We also observed a slight increase in accuracy on some datasets.

Leveraging Multilingual Resources for Language Invariant Sentiment Analysis
Allen Antony | Arghya Bhattacharya | Jaipal Goud | Radhika Mamidi
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Sentiment analysis is a widely researched NLP problem with state-of-the-art solutions capable of attaining human-like accuracies for various languages. However, these methods rely heavily on large amounts of labeled data or sentiment weighted language-specific lexical resources that are unavailable for low-resource languages. Our work attempts to tackle this data scarcity issue by introducing a neural architecture for language invariant sentiment analysis capable of leveraging various monolingual datasets for training without any kind of cross-lingual supervision. The proposed architecture attempts to learn language agnostic sentiment features via adversarial training on multiple resource-rich languages which can then be leveraged for inferring sentiment information at a sentence level on a low resource language. Our model outperforms the current state-of-the-art methods on the Multilingual Amazon Review Text Classification dataset [REF] and achieves significant performance gains over prior work on the low resource Sentiraama corpus [REF]. A detailed analysis of our research highlights the ability of our architecture to perform significantly well in the presence of minimal amounts of training data for low resource languages.

Gundapusunil at SemEval-2020 Task 8: Multimodal Memotion Analysis
Sunil Gundapu | Radhika Mamidi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Recent technological advancements in the Internet and Social media usage have resulted in the evolution of faster and efficient platforms of communication. These platforms include visual, textual and speech mediums and have brought a unique social phenomenon called Internet memes. Internet memes are in the form of images with witty, catchy, or sarcastic text descriptions. In this paper, we present a multi-modal sentiment analysis system using deep neural networks combining Computer Vision and Natural Language Processing. Our aim is different than the normal sentiment analysis goal of predicting whether a text expresses positive or negative sentiment; instead, we aim to classify the Internet meme as a positive, negative, or neutral, identify the type of humor expressed and quantify the extent to which a particular effect is being expressed. Our system has been developed using CNN and LSTM and outperformed the baseline score.

Unsupervised Technical Domain Terms Extraction using Term Extractor
Suman Dowlagar | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task

Terminology extraction, also known as term extraction, is a subtask of information extraction. The goal of terminology extraction is to extract relevant words or phrases from a given corpus automatically. This paper focuses on the unsupervised automated domain term extraction method that considers chunking, preprocessing, and ranking domain-specific terms using relevance and cohesion functions for ICON 2020 shared task 2: TermTraction.

SUKHAN: Corpus of Hindi Shayaris annotated with Sentiment Polarity Information
Salil Aggarwal | Abhigyan Ghosh | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Shayari is a form of poetry mainly popular in the Indian subcontinent, in which the poet expresses his emotions and feelings in a very poetic manner. It is one of the best ways to express our thoughts and opinions. Therefore, it is of prime importance to have an annotated corpus of Hindi shayaris for the task of sentiment analysis. In this paper, we introduce SUKHAN, a dataset consisting of Hindi shayaris along with sentiment polarity labels. To the best of our knowledge, this is the first corpus of Hindi shayaris annotated with sentiment polarity information. This corpus contains a total of 733 Hindi shayaris of various genres. Also, this dataset is of utmost value as all the annotation is done manually by five annotators and this makes it a very rich dataset for training purposes. This annotated corpus is also used to build baseline sentiment classification models using machine learning techniques.

Manovaad: A Novel Approach to Event Oriented Corpus Creation Capturing Subjectivity and Focus
Lalitha Kameswari | Radhika Mamidi
Proceedings of the Twelfth Language Resources and Evaluation Conference

In today’s era of globalisation, the increased outreach for every event across the world has been leading to conflicting opinions, arguments and disagreements, often reflected in print media and online social platforms. It is necessary to distinguish factual observations from personal judgements in news, as subjectivity in reporting can influence the audience’s perception of reality. Several studies conducted on the different styles of reporting in journalism are essential in understanding phenomena such as media bias and multiple interpretations of the same event. This domain finds applications in fields such as Media Studies, Discourse Analysis, Information Extraction, Sentiment Analysis, and Opinion Mining. We present an event corpus Manovaad-v1.0 consisting of 1035 news articles corresponding to 65 events from 3 levels of newspapers viz., Local, National, and International levels. Using this novel format, we correlate the trends in the degree of subjectivity with the geographical closeness of reporting using a Bi-RNN model. We also analyse the role of background and focus in event reporting and capture the focus shift patterns within a global discourse structure for an event. We do this across different levels of reporting and compare the results with the existing work on discourse processing.

Multichannel LSTM-CNN for Telugu Text Classification
Sunil Gundapu | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

With the instantaneous growth of text information, retrieving domain-oriented information from the text data has a broad range of applications in Information Retrieval and Natural language Processing. Thematic keywords give a compressed representation of the text. Usually, Domain Identification plays a significant role in Machine Translation, Text Summarization, Question Answering, Information Extraction, and Sentiment Analysis. In this paper, we proposed the Multichannel LSTM-CNN methodology for Technical Domain Identification for Telugu. This architecture was used and evaluated in the context of the ICON shared task “TechDOfication 2020” (task h), and our system got 69.9% of the F1 score on the test dataset and 90.01% on the validation set.

A Novel Annotation Schema for Conversational Humor: Capturing the Cultural Nuances in Kanyasulkam
Vaishnavi Pamulapati | Gayatri Purigilla | Radhika Mamidi
Proceedings of the 14th Linguistic Annotation Workshop

Humor research is a multifaceted field that has led to a better understanding of humor’s psychological effects and the development of different theories of humor. This paper’s main objective is to develop a hierarchical schema for a fine-grained annotation of Conversational Humor. Based on the Benign Violation Theory, the benignity or non-benignity of the interlocutor’s intentions is included within the framework. Under the categories mentioned above, in addition to different types of humor, the techniques utilized by these types are identified. Furthermore, a prominent play from Telugu, Kanyasulkam, is annotated to substantiate the work across cultures at multiple levels. The inter-annotator agreement is calculated to assess the accuracy and validity of the dataset. An in-depth analysis of the disagreement is performed to understand the subjectivity of humor better.

Dataset Creation and Evaluation of Aspect Based Sentiment Analysis in Telugu, a Low Resource Language
Yashwanth Reddy Regatte | Rama Rohit Reddy Gangula | Radhika Mamidi
Proceedings of the Twelfth Language Resources and Evaluation Conference

In recent years, sentiment analysis has gained popularity as it is essential to moderate and analyse the information across the internet. It has various applications like opinion mining, social media monitoring, and market research. Aspect Based Sentiment Analysis (ABSA) is an area of sentiment analysis which deals with sentiment at a finer level. ABSA classifies sentiment with respect to each aspect to gain greater insights into the sentiment expressed. Significant contributions have been made in ABSA, but this progress is limited only to a few languages with adequate resources. Telugu lags behind in this area of research despite being one of the most spoken languages in India and an enormous amount of data being created each day. In this paper, we create a reliable resource for aspect based sentiment analysis in Telugu. The data is annotated for three tasks namely Aspect Term Extraction, Aspect Polarity Classification and Aspect Categorisation. Further, we develop baselines for the tasks using deep learning methods demonstrating the reliability and usefulness of the resource.

Multilingual Pre-Trained Transformers and Convolutional NN Classification Models for Technical Domain Identification
Suman Dowlagar | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

In this paper, we present a transfer learning system to perform technical domain identification on multilingual text data. We have submitted two runs, one uses the transformer model BERT, and the other uses XLM-ROBERTa with the CNN model for text classification. These models allowed us to identify the domain of the given sentences for the ICON 2020 shared Task, TechDOfication: Technical Domain Identification. Our system ranked the best for the subtasks 1d, 1g for the given TechDOfication dataset.

Gundapusunil at SemEval-2020 Task 9: Syntactic Semantic LSTM Architecture for SENTIment Analysis of Code-MIXed Data
Sunil Gundapu | Radhika Mamidi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

The phenomenon of mixing the vocabulary and syntax of multiple languages within the same utterance is called Code-Mixing. This is more evident in multilingual societies. In this paper, we have developed a system for SemEval 2020: Task 9 on Sentiment Analysis of Hindi-English code-mixed social media text. Our system first generates two types of embeddings for the social media text. In those, the first one is character level embeddings to encode the character level information and to handle the out-of-vocabulary entries and the second one is FastText word embeddings for capturing morphology and semantics. These two embeddings were passed to the LSTM network and the system outperformed the baseline model.

2019

Detecting Political Bias in News Articles Using Headline Attention
Rama Rohit Reddy Gangula | Suma Reddy Duggenpudi | Radhika Mamidi
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Language is a powerful tool which can be used to state the facts as well as express our views and perceptions. Most of the times, we find a subtle bias towards or against someone or something. When it comes to politics, media houses and journalists are known to create bias by shrewd means such as misinterpreting reality and distorting viewpoints towards some parties. This misinterpretation on a large scale can lead to the production of biased news and conspiracy theories. Automating bias detection in newspaper articles could be a good challenge for research in NLP. We proposed a headline attention network for this bias detection. Our model has two distinctive characteristics: (i) it has a structure that mirrors a person’s way of reading a news article (ii) it has attention mechanism applied on the article based on its headline, enabling it to attend to more critical content to predict bias. As the required datasets were not available, we created a dataset comprising of 1329 news articles collected from various Telugu newspapers and marked them for bias towards a particular political party. The experiments conducted on it demonstrated that our model outperforms various baseline methods by a substantial margin.

Deep Learning Techniques for Humor Detection in Hindi-English Code-Mixed Tweets
Sushmitha Reddy Sane | Suraj Tripathi | Koushik Reddy Sane | Radhika Mamidi
Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

We propose bilingual word embeddings based on word2vec and fastText models (CBOW and Skip-gram) to address the problem of Humor detection in Hindi-English code-mixed tweets in combination with deep learning architectures. We focus on deep learning approaches which are not widely used on code-mixed data and analyzed their performance by experimenting with three different neural network models. We propose convolution neural network (CNN) and bidirectional long-short term memory (biLSTM) (with and without Attention) models which take the generated bilingual embeddings as input. We make use of Twitter data to create bilingual word embeddings. All our proposed architectures outperform the state-of-the-art results, and Attention-based bidirectional LSTM model achieved an accuracy of 73.6% which is an increment of more than 4% compared to the current state-of-the-art results.

Samajh-Boojh: A Reading Comprehension system in Hindi
Shalaka Vaidya | Hiranmai Sri Adibhatla | Radhika Mamidi
Proceedings of the 16th International Conference on Natural Language Processing

This paper presents a novel approach designed to answer questions on a reading comprehension passage. It is an end-to-end system which first focuses on comprehending the given passage wherein it converts unstructured passage into a structured data and later proceeds to answer the questions related to the passage using solely the aforementioned structured data. To the best of our knowledge, the proposed design is first of its kind which accounts for entire process of comprehending the passage and then answering the questions associated with the passage. The comprehension stage converts the passage into a Discourse Collection that comprises of the relation shared amongst logical sentences in given passage along with the key characteristics of each sentence. This design has its applications in academic domain , query comprehension in speech systems among others.

Samvaadhana: A Telugu Dialogue System in Hospital Domain
Suma Reddy Duggenpudi | Kusampudi Siva Subrahamanyam Varma | Radhika Mamidi
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

In this paper, a dialogue system for Hospital domain in Telugu, which is a resource-poor Dravidian language, has been built. It handles various hospital and doctor related queries. The main aim of this paper is to present an approach for modelling a dialogue system in a resource-poor language by combining linguistic and domain knowledge. Focusing on the question answering aspect of the dialogue system, we identified Question Classification and Query Processing as the two most important parts of the dialogue system. Our method combines deep learning techniques for question classification and computational rule-based analysis for query processing. Human evaluation of the system has been performed as there is no automated evaluation tool for dialogue systems in Telugu. Our system achieves a high overall rating along with a significantly accurate context-capturing method as shown in the results.

Stance Detection in Code-Mixed Hindi-English Social Media Data using Multi-Task Learning
Sushmitha Reddy Sane | Suraj Tripathi | Koushik Reddy Sane | Radhika Mamidi
Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Social media sites like Facebook, Twitter, and other microblogging forums have emerged as a platform for people to express their opinions and views on different issues and events. It is often observed that people tend to take a stance; in favor, against or neutral towards a particular topic. The task of assessing the stance taken by the individual became significantly important with the emergence in the usage of online social platforms. Automatic stance detection system understands the user’s stance by analyzing the standalone texts against a target entity. Due to the limited contextual information a single sentence provides, it is challenging to solve this task effectively. In this paper, we introduce a Multi-Task Learning (MTL) based deep neural network architecture for automatically detecting stance present in the code-mixed corpus. We apply our approach on Hindi-English code-mixed corpus against the target entity - “Demonetisation.” Our best model achieved the result with a stance prediction accuracy of 63.2% which is a 4.5% overall accuracy improvement compared to the current supervised classification systems developed using the benchmark dataset for code-mixed data stance detection.

SmokEng: Towards Fine-grained Classification of Tobacco-related Social Media Text
Kartikey Pant | Venkata Himakar Yanamandra | Alok Debnath | Radhika Mamidi
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Contemporary datasets on tobacco consumption focus on one of two topics, either public health mentions and disease surveillance, or sentiment analysis on topical tobacco products and services. However, two primary considerations are not accounted for, the language of the demographic affected and a combination of the topics mentioned above in a fine-grained classification mechanism. In this paper, we create a dataset of 3144 tweets, which are selected based on the presence of colloquial slang related to smoking and analyze it based on the semantics of the tweet. Each class is created and annotated based on the content of the tweets such that further hierarchical methods can be easily applied. Further, we prove the efficacy of standard text classification methods on this dataset, by designing experiments which do both binary as well as multi-class classification. Our experiments tackle the identification of either a specific topic (such as tobacco product promotion), a general mention (cigarettes and related products) or a more fine-grained classification. This methodology paves the way for further analysis, such as understanding sentiment or style, which makes this dataset a vital contribution to both disease surveillance and tobacco use research.

2018

Automatic Spelling Correction for Resource-Scarce Languages using Deep Learning
Pravallika Etoori | Manoj Chinnakotla | Radhika Mamidi
Proceedings of ACL 2018, Student Research Workshop

Spelling correction is a well-known task in Natural Language Processing (NLP). Automatic spelling correction is important for many NLP applications like web search engines, text summarization, sentiment analysis etc. Most approaches use parallel data of noisy and correct word mappings from different sources as training data for automatic spelling correction. Indic languages are resource-scarce and do not have such parallel data due to low volume of queries and non-existence of such prior implementations. In this paper, we show how to build an automatic spelling corrector for resource-scarce languages. We propose a sequence-to-sequence deep learning model which trains end-to-end. We perform experiments on synthetic datasets created for Indic languages, Hindi and Telugu, by incorporating the spelling mistakes committed at character level. A comparative evaluation shows that our model is competitive with the existing spell checking and correction techniques for Indic languages.

BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu Using Word-level Annotations
Sreekavitha Parupalli | Vijjini Anvesh Rao | Radhika Mamidi
Proceedings of ACL 2018, Student Research Workshop

The presented work aims at generating a systematically annotated corpus that can support the enhancement of sentiment analysis tasks in Telugu using word-level sentiment annotations. From OntoSenseNet, we extracted 11,000 adjectives, 253 adverbs, 8483 verbs and sentiment annotation is being done by language experts. We discuss the methodology followed for the polarity annotations and validate the developed resource. This work aims at developing a benchmark corpus, as an extension to SentiWordNet, and baseline accuracy for a model where lexeme annotations are applied for sentiment predictions. The fundamental aim of this paper is to validate and study the possibility of utilizing machine learning algorithms, word-level sentiment annotations in the task of automated sentiment identification. Furthermore, accuracy is improved by annotating the bi-grams extracted from the target corpus.

Word Level Language Identification in English Telugu Code Mixed Data
Sunil Gundapu | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

Exploring Chunk Based Templates for Generating a subset of English Text
Nikhilesh Bhatnagar | Manish Shrivastava | Radhika Mamidi
Proceedings of ACL 2018, Student Research Workshop

Natural Language Generation (NLG) is a research task which addresses the automatic generation of natural language text representative of an input non-linguistic collection of knowledge. In this paper, we address the task of the generation of grammatical sentences in an isolated context given a partial bag-of-words which the generated sentence must contain. We view the task as a search problem (a problem of choice) involving combinations of smaller chunk based templates extracted from a training corpus to construct a complete sentence. To achieve that, we propose a fitness function which we use in conjunction with an evolutionary algorithm as the search procedure to arrive at a potentially grammatical sentence (modeled by the fitness score) which satisfies the input constraints.

Affect in Tweets using Experts Model
Subba Reddy Oota | Adithya Avvaru | Mounika Reddy Marreddy | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

Towards Automation of Sense-type Identification of Verbs in OntoSenseNet
Sreekavitha Parupalli | Vijjini Anvesh Rao | Radhika Mamidi
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

In this paper, we discuss the enrichment of a manually developed resource, OntoSenseNet for Telugu. OntoSenseNet is a sense annotated resource that marks each verb of Telugu with a primary and a secondary sense. The area of research is relatively recent but has a large scope of development. We provide an introductory work to enrich the OntoSenseNet to promote further research in Telugu. Classifiers are adopted to learn the sense relevant features of the words in the resource and also to automate the tagging of sense-types for verbs. We perform a comparative analysis of different classifiers applied on OntoSenseNet. The results of the experiment prove that automated enrichment of the resource is effective using SVM classifiers and Adaboost ensemble.

Resource Creation Towards Automated Sentiment Analysis in Telugu (a low resource language) and Integrating Multiple Domain Sources to Enhance Sentiment Prediction
Rama Rohit Reddy Gangula | Radhika Mamidi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Towards Enhancing Lexical Resource and Using Sense-annotations of OntoSenseNet for Sentiment Analysis
Sreekavitha Parupalli | Vijjini Anvesh Rao | Radhika Mamidi
Proceedings of the Third Workshop on Semantic Deep Learning

This paper illustrates the interface of the tool we developed for crowd sourcing and we explain the annotation procedure in detail. Our tool is named as ‘పారుపల్లి పదజాలం’ (Parupalli Padajaalam) which means web of words by Parupalli. The aim of this tool is to populate the OntoSenseNet, sentiment polarity annotated Telugu resource. Recent works have shown the importance of word-level annotations on sentiment analysis. With this as basis, we aim to analyze the importance of sense-annotations obtained from OntoSenseNet in performing the task of sentiment analysis. We explain the features extracted from OntoSenseNet (Telugu). Furthermore we compute and explain the adverbial class distribution of verbs in OntoSenseNet. This task is known to aid in disambiguating word-senses which helps in enhancing the performance of word-sense disambiguation (WSD) task(s).

Political Discourse Analysis : A Case Study of 2014 Andhra Pradesh State Assembly Election of Interpersonal Speech Choices
Lalitha Kameswari | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

Syllables for Sentence Classification in Morphologically Rich Languages
Madhuri Tummalapalli | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

Predicting the Genre and Rating of a Movie Based on its Synopsis
Varshit Battu | Vishal Batchu | Rama Rohit Reddy Gangula | Mohana Murali Krishna Reddy Dakannagari | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

2017

Building a SentiWordNet for Odia
Gaurav Mohanty | Abishek Kannan | Radhika Mamidi
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

As a discipline of Natural Language Processing, Sentiment Analysis is used to extract and analyze subjective information present in natural language data. The task of Sentiment Analysis has acquired wide commercial uses including social media monitoring tasks, survey responses, review systems, etc. Languages like English have several resources which aid in the task of Sentiment Analysis. SentiWordNet and Subjectivity WordList are examples of such tools and resources. With more data being available in native vernacular, language-specific SentiWordNet(s) have become essential. For resource poor languages, creating such SentiWordNet(s) is a difficult task to achieve. One solution is to use available resources in English and translate the final source lexicon to target lexicon via machine translation. Machine translation systems for the English-Odia language pair have not yet been developed. In this paper, we discuss a method to create a SentiWordNet for Odia, which is resource-poor, by only using resources which are currently available for Indian languages. The lexicon created, would serve as a tool for Sentiment Analysis related task specific to Odia data.

ACTSA: Annotated Corpus for Telugu Sentiment Analysis
Sandeep Sricharan Mukku | Radhika Mamidi
Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

Sentiment analysis deals with the task of determining the polarity of a document or sentence and has received a lot of attention in recent years for the English language. With the rapid growth of social media these days, a lot of data is available in regional languages besides English. Telugu is one such regional language with abundant data available in social media, but it’s hard to find a labelled data of sentences for Telugu Sentiment Analysis. In this paper, we describe an effort to build a gold-standard annotated corpus of Telugu sentences to support Telugu Sentiment Analysis. The corpus, named ACTSA (Annotated Corpus for Telugu Sentiment Analysis) has a collection of Telugu sentences taken from different sources which were then pre-processed and manually annotated by native Telugu speakers using our annotation guidelines. In total, we have annotated 5457 sentences, which makes our corpus the largest resource currently available. The corpus and the annotation guidelines are made publicly available.

Automatic Generation of Jokes in Hindi
Srishti Aggarwal | Radhika Mamidi
Proceedings of ACL 2017, Student Research Workshop

Handling Multi-Sentence Queries in a Domain Independent Dialogue System
Prathyusha Jwalapuram | Radhika Mamidi
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

When does a compliment become sexist? Analysis and classification of ambivalent sexism using twitter data
Akshita Jha | Radhika Mamidi
Proceedings of the Second Workshop on NLP and Computational Social Science

Sexism is prevalent in today’s society, both offline and online, and poses a credible threat to social equality with respect to gender. According to ambivalent sexism theory (Glick and Fiske, 1996), it comes in two forms: Hostile and Benevolent. While hostile sexism is characterized by an explicitly negative attitude, benevolent sexism is more subtle. Previous works on computationally detecting sexism present online are restricted to identifying the hostile form. Our objective is to investigate the less pronounced form of sexism demonstrated online. We achieve this by creating and analyzing a dataset of tweets that exhibit benevolent sexism. By using Support Vector Machines (SVM), sequence-to-sequence models and FastText classifier, we classify tweets into ‘Hostile’, ‘Benevolent’ or ‘Others’ class depending on the kind of sexism they exhibit. We have been able to achieve an F1-score of 87.22% using FastText classifier. Our work helps analyze and understand the much prevalent ambivalent sexism in social media.

2016

Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text
Arnav Sharma | Sakshi Gupta | Raveesh Motlani | Piyush Bansal | Manish Shrivastava | Radhika Mamidi | Dipti M. Sharma
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Towards Building a SentiWordNet for Tamil
Abishek Kannan | Gaurav Mohanty | Radhika Mamidi
Proceedings of the 13th International Conference on Natural Language Processing

IIIT at SemEval-2016 Task 11: Complex Word Identification using Nearest Centroid Classification
Ashish Palakurthi | Radhika Mamidi
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

Statistical Sandhi Splitter and its Effect on NLP Applications
Prathyusha Kuncham | Kovida Nelakuditi | Radhika Mamidi
Proceedings of the International Conference Recent Advances in Natural Language Processing

Classification of Attributes in a Natural Language Query into Different SQL Clauses
Ashish Palakurthi | Ruthu S M | Arjun Akula | Radhika Mamidi
Proceedings of the International Conference Recent Advances in Natural Language Processing

Resolution of Pronominal Anaphora for Telugu Dialogues
Hemanth Reddy Jonnalagadda | Radhika Mamidi
Proceedings of the 12th International Conference on Natural Language Processing

A Semi Supervised Dialog Act Tagging for Telugu
Suman Dowlagar | Radhika Mamidi
Proceedings of the 12th International Conference on Natural Language Processing

2014

Identification of Karaka relations in an English sentence
Sai Kiran Gorthi | Ashish Palakurthi | Radhika Mamidi | Dipti Misra Sharma
Proceedings of the 11th International Conference on Natural Language Processing

Learning phrase-level vocabulary in second language using pictures/gestures and voice
Lavanya Prahallad | Prathyusha Danda | Radhika Mamidi
Proceedings of the 11th International Conference on Natural Language Processing

Statistical Morph Analyzer (SMA++) for Indian Languages
Saikrishna Srirampur | Ravi Chandibhamar | Radhika Mamidi
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

2013

A Novel Approach Towards Incorporating Context Processing Capabilities in NLIDB System
Arjun Akula | Rajeev Sangal | Radhika Mamidi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Stance Classification in Online Debates by Recognizing Users’ Intentions
Sarvesh Ranade | Rajeev Sangal | Radhika Mamidi
Proceedings of the SIGDIAL 2013 Conference

2012

A template matching approach for detecting pronunciation mismatch
Lavanya Prahallad | Radhika Mamidi | Kishore Prahallad
Proceedings of the Workshop on Speech and Language Processing Tools in Education

Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages
Dipti Misra Sharma | Prashanth Mannem | Joseph vanGenabith | Sobha Lalitha Devi | Radhika Mamidi | Ranjani Parthasarathi
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages

Proceedings of the Workshop on Speech and Language Processing Tools in Education
Radhika Mamidi | Kishore Prahallad
Proceedings of the Workshop on Speech and Language Processing Tools in Education

Co-authors

Pruthwik Mishra 4

Subba Reddy Oota 4

Kartikey Pant 4

Manish Shrivastava 4

Ishan Sanjeev Upadhyay 4

Salil Aggarwal 3

Samyak Agrawal 3

Kaveri Anuranjana 3

Vijjini Anvesh Rao 3

Karunesh Arora 3

Suma Reddy Duggenpudi 3

Dipankar Ganguly 3

Devansh Gautam 3

Kshitij Gupta 3

Mounika Marreddy 3

Gaurav Mohanty 3

Vandan Mujadia 3

Sudip Kumar Naskar 3

Ashish Palakurthi 3

Jayant Panwar 3

Sreekavitha Parupalli 3

Akhilesh Aravapalli 2

Adithya Avvaru 2

Meghana Bommadi 2

Abhijith Chelpuri 2

Abhigyan Ghosh 2

Revanth Gundam 2

Abishek Kannan 2

Siva Subrahamanyam Varma Kusampudi 2

Advaith Malladi 2

Abhinav Marri 2

Rahothvarman P 2

Vaishnavi Pamulapati 2

Kishore Prahallad 2

Lavanya Prahallad 2

Sushmitha Reddy Sane 2

Koushik Reddy Sane 2

Rajeev Sangal 2

Kv Aditya Srivatsa 2

Shreya Terupally 2

Suraj Tripathi 2

Vasudeva Varma 2

Venkata Himakar Yanamandra 2

Tushar Abhishek 1

Aditya Agarwal 1

Srishti Aggarwal 1

Kaushal Attaluri 1

Vaibhav Bajaj 1

Piyush Bansal 1

Vishal Batchu 1

Varshit Battu 1

Patanjali Bhamidipati 1

Nikhilesh Bhatnagar 1

Arghya Bhattacharya 1

Anudeep Chaluvadi 1

Ravi Chandibhamar 1

Aryan Chandramania 1

Tanishq Chaudhary 1

Anirudh Chebolu 1

Manoj Chinnakotla 1

Sireesha Chittepu 1

Mukund Choudhary 1

Mohana Murali Krishna Reddy Dakannagari 1

Prathyusha Danda 1

Priyanka Dasari 1

Pravallika Etoori 1

Sai Kiran Gorthi 1

Revanth Kumar Gundam 1

Hemanth Reddy Jonnalagadda 1

Prathyusha Jwalapuram 1

Vanshpreet S. Kohli 1

Parameshwari Krishnamurthy 1

Parameswari Krishnamurthy 1

Ponnurangam Kumaraguru 1

Prathyusha Kuncham 1

Sobha Lalitha Devi 1

Pulak Malhotra 1

Srihitha Mallepally 1

Prashanth Mannem 1

Sriharshitha Mareddy 1

Mounika Reddy Marreddy 1

Hardik Mittal 1

Raveesh Motlani 1

Sandeep Sricharan Mukku 1

Kovida Nelakuditi 1

Srijith Padakanti 1

Khushbu Pahwa 1

Ranjani Parthasarathi 1

Anagha Pradeep 1

Gayatri Purigilla 1

Tathagata Raha 1

Adith John Rajeev 1

Sarvesh Ranade 1

Yashwanth Reddy Regatte 1

Preetham Sathineni 1

Pavankumar Satuluri 1

Suyash Sethia 1

Anubhav Sharma 1

Utsav Shekhar 1

Kusampudi Siva Subrahamanyam Varma 1

Ravsimar Sodhi 1

Hiranmai Sri Adibhatla 1

Padakanti Srijith 1

Saikrishna Srirampur 1

K V Aditya Srivatsa 1

Bapi Raju Surampudi 1

Hitendra Sarma Thogarcheti 1

Madhuri Tummalapalli 1

Ishan Upadhyay 1

Shalaka Vaidya 1

Anvesh Rao Vijjini 1

Sanath Vobilisetty 1

Nagaraju Vuppala 1

Anshul Wadhawan 1

Joseph vanGenabith 1

Venues

DravidianLangTech4