Parameswari Krishnamurthy

2025

Shared Task at Biolaysumm2025 : Extract then summarize approach Augmented with UMLS based Definition Retrieval for Lay Summary generation.
Aaradhya Gupta | Parameswari Krishnamurthy
Proceedings of the 24th Workshop on Biomedical Language Processing (Shared Tasks)

pdf bib abs

Sandhi Splitting in Tamil and Telugu: A Sequence-to-Sequence Approach Leveraging Transformer Models
Priyanka Dasari | Mupparapu Sohan Gupta | Nagaraju Vuppala | Pruthwik Mishra | Parameswari Krishnamurthy
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

Dravidian languages like Tamil and Telugu are agglutinative languages, they form wordforms by combining two or more elements into a single string with morpho-phonemic changes at the point of concatenation, known as sandhi. This linguistic feature adds complexity to automatic language processing, making the pre-processing of sandhi words essential for NLP applications. We developed extensive sandhi-annotated corpora of 15K for Telugu and Tamil, focusing on the systematic application of sandhi rules which explains the word formation patterns by showing how lexical and functional categories combine to create composite non-compound words. We implemented compact sequence-to-sequence transformer networks for the automatic sandhi processing. To evaluate our models, we manually annotated Telugu and Tamil IN22-Conv Benchmark datasets with sandhi annotations. Our experiments aim to enhance the language processing tasks like machine translation in morphologically rich languages.

pdf bib abs

POS-Aware Neural Approaches for Word Alignment in Dravidian Languages
Antony Alexander James | Parameswari Krishnamurthy
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

This research explores word alignment in low-resource languages, specifically focusing on Telugu and Tamil, two languages within the Dravidian language family. Traditional statistical models such as FastAlign, GIZA++, and Eflomal serve as baselines but are often limited in low-resource settings. Neural methods, including SimAlign and AWESOME-align, which leverage multilingual BERT, show promising results by achieving alignment without extensive parallel data. Applying these neural models to Telugu-Tamil and Tamil-Telugu alignments, we found that fine-tuning with POS-tagged data significantly improves alignment accuracy compared to untagged data, achieving an improvement of 6–7%. However, our combined embeddings approach, which merges word embeddings with POS tags, did not yield additional gains. Expanding the study, we included Tamil, Telugu, and English alignments to explore linguistic mappings between Dravidian and an Indo-European languages. Results demonstrate the comparative performance across models and language pairs, emphasizing both the benefits of POS-tag fine-tuning and the complexities of cross-linguistic alignment.

pdf bib abs

LTRC-IIITH at PerAnsSumm 2025: SpanSense - Perspective-specific span identification and Summarization
Sushvin Marimuthu | Parameswari Krishnamurthy
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

Healthcare community question-answering (CQA) forums have become popular for users seeking medical advice, offering answers that range from personal experiences to factual information. Traditionally, CQA summarization relies on the best-voted answer as a reference summary. However, this approach overlooks the diverse perspectives across multiple responses. Structuring summaries by perspective could better meet users’ informational needs. The PerAnsSumm shared task addresses this by identifying and classifying perspective-specific spans (Task_A) and generating perspective-specific summaries from question-answer threads (Task_B). In this paper, we present our work on the PerAnsSumm shared task 2025 at the CL4Health Workshop, NAACL 2025. Our system leverages the RoBERTa-large model for identifying perspective-specific spans and the BART-large model for summarization. We achieved a Macro-F1 score of 0.9 (90%) and a Weighted-F1 score of 0.92 (92%) for classification. For span matching, our strict matching F1 score was 0.21 (21%), while proportional matching reached 0.68 (68%), resulting in an average Task A score of 0.6 (60%). For Task B, we achieved a ROUGE-1 score of 0.4 (40%), ROUGE-2 of 0.18 (18%), and ROUGE-L of 0.36 (36%). Additionally, we obtained a BERTScore of 0.84 (84%), METEOR of 0.37 (37%), and BLEU of 0.13 (13%), resulting in an average Task B score of 0.38 (38%). Combining both tasks, our system achieved an overall average score of 49% and ranked 6th on the official leaderboard for the shared task.

pdf bib abs

Family helps one another: Dravidian NLP suite for Natural Language Understanding
Abhinav P M | Priyanka Dasari | Nagaraju Vuppala | Parameswari Krishnamurthy
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Developing robust Natural Language Understanding (NLU) for morphologically rich Dravidian languages like Kannada, Malayalam, Tamil, and Telugu presents significant challenges due to their agglutinative nature and syntactic complexity. In this work, we present the Dravidian NLP Suite tackling five core tasks: Morphological Analysis (MA), POS Tagging (POS), Named Entity Recognition (NER), Dependency Parsing (DEP), and Coreference Resolution (CR), trained for monolingual models and multilingual models. To facilitate this, we present the Dravida dataset, meticulously annotated multilingual corpus for these tasks across all four languages. Our experiments demonstrate that a multilingual model, which utilizes shared linguistic features and cross-lingual patterns inherent to the Dravidian family, consistently outperforms its monolingual counterparts across all tasks. These findings suggest that multilingual learning is an effective approach for enhancing Natural Language Understanding (NLU) capabilities, particularly for languages belonging to the same family. To the best of our knowledge, this is the first work to jointly address all these core tasks on the Dravidian languages.

pdf bib abs

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Saketh Reddy Vemula | Sandipan Dandapat | Dipti Sharma | Parameswari Krishnamurthy
The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models—from pre-training through fine-tuning—for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms.Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.

pdf bib abs

Towards Resource-Rich Mizo and Khasi in NLP: Resource Development, Synthetic Data Generation and Model Building
Soumyadip Ghosh | Henry Lalsiam | Dorothy Marbaniang | Gracious Mary Temsen | Rahul Mishra | Parameswari Krishnamurthy
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)

In the rapidly evolving field of Natural Language Processing (NLP), Indian regional languages remain significantly underrepresented due to their limited digital presence and lack of annotated resources. This work presents the first comprehensive effort toward developing high quality linguistic datasets for two extremely low resource languages Mizo and Khasi. We introduce human annotated, gold standard datasets for three core NLP tasks: Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and Keyword Identification. To overcome annotation bottlenecks in NER, we further explore a synthetic data generation pipeline involving translation from Hindi and cross lingual word alignment. For POS tagging, we adopt and subsequently modify the Universal Dependencies (UD) framework to better suit the linguistic characteristics of Mizo and Khasi, while custom annotation guidelines are developed for NER and Keyword Identification. The constructed datasets are evaluated using multilingual language models, demonstrating that structured resource development, coupled with gradual fine-tuning, yields significant improvements in performance. This work represents a critical step toward advancing linguistic resources and computational tools for Mizo and Khasi.

pdf bib abs

Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages
Yash Bhaskar | Ketaki Shetye | Vandan Mujadia | Dipti Misra Sharma | Parameswari Krishnamurthy
Proceedings of Machine Translation Summit XX: Volume 1

This study addresses the critical challenge of data scarcity in machine translation for Indian languages, particularly given their morphological complexity and limited parallel data. We investigate an effective strategy to maximize the utility of existing data by generating negative samples from positive training instances using a progressive perturbation approach. This is used for aligning the model with preferential data using Kahneman-Tversky Optimization (KTO). Comparing it against traditional Supervised Fine-Tuning (SFT), we demonstrate how generating negative samples and leveraging KTO enhances data efficiency. By creating rejected samples through progressively perturbed translations from the available dataset, we fine-tune the Llama 3.1 Instruct 8B model using QLoRA across 16 language directions, including English, Hindi, Bangla, Tamil, Telugu, and Santali. Our results show that KTO-based preference alignment with progressive perturbation consistently outperforms SFT, achieving significant gains in translation quality with an average BLEU increase of 1.84 to 2.47 and CHRF increase of 2.85 to 4.01 compared to SFT for selected languages, while using the same positive training samples and under similar computational constraints. This highlights the potential of our negative sample generation strategy within KTO, especially in low resource scenarios.

pdf bib

pdf bib abs

This paper presents an overview of the Shared Task on Patient-Centric Question Answering, organized as part of the NLP-AI4Health workshop at IJCNLP. The task aims to bridge the digital divide in healthcare by developing inclusive systems for two critical domains: Head and Neck Cancer (HNC) and Cystic Fibrosis (CF). We introduce the NLP4Health-2025 Dataset, a novel, large-scale multilingual corpus consisting of more than 45,000 validated multi-turn dialogues between patients and healthcare providers across 10 languages: Assamese, Bangla, Dogri, English, Gujarati, Hindi, Kannada, Marathi, Tamil, and Telugu. Participants were challenged to develop lightweight models (< 3 billion parameters) to perform two core activities: (1) Clinical Summarization, encompassing both abstractive summaries and structured clinical extraction (SCE), and (2) Patient-Centric QA, generating empathetic, factually accurate answers in the dialogue native language. This paper details the hybrid human-agent dataset construction pipeline, task definitions, evaluation metrics, and analyzes the performance of 9 submissions from 6 teams. The results demonstrate the viability of small language models (SLMs) in low-resource medical settings when optimized via techniques like LoRA and RAG.

pdf bib

VIDAI: VIDukathAI Interpretation Through Analysis of In-context Reasoning in Tamil using LLMs
R S Mughil Srinivasan | Kesavan T | Abhijith Balan | Abhinav P M | Parameswari Krishnamurthy | Oswald C
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

Do Multilingual Transformers Encode Paninian Grammatical Relations? A Layer-wise Probing Study
Akshit Kumar | Dipti Sharma | Parameswari Krishnamurthy
Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025)

Large multilingual transformers such as XLM-RoBERTa achieve impressive performance on diverse NLP benchmarks, but understanding how they internally encode grammatical information remains challenging. This study investigates the encoding of syntactic and morphological information derived from the Paninian grammatical framework—specifically designed for morphologically rich Indian languages—across model layers. Using diagnostic probing, we analyze the hidden representations of frozen XLM-RoBERTa-base, mBERT, and IndicBERT models across seven Indian languages (Hindi, Kannada, Malayalam, Marathi, Telugu, Urdu, Bengali). Probes are trained to predict Paninian dependency relations (by edge probing) and essential morphosyntactic features (UPOS tags, Vibhakti markers). We find that syntactic structure (dependencies) is primarily encoded in the middle-to-upper-middle layers (layers 6–9), while lexical features peak slightly earlier. Although the general layer-wise trends are shared across models, significant variations in absolute probing performance reflect differences in model capacity, pre-training data, and language-specific characteristics. These findings shed light on how theory-specific grammatical information emerges implicitly within multilingual transformer representations trained largely on unstructured raw text.

pdf bib abs

keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection
Saketh Reddy Vemula | Parameswari Krishnamurthy
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Identification of hallucination spans in black-box language model generated text is essential for applications in the real world. A recent attempt at this direction is SemEval-2025 Task 3, Mu-SHROOM—a Multilingual Shared Task onHallucinations and Related Observable Over-generation Errors. In this work, we present our solution to this problem, which capitalizes on the variability of stochastically-sampled responses in order to identify hallucinated spans. Our hypothesis is that if a language model is certain of a fact, its sampled responses will be uniform, while hallucinated facts will yield different and conflicting results. We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments. Our method is not dependent on additional training and hence is cost-effective and adaptable. In addition, we conduct extensive hyperparameter tuning and perform error analysis, giving us crucial insights into model behavior and where improvement can be made.

pdf bib abs

TableWise at SemEval-2025 Task 8: LLM Agents for TabQA
Harsh Bansal | Aman Raj | Akshit Sharma | Parameswari Krishnamurthy
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Tabular Question Answering (TabQA) is a challenging task that requires models to comprehend structured tabular data and generate accurate responses based on complex reasoning. In this paper, we present our approach to SemEval Task 8: Tabular Question Answering, where we develop a large language model (LLM)-based agent capable of understanding and reasoning over tabular inputs. Our agent leverages a hybrid retrieval and generation strategy, incorporating structured table parsing, semantic understanding, and reasoning mechanisms to enhance response accuracy. We fine-tune a pre-trained LLM on domain-specific tabular data, integrating chain-of-thought prompting and adaptive decoding to improve multi-hop reasoning over tables. Experimental results demonstrate that our approach achieves competitive performance, effectively handling numerical operations, entity linking, and logical inference. Our findings suggest that LLM-based agents, when properly adapted, can significantly advance the state of the art in tabular question answering.

2024

pdf bib abs

LTRC-IIITH at EHRSQL 2024: Enhancing Reliability of Text-to-SQL Systems through Abstention and Confidence Thresholding
Jerrin Thomas | Pruthwik Mishra | Dipti Sharma | Parameswari Krishnamurthy
Proceedings of the 6th Clinical Natural Language Processing Workshop

In this paper, we present our work in the EHRSQL 2024 shared task which tackles reliable text-to-SQL modeling on Electronic Health Records. Our proposed system tackles the task with three modules - abstention module, text-to-SQL generation module, and reliability module. The abstention module identifies whether the question is answerable given the database schema. If the question is answerable, the text-to-SQL generation module generates the SQL query and associated confidence score. The reliability module has two key components - confidence score thresholding, which rejects generations with confidence below a pre-defined level, and error filtering, which identifies and excludes SQL queries that result in execution errors. In the official leaderboard for the task, our system ranks 6th. We have also made the source code public.

pdf bib abs

LTRC-IIITH at MEDIQA-M3G 2024: Medical Visual Question Answering with Vision-Language Models
Jerrin Thomas | Sushvin Marimuthu | Parameswari Krishnamurthy
Proceedings of the 6th Clinical Natural Language Processing Workshop

In this paper, we present our work to the MEDIQA-M3G 2024 shared task, which tackles multilingual and multimodal medical answer generation. Our system consists of a lightweight Vision-and-Language Transformer (ViLT) model which is fine-tuned for the clinical dermatology visual question-answering task. In the official leaderboard for the task, our system ranks 6th. After the challenge, we experiment with training the ViLT model on more data. We also explore the capabilities of large Vision-Language Models (VLMs) such as Gemini and LLaVA.

pdf bib abs

Assessing Translation Capabilities of Large Language Models involving English and Indian Languages
Vandan Mujadia | Ashok Urlana | Yash Bhaskar | Penumalla Aditya Pavani | Kukkapalli Shravya | Parameswari Krishnamurthy | Dipti Sharma
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

Generative Large Language Models (LLMs) have achieved remarkable advances in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large-language models, followed by exploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter-efficient fine-tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the model that performs best among the large language models available for the translation task.Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as chrF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively, using two-stage fine-tuned LLaMA-13b for English to Indian languages on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for Indian languages to English, we achieved average BLEU scores of 14.03, 16.65, 16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51, and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest and newstest2019 testsets. Overall, our findings highlight the potential and strength of large language models for machine translation capabilities, including languages that are currently underrepresented in LLMs.

pdf bib abs

Automating Humor: A Novel Approach to Joke Generation Using Template Extraction and Infilling
Mayank Goel | Parameswari Krishnamurthy | Radhika Mamidi
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

This paper presents a novel approach to humor generation in natural language processing by automating the creation of jokes through template extraction and infilling. Traditional methods have relied on predefined templates or neural network models, which either lack complexity or fail to produce genuinely humorous content. Our method introduces a technique to extract templates from existing jokes based on semantic salience and BERT’s attention weights. We then infill these templates using advanced techniques, through BERT and large language models (LLMs) like GPT-4, to generate new jokes. Our results indicate that the generated jokes are novel and human-like, with BERT showing promise in generating funny content and GPT-4 excelling in creating clever jokes. The study contributes to a deeper understanding of humor generation and the potential of AI in creative domains.

pdf bib abs

Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning
Yash Bhaskar | Sankalp Bahad | Parameswari Krishnamurthy
Proceedings of the 21st International Conference on Natural Language Processing (ICON): Shared Task on Decoding Fake Narratives in Spreading Hateful Stories (Faux-Hate)

Social media platforms, while enabling globalconnectivity, have become hubs for the rapidspread of harmful content, including hatespeech and fake narratives (Davidson et al.,2017; Shu et al., 2017). The Faux-Hateshared task focuses on detecting a specific phe-nomenon: the generation of hate speech drivenby fake narratives, termed Faux-Hate. Partici-pants are challenged to identify such instancesin code-mixed Hindi-English social media text.This paper describes our system developed forthe shared task, addressing two primary sub-tasks: (a) Binary Faux-Hate detection, involv-ing fake and hate speech classification, and(b) Target and Severity prediction, categoriz-ing the intended target and severity of hate-ful content. Our approach combines advancednatural language processing techniques withdomain-specific pretraining to enhance perfor-mance across both tasks. The system achievedcompetitive results, demonstrating the efficacyof leveraging multi-task learning for this com-plex problem.

pdf bib abs

Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages
Sankalp Bahad | Pruthwik Mishra | Parameswari Krishnamurthy | Dipti Sharma
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Named Entity Recognition (NER) is a use-ful component in Natural Language Process-ing (NLP) applications. It is used in varioustasks such as Machine Translation, Summa-rization, Information Retrieval, and Question-Answering systems. The research on NER iscentered around English and some other ma-jor languages, whereas limited attention hasbeen given to Indian languages. We analyze thechallenges and propose techniques that can betailored for Multilingual Named Entity Recog-nition for Indian Languages. We present a hu-man annotated named entity corpora of ∼40Ksentences for 4 Indian languages from two ofthe major Indian language families. Addition-ally, we show the transfer learning capabilitiesof pre-trained transformer models from a highresource language to multiple low resource lan-guages through a series of experiments. Wealso present a multilingual model fine-tunedon our dataset, which achieves an F1 score of∼0.80 on our dataset on average. We achievecomparable performance on completely unseenbenchmark datasets for Indian languages whichaffirms the usability of our model.

pdf bib abs

Noot Noot at SemEval-2024 Task 7: Numerical Reasoning and Headline Generation
Sankalp Bahad | Yash Bhaskar | Parameswari Krishnamurthy
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Natural language processing (NLP) modelshave achieved remarkable progress in recentyears, particularly in tasks related to semanticanalysis. However, many existing benchmarksprimarily focus on lexical and syntactic un-derstanding, often overlooking the importanceof numerical reasoning abilities. In this pa-per, we argue for the necessity of incorporatingnumeral-awareness into NLP evaluations andpropose two distinct tasks to assess this capabil-ity: Numerical Reasoning and Headline Gener-ation. We present datasets curated for each taskand evaluate various approaches using both au-tomatic and human evaluation metrics. Ourresults demonstrate the diverse strategies em-ployed by participating teams and highlight thepromising performance of emerging modelslike Mixtral 8x7b instruct. We discuss the im-plications of our findings and suggest avenuesfor future research in advancing numeral-awarelanguage understanding and generation.

pdf bib abs

Fine-tuning Language Models for AI vs Human Generated Text detection
Sankalp Bahad | Yash Bhaskar | Parameswari Krishnamurthy
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

In this paper, we introduce a machine-generated text detection system designed totackle the challenges posed by the prolifera-tion of large language models (LLMs). Withthe rise of LLMs such as ChatGPT and GPT-4,there is a growing concern regarding the po-tential misuse of machine-generated content,including misinformation dissemination. Oursystem addresses this issue by automating theidentification of machine-generated text acrossmultiple subtasks: binary human-written vs.machine-generated text classification, multi-way machine-generated text classification, andhuman-machine mixed text detection. We em-ploy the RoBERTa Base model and fine-tuneit on a diverse dataset encompassing variousdomains, languages, and sources. Throughrigorous evaluation, we demonstrate the effec-tiveness of our system in accurately detectingmachine-generated text, contributing to effortsaimed at mitigating its potential misuse.

pdf bib abs

NootNoot At SemEval-2024 Task 6: Hallucinations and Related Observable Overgeneration Mistakes Detection
Sankalp Bahad | Yash Bhaskar | Parameswari Krishnamurthy
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Semantic hallucinations in neural language gen-eration systems pose a significant challenge tothe reliability and accuracy of natural languageprocessing applications. Current neural mod-els often produce fluent but incorrect outputs,undermining the usefulness of generated text.In this study, we address the task of detectingsemantic hallucinations through the SHROOM(Semantic Hallucinations Real Or Mistakes)dataset, encompassing data from diverse NLGtasks such as definition modeling, machinetranslation, and paraphrase generation. We in-vestigate three methodologies: fine-tuning onlabelled training data, fine-tuning on labelledvalidation data, and a zero-shot approach usingthe Mixtral 8x7b instruct model. Our resultsdemonstrate the effectiveness of these method-ologies in identifying semantic hallucinations,with the zero-shot approach showing compet-itive performance without additional training.Our findings highlight the importance of robustdetection mechanisms for ensuring the accu-racy and reliability of neural language genera-tion systems.

pdf bib abs

Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo
Abhinaba Bala | Ashok Urlana | Rahul Mishra | Parameswari Krishnamurthy
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

Obtaining sufficient information in one’s mother tongue is crucial for satisfying the information needs of the users. While high-resource languages have abundant online resources, the situation is less than ideal for very low-resource languages. Moreover, the insufficient reporting of vital national and international events continues to be a worry, especially in languages with scarce resources, like Mizo. In this paper, we conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles, which leverages English-language news to supplement and enhance the information related to the corresponding news events. Furthermore, we make available 500 Mizo news articles and corresponding enriched holistic summaries. Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles.

pdf bib abs

MTNLP-IIITH: Machine Translation for Low-Resource Indic Languages
Abhinav P M | Ketaki Shetye | Parameswari Krishnamurthy
Proceedings of the Ninth Conference on Machine Translation

Machine Translation for low-resource languages presents significant challenges, primarily due to limited data availability. We have a baseline model and a primary model. For the baseline model, we first fine-tune the mBART model (mbart-large-50-many-to-many-mmt) for the language pairs English-Khasi, Khasi-English, English-Manipuri, and Manipuri-English. We then augment the dataset by back-translating from Indic languages to English. To enhance data quality, we fine-tune the LaBSE model specifically for Khasi and Manipuri, generating sentence embeddings and applying a cosine similarity threshold of 0.84 to filter out low-quality back-translations. The filtered data is combined with the original training data and used to further fine-tune the mBART model, creating our primary model. The results show that the primary model slightly outperforms the baseline model, with the best performance achieved by the English-to-Khasi (en-kh) primary model, which recorded a BLEU score of 0.0492, a chrF score of 0.3316, and a METEOR score of 0.2589 (on a scale of 0 to 1), with similar results for other language pairs.

pdf bib abs

Yes-MT’s Submission to the Low-Resource Indic Language Translation Shared Task in WMT 2024
Yash Bhaskar | Parameswari Krishnamurthy
Proceedings of the Ninth Conference on Machine Translation

This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024, focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 and IndicBart in both Multilingual and Monolingual settings, LoRA finetune IndicTrans2, zero-shot and few-shot prompting with large language models (LLMs) like Llama 3 and Mixtral 8x7b, LoRA Supervised Fine Tuning Llama 3, and training Transformers from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task’s test data using SacreBLEU and CHRF highlighting the challenges of low-resource translation and show the potential of LLMs for these tasks, particularly with fine-tuning.

2023

pdf bib abs

AbhiPaw@DravidianLangTech: Multimodal Abusive Language Detection and Sentiment Analysis
Abhinaba Bala | Parameswari Krishnamurthy
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Detecting abusive language in multimodal videos has become a pressing need in ensuring a safe and inclusive online environment. This paper focuses on addressing this challenge through the development of a novel approach for multimodal abusive language detection in Tamil videos and sentiment analysis for Tamil/Malayalam videos. By leveraging state-of-the-art models such as Multiscale Vision Transformers (MViT) for video analysis, OpenL3 for audio analysis, and the bert-base-multilingual-cased model for textual analysis, our proposed framework integrates visual, auditory, and textual features. Through extensive experiments and evaluations, we demonstrate the effectiveness of our model in accurately detecting abusive content and predicting sentiment categories. The limited availability of effective tools for performing these tasks in Dravidian Languages has prompted a new avenue of research in these domains.

pdf bib abs

AbhiPaw@ DravidianLangTech: Abusive Comment Detection in Tamil and Telugu using Logistic Regression
Abhinaba Bala | Parameswari Krishnamurthy
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Abusive comments in online platforms have become a significant concern, necessitating the development of effective detection systems. However, limited work has been done in low resource languages, including Dravidian languages. This paper addresses this gap by focusing on abusive comment detection in a dataset containing Tamil, Tamil-English and Telugu-English code-mixed comments. Our methodology involves logistic regression and explores suitable embeddings to enhance the performance of the detection model. Through rigorous experimentation, we identify the most effective combination of logistic regression and embeddings. The results demonstrate the performance of our proposed model, which contributes to the development of robust abusive comment detection systems in low resource language settings. Keywords: Abusive comment detection, Dravidian languages, logistic regression, embeddings, low resource languages, code-mixed dataset.

pdf bib abs

AbhiPaw@ DravidianLangTech: Fake News Detection in Dravidian Languages using Multilingual BERT
Abhinaba Bala | Parameswari Krishnamurthy
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

This study addresses the challenge of detecting fake news in Dravidian languages by leveraging Google’s MuRIL (Multilingual Representations for Indian Languages) model. Drawing upon previous research, we investigate the intricacies involved in identifying fake news and explore the potential of transformer-based models for linguistic analysis and contextual understanding. Through supervised learning, we fine-tune the “muril-base-cased” variant of MuRIL using a carefully curated dataset of labeled comments and posts in Dravidian languages, enabling the model to discern between original and fake news. During the inference phase, the fine-tuned MuRIL model analyzes new textual content, extracting contextual and semantic features to predict the content’s classification. We evaluate the model’s performance using standard metrics, highlighting the effectiveness of MuRIL in detecting fake news in Dravidian languages and contributing to the establishment of a safer digital ecosystem. Keywords: fake news detection, Dravidian languages, MuRIL, transformer-based models, linguistic analysis, contextual understanding.

2022

pdf bib

Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Anand Kumar Madasamy | Parameswari Krishnamurthy | Elizabeth Sherly | Sinnathamby Mahesan
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

pdf bib abs

This paper presents the overview of the shared task on emotional analysis in Tamil. The result of the shared task is presented at the workshop. This paper presents the dataset used in the shared task, task description, and the methodology used by the participants and the evaluation results of the submission. This task is organized as two Tasks. Task A is carried with 11 emotions annotated data for social media comments in Tamil and Task B is organized with 31 fine-grained emotion annotated data for social media comments in Tamil. For conducting experiments, training and development datasets were provided to the participants and results are evaluated for the unseen data. Totally we have received around 24 submissions from 13 teams. For evaluating the models, Precision, Recall, micro average metrics are used.

pdf bib abs

We present our findings from the first shared task on Multi-task Learning in Dravidian Languages at the second Workshop on Speech and Language Technologies for Dravidian Languages. In this task, a sentence in any of three Dravidian Languages is required to be classified into two closely related tasks namely Sentiment Analyis (SA) and Offensive Language Identification (OLI). The task spans over three Dravidian Languages, namely, Kannada, Malayalam, and Tamil. It is one of the first shared tasks that focuses on Multi-task Learning for closely related tasks, especially for a very low-resourced language family such as the Dravidian language family. In total, 55 people signed up to participate in the task, and due to the intricate nature of the task, especially in its first iteration, 3 submissions have been received.

2021

pdf bib

Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Anand Kumar M | Parameswari Krishnamurthy | Elizabeth Sherly
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

pdf bib abs

This paper presents an overview of the shared task on machine translation of Dravidian languages. We presented the shared task results at the EACL 2021 workshop on Speech and Language Technologies for Dravidian Languages. This paper describes the datasets used, the methodology used for the evaluation of participants, and the experiments’ overall results. As a part of this shared task, we organized four sub-tasks corresponding to machine translation of the following language pairs: English to Tamil, English to Malayalam, English to Telugu and Tamil to Telugu which are available at https://competitions.codalab.org/competitions/27650. We provided the participants with training and development datasets to perform experiments, and the results were evaluated on unseen test data. In total, 46 research groups participated in the shared task and 7 experimental runs were submitted for evaluation. We used BLEU scores for assessment of the translations.

pdf bib abs

IIITK@DravidianLangTech-EACL2021: Offensive Language Identification and Meme Classification in Tamil, Malayalam and Kannada
Nikhil Ghanghor | Parameswari Krishnamurthy | Sajeetha Thavareesan | Ruba Priyadharshini | Bharathi Raja Chakravarthi
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

This paper describes the IIITK team’s submissions to the offensive language identification, and troll memes classification shared tasks for Dravidian languages at DravidianLangTech 2021 workshop@EACL 2021. Our best configuration for Tamil troll meme classification achieved 0.55 weighted average F1 score, and for offensive language identification, our system achieved weighted F1 scores of 0.75 for Tamil, 0.95 for Malayalam, and 0.71 for Kannada. Our rank on Tamil troll meme classification is 2, and offensive language identification in Tamil, Malayalam and Kannada are 3, 3 and 4 respectively.

pdf bib

Proceedings of the First Workshop on Parsing and its Applications for Indian Languages
Kengatharaiyer Sarveswaran | Parameswari Krishnamurthy | Pruthwik Mishra
Proceedings of the First Workshop on Parsing and its Applications for Indian Languages

pdf bib abs

Parsing Subordinate Clauses in Telugu using Rule-based Dependency Parser
P Sangeetha | Parameswari Krishnamurthy | Amba Kulkarni
Proceedings of the First Workshop on Parsing and its Applications for Indian Languages

Parsing has been gaining popularity in recent years and attracted the interest of NLP researchers around the world. It is challenging when the language under study is a free-word order language that allows ellipsis like Telugu. In this paper, an attempt is made to parse subordinate clauses especially, non-finite verb clauses and relative clauses in Telugu which are highly productive and constitute a large chunk in parsing tasks. This study adopts a knowledge-driven approach to parse subordinate structures using linguistic cues as rules. Challenges faced in parsing ambiguous structures are elaborated alongside providing enhanced tags to handle them. Results are encouraging and this parser proves to be efficient for Telugu.

pdf bib

Towards Building a Modern Written Tamil Treebank
Parameswari Krishnamurthy | Kengatharaiyer Sarveswaran
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)

pdf bib abs

NITK-UoH: Tamil-Telugu Machine Translation Systems for the WMT21 Similar Language Translation Task
Richard Saldanha | Ananthanarayana V. S | Anand Kumar M | Parameswari Krishnamurthy
Proceedings of the Sixth Conference on Machine Translation

In this work, two Neural Machine Translation (NMT) systems have been developed and evaluated as part of the bidirectional Tamil-Telugu similar languages translation subtask in WMT21. The OpenNMT-py toolkit has been used to create quick prototypes of the systems, following which models have been trained on the training datasets containing the parallel corpus and finally the models have been evaluated on the dev datasets provided as part of the task. Both the systems have been trained on a DGX station with 4 -V100 GPUs. The first NMT system in this work is a Transformer based 6 layer encoder-decoder model, trained for 100000 training steps, whose configuration is similar to the one provided by OpenNMT-py and this is used to create a model for bidirectional translation. The second NMT system contains two unidirectional translation models with the same configuration as the first system, with the addition of utilizing Byte Pair Encoding (BPE) for subword tokenization through the pre-trained MultiBPEmb model. Based on the dev dataset evaluation metrics for both the systems, the first system i.e. the vanilla Transformer model has been submitted as the Primary system. Since there were no improvements in the metrics during training of the second system with BPE, it has been submitted as a contrastive system.