Ritesh Kumar

2025

Large Language Models (LLMs) can be used to convert natural language (NL) instructions into structured business process automation (BPA) process artifacts.This paper contributes (i) FLOW-BENCH, a high quality dataset of paired NL instructions and business process definitions toevaluate NL-based BPA tools, and support research in this area, and (ii) FLOW-GEN,our approach to utilize LLMs to translate NL into an intermediate Python representation that facilitates final conversion into widely adopted business process definition languages, such as BPMN and DMN. We bootstrap FLOW-BENCH by demonstrating how it can be used to evaluate the components of FLOW-GEN across eight LLMs. We hope that FLOW-GEN and FLOW-BENCHcatalyze further research in BPA.

We present LiFE Suite as a “Field-to-Model” pipeline, designed to bridge community-centred data collection with scalable language model development. This paper describes the various tools integrated into the LiFE Suite that make this unified pipeline possible. Atekho, a mobile-first data collection platform, is designed to empower communities to assert their rights over their data. MATra-Lab, a web-based data processing and annotation tool, supports the management of field data and the creation of NLP-ready datasets with support from existing state-of-the-art NLP models. LiFE Model Studio, built on top of Hugging Face AutoTrain, offers a no-code solution for building scalable language models using the field data. This end-to-end integration ensures that every dataset collected in the field retains its linguistic, cultural, and metadata context, all the way through to deployable AI models and archive-ready datasets.

This paper deals with the dual task of developing a medical question answering (QA) system and generating concise summaries of medical dialogue data across nine languages (English and eight Indian languages). The medical dialogue data focuses on two critical health issues: Head and Neck Cancer (HNC) and Cystic Fibrosis (NLP AI4health shared task). The proposed framework utilises a dual approach: a fine-tuned small Multilingual Text-to-Text Transfer Transformer (mT5) model for the conversational summarisation component and a fine-tuned Retrieval Augmented Generation (RAG) system integrating the dense intfloat/e5-large language model for the language-independent QA component. The efficacy of the proposed approaches is demonstrated by achieving promising precision in the QA task. Our framework achieved the highest F1 scores in QA for the three Indian languages, with F1 score of 0.3995 in Marathi, 0.7803 in Bangla, and 0.74759 in Hindi, respectively. We achieved the highest cometscore of 0.5626 on the Gujarati QA test set. For the dialogue summarisation task, our model registered the highest ROUGE-2 and ROUGE-L precision across all eight Indian languages, with English being the sole exception. These results confirm our approach potential to improve e-health in dialogue data for low-resource Indian languages.

pdf bib abs
Automated Telescope-Paper Linkage via Multi-Model Ensemble Learning
Ojaswa Ojaswa Varshney | Prashasti Vyas | Priyanka Goyal | Tarpita Singh | Ritesh Kumar | Mayank Singh
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications

Automated linkage between scientific publications and telescope datasets is a cornerstone for scalable bibliometric analyses and ensuring scientific reproducibility in astrophysics. We propose a multi-model ensemble architecture integrating transformer models DeBERTa, RoBERTa, and TF-IDF logistic regression, tailored to the WASP-2025 shared task on telescope-paper classification. Our approach achieves a macro F1 score approaching 0.78 after extensive multi-seed ensembling and per-label threshold tuning, significantly outperforming baseline models. This paper presents comprehensive methodology, ablation studies, and an in-depth discussion of challenges, establishing a robust benchmark for scientific bibliometric task automation.

2024

pdf bib abs
HarmPot: An Annotation Framework for Evaluating Offline Harm Potential of Social Media Text
Ritesh Kumar | Ojaswee Bhalla | Madhu Vanthi | Shehlat Maknoon Wani | Siddharth Singh
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we discuss the development of an annotation schema to build datasets for evaluating the offline harm potential of social media texts. We define “harm potential” as the potential for an online public post to cause real-world physical harm (i.e., violence). Understanding that real-world violence is often spurred by a web of triggers, often combining several online tactics and pre-existing intersectional fissures in the social milieu, to result in targeted physical violence, we do not focus on any single divisive aspect (i.e., caste, gender, religion, or other identities of the victim and perpetrators) nor do we focus on just hate speech or mis/dis-information. Rather, our understanding of the intersectional causes of such triggers focuses our attempt at measuring the harm potential of online content, irrespective of whether it is hateful or not. In this paper, we discuss the development of a framework/annotation schema that allows annotating the data with different aspects of the text including its socio-political grounding and intent of the speaker (as expressed through mood and modality) that together contribute to it being a trigger for offline harm. We also give a comparative analysis and mapping of our framework with some of the existing frameworks.

pdf bib
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Bharathi Raja Chakravarthi | Bornini Lahiri | Siddharth Singh | Shyam Ratan
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

2023

pdf bib abs
An Open-source Web-based Application for Development of Resources and Technologies in Underresourced Languages
Siddharth Singh | Shyam Ratan | Neerav Mathur | Ritesh Kumar
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

The paper discusses the Linguistic Field Data Management and Analysis System (LiFE), a new open-source, web-based software that systematises storage, management, annotation, analysis and sharing of linguistic data gathered from the field as well as that crawled from various sources on the web such as YouTube, Twitter, Facebook, Instagram, Blog, Newspaper, Wikipedia, etc. The app supports two broad workflows - (a) the field linguists’ workflow in which data is collected directly from the speakers in the field and analysed further to produce grammatical descriptions, lexicons, educational materials and possibly language technologies; (b) the computational linguists’ workflow in which data collected from the web using automated crawlers or digitised using manual or semi-automatic means, annotated for various tasks and then used for developing different kinds of language technologies. In addition to supporting these workflows, the app provides some additional features as well - (a) it allows multiple users to collaboratively work on the same project via its granular access control and sharing option; (b) it allows the data to be exported to various formats including CSV, TSV, JSON, XLSX, , PDF, Textgrid, RDF (different serialisation formats) etc as appropriate; (c) it allows data import from various formats viz. LIFT XML, XLSX, JSON, CSV, TSV, Textgrid, etc; (d) it allows users to start working in the app at any stage of their work by giving the option to either create a new project from scratch or derive a new project from an existing project in the app.The app is currently available for use and testing on our server (http://life.unreal-tece.co.in/) and its source code has been released under AGPL license on our GitHub repository (https://github.com/unrealtecellp/life). It is licensed under separate, specific conditions for commercial usage.

pdf bib
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Atul Kr. Ojha | A. Seza Doğruöz | Giovanni Da San Martino | Harish Tayyar Madabushi | Ritesh Kumar | Elisa Sartori
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

2022

pdf bib abs
Towards a Unified Tool for the Management of Data and Technologies in Field Linguistics and Computational Linguistics - LiFE
Siddharth Singh | Ritesh Kumar | Shyam Ratan | Sonal Sinha
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

The paper presents a new software - Linguistic Field Data Management and Analysis System - LiFE for endangered and low-resourced languages - an open-source, web-based linguistic data analysis and management application allowing systematic storage, management, usage and sharing of linguistic data collected from the field. The application enables users to store lexical items, sentences, paragraphs, audio-visual content including photographs, video clips, speech recordings, etc, with rich glossing and annotation. For field linguists, it provides facilities to generate interactive and print dictionaries; for NLP practitioners, it provides the data storage and representation in standard formats such as RDF, JSON and CSV. The tool provides a one-click interface to train NLP models for various tasks using the data stored in the system and then use it for assistance in further storage of the data (especially for the field linguists). At the same time, the tool also provides the facility of using the models trained outside of the tool for data storage, transcription, annotation and other tasks. The web-based application, allows for seamless collaboration among multiple persons and sharing the data, models, etc with each other.

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context” in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here consists of a total 59,152 annotated comments in four languages - Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English. The paper gives a detailed description of the tagset being used for annotation and also the process of developing a multi-label, fine-grained tagset that has been used for marking comments with aggression and bias of various kinds including sexism (called gender bias in the tagset), religious intolerance (called communal bias in the tagset), class/caste bias and ethnic/racial bias. We also define and discuss the tags that have been used for marking the different discursive role being performed through the comments, such as attack, defend, etc. Finally, we present a basic statistical analysis of the dataset. The dataset is being incrementally made publicly available on the project website.

pdf bib abs
IIT Dhanbad @LT-EDI-ACL2022- Hope Speech Detection for Equality, Diversity, and Inclusion
Vishesh Gupta | Ritesh Kumar | Rajendra Pamula
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Hope is considered significant for the wellbeing,recuperation and restoration of humanlife by health professionals. Hope speech reflectsthe belief that one can discover pathwaysto their desired objectives and become rousedto utilise those pathways. Hope speech offerssupport, reassurance, suggestions, inspirationand insight. Hate speech is a prevalent practicethat society has to struggle with everyday. The freedom of speech and ease of anonymitygranted by social media has also resulted inincitement to hatred. In this paper, we workto identify and promote positive and supportivecontent on these platforms. We work withseveral machine learning models to classify socialmedia comments as hope speech or nonhopespeech in English. This paper portraysour work for the Shared Task on Hope SpeechDetection for Equality, Diversity, and Inclusionat LT-EDI-ACL 2022.

pdf bib abs
IIT DHANBAD CODECHAMPS at SemEval-2022 Task 5: MAMI - Multimedia Automatic Misogyny Identification
Shubham Barnwal | Ritesh Kumar | Rajendra Pamula
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

With the growth of the internet, the use of social media based on images has drastically increased like Twitter, Instagram, etc. In these social media, women have a very high contribution as of 75% women use social media multiple times compared to men which is only 65% of men uses social media multiple times a day. However, with this much contribution, it also increases systematic inequality and discrimination offline is replicated in online spaces in the form of MEMEs. A meme is essentially an image characterized by pictorial content with an overlaying text a posteriori introduced by humans, with the main goal of being funny and/or ironic. Although most of them are created with the intent of making funny jokes, in a short time people started to use them as a form of hate and prejudice against women, landing to sexist and aggressive messages in online environments that subsequently amplify the sexual stereotyping and gender inequality of the offline world. This leads to the need for automatic detection of Misogyny MEMEs. Specifically, I described the model submitted for the shared task on Multimedia Automatic Misogyny Identification (MAMI) and my team name is IIT DHANBAD CODECHAMPS.

pdf bib
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)
Ritesh Kumar | Atul Kr. Ojha | Marcos Zampieri | Shervin Malmasi | Daniel Kadar
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)

2021

pdf bib abs
Multilingual Protest News Detection - Shared Task 1, CASE 2021
Ali Hürriyetoğlu | Osman Mutlu | Erdem Yörük | Farhana Ferdousi Liza | Ritesh Kumar | Shyam Ratan
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

Benchmarking state-of-the-art text classification and information extraction systems in multilingual, cross-lingual, few-shot, and zero-shot settings for socio-political event information collection is achieved in the scope of the shared task Socio-political and Crisis Events Detection at the workshop CASE @ ACL-IJCNLP 2021. Socio-political event data is utilized for national and international policy- and decision-making. Therefore, the reliability and validity of these datasets are of the utmost importance. We split the shared task into three parts to address the three aspects of data collection (Task 1), fine-grained semantic classification (Task 2), and evaluation (Task 3). Task 1, which is the focus of this report, is on multilingual protest news detection and comprises four subtasks that are document classification (subtask 1), sentence classification (subtask 2), event sentence coreference identification (subtask 3), and event extraction (subtask 4). All subtasks had English, Portuguese, and Spanish for both training and evaluation data. Data in Hindi language was available only for the evaluation of subtask 1. The majority of the submissions, which are 238 in total, are created using multi- and cross-lingual approaches. Best scores are above 77.27 F1-macro for subtask 1, above 85.32 F1-macro for subtask 2, above 84.23 CoNLL 2012 average score for subtask 3, and above 66.20 F1-macro for subtask 4 in all evaluation settings. The performance of the best system for subtask 4 is above 66.20 F1 for all available languages. Although there is still a significant room for improvement in cross-lingual and zero-shot settings, the best submissions for each evaluation scenario yield remarkable results. Monolingual models outperformed the multilingual models in a few evaluation scenarios.

pdf bib abs
Demo of the Linguistic Field Data Management and Analysis System - LiFE
Siddharth Singh | Ritesh Kumar | Shyam Ratan | Sonal Sinha
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

In the proposed demo, we will present a new software - Linguistic Field Data Management and Analysis System - LiFE - an open-source, web-based linguistic data management and analysis application that allows for systematic storage, management, sharing and usage of linguistic data collected from the field. The application allows users to store lexical items, sentences, paragraphs, audio-visual content including photographs, video clips, speech recordings, etc, along with rich glossing / annotation; generate interactive and print dictionaries; and also train and use natural language processing tools and models for various purposes using this data. Since its a web-based application, it also allows for seamless collaboration among multiple persons and sharing the data, models, etc with each other. The system uses the Python-based Flask framework and MongoDB (as database) in the backend and HTML, CSS and Javascript at the frontend. The interface allows creation of multiple projects that could be shared with the other users. At the backend, the application stores the data in RDF format so as to allow its release as Linked Data over the web using semantic web technologies - as of now it makes use of the OntoLex-Lemon for storing the lexical data and Ligt for storing the interlinear glossed text and then internally linking it to the other linked lexicons and databases such as DBpedia and WordNet. Furthermore it provides support for training the NLP systems using scikit-learn and HuggingFace Transformers libraries as well as make use of any model trained using these libraries - while the user interface itself provides limited options for tuning the system, an externally-trained model could be easily incorporated within the application; similarly the dataset itself could be easily exported into a standard machine-readable format like JSON or CSV that could be consumed by other programs and pipelines. The system is built as an online platform; however since we are making the source code available, it could be installed by users on their internal / personal servers as well.

pdf bib
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification
Ritesh Kumar | Siddharth Singh | Enakshi Nandi | Shyam Ratan | Laishram Niranjana Devi | Bornini Lahiri | Akanksha Bansal | Akash Bhagat | Yogesh Dawer
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification

pdf bib abs
ComMA@ICON: Multilingual Gender Biased and Communal Language Identification Task at ICON-2021
Ritesh Kumar | Shyam Ratan | Siddharth Singh | Enakshi Nandi | Laishram Niranjana Devi | Akash Bhagat | Yogesh Dawer | Bornini Lahiri | Akanksha Bansal
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification

This paper presents the findings of the ICON-2021 shared task on Multilingual Gender Biased and Communal Language Identification, which aims to identify aggression, gender bias, and communal bias in data presented in four languages: Meitei, Bangla, Hindi and English. The participants were presented the option of approaching the task as three separate classification tasks or a multi-label classification task or a structured classification task. If approached as three separate classification tasks, the task includes three sub-tasks: aggression identification (sub-task A), gender bias identification (sub-task B), and communal bias identification (sub-task C). For this task, the participating teams were provided with a total dataset of approximately 12,000, with 3,000 comments across each of the four languages, sourced from popular social media sites such as YouTube, Twitter, Facebook and Telegram and the the three labels presented as a single tuple. For the test systems, approximately 1,000 comments were provided in each language for every sub-task. We attracted a total of 54 registrations in the task, out of which 11 teams submitted their test runs. The best system obtained an overall instance-F1 of 0.371 in the multilingual test set (it was simply a combined test set of the instances in each individual language). In the individual sub-tasks, the best micro f1 scores are 0.539, 0.767 and 0.834 respectively for each of the sub-task A, B and C. The best overall, averaged micro f1 is 0.713. The results show that while systems have managed to perform reasonably well in individual sub-tasks, especially gender bias and communal bias tasks, it is substantially more difficult to do a 3-class classification of aggression level and even more difficult to build a system that correctly classifies everything right. It is only in slightly over 1/3 of the instances that most of the systems predicted the correct class across the board, despite the fact that there was a significant overlap across the three sub-tasks.

pdf bib abs
Developing Universal Dependencies Treebanks for Magahi and Braj
Mohit Raj | Shyam Ratan | Deepak Alok | Ritesh Kumar | Atul Kr. Ojha
Proceedings of the First Workshop on Parsing and its Applications for Indian Languages

In this paper, we discuss the development of treebanks for two low-resourced Indian languages - Magahi and Braj - based on the Universal Dependencies framework. The Magahi treebank contains 945 sentences and Braj treebank around 500 sentences marked with their lemmas, part-of-speech, morphological features and universal dependencies. This paper gives a description of the different dependency relationship found in the two languages and give some statistics of the two treebanks. The dataset will be made publicly available on Universal Dependency (UD) repository in the next (v2.10) release.

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.

While language identification is a fundamental speech and language processing task, for many languages and language families it remains a challenging task. For many low-resource and endangered languages this is in part due to resource availability: where larger datasets exist, they may be single-speaker or have different domains than desired application scenarios, demanding a need for domain and speaker-invariant language identification systems. This year’s shared task on robust spoken language identification sought to investigate just this scenario: systems were to be trained on largely single-speaker speech from one domain, but evaluated on data in other domains recorded from speakers under different recording circumstances, mimicking realistic low-resource scenarios. We see that domain and speaker mismatch proves very challenging for current methods which can perform above 95% accuracy in-domain, which domain adaptation can address to some degree, but that these conditions merit further investigation to make spoken language identification accessible in many scenarios.

pdf bib abs
Anlirika: An LSTM–CNN Flow Twister for Spoken Language Identification
Andreas Scherbakov | Liam Whittle | Ritesh Kumar | Siddharth Singh | Matthew Coleman | Ekaterina Vylomova
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

The paper presents Anlirika’s submission to SIGTYP 2021 Shared Task on Robust Spoken Language Identification. The task aims at building a robust system that generalizes well across different domains and speakers. The training data is limited to a single domain only with predominantly single speaker per language while the validation and test data samples are derived from diverse dataset and multiple speakers. We experiment with a neural system comprising a combination of dense, convolutional, and recurrent layers that are designed to perform better generalization and obtain speaker-invariant representations. We demonstrate that the task in its constrained form (without making use of external data or augmentation the train set with samples from the validation set) is still challenging. Our best system trained on the data augmented with validation samples achieves 29.9% accuracy on the test data.

2020

pdf bib abs
KMI-Panlingua-IITKGP @SIGTYP2020: Exploring rules and hybrid systems for automatic prediction of typological features
Ritesh Kumar | Deepak Alok | Akanksha Bansal | Bornini Lahiri | Atul Kr. Ojha
Proceedings of the Second Workshop on Computational Research in Linguistic Typology

This paper enumerates SigTyP 2020 Shared Task on the prediction of typological features as performed by the KMI-Panlingua-IITKGP team. The task entailed the prediction of missing values in a particular language, provided, the name of the language family, its genus, location (in terms of latitude and longitude coordinates and name of the country where it is spoken) and a set of feature-value pair are available. As part of fulfillment of the aforementioned task, the team submitted 3 kinds of system - 2 rule-based and one hybrid system. Of these 3, one rule-based system generated the best performance on the test set. All the systems were ‘constrained’ in the sense that no additional dataset or information, other than those provided by the organisers, was used for developing the systems.

pdf bib abs
Evaluating Aggression Identification in Social Media
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Marcos Zampieri
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

In this paper, we present the report and findings of the Shared Task on Aggression and Gendered Aggression Identification organised as part of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC - 2) at LREC 2020. The task consisted of two sub-tasks - aggression identification (sub-task A) and gendered identification (sub-task B) - in three languages - Bangla, Hindi and English. For this task, the participants were provided with a dataset of approximately 5,000 instances from YouTube comments in each language. For testing, approximately 1,000 instances were provided in each language for each sub-task. A total of 70 teams registered to participate in the task and 19 teams submitted their test runs. The best system obtained a weighted F-score of approximately 0.80 in sub-task A for all the three languages. While approximately 0.87 in sub-task B for all the three languages.

In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.

pdf bib abs
NUIG-Panlingua-KMI Hindi-Marathi MT Systems for Similar Language Translation Task @ WMT 2020
Atul Kr. Ojha | Priya Rani | Akanksha Bansal | Bharathi Raja Chakravarthi | Ritesh Kumar | John P. McCrae
Proceedings of the Fifth Conference on Machine Translation

NUIG-Panlingua-KMI submission to WMT 2020 seeks to push the state-of-the-art in Similar Language Translation Task for Hindi↔Marathi language pair. As part of these efforts, we conducteda series of experiments to address the challenges for translation between similar languages. Among the 4 MT systems prepared under this task, 1 PBSMT systems were prepared for Hindi↔Marathi each and 1 NMT systems were developed for Hindi↔Marathi using Byte PairEn-coding (BPE) into subwords. The results show that different architectures NMT could be an effective method for developing MT systems for closely related languages. Our Hindi-Marathi NMT system was ranked 8th among the 14 teams that participated and our Marathi-Hindi NMT system was ranked 8th among the 11 teams participated for the task.

2019

pdf bib abs
Predicting the Type and Target of Offensive Posts in Social Media
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Sara Rosenthal | Noura Farra | Ritesh Kumar
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.

pdf bib abs
SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Sara Rosenthal | Noura Farra | Ritesh Kumar
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets, and it featured three sub-tasks. In sub-task A, systems were asked to discriminate between offensive and non-offensive posts. In sub-task B, systems had to identify the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, nearly 800 teams signed up to participate in the task and 115 of them submitted results, which are presented and analyzed in this report.

pdf bib abs
bhanodaig at SemEval-2019 Task 6: Categorizing Offensive Language in social media
Ritesh Kumar | Guggilla Bhanodai | Rajendra Pamula | Maheswara Reddy Chennuru
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes the work that our team bhanodaig did at Indian Institute of Technology (ISM) towards OffensEval i.e. identifying and categorizing offensive language in social media. Out of three sub-tasks, we have participated in sub-task B: automatic categorization of offensive types. We perform the task of categorizing offensive language, whether the tweet is targeted insult or untargeted. We use Linear Support Vector Machine for classification. The official ranking metric is macro-averaged F1. Our system gets the score 0.5282 with accuracy 0.8792. However, as new entrant to the field, our scores are encouraging enough to work for better results in future.

pdf bib abs
Panlingua-KMI MT System for Similar Language Translation Task at WMT 2019
Atul Kr. Ojha | Ritesh Kumar | Akanksha Bansal | Priya Rani
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

The present paper enumerates the development of Panlingua-KMI Machine Translation (MT) systems for Hindi ↔ Nepali language pair, designed as part of the Similar Language Translation Task at the WMT 2019 Shared Task. The Panlingua-KMI team conducted a series of experiments to explore both the phrase-based statistical (PBSMT) and neural methods (NMT). Among the 11 MT systems prepared under this task, 6 PBSMT systems were prepared for Nepali-Hindi, 1 PBSMT for Hindi-Nepali and 2 NMT systems were developed for Nepali↔Hindi. The results show that PBSMT could be an effective method for developing MT systems for closely-related languages. Our Hindi-Nepali PBSMT system was ranked 2nd among the 13 systems submitted for the pair and our Nepali-Hindi PBSMTsystem was ranked 4th among the 12 systems submitted for the task.

2018

pdf bib
Aggression-annotated Corpus of Hindi-English Code-mixed Data
Ritesh Kumar | Aishwarya N. Reganti | Akshit Bhatia | Tushar Maheshwari
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

pdf bib abs
Part-of-Speech Annotation of English-Assamese code-mixed texts: Two Approaches
Ritesh Kumar | Manas Jyoti Bora
Proceedings of the First International Workshop on Language Cognition and Computational Models

In this paper, we discuss the development of a part-of-speech tagger for English-Assamese code-mixed texts. We provide a comparison of 2 approaches to annotating code-mixed data – a) annotation of the texts from the two languages using monolingual resources from each language and b) annotation of the text through a different resource created specifically for code-mixed data. We present a comparative study of the efforts required in each approach and the final performance of the system. Based on this, we argue that it might be a better approach to develop new technologies using code-mixed data instead of monolingual, ‘clean’ data, especially for those languages where we do not have significant tools and technologies available till now.

pdf bib
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
Ritesh Kumar | Atul Kr. Ojha | Marcos Zampieri | Shervin Malmasi
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

pdf bib abs
Benchmarking Aggression Identification in Social Media
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Marcos Zampieri
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

In this paper, we present the report and findings of the Shared Task on Aggression Identification organised as part of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING 2018. The task was to develop a classifier that could discriminate between Overtly Aggressive, Covertly Aggressive, and Non-aggressive texts. For this task, the participants were provided with a dataset of 15,000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation. For testing, two different sets - one from Facebook and another from a different social media - were provided. A total of 130 teams registered to participate in the task, 30 teams submitted their test runs, and finally 20 teams also sent their system description paper which are included in the TRAC workshop proceedings. The best system obtained a weighted F-score of 0.64 for both Hindi and English on the Facebook test sets, while the best scores on the surprise set were 0.60 and 0.50 for English and Hindi respectively. The results presented in this report depict how challenging the task is. The positive response from the community and the great levels of participation in the first edition of this shared task also highlights the interest in this topic.

pdf bib abs
TRAC-1 Shared Task on Aggression Identification: IIT(ISM)@COLING’18
Ritesh Kumar | Guggilla Bhanodai | Rajendra Pamula | Maheshwar Reddy Chennuru
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

This paper describes the work that our team bhanodaig did at Indian Institute of Technology (ISM) towards TRAC-1 Shared Task on Aggression Identification in Social Media for COLING 2018. In this paper we label aggression identification into three categories: Overtly Aggressive, Covertly Aggressive and Non-aggressive. We train a model to differentiate between these categories and then analyze the results in order to better understand how we can distinguish between them. We participated in two different tasks named as English (Facebook) task and English (Social Media) task. For English (Facebook) task System 05 was our best run (i.e. 0.3572) above the Random Baseline (i.e. 0.3535). For English (Social Media) task our system 02 got the value (i.e. 0.1960) below the Random Bseline (i.e. 0.3477). For all of our runs we used Long Short-Term Memory model. Overall, our performance is not satisfactory. However, as new entrant to the field, our scores are encouraging enough to work for better results in future.

2014

pdf bib abs
Developing Politeness Annotated Corpus of Hindi Blogs
Ritesh Kumar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper I discuss the creation and annotation of a corpus of Hindi blogs. The corpus consists of a total of over 479,000 blog posts and blog comments. It is annotated with the information about the politeness level of each blog post and blog comment. The annotation is carried out using four levels of politeness ― neutral, appropriate, polite and impolite. For the annotation, three classifiers ― were trained and tested maximum entropy (MaxEnt), Support Vector Machines (SVM) and C4.5 - using around 30,000 manually annotated texts. Among these, C4.5 gave the best accuracy. It achieved an accuracy of around 78% which is within 2% of the human accuracy during annotation. Consequently this classifier is used to annotate the rest of the corpus

2012

pdf bib abs
Challenges in the development of annotated corpora of computer-mediated communication in Indian Languages: A Case of Hindi
Ritesh Kumar
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The present paper describes an ongoing effort to compile and annotate a large corpus of computer-mediated communication (CMC) in Hindi. It describes the process of the compilation of the corpus, the basic structure of the corpus and the annotation of the corpus and the challenges faced in the creation of such a corpus. It also gives a description of the technologies developed for the processing of the data, addition of the metadata and annotation of the corpus. Since it is a corpus of written communication, it provides quite a distinctive challenge for the annotation process. Besides POS annotation, it will also be annotated at higher levels of representation. Once completely developed it will be a very useful resource of Hindi for research in the areas of linguistics, NLP and other social sciences research related to communication, particularly computer-mediated communication..Besides this the challenges discussed here and the way they are tackled could be taken as the model for developing the corpus of computer-mediated communication in other Indian languages. Furthermore the technologies developed for the construction of this corpus will also be made available publicly.

pdf bib
Developing a POS tagger for Magahi: A Comparative Study
Ritesh Kumar | Bornini Lahiri | Deepak Alok
Proceedings of the 10th Workshop on Asian Language Resources