Anand Kumar M

Also published as: Anand Kumar Madasamy, Anand Kumar M, Anand Kumar M.

2025

pdf bib abs

Integrating Graph based Algorithm and Transformer Models for Abstractive Summarization
Sayed Ayaan Ahmed Sha | Sangeetha Sivanesan | Anand Kumar Madasamy | Navya Binu
Proceedings of the 1st Workshop on NLP for Empowering Justice (JUST-NLP 2025)

Summarizing legal documents is a challenging and critical task in the field of Natural Language Processing(NLP). On top of that generating abstractive summaries for legal judgments poses a significant challenge to researchers as there is limitation in the number of input tokens for various language models. In this paper we experimented with two models namely BART base model finetuned on CNN DailyMail dataset along with TextRank and pegasus_indian_legal, a finetuned version of legal-pegasus on Indian legal judgments for generating abstractive summaries for Indian legal documents as part of the JUSTNLP 2025 - Shared Task on Legal Summarization. BART+TextRank outperformed pegasus_indian_legal with a score of 18.84.

pdf bib abs

Tutorial on Trustworthy Legal Text Processing with LLMs: Retrieval, Rhetorical Roles, Summarization, and Trustworthy Generation
Anand Kumar M | Sangeetha S | Manikandan R | Anjali R
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract

This half-day tutorial provides a comprehensive overview of Legal Natural Language Processing (NLP) with LLM for participants with a basic understanding of Computational Linguistics or NLP concepts. We introduce how NLP can help analyze and manage legal text by covering five key topics: legal text analysis with LLM insights, legal text retrieval, rhetorical role identification, legal text summarization, and addressing bias and hallucination in legal tasks. Our goals are to explain why these tasks matter for researchers in the legal domain, describe the challenges and open problems, and outline current solutions. This proposed tutorial blends lectures, live examples, and Q&A to help researchers and students see how language technology and LLMs can make legal information more understandable and efficient.

pdf bib abs

SCaLER@ALTA 2025: Hybrid and Bi-Encoder Approaches for Adverse Drug Event Mention Normalization
Shelke Akshay Babasaheb | Anand Kumar Madasamy
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association

This paper describes the system developed by Team Scaler for the ALTA 2025 Shared Task on Adverse Drug Event (ADE) Mention Normalization. The task aims to normalize freetext mentions of adverse events to standardized MedDRA concepts. We present and compare two architectures: (1) a Hybrid Candidate Generation + Neural Reranker approach using a pretrained PubMedBERT model, and (2) a BiEncoder model based on SapBERT, fine-tuned to align ADE mentions with MedDRA concepts. The hybrid approach retrieves candidate terms through semantic similarity search and refines the ranking using a neural reranker, while the bi-encoder jointly embeds mentions and concepts into a shared semantic space. On the development set, the hybrid reranker achieves Accuracy@1 = 0.3840, outperforming the bi-encoder (Accuracy@1 = 0.3298). The bi-encoder system was used for official submission and ranked third overall in the competition. Our analysis highlights the complementary strengths of both retrieval-based and embedding-based normalization strategies.

pdf bib abs

SeqTNS: Sequential Tolerance-based Classifier for Identification of Rhetorical Roles in Indian Legal Documents
Arjun T D | Anand Kumar Madasamy | Sheela Ramanna
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Identifying rhetorical roles in legal judgments is a foundational step for automating legal reasoning, summarization, and retrieval. In this paper, we propose a novel Sequential Tolerance-based Classifier (SeqTNS) for rhetorical role classification in Indian legal documents. The proposed classifier leverages semantic similarity and contextual dependencies by using label sequence aware BiLSTMs on top of word embeddings from finetuned InLegalBERT model. These enriched embeddings are clustered into tolerance classes via a tolerance relation using a cosine distance threshold,enabling the model to make flexible, similarity-based predictions. We evaluate SeqTNS on two benchmark datasets annotated with thirteen and seven rhetorical roles, respectively. The proposed method outperforms fine-tuned transformer baselines (LegalBERT, InLegalBERT) as well as the previously developed tolerance relation-based (TNS) model, achieving a weighted F1 score of 0.78 on thirteen class dataset and a macro F1 of 0.83 on the seven class dataset, while reducing training time by 39-40% compared to state of the art BiLSTM-CRF models. The larger of our two datasets is substantial, containing over 40,000 sentences and 1.3M tokens, and serves as a challenging real world benchmark. Additionally, we use LIME for explainability and t-SNE to validate the coherence of tolerance-based clusters.

pdf bib abs

SCaLAR_NITK @ JUSTNLP Legal Summarization (L-SUMM) Shared Task
Arjun T D | Anand Kumar Madasamy
Proceedings of the 1st Workshop on NLP for Empowering Justice (JUST-NLP 2025)

This paper presents the systems we submitted to the JUST-NLP 2025 Shared Task on Legal Summarization (L-SUMM). Creating abstractive summaries of lengthy Indian court rulings is challenging due to transformer token limits. To address this problem, we compare three systems built on a fine-tuned Legal Pegasus model. System 1 (Baseline) applies a standard hierarchical framework that chunks long documents using naive token-based segmentation. System 2 (RR-Chunk) improves this approach by using a BERT-BiLSTM model to tag sentences with rhetorical roles (RR) and incorporating these tags (e.g., [Facts]. . . ) to enable structurally informed chunking for hierarchical summarization. System 3 (WRR-Tune) tests whether explicit importance cues help the model by assigning importance scores to each RR using the geometric mean of their distributional presence in judgments and human summaries, and finetuning a separate model on text augmented with these tags (e.g., [Facts, importance score 13.58]). A comparison of the three systems demonstrates the value of progressively adding structural and quantitative importance signals to the model’s input.

2024

pdf bib abs

Leveraging Physical and Semantic Features of text item for Difficulty and Response Time Prediction of USMLE Questions
Gummuluri Venkata Ravi Ram | Ashinee Kesanam | Anand Kumar M
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper presents our system developed for the Shared Task on Automated Prediction of Item Difficulty and Item Response Time for USMLE questions, organized by the Association for Computational Linguistics (ACL) Special Interest Group for building Educational Applications (BEA SIGEDU). The Shared Task, held as a workshop at the North American Chapter of the Association for Computational Linguistics (NAACL) 2024 conference, aimed to advance the state-of-the-art in predicting item characteristics directly from item text, with implications for the fairness and validity of standardized exams. We compared various methods ranging from BERT for regression to Random forest, Gradient Boosting(GB), Linear Regression, Support Vector Regressor (SVR), k-nearest neighbours (KNN) Regressor, MultiLayer Perceptron(MLP) to custom-ANN using BioBERT and Word2Vec embeddings and provided inferences on which performed better. This paper also explains the importance of data augmentation to balance the data in order to get better results. We also proposed five hypotheses regarding factors impacting difficulty and response time for a question and also verified it thereby helping researchers to derive meaningful numerical attributes for accurate prediction. We achieved a RSME score of 0.315 for Difficulty prediction and 26.945 for Response Time.

pdf bib abs

Findings of the First Shared Task on Offensive Span Identification from Code-Mixed Kannada-English Comments
Manikandan Ravikiran | Ratnavel Rajalakshmi | Bharathi Raja Chakravarthi | Anand Kumar Madasamy | Sajeetha Thavareesan
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Effectively managing offensive content is crucial on social media platforms to encourage positive online interactions. However, addressing offensive contents in code-mixed Dravidian languages faces challenges, as current moderation methods focus on flagging entire comments rather than pinpointing specific offensive segments. This limitation stems from a lack of annotated data and accessible systems designed to identify offensive language sections. To address this, our shared task presents a dataset comprising Kannada-English code-mixed social comments, encompassing offensive comments. This paper outlines the dataset, the utilized algorithms, and the results obtained by systems participating in this shared task.

pdf bib

Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Anand Kumar Madasamy | Sajeetha Thavareesan | Elizabeth Sherly | Rajeswari Nadarajan | Manikandan Ravikiran
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

pdf bib abs

This paper offers a detailed overview of the first shared task on “Multitask Meme Classification - Unraveling Misogynistic and Trolls in Online Memes,” organized as part of the LT-EDI@EACL 2024 conference. The task was set to classify misogynistic content and troll memes within online platforms, focusing specifically on memes in Tamil and Malayalam languages. A total of 52 teams registered for the competition, with four submitting systems for the Tamil meme classification task and three for the Malayalam task. The outcomes of this shared task are significant, providing insights into the current state of misogynistic content in digital memes and highlighting the effectiveness of various computational approaches in identifying such detrimental content. The top-performing model got a macro F1 score of 0.73 in Tamil and 0.87 in Malayalam.

pdf bib abs

ScalarLab@TRAC2024: Exploring Machine Learning Techniques for Identifying Potential Offline Harm in Multilingual Commentaries
Anagha H C | Saatvik M. Krishna | Soumya Sangam Jha | Vartika T. Rao | Anand Kumar M
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

The objective of the shared task, Offline Harm Potential Identification (HarmPot-ID), is to build models to predict the offline harm potential of social media texts. “Harm potential” is defined as the ability of an online post or comment to incite offline physical harm such as murder, arson, riot, rape, etc. The first subtask was to predict the level of harm potential, and the second was to identify the group to which this harm was directed towards. This paper details our submissions for the shared task that includes a cascaded SVM model, an XGBoost model, and a TF-IDF weighted Word2Vec embedding-supported SVM model. Several other models that were explored have also been detailed.

pdf bib abs

Detecting Suicide Risk Patterns using Hierarchical Attention Networks with Large Language Models
Koushik L | Vishruth M | Anand Kumar M
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

Suicide has become a major public health and social concern in the world . This Paper looks into a method through use of LLMs (Large Lan- guage Model) to extract the likely reason for a person to attempt suicide , through analysis of their social media text posts detailing about the event , using this data we can extract the rea- son for the cause such mental state which can provide support for suicide prevention. This submission presents our approach for CLPsych Shared Task 2024. Our model uses Hierarchi- cal Attention Networks (HAN) and Llama2 for finding supporting evidence about an individ- ual’s suicide risk level.

2023

pdf bib abs

Multilingual Models for Sentiment and Abusive Language Detection for Dravidian Languages
Anand Kumar M
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

This paper presents the TFIDF based LSTM and Hierarchical Attention Networks (HAN) for code-mixed abusive comment detection and sentiment analysis for Dravidian languages. The traditional TF-IDF-based techniques have out- performed the Hierarchical Attention models in both the sentiment analysis and abusive language detection tasks. The Tulu sentiment analysis system demonstrated better performance for the Positive and Neutral classes, whereas the Tamil sentiment analysis system exhibited lower performance overall. This highlights the need for more balanced datasets and additional research to enhance the accuracy of sentiment analysis in the Tamil language. In terms of abusive language detection, the TF-IDF-LSTM models generally outperformed the Hierarchical Attention models. However, the mixed models displayed better performance for specific classes such as “Homophobia” and “Xenophobia.” This implies that considering both code-mixed and original script data can offer a different perspective for research in social media analysis.

pdf bib abs

NITK_LEGAL at SemEval-2023 Task 6: A Hierarchical based system for identification of Rhetorical Roles in legal judgements
Patchipulusu Sindhu | Diya Gupta | Sanjeevi Meghana | Anand Kumar M
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The ability to automatically recognise the rhetorical roles of sentences in a legal case judgement is a crucial challenge to tackle since it can be useful for a number of activities that come later, such as summarising legal judgements and doing legal searches. The task is exigent since legal case documents typically lack structure, and their rhetorical roles could be subjective. This paper describes SemEval-2023 Task 6: LegalEval: Understanding Legal Texts, Sub-task A: Rhetorical Roles Prediction (RR). We propose a system to automatically generate rhetorical roles of all the sentences in a legal case document using Hierarchical Bi-LSTM CRF model and RoBERTa transformer. We also showcase different techniques used to manipulate dataset to generate a set of varying embeddings and train the Hierarchical Bi-LSTM CRF model to achieve better performance. Among all, model trained with the sent2vec embeddings concatenated with the handcrafted features perform better with the micro f1-score of 0.74 on test data.

pdf bib abs

NITK-IT-NLP@DravidianLangTech: Impact of Focal Loss on Malayalam Fake News Detection using Transformers
Hariharan R L | Anand Kumar M
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Fake News Detection in Dravidian Languages is a shared task that identifies youtube comments in the Malayalam language for fake news detection. In this work, we have proposed a transformer-based model with cross-entropy loss and focal loss, which classifies the comments into fake or authentic news. We have used different transformer-based models for the dataset with modifications in the experimental setup, out of which the fine-tuned model, which is based on MuRIL with focal loss, achieved the best overall macro F1-score of 0.87, and we got second position in the final leaderboard.

pdf bib

Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
Bharathi R. Chakravarthi | Ruba Priyadharshini | Anand Kumar M | Sajeetha Thavareesan | Elizabeth Sherly
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

pdf bib abs

Interns@LT-EDI : Detecting Signs of Depression from Social Media Text
Koushik L | Hariharan R. L | Anand Kumar M
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

This submission presents our approach for depression detection in social media text. The methodology includes data collection, preprocessing - SMOTE, feature extraction/selection - TF-IDF and Glove, model development- SVM, CNN and Bi-LSTM, training, evaluation, optimisation, and validation. The proposed methodology aims to contribute to the accurate detection of depression.

pdf bib abs

Findings of the Second Shared Task on Offensive Span Identification from Code-Mixed Tamil-English Comments
Manikandan Ravikiran | Ananth Ganesh | Anand Kumar M | R Rajalakshmi | Bharathi Raja Chakravarthi
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Maintaining effective control over offensive content is essential on social media platforms to foster constructive online discussions. Yet, when it comes to code-mixed Dravidian languages, the current prevalence of offensive content moderation is restricted to categorizing entire comments, failing to identify specific portions that contribute to the offensiveness. Such limitation is primarily due to the lack of annotated data and open source systems for offensive spans. To alleviate this issue, in this shared task, we offer a collection of Tamil-English code-mixed social comments that include offensive comments. This paper provides an overview of the released dataset, the algorithms employed, and the outcomes achieved by the systems submitted for this task.

2022

pdf bib abs

Understanding the role of Emojis for emotion detection in Tamil
Ratnavel Rajalakshmi | Faerie Mattins R | Srivarshan Selvaraj | Antonette Shibani | Anand Kumar M | Bharathi Raja Chakravarthi
Proceedings of the First Workshop on Multimodal Machine Learning in Low-resource Languages

of expressing relevant idea through social media platforms and forums. At the same time, these memes are trolled by a person who tries to get identified from the other internet users like social media users, chat rooms and blogs. The memes contain both textual and visual information. Based on the content of memes, they are trolled in online community. There is no restriction for language usage in online media. The present work focuses on whether memes are trolled or not trolled. The proposed multi modal approach achieved considerably better weighted average F1 score of 0.5437 compared to Unimodal approaches. The other performance metrics like precision, recall, accuracy and macro average have also been studied to observe the proposed system.

pdf bib abs

Overview of the Shared Task on Machine Translation in Dravidian Languages
Anand Kumar Madasamy | Asha Hegde | Shubhanker Banerjee | Bharathi Raja Chakravarthi | Ruba Priyadharshini | Hosahalli Shashirekha | John Philip McCrae
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

This paper presents an outline of the shared task on translation of under-resourced Dravidian languages at DravidianLangTech-2022 workshop to be held jointly with ACL 2022. A description of the datasets used, approach taken for analysis of submissions and the results have been illustrated in this paper. Five sub-tasks organized as a part of the shared task include the following translation pairs: Kannada to Tamil, Kannada to Telugu, Kannada to Sanskrit, Kannada to Malayalam and Kannada to Tulu. Training, development and test datasets were provided to all participants and results were evaluated on the gold standard datasets. A total of 16 research groups participated in the shared task and a total of 12 submission runs were made for evaluation. Bilingual Evaluation Understudy (BLEU) score was used for evaluation of the translations.

pdf bib abs

Findings of the Shared Task on Offensive Span Identification fromCode-Mixed Tamil-English Comments
Manikandan Ravikiran | Bharathi Raja Chakravarthi | Anand Kumar Madasamy | Sangeetha Sivanesan | Ratnavel Rajalakshmi | Sajeetha Thavareesan | Rahul Ponnusamy | Shankar Mahadevan
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Offensive content moderation is vital in social media platforms to support healthy online discussions. However, their prevalence in code-mixed Dravidian languages is limited to classifying whole comments without identifying part of it contributing to offensiveness. Such limitation is primarily due to the lack of annotated data for offensive spans. Accordingly, in this shared task, we provide Tamil-English code-mixed social comments with offensive spans. This paper outlines the dataset so released, methods, and results of the submitted systems.

pdf bib

Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Anand Kumar Madasamy | Parameswari Krishnamurthy | Elizabeth Sherly | Sinnathamby Mahesan
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

pdf bib abs

NITK-IT_NLP@TamilNLP-ACL2022: Transformer based model for Toxic Span Identification in Tamil
Hariharan LekshmiAmmal | Manikandan Ravikiran | Anand Kumar Madasamy
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Toxic span identification in Tamil is a shared task that focuses on identifying harmful content, contributing to offensiveness. In this work, we have built a model that can efficiently identify the span of text contributing to offensive content. We have used various transformer-based models to develop the system, out of which the fine-tuned MuRIL model was able to achieve the best overall character F1-score of 0.4489.

2021

pdf bib abs

NITK-UoH: Tamil-Telugu Machine Translation Systems for the WMT21 Similar Language Translation Task
Richard Saldanha | Ananthanarayana V. S | Anand Kumar M | Parameswari Krishnamurthy
Proceedings of the Sixth Conference on Machine Translation

In this work, two Neural Machine Translation (NMT) systems have been developed and evaluated as part of the bidirectional Tamil-Telugu similar languages translation subtask in WMT21. The OpenNMT-py toolkit has been used to create quick prototypes of the systems, following which models have been trained on the training datasets containing the parallel corpus and finally the models have been evaluated on the dev datasets provided as part of the task. Both the systems have been trained on a DGX station with 4 -V100 GPUs. The first NMT system in this work is a Transformer based 6 layer encoder-decoder model, trained for 100000 training steps, whose configuration is similar to the one provided by OpenNMT-py and this is used to create a model for bidirectional translation. The second NMT system contains two unidirectional translation models with the same configuration as the first system, with the addition of utilizing Byte Pair Encoding (BPE) for subword tokenization through the pre-trained MultiBPEmb model. Based on the dev dataset evaluation metrics for both the systems, the first system i.e. the vanilla Transformer model has been submitted as the Primary system. Since there were no improvements in the metrics during training of the second system with BPE, it has been submitted as a contrastive system.

pdf bib abs

Detecting offensive language in social media in local languages is critical for moderating user-generated content. Thus, the field of offensive language identification in under-resourced Tamil, Malayalam and Kannada languages are essential. As the user-generated content is more code-mixed and not well studied for under-resourced languages, it is imperative to create resources and conduct benchmarking studies to encourage research in under-resourced Dravidian languages. We created a shared task on offensive language detection in Dravidian languages. We summarize here the dataset for this challenge which are openly available at https://competitions.codalab.org/competitions/27654, and present an overview of the methods and the results of the competing systems.

pdf bib

Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Anand Kumar M | Parameswari Krishnamurthy | Elizabeth Sherly
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

pdf bib abs

Classification of Censored Tweets in Chinese Language using XLNet
Shaikh Sahil Ahmed | Anand Kumar M.
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

In the growth of today’s world and advanced technology, social media networks play a significant role in impacting human lives. Censorship is the overthrowing of speech, public transmission, or other details that play a vast role in social media. The content may be considered harmful, sensitive, or inconvenient. Authorities like institutes, governments, and other organizations conduct Censorship. This paper has implemented a model that helps classify censored and uncensored tweets as a binary classification. The paper describes submission to the Censorship shared task of the NLP4IF 2021 workshop. We used various transformer-based pre-trained models, and XLNet outputs a better accuracy among all. We fine-tuned the model for better performance and achieved a reasonable accuracy, and calculated other performance metrics.

pdf bib abs

This paper presents an overview of the shared task on machine translation of Dravidian languages. We presented the shared task results at the EACL 2021 workshop on Speech and Language Technologies for Dravidian Languages. This paper describes the datasets used, the methodology used for the evaluation of participants, and the experiments’ overall results. As a part of this shared task, we organized four sub-tasks corresponding to machine translation of the following language pairs: English to Tamil, English to Malayalam, English to Telugu and Tamil to Telugu which are available at https://competitions.codalab.org/competitions/27650. We provided the participants with training and development datasets to perform experiments, and the results were evaluated on unseen test data. In total, 46 research groups participated in the shared task and 7 experimental runs were submitted for evaluation. We used BLEU scores for assessment of the translations.

2020

pdf bib abs

NITK NLP at FinCausal-2020 Task 1 Using BERT and Linear models.
Hariharan R L | Anand Kumar M
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

FinCausal-2020 is the shared task which focuses on the causality detection of factual data for financial analysis. The financial data facts don’t provide much explanation on the variability of these data. This paper aims to propose an efficient method to classify the data into one which is having any financial cause or not. Many models were used to classify the data, out of which SVM model gave an F-Score of 0.9435, BERT with specific fine-tuning achieved best results with F-Score of 0.9677.

2019

pdf bib

NITK-IT_NLP@NSURL2019: Transfer Learning based POS Tagger for Under Resourced Bhojpuri and Magahi Language
Anand Kumar M
Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 - Short Papers

2018

pdf bib abs

TeamCEN at SemEval-2018 Task 1: Global Vectors Representation in Emotion Detection
Anon George | Barathi Ganesh H. B. | Anand Kumar M | Soman K P
Proceedings of the 12th International Workshop on Semantic Evaluation

Emotions are a way of expressing human sentiments. In the modern era, social media is a platform where we convey our emotions. These emotions can be joy, anger, sadness and fear. Understanding the emotions from the written sentences is an interesting part in knowing about the writer. In the amount of digital language shared through social media, a considerable amount of data reflects the sentiment or emotion towards some product, person and organization. Since these texts are from users with diverse social aspects, these texts can be used to enrich the application related to the business intelligence. More than the sentiment, identification of intensity of the sentiment will enrich the performance of the end application. In this paper we experimented the intensity prediction as a text classification problem that evaluates the distributed representation text using aggregated sum and dimensionality reduction of the glove vectors of the words present in the respective texts .

pdf bib abs

CENNLP at SemEval-2018 Task 2: Enhanced Distributed Representation of Text using Target Classes for Emoji Prediction Representation
Naveen J R | Hariharan V | Barathi Ganesh H. B. | Anand Kumar M | Soman K P
Proceedings of the 12th International Workshop on Semantic Evaluation

Emoji is one of the “fastest growing language ” in pop-culture, especially in social media and it is very unlikely for its usage to decrease. These are generally used to bring an extra level of meaning to the texts, posted on social media platforms. Providing such an added info, gives more insights to the plain text, arising to hidden interpretation within the text. This paper explains our analysis on Task 2, ” Multilingual Emoji Prediction” sharedtask conducted by Semeval-2018. In the task, a predicted emoji based on a piece of Twitter text are labelled under 20 different classes (most commonly used emojis) where these classes are learnt and further predicted are made for unseen Twitter text. In this work, we have experimented and analysed emojis predicted based on Twitter text, as a classification problem where the entailing emoji is considered as a label for every individual text data. We have implemented this using distributed representation of text through fastText. Also, we have made an effort to demonstrate how fastText framework can be useful in case of emoji prediction. This task is divide into two subtask, they are based on dataset presented in two different languages English and Spanish.

pdf bib abs

AmritaNLP at SemEval-2018 Task 10: Capturing discriminative attributes using convolution neural network over global vector representation.
Vivek Vinayan | Anand Kumar M | Soman K P
Proceedings of the 12th International Workshop on Semantic Evaluation

The “Capturing Discriminative Attributes” sharedtask is the tenth task, conjoint with SemEval2018. The task is to predict if a word can capture distinguishing attributes of one word from another. We use GloVe word embedding, pre-trained on openly sourced corpus for this task. A base representation is initially established over varied dimensions. These representations are evaluated based on validation scores over two models, first on an SVM based classifier and second on a one dimension CNN model. The scores are used to further develop the representation with vector combinations, by considering various distance measures. These measures correspond to offset vectors which are concatenated as features, mainly to improve upon the F1score, with the best accuracy. The features are then further tuned on the validation scores, to achieve highest F1score. Our evaluation narrowed down to two representations, classified on CNN models, having a total dimension length of 1204 & 1203 for the final submissions. Of the two, the latter feature representation delivered our best F1score of 0.658024 (as per result).

pdf bib abs

Amrita_student at SemEval-2018 Task 1: Distributed Representation of Social Media Text for Affects in Tweets
Nidhin A Unnithan | Shalini K. | Barathi Ganesh H. B. | Anand Kumar M | Soman K. P.
Proceedings of the 12th International Workshop on Semantic Evaluation

In this paper we did an analysis of “Affects in Tweets” which was one of the task conducted by semeval 2018. Task was to build a model which is able to do regression and classification of different emotions from the given tweets data set. We developed a base model for all the subtasks using distributed representation (Doc2Vec) and applied machine learning techniques for classification and regression. Distributed representation is an unsupervised algorithm which is capable of learning fixed length feature representation from variable length texts. Machine learning techniques used for regression is ’Linear Regression’ while ’Random Forest Tree’ is used for classification purpose. Empirical results obtained for all the subtasks by our model are shown in this paper.

pdf bib abs

CENNLP at SemEval-2018 Task 1: Constrained Vector Space Model in Affects in Tweets
Naveen J R | Barathi Ganesh H. B. | Anand Kumar M | Soman K P
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper discusses on task 1, “Affect in Tweets” sharedtask, conducted in SemEval-2018. This task comprises of various subtasks, which required participants to analyse over different emotions and sentiments based on the provided tweet data and also measure the intensity of these emotions for subsequent subtasks. Our approach in these task was to come up with a model on count based representation and use machine learning techniques for regression and classification related tasks. In this work, we use a simple bag of words technique for supervised text classification model as to compare, that even with some advance distributed representation models we can still achieve significant accuracy. Further, fine tuning on various parameters for the bag of word, representation model we acquired better scores over various other baseline models (Vinayan et al.) participated in the sharedtask.

Anand Kumar M

2025

2024

2023

2022

2021

2020

2019

2018

2016

2015

Co-authors

Venues