Sudipta Kar


pdf bib
MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition
Shervin Malmasi | Anjie Fang | Besnik Fetahu | Sudipta Kar | Oleg Rokhlenko
Proceedings of the 29th International Conference on Computational Linguistics

We present AnonData, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We tested the performance of two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art NER GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%). GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%) and demonstrates the difficulty of our dataset. AnonData poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems.

pdf bib
SemEval-2022 Task 11: Multilingual Complex Named Entity Recognition (MultiCoNER)
Shervin Malmasi | Anjie Fang | Besnik Fetahu | Sudipta Kar | Oleg Rokhlenko
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We present the findings of SemEval-2022 Task 11 on Multilingual Complex Named Entity Recognition MULTICONER. Divided into 13 tracks, the task focused on methods to identify complex named entities (like names of movies, products and groups) in 11 languages in both monolingual and multi-lingual scenarios. Eleven tracks required building monolingual NER models for individual languages, one track focused on multilingual models able to work on all languages, and the last track featured code-mixed texts within any of these languages. The task is based on the MULTICONER dataset comprising of 2.3 millions instances in Bangla, Chinese, Dutch, English, Farsi, German, Hindi, Korean, Russian, Spanish, and Turkish. Results showed that methods fusing external knowledge into transformer models achieved the best results. However, identifying entities like creative works is still challenging even with external knowledge. MULTICONER was one of the most popular tasks in SemEval-2022 and it attracted 377 participants during the practice phase. 236 participants signed up for the final test phase and 55 teams submitted their systems.


pdf bib
SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts
Khondoker Ittehadul Islam | Sudipta Kar | Md Saiful Islam | Mohammad Ruhul Amin
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper, we propose an annotated sentiment analysis dataset made of informally written Bangla texts. This dataset comprises public comments on news and videos collected from social media covering 13 different domains, including politics, education, and agriculture. These comments are labeled with one of the polarity labels, namely positive, negative, and neutral. One significant characteristic of the dataset is that each of the comments is noisy in terms of the mix of dialects and grammatical incorrectness. Our experiments to develop a benchmark classification system show that hand-crafted lexical features provide superior performance than neural network and pretrained language models. We have made the dataset and accompanying models presented in this paper publicly available at

pdf bib
Syntax and Themes: How Context Free Grammar Rules and Semantic Word Association Influence Book Success
Henry Gorelick | Biddut Sarker Bijoy | Syeda Jannatus Saba | Sudipta Kar | Md Saiful Islam | Mohammad Ruhul Amin
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we attempt to improve upon the state-of-the-art in predicting a novel’s success by modeling the lexical semantic relationships of its contents. We created the largest dataset used in such a project containing lexical data from 17,962 books from Project Gutenberg. We utilized domain specific feature reduction techniques to implement the most accurate models to date for predicting book success, with our best model achieving an average accuracy of 94.0%. By analyzing the model parameters, we extracted the successful semantic relationships from books of 12 different genres. We finally mapped those semantic relations to a set of themes, as defined in Roget’s Thesaurus and discovered the themes that successful books of a given genre prioritize. At the end of the paper, we further showed that our model demonstrate similar performance for book success prediction even when Goodreads rating was used instead of download count to measure success.


pdf bib
Age Suitability Rating: Predicting the MPAA Rating Based on Movie Dialogues
Mahsa Shafaei | Niloofar Safi Samghabadi | Sudipta Kar | Thamar Solorio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Movies help us learn and inspire societal change. But they can also contain objectionable content that negatively affects viewers’ behaviour, especially children. In this paper, our goal is to predict the suitability of movie content for children and young adults based on scripts. The criterion that we use to measure suitability is the MPAA rating that is specifically designed for this purpose. We create a corpus for movie MPAA ratings and propose an RNN based architecture with attention that jointly models the genre and the emotions in the script to predict the MPAA rating. We achieve 81% weighted F1-score for the classification model that outperforms the traditional machine learning method by 7%.

pdf bib
LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation
Gustavo Aguilar | Sudipta Kar | Thamar Solorio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Recent trends in NLP research have raised an interest in linguistic code-switching (CS); modern approaches have been proposed to solve a wide range of NLP tasks on multiple language pairs. Unfortunately, these proposed methods are hardly generalizable to different code-switched languages. In addition, it is unclear whether a model architecture is applicable for a different task while still being compatible with the code-switching setting. This is mainly because of the lack of a centralized benchmark and the sparse corpora that researchers employ based on their specific needs and interests. To facilitate research in this direction, we propose a centralized benchmark for Linguistic Code-switching Evaluation (LinCE) that combines eleven corpora covering four different code-switched language pairs (i.e., Spanish-English, Nepali-English, Hindi-English, and Modern Standard Arabic-Egyptian Arabic) and four tasks (i.e., language identification, named entity recognition, part-of-speech tagging, and sentiment analysis). As part of the benchmark centralization effort, we provide an online platform where researchers can submit their results while comparing with others in real-time. In addition, we provide the scores of different popular models, including LSTM, ELMo, and multilingual BERT so that the NLP community can compare against state-of-the-art systems. LinCE is a continuous effort, and we will expand it with more low-resource languages and tasks.

pdf bib
BanFakeNews: A Dataset for Detecting Fake News in Bangla
Md Zobaer Hossain | Md Ashraful Rahman | Md Saiful Islam | Sudipta Kar
Proceedings of the Twelfth Language Resources and Evaluation Conference

Observing the damages that can be done by the rapid propagation of fake news in various sectors like politics and finance, automatic identification of fake news using linguistic analysis has drawn the attention of the research community. However, such methods are largely being developed for English where low resource languages remain out of the focus. But the risks spawned by fake and manipulative news are not confined by languages. In this work, we propose an annotated dataset of ≈ 50K news that can be used for building automated fake news detection systems for a low resource language like Bangla. Additionally, we provide an analysis of the dataset and develop a benchmark system with state of the art NLP techniques to identify Bangla fake news. To create this system, we explore traditional linguistic features and neural network based methods. We expect this dataset will be a valuable resource for building technologies to prevent the spreading of fake news and contribute in research with low resource languages. The dataset and source code are publicly available at

pdf bib
SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets
Parth Patwa | Gustavo Aguilar | Sudipta Kar | Suraj Pandey | Srinivas PYKL | Björn Gambäck | Tanmoy Chakraborty | Thamar Solorio | Amitava Das
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we present the results of the SemEval-2020 Task 9 on Sentiment Analysis of Code-Mixed Tweets (SentiMix 2020). We also release and describe our Hinglish (Hindi-English)and Spanglish (Spanish-English) corpora annotated with word-level language identification and sentence-level sentiment labels. These corpora are comprised of 20K and 19K examples, respectively. The sentiment labels are - Positive, Negative, and Neutral. SentiMix attracted 89 submissions in total including 61 teams that participated in the Hinglish contest and 28 submitted systems to the Spanglish competition. The best performance achieved was 75.0% F1 score for Hinglish and 80.6% F1 for Spanglish. We observe that BERT-like models and ensemble methods are the most common and successful approaches among the participants.

pdf bib
Attending the Emotions to Detect Online Abusive Language
Niloofar Safi Samghabadi | Afsheen Hatami | Mahsa Shafaei | Sudipta Kar | Thamar Solorio
Proceedings of the Fourth Workshop on Online Abuse and Harms

In recent years, abusive behavior has become a serious issue in online social networks. In this paper, we present a new corpus for the task of abusive language detection that is collected from a semi-anonymous online platform, and unlike the majority of other available resources, is not created based on a specific list of bad words. We also develop computational models to incorporate emotions into textual cues to improve aggression identification. We evaluate our proposed methods on a set of corpora related to the task and show promising results with respect to abusive language detection.

pdf bib
Multi-view Story Characterization from Movie Plot Synopses and Reviews
Sudipta Kar | Gustavo Aguilar | Mirella Lapata | Thamar Solorio
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

This paper considers the problem of characterizing stories by inferring properties such as theme and style using written synopses and reviews of movies. We experiment with a multi-label dataset of movie synopses and a tagset representing various attributes of stories (e.g., genre, type of events). Our proposed multi-view model encodes the synopses and reviews using hierarchical attention and shows improvement over methods that only use synopses. Finally, we demonstrate how we can take advantage of such a model to extract a complementary set of story-attributes from reviews without direct supervision. We have made our dataset and source code publicly available at


pdf bib
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
Sudipta Kar | Farah Nadeem | Laura Burdick | Greg Durrett | Na-Rae Han
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop


pdf bib
RiTUAL-UH at TRAC 2018 Shared Task: Aggression Identification
Niloofar Safi Samghabadi | Deepthi Mave | Sudipta Kar | Thamar Solorio
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

This paper presents our system for “TRAC 2018 Shared Task on Aggression Identification”. Our best systems for the English dataset use a combination of lexical and semantic features. However, for Hindi data using only lexical features gave us the best results. We obtained weighted F1-measures of 0.5921 for the English Facebook task (ranked 12th), 0.5663 for the English Social Media task (ranked 6th), 0.6292 for the Hindi Facebook task (ranked 1st), and 0.4853 for the Hindi Social Media task (ranked 2nd).

pdf bib
Letting Emotions Flow: Success Prediction by Modeling the Flow of Emotions in Books
Suraj Maharjan | Sudipta Kar | Manuel Montes | Fabio A. González | Thamar Solorio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Books have the power to make us feel happiness, sadness, pain, surprise, or sorrow. An author’s dexterity in the use of these emotions captivates readers and makes it difficult for them to put the book down. In this paper, we model the flow of emotions over a book using recurrent neural networks and quantify its usefulness in predicting success in books. We obtained the best weighted F1-score of 69% for predicting books’ success in a multitask setting (simultaneously predicting success and genre of books).

pdf bib
MPST: A Corpus of Movie Plot Synopses with Tags
Sudipta Kar | Suraj Maharjan | A. Pastor López-Monroy | Thamar Solorio
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Folksonomication: Predicting Tags for Movies from Plot Synopses using Emotion Flow Encoded Neural Network
Sudipta Kar | Suraj Maharjan | Thamar Solorio
Proceedings of the 27th International Conference on Computational Linguistics

Folksonomy of movies covers a wide range of heterogeneous information about movies, like the genre, plot structure, visual experiences, soundtracks, metadata, and emotional experiences from watching a movie. Being able to automatically generate or predict tags for movies can help recommendation engines improve retrieval of similar movies, and help viewers know what to expect from a movie in advance. In this work, we explore the problem of creating tags for movies from plot synopses. We propose a novel neural network model that merges information from synopses and emotion flows throughout the plots to predict a set of tags for movies. We compare our system with multiple baselines and found that the addition of emotion flows boosts the performance of the network by learning ≈18% more tags than a traditional machine learning system.


pdf bib
RiTUAL-UH at SemEval-2017 Task 5: Sentiment Analysis on Financial Data Using Neural Networks
Sudipta Kar | Suraj Maharjan | Thamar Solorio
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper, we present our systems for the “SemEval-2017 Task-5 on Fine-Grained Sentiment Analysis on Financial Microblogs and News”. In our system, we combined hand-engineered lexical, sentiment and metadata features, the representations learned from Convolutional Neural Networks (CNN) and Bidirectional Gated Recurrent Unit (Bi-GRU) with Attention model applied on top. With this architecture we obtained weighted cosine similarity scores of 0.72 and 0.74 for subtask-1 and subtask-2, respectively. Using the official scoring system, our system ranked the second place for subtask-2 and eighth place for the subtask-1. It ranked first for both of the subtasks by the scores achieved by an alternate scoring system.


pdf bib
UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering
Marc Franco-Salvador | Sudipta Kar | Thamar Solorio | Paolo Rosso
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)