Dipankar Das - ACL Anthology

Dipankar Das

2025

CheckSent-BN: A Bengali Multi-Task Dataset for Claim Checkworthiness and Sentiment Classification from News Headlines
Pritam Pal | Dipankar Das
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)

This paper presents **CheckSent-BN** (Claim **Check**worthiness and **Sen**timent Classification in **B**engali **N**ews Headline), a novel multi-task dataset in Bengali comprising approximately 11.5K news headlines annotated for two critical natural language processing (NLP) tasks: claim checkworthiness detection and sentiment classification. To address the lack of high-quality annotated resources in Bengali, we employ a cost-effective annotation strategy that utilizes three large language models (GPT-4o-mini, GPT-4.1-mini, and Llama-4), followed by majority voting and manual verification to ensure label consistency. We provide benchmark results using multilingual and Bengali-focused transformer models under both single-task and multi-task learning (MTL) frameworks. Experimental results show that IndicBERTv2, BanglaBERT, and mDeBERTa model-based frameworks outperform other multilingual models, with IndicBERTv2 achieving the best overall performance in the MTL setting. CheckSent-BN establishes the first comprehensive benchmark for joint claim checkworthiness and sentiment classification in Bengali news headlines, offering a valuable resource for advancing misinformation detection and sentiment-aware analysis in low-resource languages. The CheckSent-BN dataset is available at: https://github.com/pritampal98/check-sent-bn

JU_NLP at BLP-2025 Task 2: Leveraging Zero-Shot Prompting for Bangla Natural Language to Python Code Generation
Pritam Pal | Dipankar Das
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)

Code synthesis from natural language problem statements has recently gained popularity with the use of large language models (LLMs). Most of the available systems and benchmarks, however, are developed for English or other high-resource languages, and a gap exists for low-resource languages such as Bangla. Addressing the gap, the Bangla Language Processing (BLP) Workshop at AACL-IJCNLP 2025 featured a shared task on Bangla-to-Python code generation. Participants were asked to design systems that consume Bangla problem statements and generate executable Python programs. A benchmark data set of training, development, and test splits was provided, and evaluation utilized the Pass@1 metric through hidden test cases. We present here a system we developed, using the state-of-the-art LLMs through a zero-shot prompting setup. We report outcomes on several models, including variants of GPT-4 and Llama-4, and specify their relative strengths and weaknesses. Our best-performing system, based on GPT-4.1, achieved a Pass@1 score of 78.6% over the test dataset. We address the challenges of Bangla code generation, morphological richness, cross-lingual understanding, and functional correctness, and outline the potential for future work in multilingual program synthesis.

LLMForum-RAG: A Multilingual, Multi-domain Framework for Factual Reasoning via Weighted Retrieval and LLM Collaboration
Soham Chaudhuri | Dipanjan Saha | Dipankar Das
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

LLMs have emerged as a transformative technology, enabling a wide range of tasks such as text generation, summarization, question answering, and more. The use of RAG with LLM is on the rise to provide deeper knowledge bases of various domains. In the present study, we propose a RAG framework that employs weighted Rocchio mechanism for retrieval and LLM collaborative forum with supervision for generation. Our framework is evaluated in two downstream tasks: a biomedical question answering (BioASQ-QA) and a multilingual claim verification (e.g. in English, Hindi, and Bengali) to showcase its adaptability across various domains and languages. The proposed retriever is capable to achieve substantial improvement over BM25 of +8% (BioASQ-QA), +15% (English), +5% (Hindi), and +20% (Bengali) for Recall@5. In veracity classification, our framework achieves an average answer correctness of 0.78 on BioASQ-QA while achieving F1-score of 0.59, 0.56, and 0.41 for English, Hindi and Bengali languages, respectively. These results demonstrate the effectiveness and robustness of our framework for retrieval and generation in multilingual and multi-domain settings.

IndicClaimBuster: A Multilingual Claim Verification Dataset
Pritam Pal | Shyamal Krishna Jana | Dipankar Das
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

The present article introduces **IndicClaimBuster**, a novel multilingual claim verification dataset comprising ≈ 9K claims and their corresponding evidence in English, Hindi, Bengali, and Hindi-English CodeMixed texts. The data set covers three key domains: politics, law and order, and health, to address the challenges of verifiable facts. Each claim was sourced from reputable Indian news portals and is accompanied by three pieces of evidence, two LLM-generated and one manually curated. Additionally, a separate attempt was conducted to generate refuted claims by employing an LLM. We further develop two frameworks: an unsupervised baseline and a two-stage pipeline that comprises evidence retrieval and veracity prediction modules. For retrieval, we fine-tuned SBERT models, with e5-base demonstrating superior average performance across languages, whereas for veracity prediction, multilingual transformers (mBERT, XLM-R, MuRIL, IndicBERTv2) were fine-tuned. Results indicate MuRIL and IndicBERTv2 excel in Indian languages, while XLM-R performs the best for CodeMix. Our work contributes a high-quality multilingual dataset and strong baseline methodologies, offering valuable resources for advancing automated claim verification in linguistically diverse and low-resource settings for Indian languages. The IndicClaimBuster dataset is available at: https://github.com/pritampal98/indic-claim-buster

IWSLT 2025 Indic Track System Description Paper: Speech-to-Text Translation from Low-Resource Indian Languages (Bengali and Tamil) to English
Sayan Das | Soham Chaudhuri | Dipanjan Saha | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

Multi-language Speech-to-Text Translation (ST) plays a crucial role in breaking linguistic barriers, particularly in multilingual regions like India. This paper focuses on building a robust ST system for low resource Indian languages, with a special emphasis on Bengali and Tamil. These languages represent the Indo-Aryan and Dravidian families, respectively. The dataset used in this work comprises spoken content from TED Talks and conferences, paired with transcriptions in English and their translations in Bengali and Tamil. Our work specifically addresses the translation of Bengali and Tamil speech to English text, a critical area given the scarcity of annotated speech data. To enhance translation quality and model robustness, we leverage cross-lingual resources and word level translation strategies. The ultimate goal is to develop an end-to-end ST model capable of real-world deployment for under represented languages.

Identifying Severity of Depression in Forum Posts using Zero-Shot Classifier and DistilBERT Model
Zafar Sarif | Sannidhya Das | Dr. Abhishek Das | Md Fahin Parvej | Dipankar Das
Proceedings of the First on Natural Language Processing and Language Models for Digital Humanities

This paper presents our approach to the RANLP 2025 Shared Task on “Identification of the Severity of Depression in Forum Posts.” The objective of the task is to classify user-generated posts into one of four severity levels of depression: subthreshold, mild, moderate, or severe. A key challenge in the task was the absence of annotated training data. To address this, we employed a two-stage pipeline: first, we used zero-shot classification with facebook/bart-large-mnli to generate pseudo-labels for the unlabeled training set. Next, we fine-tuned a DistilBERT model on the pseudo-labeled data for multi-class classification. Our system achieved an internal accuracy of 0.92 on the pseudo-labeled test set and an accuracy of 0.289 on the official blind evaluation set. These results demonstrate the feasibility of leveraging zero-shot learning and weak supervision for mental health classification tasks, even in the absence of gold-standard annotations.

JUNLP@LT-EDI-2025: Efficient Low-Rank Adaptation of Whisper for Inclusive Tamil Speech Recognition Targeting Vulnerable Populations
Priyobroto Acharya | Soham Chaudhuri | Sayan Das | Dipanjan Saha | Dipankar Das
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

Speech recognition has received extensive research attention in recent years. It becomes much more challenging when the speaker’s age, gender and other factors introduce variations in the speech. In this work, we propose a fine-tuned automatic speech recognition model derived from OpenAI’s whisperlarge-v2. Though we experimented with both Whisper-large and Wav2vec2-XLSR-large, the reduced WER of whisper-large proved to be a superior model. We secured 4th rank in the LT-EDI-2025 shared task. Our implementation details and code are available at our GitHub repository1.

NLP4Health: Multilingual Clinical Dialogue Summarization and QA with mT5 and LoRA
Moutushi Roy | Dipankar Das
NLP-AI4Health

In this work, we present NLP4Health, a unified and reproducible pipeline to accomplish the tasks of multilingual clinical dialogue summarization and question answering (QA). Our system fine-tunes the multilingual sequence-to-sequence model google/mt5-base along with parameter-efficient Low-Rank Adaptation (LoRA) modules to support ten Indian languages. For each clinical dialogue, the model produces (1) a free-text English summary, (2) an English structured key–value (KnV) JSON summary, and (3) QA responses in the dialogue’s original language. We conducted preprocessing, fine-tuning, and inference, and evaluated across QA, textual, and structured metrics, analyzing performance in low-resource settings. The adapter weights, tokenizer, and inference scripts are publicly released to promote transparency and reproducibility.

Hybrid Classical-Quantum Framework for Sentiment Classification and Claim Check-Worthiness Identification in Bengali
Pritam Pal | Dipankar Das | Anup Kumar Kolya | Siddhartha Bhattacharyya
Proceedings of the QuantumNLP{:} Integrating Quantum Computing with Natural Language Processing

Traditional machine learning and deep learning models have demonstrated remarkable performance across various NLP tasks in multiple languages. However, these conventional models often struggle with languages with complex linguistic structures and nuanced contexts, such as Bengali. Recent advancements in quantum computing offer promising solutions for tackling complex, computationally challenging problems, providing faster, more efficient processing than classical systems. This research aims to address the challenges posed by the intricate linguistic structure of the less-resourced Bengali language by developing a quantum-enhanced framework for sentiment classification and claim-checkworthiness identification. We created a classical LSTM framework and proposed novel 2-qubit and 4-qubit classical-quantum frameworks, evaluating their effectiveness for sentiment classification and claim-checkworthiness identification tasks in Bengali. An entirely new dataset comprising ≈3K samples was developed by curating Bengali news headlines from prominent sources. We tagged these headlines with sentiment and claim checkworthy labels using state-of-the-art LLMs. Our findings indicate that the quantum-enhanced frameworks outperform the traditional models in both tasks. Notably, the 4-qubit-based framework achieved the highest F1-score in sentiment classification, while the 2-qubit-based framework demonstrated the best F1-score in claim checkworthiness identification.

Top Ten from Lakhs: A Transformer-based Retrieval System for Identifying Previously Fact-Checked Claims across Multiple Languages
Srijani Debnath | Pritam Pal | Dipankar Das
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

The efficient identification of previously fact-checked claims across multiple languages is a challenging task. It can be time-consuming for professional fact-checkers even within a single language. It becomes much more difficult to perform manually when the claim and the fact-check may be in different languages. This paper presents a systematic approach for the retrieval of top-k relevant fact-checks for a given post in a monolingual and cross-lingual setup using two transformer-based fact-checked claim retrieval frameworks that share a common preprocessing pipeline but differ in their underlying encoder implementations: TIDE, a TensorFlow-based custom dual encoder applied to english-translated data, and PTEX, a PyTorch-based encoder operating on both english-translated and original-language inputs, and introduces a lightweight post-processing technique based on a textual feature: Keyword Overlap Count applied via reranking on top of the transformer-based frameworks. Training and evaluation on a large multilingual corpus show that the fine-tuned E5-Large-v2 model in the PTEX framework yields the best monolingual track performance, achieving an average Success@10 score of 0.8846 and the same framework model with post-processing technique achieves an average Success@10 score of 0.7393 which is the best performance in crosslingual track.

Toward Quantum-Enhanced Natural Language Understanding: Sarcasm and Claim Detection with QLSTM
Pritam Pal | Dipankar Das
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Traditional machine learning (ML) and deep learning (DL) models have shown effectiveness in natural language processing (NLP) tasks, such as sentiment analysis. However, they often struggle with complex linguistic structures, such as sarcasm and implicit claims. This paper introduces a Quantum Long Short-Term Memory (QLSTM) framework for detecting sarcasm and identifying claims in text, aiming to enhance the analysis of complex sentences. We evaluate four approaches: (1) classical LSTM, (2) quantum framework using QLSTM, (3) voting ensemble combining classical and quantum LSTMs, and (4) hybrid framework integrating both types. The experimental results indicate that the QLSTM approach excels in sarcasm detection, while the voting framework performs best in claim identification.

Generating and Analyzing Disfluency in a Code-Mixed Setting
Aryan Paul | Tapabrata Mondal | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

This work explores the intersection of code-mixing and disfluency in bilingual speech and text, with a focus on understanding how large language models (LLMs) handle code-mixed disfluent utterances. One of the primary objectives is to explore LLMs’ ability to generate code-mixed disfluent sentences and to address the lack of high-quality code-mixed disfluent corpora, particularly for Indic languages. We aim to compare the performance of LLM-based approaches with traditional disfluency detection methods and to develop novel metrics for quantitatively assessing disfluency phenomena. Additionally, we investigate the relationship between code-mixing and disfluency, exploring how factors such as switching frequency and direction influence the occurrence of disfluencies. By analyzing these intriguing dynamics, we seek to gain a deeper understanding of the mutual influence between code-mixing and disfluency in multilingual speech.

Enhancing Textual Understanding: Automated Claim Span Identification in English, Hindi, Bengali, and CodeMix
Rudra Roy | Pritam Pal | Dipankar Das | Saptarshi Ghosh | Biswajit Paul
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Claim span identification, a crucial task in Natural Language Processing (NLP), aims to extract specific claims from texts. Such claim spans can be further utilized in various critical NLP applications, such as claim verification, fact-checking, and opinion mining, among others. The present work proposes a multilingual claim span identification framework for handling social media data in English, Hindi, Bengali, and CodeMixed texts, leveraging the strengths and knowledge of transformer-based pre-trained models. Our proposed framework efficiently identifies the contextual relationships between words and precisely detects claim spans across all languages, achieving a high F1 score and Jaccard score. The source code and datasets are available at: https://github.com/pritampal98/claim-span-multilingual

Trans-Sent at SemEval-2025 Task 11: Text-based Multi-label Emotion Detection using Pre-Trained BERT Transformer Models
Zafar Sarif | Md Sharib Akhtar | Abhishek Das | Dipankar Das
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

We have introduced Trans-Sent, a Transformer-based model designed for multi-label emotion classification in SemEval-2025 Task 11. The model predicts perceived emotions such as joy, sadness, anger, fear, surprise, and disgust from text across seven languages, including Amharic, German, English, Hindi, Marathi, Russian, and Romanian. To handle data imbalance, the system incorporates preprocessing techniques, SMOTE oversampling, and feature engineering to enhance classification accuracy. The model was trained using the BRIGHTER and EthioEmo datasets, which contain diverse textual sources, such as social media, news, literature, and personal narratives. Traditional machine learning models, including Logistic Regression and Decision Trees, were tested but proved inadequate for multi-label classification due to their limited ability to capture contextual and semantic meaning. Fine-tuned BERT models demonstrated superior performance, with Russian achieving the highest ranking (9th overall), while languages with complex grammar, such as German and Amharic, performed lower. Future enhancements may include advanced data augmentation, cross-lingual learning, and multimodal emotion analysis to improve classification across different languages. Trans-Sent contributes to NLP by advancing multi-label emotion detection, particularly in underrepresented languages.

JU-CSE-NLP’25 at SemEval-2025 Task 4: Learning to Unlearn LLMs
Arkajyoti Naskar | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Large Language Models (LLMs) have achieved enormous success recently due to their ability to understand and solve various non-trivial tasks in natural language. However, they have been shown to memorize their training data which, among other concerns, increases the risk of the model regurgitating creative or private content, potentially leading to legal issues for the model developer and/or vendors. Such issues are often discovered post-model training during testing or red teaming. While unlearning has been studied for some time in classification problems, it is still a relatively underdeveloped area of study in LLM research since the latter operates in a potentially unbounded output label space. Specifically, robust evaluation frameworks are lacking to assess the accuracy of these unlearning strategies. In this challenge, we aim to bridge this gap by developing a comprehensive evaluation challenge for unlearning sensitive datasets in LLMs.

JU_NLP at SemEval-2025 Task 7: Leveraging Transformer-Based Models for Multilingual & Crosslingual Fact-Checked Claim Retrieval
Atanu Nayak | Srijani Debnath | Arpan Majumdar | Pritam Pal | Dipankar Das
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Fact-checkers are often hampered by the sheer amount of online content that needs to be fact-checked. NLP can help them by retrieving already existing fact-checks relevant to the content being investigated. This paper presents a systematic approach for the retrieval of top-k relevant fact-checks for a given post in a monolingual and cross-lingual setup using transformer-based pre-trained models fine-tuned with a dual encoder architecture. By training and evaluating the shared task test dataset, our proposed best-performing framework achieved an average success@10 score of 0.79 and 0.62 for the retrieval of 10 fact-checks from the fact-check corpus against a post in monolingual and crosslingual track respectively.

JUNLP_Sarika at SemEval-2025 Task 11: Bridging Contextual Gaps in Text-Based Emotion Detection using Transformer Models
Sarika Khatun | Dipanjan Saha | Dipankar Das
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Because language is subjective, it can be difficult to infer human emotions from textual data. This work investigates the categorization of emotions using BERT, classifying five emotions—angry, fearful, joyful, sad, and surprised—by utilizing its contextual embeddings. Preprocessing techniques like tokenization and stop-word removal are used on the dataset, which comes from social media and personal tales. With a weighted F1-score of 0.75, our model was trained using a multi-label classification strategy. BERT has the lowest F1-score when it comes to rage, but it does well when it comes to identifying fear and surprise. The findings demonstrate the difficulties presented by unbalanced datasets while also highlighting the promise of transformer-based models for text-based emotion identification. Future research will use data augmentation methods, domain-adapted BERT models, and other methods to improve classification performance.

JU-NLP: Improving Low-Resource Indic Translation System with Efficient LoRA-Based Adaptation
Priyobroto Acharya | Haranath Mondal | Dipanjan Saha | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the Tenth Conference on Machine Translation

Low-resource Indic languages such as Assamese, Manipuri, Mizo, and Bodo face persistent challenges in NMT due to limited parallel data, diverse scripts, and complex morphology. We address these issues in the WMT $2025$ shared task by introducing a unified multilingual NMT framework that combines rigorous language-specific preprocessing with parameter-efficient adaptation of large-scale models. Our pipeline integrates the NLLB-$200$ and IndicTrans$2$ architectures, fine-tuned using LoRA and DoRA, reducing trainable parameters by over 90% without degrading translation quality. A comprehensive preprocessing suite, including Unicode normalization, semantic filtering, transliteration, and noise reduction, ensures high-quality inputs, while script-aware post-processing mitigates evaluation bias from orthographic mismatches. Experiments across English-Indic directions demonstrate that NLLB-$200$ achieves superior results for Assamese, Manipuri, and Mizo, whereas IndicTrans$2$ excels in English-Bodo. Evaluated using BLEU, chrF, METEOR, ROUGE-L, and TER, our approach yields consistent improvements over baselines, underscoring the effectiveness of combining efficient fine-tuning with linguistically informed preprocessing for low-resource Indic MT.

SpeechEE@XLLM25: End-to-End Structured Event Extraction from Speech
Soham Chaudhuri | Diganta Biswas | Dipanjan Saha | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)

Event extraction from text is a complex taskthat involves the identification of event triggersand their supporting arguments. Whenapplied to speech, this task becomes evenmore challenging due to the continuous natureof audio signals and the need for robustAutomatic Speech Recognition (ASR). Thispaper proposes an approach that integratesASR with event extraction by utilizing theWhisper model for speech recognition and aText2Event2 Transformer for extracting eventsfrom English audio samples. The Whispermodel is used to generate transcripts from audio,which are then fed into the Text2Event2Transformer to identify event triggers and theirarguments. This approach combines two difficulttasks into one, streamlining the processof extracting structured event information directlyfrom audio. Our approach leverages arobust ASR system (Whisper) followed by aparameter-efficient transformer (Text2Event2fine-tuned via LoRA) to extract structuredevents from raw speech. Unlike prior worktrained on gold textual input, our pipeline istrained end-to-end on noisy ASR outputs. Despitesignificant resource constraints and datanoise, our system ranked first in the ACL 2025XLLM Shared Task II.

2024

Human vs Machine: An Automated Machine-Generated Text Detection Approach
Urwah Jawaid | Rudra Roy | Pritam Pal | Srijani Debnath | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

With the advancement of natural language processing (NLP) and sophisticated Large Language Models (LLMs), distinguishing between human-written texts and machine-generated texts is quite difficult nowadays. This paper presents a systematic approach to classifying machine-generated text from human-written text with a combination of the transformer-based model and textual feature-based post-processing technique. We extracted five textual features: readability score, stop word score, spelling and grammatical error count, unique word score and human phrase count from both human-written and machine-generated texts separately and trained three machine learning models (SVM, Random Forest and XGBoost) with these scores. Along with exploring traditional machine-learning models, we explored the BiLSTM and transformer-based distilBERT models to enhance the classification performance. By training and evaluating with a large dataset containing both human-written and machine-generated text, our best-performing framework achieves an accuracy of 87.5%.

Unveiling the Truth: A Deep Dive into Claim Identification Methods
Shankha Shubhra Das | Pritam Pal | Dipankar Das
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

HIJLI_JU at SemEval-2024 Task 7: Enhancing Quantitative Question Answering Using Fine-tuned BERT Models
Partha Sengupta | Sandip Sarkar | Dipankar Das
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

In data and numerical analysis, Quantitative Question Answering (QQA) becomes a crucial instrument that provides deep insights for analyzing large datasets and helps make well-informed decisions in industries such as finance, healthcare, and business. This paper explores the “HIJLI_JU” team’s involvement in NumEval Task 1 within SemEval 2024, with a particular emphasis on quantitative comprehension. Specifically, our method addresses numerical complexities by fine-tuning a BERT model for sophisticated multiple-choice question answering, leveraging the Hugging Face ecosystem. The effectiveness of our QQA model is assessed using a variety of metrics, with an emphasis on the f1_score() of the scikit-learn library. Thorough analysis of the macro-F1, micro-F1, weighted-F1, average, and binary-F1 scores yields detailed insights into the model’s performance in a range of question formats.

AlphaIntellect at SemEval-2024 Task 6: Detection of Hallucinations in Generated Text
Sohan Choudhury | Priyam Saha | Subharthi Ray | Shankha Das | Dipankar Das
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

One major issue in natural language generation (NLG) models is detecting hallucinations (semantically inaccurate outputs). This study investigates a hallucination detection system designed for three distinct NLG tasks: definition modeling, paraphrase generation, and machine translation. The system uses feedforward neural networks for classification and SentenceTransformer models for similarity scores and sentence embeddings. Even though the SemEval-2024 benchmark shows good results, there is still room for improvement. Promising paths toward improving performance include considering multi-task learning methods, including strategies for handling out-of-domain data minimizing bias, and investigating sophisticated architectures.

Magnum JUCSE at SemEval-2024 Task 4: Multilingual Detection of Persuasion Techniques in Memes
Adnan Khurshid | Dipankar Das
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper focuses on the task of detecting persuasion techniques organised in a hierarchy within meme text in multiple languages like English, North Macedonian, Arabic and Bulgarian, exploring the ways in which textual elements contribute to the dissemination of persuasive messages.The main strategy of the system is to train a binary classifier for each node in the hierarchy and predict labels in a top down fashion by seeing the confidence value of the prediction at any node. For each unique label in the hierarchy, a dataset is created from the original dataset which is then used to train the binary classifier for that label

2023

Identifying Intent-Sentiment Co-reference from Legal Utterances
Pinaki Karkun | Dipankar Das
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Co-reference is always treated as one of challenging tasks under natural language processing and has been explored only in the domain of anaphora resolution to an extent. However, the benefit of it to identify the relations between multiple entities in a single context can be explored better while we aim to identify intent and sentiment from the utterances of a dialogue or conversation. The utilization of co-reference becomes more elegant while tracking users’ intents with respect to their corresponding sentiments explored in a specialized domain like judiciary. Thus, in the present attempt, we have identified not only intent and sentiment expressions at token level in an individual manner, we also classified the utterances and identified the co-reference between intent and sentiment entities in utterance level context. Last but not the least, the deep learning algorithms have shown improvements over traditional machine learning in all cases.

Mytho-Annotator: An Annotation tool for Indian Hindu Mythology
Apurba Paul | Anupam Mondal | Sainik Kumar Mahata | Srijan Seal | Prasun Sarkar | Dipankar Das
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Mythology is a collection of myths, especially one belonging to a particular religious or cultural tradition. We observed that an annotation tool is essential to identify important and complex information from any mythological texts or corpora. Additionally, obtaining highquality annotated corpora for complex information extraction including labeled text segments is an expensive and timeconsuming process. Hence, in this paper, we have designed and deployed an annotation tool for Hindu mythology which is presented as Mytho-Annotator. Its easy-to-use web-based text annotation tool is powered by Natural Language Processing (NLP). This tool primarily labels three different categories such as named entities, relationships, and event entities. This annotation tool offers a comprehensive and adaptable annotation paradigm.

Transfer learning in low-resourced MT: An empirical study
Sainik Kumar Mahata | Dipanjan Saha | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Translation systems rely on a large and goodquality parallel corpus for producing reliable translations. However, obtaining such a corpus for low-resourced languages is a challenge. New research has shown that transfer learning can mitigate this issue by augmenting lowresourced MT systems with high-resourced ones. In this work, we explore two types of transfer learning techniques, namely, crosslingual transfer learning and multilingual training, both with information augmentation, to examine the degree of performance improvement following the augmentation. Furthermore, we use languages of the same family (Romanic, in our case), to investigate the role of the shared linguistic property, in producing dependable translations.

2022

Can Unsupervised Knowledge Transfer from Social Discussions Help Argument Mining?
Subhabrata Dutta | Jeevesh Juneja | Dipankar Das | Tanmoy Chakraborty
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Identifying argument components from unstructured texts and predicting the relationships expressed among them are two primary steps of argument mining. The intrinsic complexity of these tasks demands powerful learning models. While pretrained Transformer-based Language Models (LM) have been shown to provide state-of-the-art results over different NLP tasks, the scarcity of manually annotated data and the highly domain-dependent nature of argumentation restrict the capabilities of such models. In this work, we propose a novel transfer learning strategy to overcome these challenges. We utilize argumentation-rich social discussions from the ChangeMyView subreddit as a source of unsupervised, argumentative discourse-aware knowledge by finetuning pretrained LMs on a selectively masked language modeling task. Furthermore, we introduce a novel prompt-based strategy for inter-component relation prediction that compliments our proposed finetuning method while leveraging on the discourse context. Exhaustive experiments show the generalization capability of our method on these two tasks over within-domain as well as out-of-domain datasets, outperforming several existing and employed strong baselines.

JU_NLP at HinglishEval: Quality Evaluation of the Low-Resource Code-Mixed Hinglish Text
Prantik Guha | Rudra Dhar | Dipankar Das
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

In this paper we describe a system submitted to the INLG 2022 Generation Challenge (GenChal) on Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text. We implement a Bi-LSTM-based neural network model to predict the Average rating score and Disagreement score of the synthetic Hinglish dataset. In our models, we used word embeddings for English and Hindi data, and one hot encodings for Hinglish data. We achieved a F1 score of 0.11, and mean squared error of 6.0 in the average rating score prediction task. In the task of Disagreement score prediction, we achieve a F1 score of 0.18, and mean squared error of 5.0.

2021

Sentiment Classification of Code-Mixed Tweets using Bi-Directional RNN and Language Tags
Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Sentiment analysis tools and models have been developed extensively throughout the years, for European languages. In contrast, similar tools for Indian Languages are scarce. This is because, state-of-the-art pre-processing tools like POS tagger, shallow parsers, etc., are not readily available for Indian languages. Although, such working tools for Indian languages, like Hindi and Bengali, that are spoken by the majority of the population, are available, finding the same for less spoken languages like, Tamil, Telugu, and Malayalam, is difficult. Moreover, due to the advent of social media, the multi-lingual population of India, who are comfortable with both English ad their regional language, prefer to communicate by mixing both languages. This gives rise to massive code-mixed content and automatically annotating them with their respective sentiment labels becomes a challenging task. In this work, we take up a similar challenge of developing a sentiment analysis model that can work with English-Tamil code-mixed data. The proposed work tries to solve this by using bi-directional LSTMs along with language tagging. Other traditional methods, based on classical machine learning algorithms have also been discussed in the literature, and they also act as the baseline systems to which we will compare our Neural Network based model. The performance of the developed algorithm, based on Neural Network architecture, garnered precision, recall, and F1 scores of 0.59, 0.66, and 0.58 respectively.

Leveraging Expectation Maximization for Identifying Claims in Low Resource Indian Languages
Rudra Dhar | Dipankar Das
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Identification of the checkable claims is one of the important prior tasks while dealing with infinite amount of data streaming from social web and the task becomes a compulsory one when we analyze them on behalf of a multilingual country like India that contains more than 1 billion people. In the present work, we describe our system which is made for detecting check-worthy claim sentences in resource scarce Indian languages (e.g., Bengali and Hindi). Firstly, we collected sentences from various sources in Bengali and Hindi and vectorized them with several NLP features. We labeled a small portion of them for check-worthy claims manually. However, in order to label rest amount of data in a semi-supervised fashion, we employed the Expectation Maximization (EM) algorithm tuned with the Multivariate Gaussian Mixture Model (GMM) to assign weakly labels. The optimal number of Gaussians in this algorithm is traced by using Logistic Regression. Furthermore, we used different ratios of manually labeled data and weakly labeled data to train our various machine learning models. We tabulated and plotted the performances of the models along with the stepwise decrement in proportion of manually labeled data. The experimental results were at par with our theoretical understanding, and we conclude that the weakly labeling of check-worthy claim sentences in low resource languages with EM algorithm has true potential.

Studies Towards Language Independent Fake News Detection
Soumayan Majumder | Dipankar Das
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

We have studied that fake news is currently one of the trending topic and it causes problem to many people and organization. We use COVID19 domain and 7 languages to work on. We collect our data from twitter. We build two types of model one is language dependent and other one is language independent. We get better result in language independent model for English, Hindi and Bengali language. Results of European languages like German, Italian, French and Spanish are comparable in both language dependent and independent model.

Classification of COVID19 tweets using Machine Learning Approaches
Anupam Mondal | Sainik Mahata | Monalisa Dey | Dipankar Das
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

The reported work is a description of our participation in the “Classification of COVID19 tweets containing symptoms” shared task, organized by the “Social Media Mining for Health Applications (SMM4H)” workshop. The literature describes two machine learning approaches that were used to build a three class classification system, that categorizes tweets related to COVID19, into three classes, viz., self-reports, non-personal reports, and literature/news mentions. The steps for pre-processing tweets, feature extraction, and the development of the machine learning models, are described extensively in the documentation. Both the developed learning models, when evaluated by the organizers, garnered F1 scores of 0.93 and 0.92 respectively.

Classifying Emotional Utterances by Employing Multi-modal Speech Emotion Recognition
Dipankar Das
Proceedings of the Workshop on Speech and Music Processing 2021

Deep learning methods are being applied to several speech processing problems in recent years. In the present work, we have explored different deep learning models for speech emotion recognition. We have employed normal deep feedforward neural network (FFNN) and convolutional neural network (CNN) to classify audio files according to their emotional content. Comparative study indicates that CNN model outperforms FFNN in case of emotions as well as gender classification. It was observed that the sole audio based models can capture the emotions up to a certain limit. Thus, we attempted a multi-modal framework by combining the benefits of the audio and text features and employed them into a recurrent encoder. Finally, the audio and text encoders are merged to provide the desired impact on various datasets. In addition, a database consists of emotional utterances of several words has also been developed as a part of this work. It contains same word in different emotional utterances. Though the size of the database is not that large but this database is ideally supposed to contain all the English words that exist in an English dictionary.

2020

JUNLP@ICON2020: Low Resourced Machine Translation for Indic Languages
Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

In the current work, we present the description of the systems submitted to a machine translation shared task organized by ICON 2020: 17th International Conference on Natural Language Processing. The systems were developed to show the capability of general domain machine translation when translating into Indic languages, English-Hindi, in our case. The paper shows the training process and quantifies the performance of two state-of-the-art translation systems, viz., Statistical Machine Translation and Neural Machine Translation. While Statistical Machine Translation systems work better in a low-resource setting, Neural Machine Translation systems are able to generate sentences that are fluent in nature. Since both these systems have contrasting advantages, a hybrid system, incorporating both, was also developed to leverage all the strong points. The submitted systems garnered BLEU scores of 8.701943312, 0.6361336198, and 11.78873307 respectively and the scores of the hybrid system helped us to the fourth spot in the competition leaderboard.

JUNLP at SemEval-2020 Task 9: Sentiment Analysis of Hindi-English Code Mixed Data Using Grid Search Cross Validation
Avishek Garain | Sainik Mahata | Dipankar Das
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Code-mixing is a phenomenon which arises mainly in multilingual societies. Multilingual people, who are well versed in their native languages and also English speakers, tend to code-mix using English-based phonetic typing and the insertion of anglicisms in their main language. This linguistic phenomenon poses a great challenge to conventional NLP domains such as Sentiment Analysis, Machine Translation, and Text Summarization, to name a few. In this work, we focus on working out a plausible solution to the domain of Code-Mixed Sentiment Analysis. This work was done as participation in the SemEval-2020 Sentimix Task, where we focused on the sentiment analysis of English-Hindi code-mixed sentences. our username for the submission was “sainik.mahata” and team name was “JUNLP”. We used feature extraction algorithms in conjunction with traditional machine learning algorithms such as SVR and Grid Search in an attempt to solve the task. Our approach garnered an f1-score of 66.2% when tested using metrics prepared by the organizers of the task.

2019

Development of POS tagger for English-Bengali Code-Mixed data
Tathagata Raha | Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 16th International Conference on Natural Language Processing

Code-mixed texts are widespread nowadays due to the advent of social media. Since these texts combine two languages to formulate a sentence, it gives rise to various research problems related to Natural Language Processing. In this paper, we try to excavate one such problem, namely, Parts of Speech tagging of code-mixed texts. We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script. Our approach initially involves the collection and cleaning of English-Bengali code-mixed tweets. These tweets were used as a development dataset for building our system. The proposed system is a modular approach that starts by tagging individual tokens with their respective languages and then passes them to different POS taggers, designed for different languages (English and Bengali, in our case). Tags given by the two systems are later joined together and the final result is then mapped to a universal POS tag set. Our system was checked using 100 manually POS tagged code-mixed sentences and it returned an accuracy of 75.29%.

NLP at SemEval-2019 Task 6: Detecting Offensive language using Neural Networks
Prashant Kapil | Asif Ekbal | Dipankar Das
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper we built several deep learning architectures to participate in shared task OffensEval: Identifying and categorizing Offensive language in Social media by semEval-2019. The dataset was annotated with three level annotation schemes and task was to detect between offensive and not offensive, categorization and target identification in offensive contents. Deep learning models with POS information as feature were also leveraged for classification. The three best models that performed best on individual sub tasks are stacking of CNN-Bi-LSTM with Attention, BiLSTM with POS information added with word features and Bi-LSTM for third task. Our models achieved a Macro F1 score of 0.7594, 0.5378 and 0.4588 in Task(A,B,C) respectively with rank of 33rd, 54th and 52nd out of 103, 75 and 65 submissions. The three best models that performed best on individual sub task are using Neural Networks.

JUMT at WMT2019 News Translation Task: A Hybrid Approach to Machine Translation for Lithuanian to English
Sainik Kumar Mahata | Avishek Garain | Adityar Rayala | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In the current work, we present a description of the system submitted to WMT 2019 News Translation Shared task. The system was created to translate news text from Lithuanian to English. To accomplish the given task, our system used a Word Embedding based Neural Machine Translation model to post edit the outputs generated by a Statistical Machine Translation model. The current paper documents the architecture of our model, descriptions of the various modules and the results produced using the same. Our system garnered a BLEU score of 17.6.

2018

WME 3.0: An Enhanced and Validated Lexicon of Medical Concepts
Anupam Mondal | Dipankar Das | Erik Cambria | Sivaji Bandyopadhyay
Proceedings of the 9th Global Wordnet Conference

Information extraction in the medical domain is laborious and time-consuming due to the insufficient number of domain-specific lexicons and lack of involvement of domain experts such as doctors and medical practitioners. Thus, in the present work, we are motivated to design a new lexicon, WME 3.0 (WordNet of Medical Events), which contains over 10,000 medical concepts along with their part of speech, gloss (descriptive explanations), polarity score, sentiment, similar sentiment words, category, affinity score and gravity score features. In addition, the manual annotators help to validate the overall as well as individual category level of medical concepts of WME 3.0 using Cohen’s Kappa agreement metric. The agreement score indicates almost correct identification of medical concepts and their assigned features in WME 3.0.

Summarization of Table Citations from Text
Monalisa Dey | Salma Mandi | Dipankar Das
Proceedings of the 15th International Conference on Natural Language Processing

A Content-based Recommendation System for Medical Concepts: Disease and Symptom
Anupam Mondal | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 15th International Conference on Natural Language Processing

SMT vs NMT: A Comparison over Hindi and Bengali Simple Sentences
Sainik Kumar Mahata | Soumil Mandal | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 15th International Conference on Natural Language Processing

JUCBNMT at WMT2018 News Translation Task: Character Based Neural Machine Translation of Finnish to English
Sainik Kumar Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

In the current work, we present a description of the system submitted to WMT 2018 News Translation Shared task. The system was created to translate news text from Finnish to English. The system used a Character Based Neural Machine Translation model to accomplish the given task. The current paper documents the preprocessing steps, the description of the submitted system and the results produced using the same. Our system garnered a BLEU score of 12.9.

2017

JUNLP at IJCNLP-2017 Task 3: A Rank Prediction Model for Review Opinion Diversification
Monalisa Dey | Anupam Mondal | Dipankar Das
Proceedings of the IJCNLP 2017, Shared Tasks

IJCNLP-17 Review Opinion Diversification (RevOpiD-2017) task has been designed for ranking the top-k reviews of a product from a set of reviews, which assists in identifying a summarized output to express the opinion of the entire review set. The task is divided into three independent subtasks as subtask-A,subtask-B, and subtask-C. Each of these three subtasks selects the top-k reviews based on helpfulness, representativeness, and exhaustiveness of the opinions expressed in the review set individually. In order to develop the modules and predict the rank of reviews for all three subtasks, we have employed two well-known supervised classifiers namely, Naïve Bayes and Logistic Regression on the top of several extracted features such as the number of nouns, number of verbs, and number of sentiment words etc from the provided datasets. Finally, the organizers have helped to validate the predicted outputs for all three subtasks by using their evaluation metrics. The metrics provide the scores of list size 5 as (0.80 (mth)) for subtask-A, (0.86 (cos), 0.87 (cos d), 0.71 (cpr), 4.98 (a-dcg), and 556.94 (wt)) for subtask B, and (10.94 (unwt) and 0.67 (recall)) for subtask C individually.

NITMZ-JU at IJCNLP-2017 Task 4: Customer Feedback Analysis
Somnath Banerjee | Partha Pakray | Riyanka Manna | Dipankar Das | Alexander Gelbukh
Proceedings of the IJCNLP 2017, Shared Tasks

In this paper, we describe a deep learning framework for analyzing the customer feedback as part of our participation in the shared task on Customer Feedback Analysis at the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017). A Convolutional Neural Network (CNN) based deep neural network model was employed for the customer feedback task. The proposed system was evaluated on two languages, namely, English and French.

JU NITM at IJCNLP-2017 Task 5: A Classification Approach for Answer Selection in Multi-choice Question Answering System
Sandip Sarkar | Dipankar Das | Partha Pakray
Proceedings of the IJCNLP 2017, Shared Tasks

This paper describes the participation of the JU NITM team in IJCNLP-2017 Task 5: “Multi-choice Question Answering in Examinations”. The main aim of this shared task is to choose the correct option for each multi-choice question. Our proposed model includes vector representations as feature and machine learning for classification. At first we represent question and answer in vector space and after that find the cosine similarity between those two vectors. Finally we apply classification approach to find the correct answer. Our system was only developed for the English language, and it obtained an accuracy of 40.07% for test dataset and 40.06% for valid dataset.

Identification of Character Adjectives from Mahabharata
Apurba Paul | Dipankar Das
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

The present paper describes the identification of prominent characters and their adjectives from Indian mythological epic, Mahabharata, written in English texts. However, in contrast to the tra-ditional approaches of named entity identifica-tion, the present system extracts hidden attributes associated with each of the characters (e.g., character adjectives). We observed distinct phrase level linguistic patterns that hint the pres-ence of characters in different text spans. Such six patterns were used in order to extract the cha-racters. On the other hand, a distinguishing set of novel features (e.g., multi-word expression, nodes and paths of parse tree, immediate ancestors etc.) was employed. Further, the correlation of the features is also measured in order to identify the important features. Finally, we applied various machine learning algorithms (e.g., Naive Bayes, KNN, Logistic Regression, Decision Tree, Random Forest etc.) along with deep learning to classify the patterns as characters or non-characters in order to achieve decent accuracy. Evaluation shows that phrase level linguistic patterns as well as the adopted features are highly active in capturing characters and their adjectives.

JU CSE NLP @ SemEval 2017 Task 7: Employing Rules to Detect and Interpret English Puns
Aniket Pramanick | Dipankar Das
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

System description. Implementation of HMM and Cyclic Dependency Network.

BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora
Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

A Statistical Machine Translation (SMT) system is always trained using large parallel corpus to produce effective translation. Not only is the corpus scarce, it also involves a lot of manual labor and cost. Parallel corpus can be prepared by employing comparable corpora where a pair of corpora is in two different languages pointing to the same domain. In the present work, we try to build a parallel corpus for French-English language pair from a given comparable corpus. The data and the problem set are provided as part of the shared task organized by BUCC 2017. We have proposed a system that first translates the sentences by heavily relying on Moses and then group the sentences based on sentence length similarity. Finally, the one to one sentence selection was done based on Cosine Similarity algorithm.

Relationship Extraction based on Category of Medical Concepts from Lexical Contexts
Anupam Mondal | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

Retrieving Similar Lyrics for Music Recommendation System
Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

Developing Lexicon and Classifier for Personality Identification in Texts
Kumar Gourav Das | Dipankar Das
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

A Deep Dive into Identification of Characters from Mahabharata
Apurba Paul | Dipankar Das
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

WME: Sense, Polarity and Affinity based Concept Resource for Medical Events
Anupam Mondal | Dipankar Das | Erik Cambria | Sivaji Bandyopadhyay
Proceedings of the 8th Global WordNet Conference (GWC)

In order to overcome the lack of medical corpora, we have developed a WordNet for Medical Events (WME) for identifying medical terms and their sense related information using a seed list. The initial WME resource contains 1654 medical terms or concepts. In the present research, we have reported the enhancement of WME with 6415 number of medical concepts along with their conceptual features viz. Parts-of-Speech (POS), gloss, semantics, polarity, sense and affinity. Several polarity lexicons viz. SentiWordNet, SenticNet, Bing Liu’s subjectivity list and Taboda’s adjective list were introduced with WordNet synonyms and hyponyms for expansion. The semantics feature guided us to build a semantic co-reference relation based network between the related medical concepts. These features help to prepare a medical concept network for better sense relation based visualization. Finally, we evaluated with respect to Adaptive Lesk Algorithm and conducted an agreement analysis for validating the expanded WME resource.

Multimodal Mood Classification - A Case Study of Differences in Hindi and Western Songs
Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Music information retrieval has emerged as a mainstream research area in the past two decades. Experiments on music mood classification have been performed mainly on Western music based on audio, lyrics and a combination of both. Unfortunately, due to the scarcity of digitalized resources, Indian music fares poorly in music mood retrieval research. In this paper, we identified the mood taxonomy and prepared multimodal mood annotated datasets for Hindi and Western songs. We identified important audio and lyric features using correlation based feature selection technique. Finally, we developed mood classification systems using Support Vector Machines and Feed Forward Neural Networks based on the features collected from audio, lyrics, and a combination of both. The best performing multimodal systems achieved F-measures of 75.1 and 83.5 for classifying the moods of the Hindi and Western songs respectively using Feed Forward Neural Networks. A comparative analysis indicates that the selected features work well for mood classification of the Western songs and produces better results as compared to the mood classification systems for Hindi songs.

JU_NLP at SemEval-2016 Task 6: Detecting Stance in Tweets using Support Vector Machines
Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

JUNITMZ at SemEval-2016 Task 1: Identifying Semantic Similarity Using Levenshtein Ratio
Sandip Sarkar | Dipankar Das | Partha Pakray | Alexander Gelbukh
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

JU_NLP at SemEval-2016 Task 11: Identifying Complex Words in a Sentence
Niloy Mukherjee | Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

JUNLP at SemEval-2016 Task 13: A Language Independent Approach for Hypernym Identification
Promita Maitra | Dipankar Das
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

WMT2016: A Hybrid Approach to Bilingual Document Alignment
Sainik Mahata | Dipankar Das | Santanu Pal
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Unraveling the English-Bengali Code-Mixing Phenomenon
Arunavha Chanda | Dipankar Das | Chandan Mazumdar
Proceedings of the Second Workshop on Computational Approaches to Code Switching

Part-of-speech Tagging of Code-Mixed Social Media Text
Souvick Ghosh | Satanu Ghosh | Dipankar Das
Proceedings of the Second Workshop on Computational Approaches to Code Switching

Columbia-Jadavpur submission for EMNLP 2016 Code-Switching Workshop Shared Task: System description
Arunavha Chanda | Dipankar Das | Chandan Mazumdar
Proceedings of the Second Workshop on Computational Approaches to Code Switching

2015

Identification and Classification of Emotional Key Phrases from Psychological Texts
Apurba Paul | Dipankar Das
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction

Mood Classification of Hindi Songs based on Lyrics
Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 12th International Conference on Natural Language Processing

2014

JU_CSE: A Conditional Random Field (CRF) Based Approach to Aspect Based Sentiment Analysis
Braja Gopal Patra | Soumik Mandal | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

How Sentiment Analysis Can Help Machine Translation
Santanu Pal | Braja Gopal Patra | Dipankar Das | Sudip Kumar Naskar | Sivaji Bandyopadhyay | Josef van Genabith
Proceedings of the 11th International Conference on Natural Language Processing

2013

Construction of Emotional Lexicon Using Potts Model
Braja Gopal Patra | Hiroya Takamura | Dipankar Das | Manabu Okumura | Sivaji Bandyopadhyay
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Emotion Co-referencing - Emotional Expression, Holder, and Topic
Dipankar Das | Sivaji Bandyopadhyay
International Journal of Computational Linguistics & Chinese Language Processing, Volume 18, Number 1, March 2013

Automatic Music Mood Classification of Hindi Songs
Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 3rd Workshop on Sentiment Analysis where AI meets Psychology

2012

Part of Speech (POS) Tagger for Kokborok
Braja Gopal Patra | Khumbar Debbarma | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of COLING 2012: Posters

A Light Weight Stemmer in Kokborok
Braja Gopal Patra | Khumbar Debbarma | Swapan Debbarma | Dipankar Das | Amitava Das | Sivaji Bandyopadhyay
Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

Morphological Analyzer for Kokborok
Khumbar Debbarma | Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

Classification of Interviews - A Case Study on Cancer Patients
Braja Gopal Patra | Amitava Kundu | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 2nd Workshop on Sentiment Analysis where AI meets Psychology

2011

Semantic Clustering: an Attempt to Identify Multiword Expressions in Bengali
Tanmoy Chakraborty | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

Identifying Event-Sentiment Association using Lexical Equivalence and Co-reference Approaches
Anup Kolya | Dipankar Das | Asif Ekbal | Sivaji Bandyopadhyay
Proceedings of the ACL 2011 Workshop on Relational Models of Semantics

Developing Japanese WordNet Affect for Analyzing Emotions
Yoshimitsu Torii | Dipankar Das | Sivaji Bandyopadhyay | Manabu Okumura
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)

Analyzing Emotional Statements – Roles of General and Physiological Variables
Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2011)

2010

Discerning Emotions of Bloggers based on Topics – a Supervised Coreference Approach in Bengali
Dipankar Das | Sivaji Bandyopadhyay
ROCLING 2010 Poster Papers

JU: A Supervised Approach to Identify Semantic Relations from Paired Nominals
Santanu Pal | Partha Pakray | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 5th International Workshop on Semantic Evaluation

Labeling Emotion in Bengali Blog Corpus – A Fine Grained Tagging at Sentence Level
Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the Eighth Workshop on Asian Language Resouces

Automatic Extraction of Complex Predicates in Bengali
Dipankar Das | Santanu Pal | Tapabrata Mondal | Tanmoy Chakraborty | Sivaji Bandyopadhyay
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

Identifying Emotional Expressions, Intensities and Sentence Level Emotion Tags Using a Supervised Framework
Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

Finding Emotion Holder from Bengali Blog Texts—An Unsupervised Syntactic Approach
Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

2009

Word to Sentence Level Emotion Tagging for Bengali Blogs
Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Bengali Verb Subcategorization Frame Acquisition - A Baseline Model
Somnath Banerjee | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)

Co-authors

Dipanjan Saha 7

Soham Chaudhuri 4

Partha Pakray 4

Tanmoy Chakraborty 3

Khumbar Debbarma 3

Srijani Debnath 3

Sandip Sarkar 3

Priyobroto Acharya 2

Somnath Banerjee 2

Arunavha Chanda 2

Avishek Garain 2

Alexander Gelbukh 2

Anup Kumar Kolya 2

Soumil Mandal 2

Chandan Mazumdar 2

Tapabrata Mondal 2

Manabu Okumura 2

Md Sharib Akhtar 1

Siddhartha Bhattacharyya 1

Diganta Biswas 1

Sohan Choudhury 1

Shankha Shubhra Das 1

Sannidhya Das 1

Dr. Abhishek Das 1

Kumar Gourav Das 1

Swapan Debbarma 1

Subhabrata Dutta 1

Saptarshi Ghosh 1

Souvick Ghosh 1

Shyamal Krishna Jana 1

Jeevesh Juneja 1

Prashant Kapil 1

Pinaki Karkun 1

Sarika Khatun 1

Adnan Khurshid 1

Amitava Kundu 1

Promita Maitra 1

Arpan Majumdar 1

Soumayan Majumder 1

Riyanka Manna 1

Haranath Mondal 1

Niloy Mukherjee 1

Arkajyoti Naskar 1

Sudip Kumar Naskar 1

Md Fahin Parvej 1

Biswajit Paul 1

Aniket Pramanick 1

Tathagata Raha 1

Subharthi Ray 1

Adityar Rayala 1

Prasun Sarkar 1

Partha Sengupta 1

Hiroya Takamura 1

Yoshimitsu Torii 1

Josef van Genabith 1

Venues

dravidianlangtech1