Tirthankar Dasgupta - ACL Anthology

Tirthankar Dasgupta

2026

RegNLI: Detecting Online Product Misbranding through Legal and Linguistic Alignment
Diya Saha | Abhishek Bharadwaj Varanasi | Tirthankar Dasgupta | Manjira Sinha
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Misbranding of health-related products poses significant risks to public safety and regulatory compliance. Existing approaches to claim verification largely rely on keyword matching or generic text classification, failing to capture the nuanced reasoning required to align product claims with legal statutes. In this work, we introduce RegNLI, a novel framework that formulates misbranding detection as a inference task between product claims and regulatory provisions. Leveraging a curated dataset of FDA warning letters, we construct structured representations of claims and statutes. Our model integrates a regulation-aware gating mechanism with a contrastive alignment objective to jointly optimize misbranding classification and statute mapping. Experiments on the FDA-Misbrand dataset demonstrate that RegNLI significantly outperforms strong baselines across accuracy, F1-score, and regulation alignment metrics, while providing interpretable attention patterns that highlight critical linguistic cues. This work establishes a foundation for compliance-aware NLP systems and opens new directions for integrating formal reasoning with neural architectures in regulatory domains.

DisGraph-RP: Graph-Augmented Temporal Modeling with Aspect-Based Contrastive Encoding of Discharge Summary for Readmission Prediction
Sudeshna Jana | Tirthankar Dasgupta | Manjira Sinha | Pabitra Mitra
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Predicting hospital readmissions is a critical clinical task with substantial implications for patient outcomes and healthcare cost management. We propose DisGraph-RP, a graph-augmented temporal modeling framework that integrates structured discourse-aware text representation with cross-admission relational reasoning. Our approach introduces a Section-Aware Contrastive Encoder that leverages section segmentation and aspect-based supervision to produce fine-grained representations of discharge summaries. These representations are then composed over time using a Graph-Based temporal module that encodes inter-visit dependencies through learned edge relations, enabling the model to capture disease progression, treatment history, and recurrent risk signals. Experiments on multiple real-world datasets demonstrate that DisGraph-RP achieves significant improvements over strong baselines, including transformer-based clinical models and prompting-based LLM approaches. Our findings highlight the importance of combining discourse-informed text encoding with temporal graph reasoning for robust clinical outcome prediction.

A Graph-Augmented Liquid Neural Network for Extracting Food Hazards and Disease Outbreaks
Tirthankar Dasgupta | Manjira Sinha | Sudeshna Jana | Diya Saha | Ishan Verma | Vaishali Aggarwal
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)

The increasing frequency of foodborne illnesses, safety hazards, and disease outbreaks in the food supply chain demands urgent attention to protect public health. These incidents, ranging from contamination to intentional adulteration of food and feed, pose serious risks to consumers, leading to poisoning, and disease outbreaks that lead to product recalls. Identifying and tracking the sources and pathways of contamination is essential for timely intervention and prevention. This paper explores the use of social media and regulatory news reports to detect food safety issues and disease outbreaks. We present an automated approach leveraging a multi-task sequence labeling and sequence classification model that uses a liquid time-constant neural network augmented with a graph convolution network to extract and analyze relevant information from social media posts and official reports. Our methodology includes the creation of annotated datasets of social media content and regulatory documents, enabling the model to identify foodborne infections and safety hazards in real-time. Preliminary results demonstrate that our model outperforms baseline models, including advanced large language models like LLAMA-3 and Mistral-7B, in terms of accuracy and efficiency. The integration of liquid neural networks significantly reduces computational and memory requirements, achieving superior performance with just 1.2 × e⁶ bytes of memory, compared to the 20.3 GB of GPU memory needed by traditional transformer-based models. This approach offers a promising solution for leveraging social media data in monitoring and mitigating food safety risks and public health threats.

2025

Self-State Evidence Extraction and Well-Being Prediction from Social Media Timelines
Suchandra Chakraborty | Sudeshna Jana | Manjira Sinha | Tirthankar Dasgupta
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)

This study explores the application of Large Language Models (LLMs) and supervised learning to analyze social media posts from Reddit users, addressing two key objectives: first, to extract adaptive and maladaptive self-state evidence that supports psychological assessment (Task A1); and second, to predict a well-being score that reflects the user’s mental state (Task A2). We propose i) a fine-tuned RoBERTa (Liu et al., 2019) model for Task A1 to identify self-state evidence spans and ii) evaluate two approaches for Task A2: a retrieval-augmented DeepSeek-7B (DeepSeek-AI et al., 2025) model and a Random Forest regression model trained on sentence embeddings. While LLM-based prompting utilizes contextual reasoning, our findings indicate that supervised learning provides more reliable numerical predictions. The RoBERTa model achieves the highest recall (0.602) for Task A1, and Random Forest regression outperforms DeepSeek-7B for Task A2 (MSE: 2.994 vs. 6.610). These results highlight the strengths and limitations of generative vs. supervised methods in mental health NLP, contributing to the development of privacy-conscious, resource-efficient approaches for psychological assessment. This work is part of the CLPsych 2025 shared task (Tseriotou et al., 2025).

Benchmarking Bangla Causality: A Dataset of Implicit and Explicit Causal Sentences and Cause-Effect Relations
Diya Saha | Sudeshna Jana | Manjira Sinha | Tirthankar Dasgupta
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Causal reasoning is central to language understanding, yet remains under-resourced in Bangla. In this paper, we introduce the first large-scale dataset for causal inference in Bangla, consisting of over 11663 sentences annotated for causal sentence types (explicit, implicit, non-causal) and token-level spans for causes, effects, and connectives. The dataset captures both simple and complex causal structures across diverse domains such as news, education, and health. We further benchmark a suite of state-of-the-art instruction-tuned large language models, including LLaMA 3.3 70B, Gemma 2 9B, Qwen 32B, and DeepSeek, under zero-shot and three-shot prompting conditions. Our analysis reveals that while LLMs demonstrate moderate success in explicit causality detection, their performance drops significantly on implicit and span-level extraction tasks. This work establishes a foundational resource for Bangla causal understanding and highlights key challenges in adapting multilingual LLMs for structured reasoning in low-resource languages.

Predicting ICU Length of Stay for Patients using Latent Categorization of Health Conditions
Tirthankar Dasgupta | Manjira Sinha | Sudeshna Jana
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Predicting the duration of a patient’s stay in an Intensive Care Unit (ICU) is a critical challenge for healthcare administrators, as it impacts resource allocation, staffing, and patient care strategies. Traditional approaches often rely on structured clinical data, but recent developments in language models offer significant potential to utilize unstructured text data such as nursing notes, discharge summaries, and clinical reports for ICU length-of-stay (LoS) predictions. In this study, we introduce a method for analyzing nursing notes to predict the remaining ICU stay duration of patients. Our approach leverages a joint model of latent note categorization, which identifies key health-related patterns and disease severity factors from unstructured text data. This latent categorization enables the model to derive high-level insights that influence patient care planning. We evaluate our model on the widely used MIMIC-III dataset, and our preliminary findings show that it significantly outperforms existing baselines, suggesting promising industrial applications for resource optimization and operational efficiency in healthcare settings.

Cross-Linguistic Phonological Similarity Analysis in Sign Languages Using HamNoSys
Abhishek Bharadwaj Varanasi | Manjira Sinha | Tirthankar Dasgupta
Proceedings of the Workshop on Sign Language Processing (WSLP)

This paper presents a cross-linguistic analysis of phonological similarity in sign languages using symbolic representations from the Hamburg Notation System (HamNoSys). We construct a dataset of 1000 signs each from British Sign Language (BSL), German Sign Language (DGS), French Sign Language (LSF), and Greek Sign Language (GSL), and compute pairwise phonological similarity using normalized edit distance over HamNoSys strings. Our analysis reveals both universal and language-specific patterns in handshape usage, movement dynamics, non-manual features, and spatial articulation. We explore intra and inter-language similarity distributions, phonological clustering, and co-occurrence structures across feature types. The findings offer insights into the structural organization of sign language phonology and highlight typological variation shaped by linguistic and cultural factors.

2024

EnClaim: A Style Augmented Transformer Architecture for Environmental Claim Detection
Diya Saha | Manjira Sinha | Tirthankar Dasgupta
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)

Across countries, a noteworthy paradigm shift towards a more sustainable and environmentally responsible economy is underway. However, this positive transition is accompanied by an upsurge in greenwashing, where companies make exaggerated claims about their environmental commitments. To address this challenge and protect consumers, initiatives have emerged to substantiate green claims. With the proliferation of environmental and scientific assertions, a critical need arises for automated methods to detect and validate these claims at scale. In this paper, we introduce EnClaim, a transformer network architecture augmented with stylistic features for automatically detecting claims from open web documents or social media posts. The proposed model considers various linguistic stylistic features in conjunction with language models to predict whether a given statement constitutes a claim. We have rigorously evaluated the model using multiple open datasets. Our initial findings indicate that incorporating stylistic vectors alongside the BERT-based language model enhances the overall effectiveness of environmental claim detection.

PollCardioKG: A Dynamic Knowledge Graph of Interaction Between Pollution and Cardiovascular Diseases
Sudeshna Jana | Anunak Roy | Manjira Sinha | Tirthankar Dasgupta
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

In recent decades, environmental pollution has become a pressing global health concern. According to the World Health Organization (WHO), a significant portion of the population is exposed to air pollutant levels exceeding safety guidelines. Cardiovascular diseases (CVDs) — including coronary artery disease, heart attacks, and strokes — are particularly significant health effects of this exposure. In this paper, we investigate the effects of air pollution on cardiovascular health by constructing a dynamic knowledge graph based on extensive biomedical literature. This paper provides a comprehensive exploration of entity identification and relation extraction, leveraging advanced language models. Additionally, we demonstrate how in-context learning with large language models can enhance the accuracy and efficiency of the extraction process. The constructed knowledge graph enables us to analyze the relationships between pollutants and cardiovascular diseases over the years, providing deeper insights into the long-term impact of cumulative exposure, underlying causal mechanisms, vulnerable populations, and the role of emerging contaminants in worsening various cardiac outcomes.

Linguistically Informed Transformers for Text to American Sign Language Translation
Abhishek Bharadwaj Varanasi | Manjira Sinha | Tirthankar Dasgupta | Charudatta Jadhav
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)

In this paper we propose a framework for automatic translation of English text to American Sign Language (ASL) which leverages a linguistically informed transformer model to translate English sentences into ASL gloss sequences. These glosses are then associated with respective ASL videos, effectively representing English text in ASL. To facilitate experimentation, we create an English-ASL parallel dataset on banking domain.Our preliminary results demonstrated that the linguistically informed transformer model achieves a 97.83% ROUGE-L score for text-to-gloss translation on the ASLG-PC12 dataset. Furthermore, fine-tuning the transformer model on the banking domain dataset yields an 89.47% ROUGE-L score when fine-tuned on ASLG-PC12 + banking domain dataset. These results demonstrate the effectiveness of the linguistically informed model for both general and domain-specific translations. To facilitate parallel dataset generation in banking-domain, we choose ASL despite having limited benchmarks and data corpus compared to some of the other sign languages.

FORCE: A Benchmark Dataset for Foodborne Disease Outbreak and Recall Event Extraction from News
Sudeshna Jana | Manjira Sinha | Tirthankar Dasgupta
Proceedings of the 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks

The escalating prevalence of food safety incidents within the food supply chain necessitates immediate action to protect consumers. These incidents encompass a spectrum of issues, including food product contamination and deliberate food and feed adulteration for economic gain leading to outbreaks and recalls. Understanding the origins and pathways of contamination is imperative for prevention and mitigation. In this paper, we introduce FORCE Foodborne disease Outbreak and ReCall Event extraction from openweb). Our proposed model leverages a multi-tasking sequence labeling architecture in conjunction with transformer-based document embeddings. We have compiled a substantial annotated corpus comprising relevant articles published between 2011 and 2023 to train and evaluate the model. The dataset will be publicly released with the paper. The event detection model demonstrates fair accuracy in identifying food-related incidents and outbreaks associated with organizations, as assessed through cross-validation techniques.

Exploring Language Models to Analyze Market Demand Sentiments from News
Tirthankar Dasgupta | Manjira Sinha
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Obtaining demand trends for products is an essential aspect of supply chain planning. It helps in generating scenarios for simulation before actual demands start pouring in. Presently, experts obtain this number manually from different News sources. In this paper, we have presented methods that can automate the information acquisition process. We have presented a joint framework that performs information extraction and sentiment analysis to acquire demand related information from business text documents. The proposed system leverages a TwinBERT-based deep neural network model to first extract product information for which demand is associated and then identify the respective sentiment polarity. The articles are also subjected to causal analytics, that, together yield rich contextual information about reasons for rise or fall of demand of various products. The enriched information is targeted for the decision-makers, analysts and knowledge workers. We have exhaustively evaluated our proposed models with datasets curated and annotated for two different domains namely, automobile sector and housing. The proposed model outperforms the existing baseline systems.

2023

Dy-poThon: A Bangla Sentence-Learning System for Children with Dyslexia
Dipshikha Podder | Manjira Sinha | Tirthankar Dasgupta | Anupam Basu
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

The number of assistive technologies available for dyslexia in Bangla is low and most of them do not use multisensory teaching methods. As a solution, a computer-based audio-visual system Dy-poThon is proposed to teach sentence reading in Bangla. It incorporates the multisensory teaching method through three activities, listening, reading, and writing, checks the reading and writing ability of the user and tracks the response time. A criteria-based evaluation was conducted with 28 special educators to evaluate Dy-poThon. Content, efficiency, ease of use and aesthetics are evaluated using a standardised questionnaire. The result suggests that Dy-poThon is useful for teaching Bangla sentence-reading.

2022

TCS WITM 2022@FinSim4-ESG: Augmenting BERT with Linguistic and Semantic features for ESG data classification
Tushar Goel | Vipul Chauhan | Suyash Sangwan | Ishan Verma | Tirthankar Dasgupta | Lipika Dey
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

Advanced neural network architectures have provided several opportunities to develop systems to automatically capture information from domain-specific unstructured text sources. The FinSim4-ESG shared task, collocated with the FinNLP workshop, proposed two sub-tasks. In sub-task1, the challenge was to design systems that could utilize contextual word embeddings along with sustainability resources to elaborate an ESG taxonomy. In the second sub-task, participants were asked to design a system that could classify sentences into sustainable or unsustainable sentences. In this paper, we utilize semantic similarity features along with BERT embeddings to segregate domain terms into a fixed number of class labels. The proposed model not only considers the contextual BERT embeddings but also incorporates Word2Vec, cosine, and Jaccard similarity which gives word-level importance to the model. For sentence classification, several linguistic elements along with BERT embeddings were used as classification features. We have shown a detailed ablation study for the proposed models.

ATL at FinCausal 2022: Transformer Based Architecture for Automatic Causal Sentence Detection and Cause-Effect Extraction
Abir Naskar | Tirthankar Dasgupta | Sudeshna Jana | Lipika Dey
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

Automatic extraction of cause-effect relationships from natural language texts is a challenging open problem in Artificial Intelligence. Most of the early attempts at its solution used manually constructed linguistic and syntactic rules on restricted domain data sets. With the advent of big data, and the recent popularization of deep learning, the paradigm to tackle this problem has slowly shifted. In this work we proposed a transformer based architecture to automatically detect causal sentences from textual mentions and then identify the corresponding cause-effect relations. We describe our submission to the FinCausal 2022 shared task based on this method. Our model achieves a F1-score of 0.99 for the Task-1 and F1-score of 0.60 for Task-2 on the shared task data set on financial documents.

Proceedings of the First Workshop on NLP in Agriculture and Livestock Management
Manjira Sinha | Tirthankar Dasgupta | Sanjay Chatterjee
Proceedings of the First Workshop on NLP in Agriculture and Livestock Management

2020

Extracting Semantic Aspects for Structured Representation of Clinical Trial Eligibility Criteria
Tirthankar Dasgupta | Ishani Mondal | Abir Naskar | Lipika Dey
Proceedings of the 3rd Clinical Natural Language Processing Workshop

Eligibility criteria in the clinical trials specify the characteristics that a patient must or must not possess in order to be treated according to a standard clinical care guideline. As the process of manual eligibility determination is time-consuming, automatic structuring of the eligibility criteria into various semantic categories or aspects is the need of the hour. Existing methods use hand-crafted rules and feature-based statistical machine learning methods to dynamically induce semantic aspects. However, in order to deal with paucity of aspect-annotated clinical trials data, we propose a novel weakly-supervised co-training based method which can exploit a large pool of unlabeled criteria sentences to augment the limited supervised training data, and consequently enhance the performance. Experiments with 0.2M criteria sentences show that the proposed approach outperforms the competitive supervised baselines by 12% in terms of micro-averaged F1 score for all the aspects. Probing deeper into analysis, we observe domain-specific information boosts up the performance by a significant margin.

Learning Domain Terms - Empirical Methods to Enhance Enterprise Text Analytics Performance
Gargi Roy | Lipika Dey | Mohammad Shakir | Tirthankar Dasgupta
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

Performance of standard text analytics algorithms are known to be substantially degraded on consumer generated data, which are often very noisy. These algorithms also do not work well on enterprise data which has a very different nature from News repositories, storybooks or Wikipedia data. Text cleaning is a mandatory step which aims at noise removal and correction to improve performance. However, enterprise data need special cleaning methods since it contains many domain terms which appear to be noise against a standard dictionary, but in reality are not so. In this work we present detailed analysis of characteristics of enterprise data and suggest unsupervised methods for cleaning these repositories after domain terms have been automatically segregated from true noise terms. Noise terms are thereafter corrected in a contextual fashion. The effectiveness of the method is established through careful manual evaluation of error corrections over several standard data sets, including those available for hate speech detection, where there is deliberate distortion to avoid detection. We also share results to show enhancement in classification accuracy after noise correction.

2018

Automatic Curation and Visualization of Crime Related Information from Incrementally Crawled Multi-source News Reports
Tirthankar Dasgupta | Lipika Dey | Rupsa Saha | Abir Naskar
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

In this paper, we demonstrate a system for the automatic extraction and curation of crime-related information from multi-source digitally published News articles collected over a period of five years. We have leveraged the use of deep convolution recurrent neural network model to analyze crime articles to extract different crime related entities and events. The proposed methods are not restricted to detecting known crimes only but contribute actively towards maintaining an updated crime ontology. We have done experiments with a collection of 5000 crime-reporting News articles span over time, and multiple sources. The end-product of our experiments is a crime-register that contains details of crime committed across geographies and time. This register can be further utilized for analytical and reporting purposes.

Augmenting Textual Qualitative Features in Deep Convolution Recurrent Neural Network for Automatic Essay Scoring
Tirthankar Dasgupta | Abir Naskar | Lipika Dey | Rupsa Saha
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

In this paper we present a qualitatively enhanced deep convolution recurrent neural network for computing the quality of a text in an automatic essay scoring task. The novelty of the work lies in the fact that instead of considering only the word and sentence representation of a text, we try to augment the different complex linguistic, cognitive and psycological features associated within a text document along with a hierarchical convolution recurrent neural network framework. Our preliminary investigation shows that incorporation of such qualitative feature vectors along with standard word/sentence embeddings can give us better understanding about improving the overall evaluation of the input essays.

Proceedings of the First International Workshop on Language Cognition and Computational Models
Manjira Sinha | Tirthankar Dasgupta
Proceedings of the First International Workshop on Language Cognition and Computational Models

Automatic Extraction of Causal Relations from Text using Linguistically Informed Deep Neural Networks
Tirthankar Dasgupta | Rupsa Saha | Lipika Dey | Abir Naskar
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

In this paper we have proposed a linguistically informed recursive neural network architecture for automatic extraction of cause-effect relations from text. These relations can be expressed in arbitrarily complex ways. The architecture uses word level embeddings and other linguistic features to detect causal events and their effects mentioned within a sentence. The extracted events and their relations are used to build a causal-graph after clustering and appropriate generalization, which is then used for predictive purposes. We have evaluated the performance of the proposed extraction model with respect to two baseline systems,one a rule-based classifier, and the other a conditional random field (CRF) based supervised model. We have also compared our results with related work reported in the past by other authors on SEMEVAL data set, and found that the proposed bi-directional LSTM model enhanced with an additional linguistic layer performs better. We have also worked extensively on creating new annotated datasets from publicly available data, which we are willing to share with the community.

Leveraging Web Based Evidence Gathering for Drug Information Identification from Tweets
Rupsa Saha | Abir Naskar | Tirthankar Dasgupta | Lipika Dey
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

In this paper, we have explored web-based evidence gathering and different linguistic features to automatically extract drug names from tweets and further classify such tweets into Adverse Drug Events or not. We have evaluated our proposed models with the dataset as released by the SMM4H workshop shared Task-1 and Task-3 respectively. Our evaluation results shows that the proposed model achieved good results, with Precision, Recall and F-scores of 78.5%, 88% and 82.9% respectively for Task1 and 33.2%, 54.7% and 41.3% for Task3.

2017

Study on Visual Word Recognition in Bangla across Different Reader Groups
Manjira Sinha | Tirthankar Dasgupta
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

A Framework for Mining Enterprise Risk and Risk Factors from News Documents
Tirthankar Dasgupta | Lipika Dey | Prasenjit Dey | Rupsa Saha
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

Any real world events or trends that can affect the company’s growth trajectory can be considered as risk. There has been a growing need to automatically identify, extract and analyze risk related statements from news events. In this demonstration, we will present a risk analytics framework that processes enterprise project management reports in the form of textual data and news documents and classify them into valid and invalid risk categories. The framework also extracts information from the text pertaining to the different categories of risks like their possible cause and impacts. Accordingly, we have used machine learning based techniques and studied different linguistic features like n-gram, POS, dependency, future timing, uncertainty factors in texts and their various combinations. A manual annotation study from management experts using risk descriptions collected for a specific organization was conducted to evaluate the framework. The evaluation showed promising results for automated risk analysis and identification.

Effect of Syntactic Features in Bangla Sentence Comprehension
Manjira Sinha | Tirthankar Dasgupta | Anupam Basu
Proceedings of the 13th International Conference on Natural Language Processing

2015

Compositionality in Bangla Compound Verbs and their Processing in the Mental Lexicon
Tirthankar Dasgupta | Manjira Sinha | Anupam Basu
Proceedings of the 12th International Conference on Natural Language Processing

2014

Influence of Target Reader Background and Text Features on Text Readability in Bangla: A Computational Approach
Manjira Sinha | Tirthankar Dasgupta | Anupam Basu
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Design and Development of an Online Computational Framework to Facilitate Language Comprehension Research on Indian Languages
Manjira Sinha | Tirthankar Dasgupta | Anupam Basu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we have developed an open-source online computational framework that can be used by different research groups to conduct reading researches on Indian language texts. The framework can be used to develop a large annotated Indian language text comprehension data from different user based experiments. The novelty in this framework lies in the fact that it brings different empirical data-collection techniques for text comprehension under one roof. The framework has been customized specifically to address language particularities for Indian languages. It will also offer many types of automatic analysis on the data at different levels such as full text, sentence and word level. To address the subjectivity of text difficulty perception, the framework allows to capture user background against multiple factors. The assimilated data can be automatically cross referenced against varying strata of readers.

Text Readability in Hindi: A Comparative Study of Feature Performances Using Support Vectors
Manjira Sinha | Tirthankar Dasgupta | Anupam Basu
Proceedings of the 11th International Conference on Natural Language Processing

2013

Psycholinguistically Motivated Computational Models on the Organization and Processing of Morphologically Complex Words
Tirthankar Dasgupta
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

2012

Modelling the Organization and Processing of Bangla Polymorphemic Words in the Mental Lexicon: A Computational Approach
Tirthankar Dasgupta | Manjira Sinha | Anupam Basu
Proceedings of COLING 2012: Posters

New Readability Measures for Bangla and Hindi Texts
Manjira Sinha | Sakshi Sharma | Tirthankar Dasgupta | Anupam Basu
Proceedings of COLING 2012: Posters

Forward Transliteration of Dzongkha Text to Braille
Tirthankar Dasgupta | Manjira Sinha | Anupam Basu
Proceedings of the Second Workshop on Advances in Text Input Methods

Automatic Extraction of Compound Verbs from Bangla Corpora
Sibanshu Mukhopadhayay | Tirthankar Dasgupta | Manjira Sinha | Anupam Basu
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

A New Semantic Lexicon and Similarity Measure in Bangla
Manjira Sinha | Abhik Jana | Tirthankar Dasgupta | Anupam Basu
Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon

2010

Resource Creation for Training and Testing of Transliteration Systems for Indian Languages
Sowmya V. B. | Monojit Choudhury | Kalika Bali | Tirthankar Dasgupta | Anupam Basu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Machine transliteration is used in a number of NLP applications ranging from machine translation and information retrieval to input mechanisms for non-roman scripts. Many popular Input Method Editors for Indian languages, like Baraha, Akshara, Quillpad etc, use back-transliteration as a mechanism to allow users to input text in a number of Indian language. The lack of a standard dataset to evaluate these systems makes it difficult to make any meaningful comparisons of their relative accuracies. In this paper, we describe the methodology for the creation of a dataset of ~2500 transliterated sentence pairs each in Bangla, Hindi and Telugu. The data was collected across three different modes from a total of 60 users. We believe that this dataset will prove useful not only for the evaluation and training of back-transliteration systems but also help in the linguistic analysis of the process of transliterating Indian languages from native scripts to Roman.

2008

Prototype Machine Translation System From Text-To-Indian Sign Language
Tirthankar Dasgupta | Sandipan Dandpat | Anupam Basu
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

A Multilingual Multimedia Indian Sign Language Dictionary Tool
Tirthankar Dasgupta | Sambit Shukla | Sandeep Kumar | Synny Diwakar | Anupam Basu
Proceedings of the 6th Workshop on Asian Language Resources

Venues