Imon Mukherjee


2024

pdf bib
Few-TK: A Dataset for Few-shot Scientific Typed Keyphrase Recognition
Avishek Lahiri | Pratyay Sarkar | Medha Sen | Debarshi Sanyal | Imon Mukherjee
Findings of the Association for Computational Linguistics: NAACL 2024

Scientific texts are distinctive from ordinary texts in quite a few aspects like their vocabulary and discourse structure. Consequently, Information Extraction (IE) tasks for scientific texts come with their own set of challenges. The classical definition of Named Entities restricts the inclusion of all scientific terms under its hood, which is why previous works have used the terms Named Entities and Keyphrases interchangeably. We suggest the rechristening of Named Entities for the scientific domain as Typed Keyphrases (TK), broadening their scope. We advocate for exploring this task in the few-shot domain due to the scarcity of labeled scientific IE data. Currently, no dataset exists for few-shot scientific Typed Keyphrase Recognition. To address this gap, we develop an annotation schema and present Few-TK, a dataset in the AI/ML field that includes scientific Typed Keyphrase annotations on abstracts of 500 research papers. To the best of our knowledge, this is the introductory few-shot Typed Keyphrase recognition dataset and only the second dataset structured specifically for few-shot NER, after Few-NERD. We report the results of several few-shot sequence-labelling models applied to our dataset. The data and code are available at https://github.com/AvishekLahiri/Few_TK.git

2023

pdf bib
Combating Hallucination and Misinformation: Factual Information Generation with Tokenized Generative Transformer
Sourav Das | Sanjay Chatterji | Imon Mukherjee
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

Large language models have gained a meteoric rise recently. With the prominence of LLMs, hallucination and misinformation generation have become a severity too. To combat this issue, we propose a contextual topic modeling approach called Co-LDA for generative transformer. It is based on Latent Dirichlet Allocation and is designed for accurate sentence-level information generation. This method extracts cohesive topics from COVID-19 research literature, grouping them into relevant categories. These contextually rich topic words serve as masked tokens in our proposed Tokenized Generative Transformer, a modified Generative Pre-Trained Transformer for generating accurate information in any designated topics. Our approach addresses micro hallucination and incorrect information issues in experimentation with the LLMs. We also introduce a Perplexity-Similarity Score system to measure semantic similarity between generated and original documents, offering accuracy and authenticity for generated texts. Evaluation of benchmark datasets, including question answering, language understanding, and language similarity demonstrates the effectiveness of our text generation method, surpassing some state-of-the-art transformer models.

2022

pdf bib
A custom CNN model for detection of rice disease under complex environment
Chiranjit Pal | Sanjoy Pratihar | Imon Mukherjee
Proceedings of the First Workshop on NLP in Agriculture and Livestock Management

The work in this paper designs an image-based rice disease detection framework that takes rice plant image as input and identifies the presence of BrownSpot disease in the image fed into the system. A CNN-based disease detection scheme performs the binary classification task on our custom dataset containing 2223 images of healthy and unhealthy classes under complex environments. Experimental results show that our system is able to achieve consistently satisfactory results in performing disease detection tasks. Furthermore, the CNN disease detection model compares with state-of-the-art works and procures an accuracy of 96.8%.