Shubham Gupta


2025

pdf bib
Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems
Rafael Teixeira de Lima | Shubham Gupta | Cesar Berrospi Ramis | Lokesh Mishra | Michele Dolfi | Peter Staar | Panagiotis Vagenas
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system’s use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development.

pdf bib
Mirror Minds : An Empirical Study on Detecting LLM-Generated Text via LLMs
Josh Baradia | Shubham Gupta | Suman Kundu
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

The use of large language models (LLMs) is inevitable in text generation. LLMs are intelligent and slowly replacing the search engines. LLMs became the de facto choice for conversation, knowledge extraction, and brain storming. This study focuses on a question: ‘Can we utilize the generative capabilities of LLMs to detect AI-generated content?’ We present a methodology and empirical results on four publicly available data sets. The result shows, with 90% accuracy it is possible to detect AI-generated content by a zero-shot detector utilizing multiple LLMs.

2024

pdf bib
Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs
Lokesh Mishra | Sohayl Dhibi | Yusik Kim | Cesar Berrospi Ramis | Shubham Gupta | Michele Dolfi | Peter Staar
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)

Environment, Social, and Governance (ESG) KPIs assess an organization’s performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.

2022

pdf bib
Incorporating Linguistic Knowledge for Abstractive Multi-document Summarization
Congbo Ma | Wei Emma Zhang | Hu Wang | Shubham Gupta | Mingyu Guo
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

2021

pdf bib
Active2 Learning: Actively reducing redundancies in Active Learning methods for Sequence Tagging and Machine Translation
Rishi Hazra | Parag Dutta | Shubham Gupta | Mohammed Abdul Qaathir | Ambedkar Dukkipati
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

While deep learning is a powerful tool for natural language processing (NLP) problems, successful solutions to these problems rely heavily on large amounts of annotated samples. However, manually annotating data is expensive and time-consuming. Active Learning (AL) strategies reduce the need for huge volumes of labeled data by iteratively selecting a small number of examples for manual annotation based on their estimated utility in training the given model. In this paper, we argue that since AL strategies choose examples independently, they may potentially select similar examples, all of which may not contribute significantly to the learning process. Our proposed approach, Active2 Learning (A2L), actively adapts to the deep learning model being trained to eliminate such redundant examples chosen by an AL strategy. We show that A2L is widely applicable by using it in conjunction with several different AL strategies and NLP tasks. We empirically demonstrate that the proposed approach is further able to reduce the data requirements of state-of-the-art AL strategies by 3-25% on an absolute scale on multiple NLP tasks while achieving the same performance with virtually no additional computation overhead.