Lokesh Mishra

2025

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems
Rafael Teixeira de Lima | Shubham Gupta | Cesar Berrospi Ramis | Lokesh Mishra | Michele Dolfi | Peter Staar | Panagiotis Vagenas
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system’s use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development.

2024

pdf bib abs

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs
Lokesh Mishra | Sohayl Dhibi | Yusik Kim | Cesar Berrospi Ramis | Shubham Gupta | Michele Dolfi | Peter Staar
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)

Environment, Social, and Governance (ESG) KPIs assess an organization’s performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.

Co-authors

Sohayl Dhibi 1

Yusik Kim 1

Panagiotis Vagenas 1

Venues

Fix author