Sandeep Tata

2025

SUMIE: A Synthetic Benchmark for Incremental Entity Summarization
Eunjeong Hwang | Yichao Zhou | Beliz Gunel | James Bradley Wendt | Sandeep Tata
Proceedings of the 31st International Conference on Computational Linguistics

No existing dataset adequately tests how well language models can incrementally update entity summaries – a crucial ability as these models rapidly advance. The Incremental Entity Summarization (IES) task is vital for maintaining accurate, up-to-date knowledge. To address this, we introduce , a fully synthetic dataset designed to expose real-world IES challenges. This dataset addresses issues like incorrect entity association and incomplete information, capturing real-world complexity by generating diverse attributes, summaries, and unstructured paragraphs with 99% alignment accuracy between generated summaries and paragraphs. Extensive experiments demonstrate the dataset’s difficulty – state-of-the-art LLMs struggle to update summaries with an F1 higher than 80.4%. We will open-source the benchmark and the evaluation metrics to help the community make progress on IES tasks.

pdf bib abs

PRISM: Efficient Long-Range Reasoning With Short-Context LLMs
Dulhan Jayalath | James Bradley Wendt | Nicholas Monath | Sandeep Tata | Beliz Gunel
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Long-range tasks demand reasoning over long inputs. However, existing solutions are limited, e.g., long-context models require large compute budgets, parameter-efficient fine-tuning (PEFT) needs training data, and retrieval-augmented generation (RAG) entails complex task-specific designs. Though in-context approaches overcome many of these issues, methods with short-context LLMs are inefficient, trading context for processing more tokens. We introduce **PRISM**, a highly token-efficient in-context method based on structured schemas that outperforms baselines on diverse tasks with **4x shorter contexts**. This approach produces concise outputs and efficiently leverages key-value (KV) caches to **reduce costs by up to 54%**. PRISM scales down to tiny contexts without increasing costs or sacrificing quality, and generalizes to new tasks with minimal effort by generating schemas from task descriptions.

2024

pdf bib abs

Large language models (LLMs) often struggle with processing extensive input contexts, which can lead to redundant, inaccurate, or incoherent summaries. Recent methods have used unstructured memory to incrementally process these contexts, but they still suffer from information overload due to the volume of unstructured data handled. In our study, we introduce structured knowledge representations (GU_json), which significantly improve summarization performance by 40% and 14% across two public datasets. Most notably, we propose the Chain-of-Key strategy (CoK_json) that dynamically updates or augments these representations with new information, rather than recreating the structured memory for each new source. This method further enhances performance by 7% and 4% on the datasets.

2023

pdf bib abs

Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models
Yichao Zhou | James Bradley Wendt | Navneet Potti | Jing Xie | Sandeep Tata
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide “yes/no” labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by 10× with a negligible loss in accuracy.

2020

pdf bib abs

Representation Learning for Information Extraction from Form-like Documents
Bodhisattwa Prasad Majumder | Navneet Potti | Sandeep Tata | James Bradley Wendt | Qi Zhao | Marc Najork
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains but are also interpretable, as we show using loss cases.

Co-authors

Jing Xie 2

Dulhan Jayalath 1

Bodhisattwa Prasad Majumder 1

Nicholas Monath 1

Marc Najork 1

Nguyen Vo 1

Qi Zhao 1

Venues

Fix author