Petr Babkin

2025

The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in the multi-modal domain. Several datasets exist for research on specific tasks of VRDU, such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific type of documents or task is not representative of how documents often need to be processed in the wild – where variety in style and requirements is expected. In this paper, we introduce BuDDIE: Business Document Dataset for Information Extraction, the first multi-task dataset of 1665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and large language model approaches to VRDU.

2024

Enterprise documents such as forms, receipts, reports, and other such records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

pdf bib abs
ReportGPT: Human-in-the-loop Verifiable Table-to-Text Generation
Lucas Cecchi | Petr Babkin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Recent developments in the quality and accessibility of large language models have precipitated a surge in user-facing tools for content generation. Motivated by a necessity for human quality control of these systems, we introduce ReportGPT: a pipeline framework for verifiable human-in-the-loop table-to-text generation. ReportGPT is based on a domain specific language, which acts as a proof mechanism for generating verifiable commentary. This allows users to quickly check the relevancy and factuality of model outputs. User selections then become few-shot examples for improving the performance of the pipeline. We configure 3 approaches to our pipeline, and find that usage of language models in ReportGPT’s components trade off precision for more insightful downstream commentary. Furthermore, ReportGPT learns from human feedback in real-time, needing only a few samples to improve performance.

2017

pdf bib
Fast Forward Through Opportunistic Incremental Meaning Representation Construction
Petr Babkin | Sergei Nirenburg
Proceedings of ACL 2017, Student Research Workshop

2016

pdf bib abs
Detection and Resolution of Verb Phrase Ellipsis
Marjorie McShane | Petr Babkin
Linguistic Issues in Language Technology, Volume 13, 2016

Verb phrase (VP) ellipsis is the omission of a verb phrase whose meaning can be reconstructed from the linguistic or real-world context. It is licensed in English by auxiliary verbs, often modal auxiliaries: She can go to Hawaii but he can’t [e]. This paper describes a system called ViPER (VP Ellipsis Resolver) that detects and resolves VP ellipsis, relying on linguistic principles such as syntactic parallelism, modality correlations, and the delineation of core vs. peripheral sentence constituents. The key insight guiding the work is that not all cases of ellipsis are equally difficult: some can be detected and resolved with high confidence even before we are able to build systems with human-level semantic and pragmatic understanding of text.

2014

pdf bib abs
Nominal Compound Interpretation by Intelligent Agents
Marjorie McShane | Stephen Beale | Petr Babkin
Linguistic Issues in Language Technology, Volume 10, 2014

This paper presents a cognitively-inspired algorithm for the semantic analysis of nominal compounds by intelligent agents. The agents, modeled within the OntoAgent environment, are tasked to compute a full context-sensitive semantic interpretation of each compound using a battery of engines that rely on a high-quality computational lexicon and ontology. Rather than being treated as an isolated “task”, as in many NLP approaches, nominal compound analysis in OntoAgent represents a minimal extension to the core process of semantic analysis. We hypothesize that seeking similarities across language analysis tasks reflects the spirit of how people approach language interpretation, and that this approach will make feasible the long-term development of truly sophisticated, human-like intelligent agents. The initial evaluation of our approach to nominal compounds are fixed expressions, requiring individual semantic specification at the lexical level.