PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents

,


Introduction
Research papers and textbooks are central to the scientific enterprise, and there is increasing interest in developing new tools for extracting knowledge from these visually-rich documents.Recent research has explored, for example, AI-powered reading support for math symbol definitions (Head et al., 2021), in-situ passage explanations or summaries (August et al., 2023;Rachatasumrit et al., 2022;Kim et al., 2023), automatic span highlighting (Chang et al., 2023;Fok et al., 2023b), interactive clipping and synthesis (Kang et al., 2022(Kang et al., , 2023) ) Figure 1: papermage's document creation and representation.(A) Recipes are turn-key methods for processing a PDF.(B) They compose models operating across different data modalities and machine learning frameworks to extract document structure, which we conceptualize as layers of annotation that store textual and visual information.(C) Users can access and manipulate layers.and more.Further, extracting clean, properlystructured scientific text from PDF documents (Lo et al., 2020;Wang et al., 2020) forms a critical first step in pretraining language models of science (Beltagy et al., 2019;Lee et al., 2019;Gu et al., 2020;Luo et al., 2022;Taylor et al., 2022;Trewartha et al., 2022;Hong et al., 2023), automatic generation of more accessible paper formats (Wang et al., 2021), and developing datasets for scientific natural language processing (NLP) tasks over structured full text (Jain et al., 2020;Subramanian et al., 2020;Dasigi et al., 2021;Lee et al., 2023).
However, this type of NLP research on scientific corpora is difficult because the documents come in difficult-to-use formats like PDF,2 and existing tools for working with the documents are limited.Typically, the first step in scientific document processing is to invoke a parser on a document file to convert it into a sequence of tokens and bounding boxes in inferred reading order.Parsers extract only the raw document content, and obtaining richer document structure (e.g., titles, authors, figures) or linguistic structure and semantics (e.g., sentences, discourse units, scientific claims) requires sending the token sequence through downstream models.
Unlike more mature parsers ( §2.1), these downstream models are often research prototypes ( §2.2) that are limited to extracting only a subset of the structures needed for one's research (e.g., the same model may not provide both sentence splits and figure detection).As a result, users must write extensive custom code that strings pipelines of multiple models together.Research projects using models of different modalities (e.g., combining an imagebased formula detector with a text-based definition extractor) can require hundreds of lines of code.We introduce papermage, an open-source Python toolkit for processing scientific documents.Its contributions include (1) magelib, a library of primitives and methods for representing and manipulating visually-rich documents as multimodal constructs, (2) Predictors, a set of implementations that integrate different state-of-the-art scientific document analysis models into a unified interface, even if individual models are written in different frameworks or operate on different modalities, and (3) Recipes, which provide turn-key access to well-tested combinations of individual (often single-modality) modules to form sophisticated, extensible multimodal pipelines.

Turn-key software for scientific documents
Processing visually-rich documents like scientific documents requires a joint understanding of both visual and textual information.In practice, this often requires combining different models into complex processing pipelines.For example, GRO-BID (Grobid, 2008(Grobid, -2023)), a widely-adopted software tool for scientific document processing, uses twelve interdependent sequence labeling models 3 to perform its full text extraction.Other similar tools inlude CERMINE (Tkaczyk et al., 2015) and ParsCit (Councill et al., 2008).While such software is often an ideal choice for off-the-shelf processing, they are not necessarily designed for easy extension and/or integration with newer research models. 4

Models for scientific document processing
While aforementioned software tools use CRF or BiLSTM-based models, Transformer-based models have seen wide adoption among NLP researchers for their powerful processing capabilities.Recent years have seen the rise of layout-infused Transformers (Xu et al., 2019;Shen et al., 2022;Xu et al., 2021;Huang et al., 2022b;Chen et al., 2023) for processing visually-rich documents, including recovering logical structure (e.g., titles, abstracts) of scientific papers (Huang et al., 2022a).Similarly, computer vision (CV) researchers have also shown impressive capabilities of CNN-based object detection models (Ren et al., 2015;Tan et al., 2020) for segmenting visually-rich documents based on their layout.While these research models are powerful and extensible for research purposes, it often requires significant "glue" code and stitching software tools to create robust processing pipelines.For example, Lincker et al. (2023) bootstraps a sophisticated processing pipeline around a research model for processing children's textbooks.

Combining models and pipelines
papermage's use case lies between that of turnkey software and a framework for supporting research.Similar to Transformers (Wolfe et al., 2022)'s integration of different research models into standard interfaces, others have done similarly for the visually-rich document domain.LayoutParser (Shen et al., 2021) provides models for visually-rich documents and supports the creation of document processing pipelines.papermage, in fact, depends on LayoutParser for access to vision models, but is designed to also integrate text models which are omitted from 3 https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/#models 4 Most research in NLP requires that a researcher be able to manipulate models within Python.Yet, Grobid requires users to manage a separate service process and send PDFs through a client.In performing evaluation in §3.3, we also found it difficult to run only the model components isolated from PDF utilities, which makes comparison with other research models challenging without significant "glue" code.LayoutParser.To allow models of different modalities to work well together, we also developed the magelib library ( §3.1).
3 Design of papermage papermage is three parts: (1) magelib, a library for intuitively representing and manipulating visuallyrich documents, (2) Predictors, implementations of models for analyzing scientific papers that unify disparate machine learning frameworks under a common interface, and (3) Recipes, combinations of Predictors that form multimodal pipelines.

Representing and manipulating visually-rich documents with magelib
In this section, we use code snippets to show how our library's abstractions and syntax are tailored for the visually-rich document problem domain.
Data Classes.magelib provides three base data classes for representing fundamental elements of visually-rich, structured documents: Document, Layers and Entities.First, a Document might minimally store text as a string of symbols: 1 >>> from papermage import Document 2 >>> doc .symbols 3 " Revolt : Collaborative Crowdsourcing ... " But visually-rich documents are more than a linearized string.For example, analyzing a scientific paper requires access to its visuospatial layout (e.g., pages, blocks, lines), logical structure (e.g., title, abstract, figures, tables, footnotes, sections), semantic units (e.g., paragraphs, sentences, tokens), and more (e.g., citations, terms).In practice, this means different parts of doc.symbols can correspond to different paragraphs, sentences, tokens, etc. in the Document, each with its own set of corresponding coordinates representing its visual position on a page.magelib represents structure using Layers that can be accessed as attributes of a Document (e.g., doc.sentences, doc.figures, doc.tokens) (Figure 1).Each Layer is a sequence of content units, called Entities, which store both textual (e.g., spans, strings) and visuospatial (e.g., bounding boxes, pixel arrays) information:  See Figure 2 for an example on how "sentences" in a scientific document are represented as Entities.Section §3.2 explains in more detail how a user can generate Entities.
Methods.magelib also provides a set of functions for building and interacting with data: augmenting a Document with additional Layers, traversing and spatially searching for matching Entities in one Layer, and cross-referencing between Layers (see Figure 3).
A Document that only contains doc.symbols can be augmented with additional Layers: 1 >>> paragraphs = Layer (...) 2 >>> sentences = Layer (...) 3 >>> tokens = Layer (...) ["Techniques", "for", "collecting", "labeled", "data", "perts", "for", "manual", "annotation", ...] Crowdsourcing provides a scalable and efficient way to construct labeled datasets for training machine learning systems.However, creating comprehensive label guidelines for crowdworkers is often prohibitive even for seemingly simple concepts.Incomplete or ambiguous label guidelines can then result in differing interpretations of concepts and inconsistent labels.Existing approaches for improving label quality, such as worker screening or detection of poor work, are ineffective for this problem and can lead to rejection of honest work and a missed opportunity to capture rich interpretations about data.
We introduce Revolt, a collaborative approach that brings ideas from expert annotation workflows to crowd-based labeling.
Revolt eliminates the burden of creating detailed label guidelines by harnessing crowd disagreements to identify ambiguous concepts and create rich structures (groups of semantically related items) for post-hoc label decisions.Experiments comparing Revolt to traditional crowdsourced labeling show that Revolt produces high quality labels without requiring label guidelines in turn for an increase in monetary cost.This up front cost, however, is mitigated by Revolt's ability to produce reusable structures that can accommodate a variety of label boundaries without requiring new data to be collected.Further comparisons of Revolt's collaborative and non-collaborative variants show that collabvoration reaches higher label accuracy with lower monetary cost.
learned models that must be trained on representative datasets labeled according to target concepts (e.g., speech labeled by their intended commands, faces labeled in images, emails labeled as spam or not spam).Protocols and Utilities.To instantiate a Document, magelib provides protocols and utilities like Parsers and Rasterizers, which hook into off-the-shelf PDF processing tools:5  In this example, papermage runs PDF2TextParser (using pdfplumber) to extract the textual information from a PDF file.
Then it runs PDF2ImageRasterizer (using pdf2image) to update the first Document with images of pages.

Interfacing with models for scientific document analysis through Predictors
In §3.1, we described how users create Layers by assembling collections of Entities.But how would they make Entities in the first place?For example, to identify multimodal structures in visually-rich documents, researchers might want to build complex pipelines that run and combine output from many different models (e.g., computer vision models for extracting figures, NLP models for classifying body text).papermage provides a unified interface, called Predictors, to ensure models produce Entities that are compatible with the Document.papermage includes several ready-to-use Predictors that leverage state-of-the-art models to extract specific document structures (Table 1).While magelib's abstractions are general for visually-rich documents, Predictors are optimized for parsing of scientific documents.They are designed to (1) be compatible with models from many different machine learning frameworks, (2) support inference with text-only, vision-only, and multimodal models, and (3) support both adaptation of off-the-shelf, pretrained models as well as

Description Examples
Linguistic/ Semantic Segments doc into text units often used for downstream models.
As many practitioners depend on prompting a model through an API call, we implement APIPredictor which interfaces external APIs, such as GPT-3 (Brown et al., 2020), to perform tasks like question answering over a structured Document.
We also implement SnippetRetrievalPredictor which wraps models like Contriever (Izacard et al., 2022) to perform top-k within-document snippet retrieval.See §4 for how these two can be combined.development of new ones from scratch.Similarly to the Transformers library, a Predictor's implementation is typically independent from its configuration, allowing users to customize each Predictor by tweaking hyperparameters or loading a different set of weights.Below, we showcase how a vision model and two text models (both neural and symbolic) can be applied in succession to a single Document.See Table 1 for a summary of supported Predictors.
Predictors return a list of Entities, which can be group_by() to organize them based on predicted label value (e.g., tokens classified as "title" or "authors").Finally, these predictions are passed to doc.annotate() to be added to Document.

End-to-end processing with Recipes
Finally, papermage provides predefined combinations of Predictors, called Recipes, for users seeking high-quality options for turn-key processing of visually-rich documents: Recipes can also be flexibly modified to support development.For example, our current default combines the pdfplumber PDF parsing utility with the I-VILA (Shen et al., 2022) research model.We show in Table 2 an evaluation comparing this against the same recipe but configured to (1) swap I-VILA for a RoBERTa model, as well as (2) swap both for Grobid API calls.We expect Recipes to appeal to two groups of users-end-to-end consumers, and developers of high-level applications.The former is comprised of developers and researchers who are looking for a one-step solution to multimodal scientific document analysis.The latter are likely developers and researchers looking to combine document structure primitives to build a complex application (see example in §4).2023), Lucy is studying how language models can be used to resolve questions that arise while reading a paper (e.g.What does this mean?or What does this refer to?).In her prototype interface, a user can highlight a passage in a PDF and ask a question about it.A retrieval model then finds relevant passages from the rest of the paper.The prototype then uses the text of the retrieved passages along with the user question to prompt a language model to generate an answer.
When presenting the answer to the user, the prototype also visually highlights the retrieved passages as supporting evidence to the generated answer.
Getting started quickly.As a researcher proficient in Python, it only takes Lucy minutes to install papermage using pip and successfully process a local PDF file by following the example code snippet for CoreRecipe in §3.2.In an interactive session, she familiarizes herself with the provided Layers by following the traversal, cross-referencing and querying examples in §3.1.She makes sure she can serialize and re-instantiate her Document ( §A.2).
Formatting input.Before using papermage, Lucy has prior experience building QA pipelines, but has only dealt with documents as sentencesplit text data (e.g., <List[str]>).Lucy realizes that she can reuse her prior text-only code with papermage by implementing a couple of wrappers to gain additional capabilities: First, she converts a user's highlighted passage from a visual selection to text following the example in Figure 3F.
Next, she converts Document to her required text format by following the traversal examples in §3.1 (e.g., using [s.text for s in doc.sentences]).Within a few lines of code, Lucy has everything she needs for text-only input to her QA pipeline.
Formatting output.Lucy runs her QA system on her newly acquired text data and now has (1) a model-generated answer and (2) several retrieved evidence passages.She realizes that she already has access to the evidences' bounding boxes via a similar call to how she defined the model input context (e.g., [s.boxes for s in doc.sentences]).She can easily pass this to the user interface to enable linking to and highlighting of those passages.
Defining a Predictor.The pattern Lucy has followed is used in our many Predictor implementations: (1) gain access to text by traversing Layers (e.g., sentences), ( 2) perform all usual NLP computation on that text, and (3) format model output as Entities.This simple pattern allows users to reuse familiar models in existing frameworks and eschews lengthy onboarding to papermage.Lucy wraps her prompting and retrieval code in new classes: APIPredictor and SnippetRetrievalPredictor (see Table 1).
Fast iterations.Leveraging the bounding box data from papermage to visually highlight the retrieved passages, Lucy suspects the retrieval component is likely underperforming.She makes a simple edit from doc.sentences to doc.paragraphs and evaluates system performance under different input granularity.She also realizes the system often retrieves content outside the main body text.She restricts her traversal to filter out paragraphs that overlap with footnotes-[p.textfor p in doc.paragraphs if len(p.footnotes)== 0]making clever use of the cross-referencing functionality to detect when a paragraph is actually coming from a footnote.This example demonstrates the versatility of the affordances provided by magelib.

Conclusion
In this work, we've introduced papermage, an open-source Python toolkit for processing scientific documents.papermage was developed to supply high-quality data and reduce friction for research prototype development at Semantic Scholar.Today, it is being used in the production PDF processing pipeline to provide data for both the literature graph (Ammar et al., 2018;Kinney et al., 2023) and the paper-reading interface (Lo et al., 2023).It has also been used in working research prototypes which have since contributed to research publications (Fok et al., 2023b;Kim et al., 2023). 6We open-source papermage in hopes it will simplify research workflows that depend on scientific documents and promote extensions to other visuallyrich documents like textbooks (Lincker et al., 2023) and digitized print media (Lee et al., 2020).

Ethical Considerations
As a toolkit primarily designed to process scientific documents, there are two areas where papermage could cause harms or have unintended effects.
Extraction of bibliographic information papermage could be used to parse author names, affiliation, emails from scientific documents.Like any software, this extraction can be noisy, leading to incorrect parsing and thus mis-attribution of manuscripts.Further, since papermage relies on static PDF documents, rather than metadata dynamically retrieved from publishers, users of papermage need consider how and when extracted names should no longer be associated with authors, a harmful practice called deadnaming (Queer in AI et al., 2023).We recommend papermage users to exercise caution when using our toolkit to extract metadata, to cross-reference extracted content with other sources when possible, and to design systems such that authors have the ability to manually edit any data about themselves.
Misrepresentation or fabrication of information in documents In §3, we discussed how papermage can be easily extended to support highlevel applications.Such applications might include question answering chatbots, or AI summarizers that perform information synthesis over one or more papermage documents.Such applications typically rely on generative models to produce their output, which might fabricate incorrect information or misstate claims.Developers should be vigilant when integrating papermage output into any downstream application, especially in systems that purport to represent information gathered from scientific publications.

A.1 Comparison and Compatibility with XML
One can view Layers as capturing content hierarchy (e.g., tokens vs sentences) similar to that of other structured document representations, like TEI XML trees.We note that Layers are stored as unordered attributes and don't require nesting.This allows for specific cross-layer referencing operations that don't adhere to strict nesting relationships.For example: 1 for sentence in doc .sentences : for line in sentence .lines : Recall that a sentence can begin or end midway through a line and cross multiple lines ( §3.1).Similarly, not all lines are exactly contained within the boundaries of a sentence.As such, sentences and lines are not strictly nested within each other.This relationship would be difficult to encode in an XML format adhering to document tree structure.
Regardless, the way we represent structure in documents is highly versatile.We demonstrate this by also implementing GrobidParser as an alternative to the PDF2TextParser in §3.1.GrobidParser invokes Grobid to process PDFs, and reads the resulting TEI XML file generated by Grobid by converting each XML tag of a common level into an Entity of its own Layer.We use this to perform the evaluation in Table 2.

A.2 Additional magelib Protocols and Utilities
Serialization.Any Document and all of its Layers can be exported to a JSON format, and perfectly reconstructed: 1 import json 2 with open ( " .... json " , "w ") as f_out : Here, we detail how we performed the evaluation reported in §3.3 (Table 2).We also provide a full breakdown by category in Table 3.As described earlier in the paper, Grobid is quite difficult to evaluate as it is developed with tight coupling between the PDF parser (pdfalto) and the models it employs to perform logical structure recovery over the resulting token stream.As such, there is no straightforward way to run just the model components of Grobid on an alternative token stream like that provided in the S2-VL (Shen et al., 2022) dataset.
To perform this baseline evaluation, we ran the original PDFs that were annotated for S2-VL through our GrobidParser using v0.7.3.Grobid also returns bounding boxes of some predicted categories (e.g., authors, abstract, paragraphs).We use these bounding boxes to create Entities that we annotate on a Document constructed manually from from S2-VL data.Using magelib cross-layer referencing, we were able to match Grobid predictions to S2-VL data to perform this evaluation.
Though we found there are certain categories for which bounding box information was either not available (e.g., Titles) or Grobid simply did not return that output (e.g., Figure text extraction).These are represented by zeros in Table 3, which contributes to the lower scores in Table 2 after macro averaging.For a more apples-to-apples comparison, we also included a "Grobid Subset" evaluation which restricted to just categories in S2-VL for which Grobid produced bounding box information.
In addition to Grobid, we evaluate two of our provided Transformer-based models.The RoBERTalarge (Liu et al., 2019) model is a Transformers token classification model that we finetuned on the S2-VL training set.The I-VILA model is a layoutinfused Transformer model pretrained by Shen et al. (2022) on the S2-VL training set.Like we did with Grobid, we ran our CoreRecipe using these two models on the original PDFs in S2-VL, and performed a similar token mapping operation since our PDF2TextParser also produces a different token stream than that provided in S2-VL.
At the end of the day, the Transformer-based models performed better at this task than Grobid.This is unsurprising given expected improvements using a Transformer model over a CRF or BiL-STM.The Transformer models were also trained on S2-VL data, which gave them an advantage over Grobid.Overall, this evaluation intended to show how papermage enables cross-system comparisons, even eschewing token stream incompatibility, and to illustrate an upper bound of the performance left on the table by existing software systems that don't use of state-of-the-art models.

Figure 2 :
Figure 2: Entities are multimodal content units.Here, spans of a sentence are used to identify its text among all symbols, while boxes map its visual coordinates on a page.spans and boxes can include non-contiguous units, allowing great flexibility in Entities to handle layout nuances.A sentence split across columns/pages and interrupted by floating figures/footnotes would require multiple spans and bounding boxes to represent.

Figure 1 .Figure 3 :
Figure 1.Revolt creates labels for unanimously labeled "certain" items (e.g., cats and not cats), and surfaces categories of "uncertain" items enriched with crowd feedback (e.g., cats and dogs and cartoon cats in the dotted middle region are annotated with crowd explanations).Rich structures allow label requesters to better understand concepts in the data and make post-hoc decisions on label boundaries (e.g., assigning cats and dogs to the cats label and cartoon cats to the not cats label) rather than providing crowd-workers with a priori label guidelines.

Table 1 :
Types of Predictors implemented in papermage.