PHD: Pixel-Based Language Modeling of Historical Documents

The digitisation of historical documents has provided historians with unprecedented research opportunities. Yet, the conventional approach to analysing historical documents involves converting them from images to text using OCR, a process that overlooks the potential benefits of treating them as images and introduces high levels of noise. To bridge this gap, we take advantage of recent advancements in pixel-based language models trained to reconstruct masked patches of pixels instead of predicting token distributions. Due to the scarcity of real historical scans, we propose a novel method for generating synthetic scans to resemble real historical documents. We then pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. Through our experiments, we demonstrate that PHD exhibits high proficiency in reconstructing masked image patches and provide evidence of our model's noteworthy language understanding capabilities. Notably, we successfully apply our model to a historical QA task, highlighting its usefulness in this domain.


Introduction
Recent years have seen a boom in efforts to digitise historical documents in numerous languages and sources (Chadwyck, 1998;Groesen, 2015;Moss, 2009), leading to a transformation in the way historians work.Researchers are now able to expedite the analysis process of vast historical corpora using NLP tools, thereby enabling them to focus on interpretation instead of the arduous task of evidence collection (Laite, 2020;Gerritsen, 2012).
The primary step in most NLP tools tailored for historical analysis involves Optical Character Recognition (OCR).However, this approach poses several challenges and drawbacks.First, OCR *This paper shows dataset samples that are racist in nature strips away any valuable contextual meaning embedded within non-textual elements, such as page layout, fonts, and figures. 1 Moreover, historical documents present numerous challenges to OCR systems.This can range from deteriorated pages, archaic fonts and language, the presence of nontextual elements, and occasional deficiencies in scan quality (e.g., blurriness), all of which contribute to the introduction of additional noise.Consequently, the extracted text is often riddled with errors at the character level (Robertson and Goldwater, 2018;Bollmann, 2019), which most large language models (LLMs) are not tuned to process.Token-based LLMs are especially sensitive to this, as the discrete structure of their input space cannot handle well the abundance of out-of-vocabulary words that characterise OCRed historical documents (Rust et al., 2023).Therefore, while LLMs have proven remarkably successful in modern domains, their performance is considerably weaker when applied to historical texts (Manjavacas and Fonteyn, 2022;Baptiste et al., 2021, inter alia).Finally, for many languages, OCR systems either do not exist or perform particularly poorly.As training new OCR models is laborious and expensive (Li et al., 2021a), the application of NLP tools to historical documents in these languages is limited.
This work addresses these limitations by taking advantage of recent advancements in pixel-based language modelling, with the goal of constructing a general-purpose, image-based and OCR-free language encoder of historical documents.Specifically, we adapt PIXEL (Rust et al., 2023), a language model that renders text as images and is trained to reconstruct masked patches instead of predicting a distribution over tokens.PIXEL's training methodology is highly suitable for the historical domain, as (unlike other pixel-based language models) it does not rely on a pretraining dataset  Given the paucity of large, high-quality datasets comprising historical scans, we pretrain our model using a combination of 1) synthetic scans designed to resemble historical documents faithfully, produced using a novel method we propose for synthetic scan generation; and 2) real historical English newspapers published in the Caribbeans in the 18th and 19th centuries.The resulting pixelbased language encoder, PHD (Pixel-based model for Historical Documents), is subsequently evaluated based on its comprehension of natural language and its effectiveness in performing Question Answering from historical documents.
We discover that PHD displays impressive reconstruction capabilities, being able to correctly predict both the form and content of masked patches of historical newspapers ( §4.4).We also note the challenges concerning quantitatively evaluating these predictions.We provide evidence of our model's noteworthy language understanding capabilities while exhibiting an impressive resilience to noise.Finally, we demonstrate the usefulness of the model when applied to the historical QA task ( §5.4).
To facilitate future research, we provide the dataset, models, and code at https://gith ub.com/nadavborenstein/pixel-bw.

NLP for Historical Texts
Considerable efforts have been invested in improving both OCR accuracy (Li et al., 2021a;Smith, 2023) and text normalisation techniques for historical documents (Drobac et al., 2017;Robertson and Goldwater, 2018;Bollmann et al., 2018;Boll-mann, 2019;Lyu et al., 2021).This has been done with the aim of aligning historical texts with their modern counterparts.However, these methods are not without flaws (Robertson and Goldwater, 2018;Bollmann, 2019), and any errors introduced during these preprocessing stages can propagate to downstream tasks (Robertson and Goldwater, 2018;Hill and Hengchen, 2019).As a result, historical texts remain a persistently challenging domain for NLP research (Lai et al., 2021;De Toni et al., 2022;Borenstein et al., 2023b).Here, we propose a novel approach to overcome the challenges associated with OCR in historical material, by employing an image-based language model capable of directly processing historical document scans and effectively bypassing the OCR stage.

Pixel-based Models for NLU
Extensive research has been conducted on models for processing text embedded in images.Most existing approaches incorporate OCR systems as an integral part of their inference pipeline (Appalaraju et al., 2021;Li et al., 2021b;Delteil et al., 2022).These approaches employ multimodal architectures where the input consists of both the image and the output generated by an OCR system.
Recent years have also witnessed the emergence of OCR-free approaches for pixel-based language understanding.Kim et al. (2022) introduce Donut, an image-encoder-text-decoder model for document comprehension.Donut is pretrained with the objective of extracting text from scans, a task they refer to as "pseudo-OCR".Subsequently, it is finetuned on various text generation tasks, reminiscent of T5 (Roberts et al., 2020).While architecturally similar to Donut, Dessurt (Davis et al., 2023) and Pix2Struct (Lee et al., 2022) were pretrained by masking image regions and predicting the text in both masked and unmasked image regions.Unlike our method, all above-mentioned models predict in the text space rather than the pixel space.This presupposes access to a pretraining dataset comprised of instances where the image and text are aligned.However, this assumption cannot hold for historical NLP since OCR-independent ground truth text for historical scans is, in many times, unprocurable and cannot be used for training purposes.
Text-free models that operate at the pixel level for language understanding are relatively uncommon.One notable exception is Li et al. (2022), which utilises Masked Image Modeling for pretraining on document patches.Nevertheless, their focus lies primarily on tasks that do not necessitate robust language understanding, such as table detection, document classification, and layout analysis.PIXEL (Rust et al., 2023), conversely, is a text-free pixel-based language model that exhibits strong language understanding capabilities, making it the ideal choice for our research.The subsequent section will delve into a more detailed discussion of PIXEL and how we adapt it to our task.

Model
PIXEL We base PHD on PIXEL, a pretrained pixel-based encoder of language.PIXEL has three main components: A text renderer that draws texts as images, a pixel-based encoder, and a pixel-based decoder.The training of PIXEL is analogous to BERT (Devlin et al., 2019).During pretraining, input strings are rendered as images, and the encoder and the decoder are trained jointly to reconstruct randomly masked image regions from the unmasked context.During finetuning, the decoder is replaced with a suitable classification head, and no masking is performed.The encoder and decoder are based on the ViT-MAE architecture (He et al., 2022) and work at the patch level.That is, the encoder breaks the input image into patches of 16 × 16 pixels and outputs an embedding for each patch.The decoder then decodes these patch embeddings back into pixels.Therefore, random masking is performed at the patch level as well.
PHD We follow the same approach as PIXEL's pretraining and finetuning schemes.However, PIXEL's intended use is to process texts, not natural images.That is, the expected input to PIXEL is a string, not an image file.In contrast, we aim to use the model to encode real document scans.Therefore, we make several adaptations to PIXEL's training and data processing procedures to make it compatible with our use case ( §4 and §5).
Most crucially, we alter the dimensions of the model's input: The text renderer of PIXEL renders strings as a long and narrow image with a resolution of 16 × 8464 pixels (corresponding to 1 × 529 patches), such that the resulting image resembles a ribbon with text.Each input character is set to be not taller than 16 pixels and occupies roughly one patch.However, real document scans cannot be represented this way, as they have a natural twodimensional structure and irregular fonts, as

Training a Pixel-Based Historical LM
We design PHD to serve as a general-purpose, pixel-based language encoder of historical documents.Ideally, PHD should be pretrained on a large dataset of scanned documents from various historical periods and different locations.However, large, high-quality datasets of historical scans are not easily obtainable.Therefore, we propose a novel method for generating historical-looking artificial data from modern corpora (see subsection 4.1).We adapt our model to the historical domain by continuously pretraining it on a medium-sized corpus of real historical documents.Below, we describe the datasets and the pretraining process of the model.

Artificially Generated Pretraining Data
Our pretraining dataset consists of artificially generated scans of texts from the same sources that BERT used, namely the BookCorpus (Zhu et al., 2015) and the English Wikipedia. 2We generate the scans as follows.
We generate dataset samples on-the-fly, adopting a similar approach as Davis et al. (2023).First,  we split the text corpora into paragraphs, using the new-line character as a delimiter.From a paragraph chosen at random, we pick a random spot and keep the text spanning from that spot to the paragraph's end.We also sample a random font and font size from a pre-defined list of fonts (from Davis et al. ( 2023)).The text span and the font are then embedded within an HTML template using the Python package Jinja,3 set to generate a Web page with dimensions that match the input dimension of the model.

Real Historical Scans
We adapt PHD to the historical domain by continuously pretraining it on a medium-sized corpus of

Pretraining Procedure
Like PIXEL, the pretraining objective of PHD is to reconstruct the pixels in masked image patches.We randomly occlude 28% of the input patches with 2D rectangular masks.We uniformly sample their width and height from [2, 6] and [2, 4

Pretraining Results
Qualitative Evaluation.We begin by conducting a qualitative examination of the predictions made by our model.2023), unsurprisingly, prediction quality is high and the results are sharp for smaller masks and when words are only partially obscured.However, as the completions become longer, the text quality deteriorates, resulting in blurry text.It is important to note that evaluating these blurry completions presents a significant challenge.Unlike token-based models, where the presence of multiple words with high, similar likelihood can easily be detected by examining the discrete distribution, this becomes impossible with pixel-based models.In pixel-based completions, high-likelihood words may overlay and produce a blurry completion.Clear completions are only observed when a single word has a significantly higher probability compared to others.This limitation is an area that we leave for future work.We now move to analyse PHD's ability to fill in single masked words.We randomly sample test scans and OCRed them using Tesseract.7 Next, we randomly select a single word from the OCRed text and use Tesseract's word-to-image location functionality to (heuristically) mask the word from the image.Results are presented in Fig 4 .Similar to our earlier findings, the reconstruction quality of single-word completion varies.Some completions are sharp and precise, while others appear blurry.In some few cases, the model produces a sharp reconstruction of an incorrect word (Fig 4c).Unfortunately, due to the blurry nature of many of the results (regardless of their correctness), a quantitative analysis of these results (e.g., by OCRing the reconstructed patch and comparing it to the OCR output of the original patch) is unattainable.
Semantic Search.A possible useful application of PHD is semantic search.That is, searching in a corpus for historical documents that are semantically similar to a concept of interest.We now analyse PHD's ability to assign similar historical scans with similar embeddings.We start by taking a random sample of 1000 images from our test set and embed them by averaging the patch embeddings of the final layer of the model.We then reduce the dimensionality of the embeddings with

Training for Downstream NLU Tasks
After obtaining a pretrained pixel-based language model adapted to the historical domain ( §4), we now move to evaluate its understanding of natural language and its usefulness in addressing historically-oriented NLP tasks.Below, we describe the datasets we use for this and the experimental settings.

Language Understanding
We adapt the commonly used GLUE benchmark (Wang et al., 2018) to gauge our model's understanding of language.We convert GLUE instances into images similar to the process described in §4.1.Given a GLUE instance with sentences s 1 , s 2 (s 2 can be empty), we embed s 1 and s 2 into an HTML template, introducing a line break between the sentences.We then render the HTML files as images.
We generate two versions of this visual GLUE dataset -clean and noisy.The former is rendered using a single pre-defined font without applying degradations or augmentations, whereas the latter is generated with random fonts and degradations.Fig 6 presents a sample of each of the two dataset versions.While the first version allows us to measure PHD's understanding of language in "sterile" settings, we can use the second version to estimate the robustness of the model to noise common to historical scans.

Historical Question Answering
QA applied to historical datasets can be immensely valuable and useful for historians (Borenstein et al., 2023a).Therefore, we assess PHD's potential for assisting historians with this important NLP task.We finetune the model on two novel datasets.The first is an adaptation of the classical SQuAD-v2 dataset (Rajpurkar et al., 2016), while the second is a genuine historical QA dataset.
SQuAD Dataset We formulate SQuAD-v2 as a patch classification task, as illustrated in Fig 11 in App C. Given a SQuAD instance with question q, context c and answer a that is a span in c, we render c as an image, I (Fig 11a).Then, each patch of I is labelled with 1 if it contains a part of a or 0 otherwise.This generates a binary label mask M for I, which our model tries to predict (Fig 11b).If any degradations or augmentations are later applied to I, we ensure that M is affected accordingly.Finally, similarly to Lee et al. (2022), we concatenate to I a rendering of q and crop the resulting image to the appropriate input size (Fig 11c).
Generating the binary mask M is not straightforward, as we do not know where a is located inside the generated image I.For this purpose, we first use Tesseract to OCR I and generate ĉ.Next, we use fuzzy string matching to search for a within ĉ.If a match â ∈ ĉ is found, we use Tesseract to find the pixel coordinates of â within I. We then map the pixel coordinates to patch coordinates and label all the patches containing â with 1.In about 15% of the cases, Tesseract fails to OCR I properly, and â cannot be found in ĉ, resulting in a higher proportion of SQuAD samples without an answer compared to the text-based version.
As with GLUE, we generate two versions of visual SQuAD, which we use to evaluate PHD's performance in both sterile and historical settings.
Historical QA Dataset Finally, we finetune PHD for a real historical QA task.For this, we use the English dataset scraped from the website of the Runaways Slaves in Britain project, a searchable database of over 800 newspaper adverts printed between 1700 and 1780 placed by enslavers who wanted to capture enslaved people who had selfliberated (Newman et al., 2019).Each ad was manually transcribed and annotated with more than 50 different attributes, such as the described gender and age, what clothes the enslaved person wore, and their physical description.
Following Borenstein et al. (2023a), we convert this dataset to match the SQuAD format: given an ad and an annotated attribute, we define the transcribed ad as the context c, the attribute as the answer a, and manually compose an appropriate question q.We process the resulting dataset similarly to how SQuAD is processed, with one key difference: instead of rendering the transcribed ad c as an image, we use the original ad scan.Therefore, we also do not introduce any noise to the images.See Figure 7 for an example instance.We reserve 20% of the dataset for testing.

Training Procedure
Similar to BERT, PHD is finetuned for downstream tasks by replacing the decoder with a suitable head.Tab 4 in App A.1 details the hyperparameters used to train PHD on the different GLUE tasks.We use the standard GLUE metrics to evaluate our model.Since GLUE is designed for models of modern English, we use this benchmark to evaluate a checkpoint of our model obtained after training on the artificial modern scans, but before training on the real historical scans.The same checkpoint is also used to evaluate PHD on SQuAD.Conversely, we use the final model checkpoint (after introducing the historical data) to finetune on the historical QA dataset: First, we train the model on the noisy SQuAD and subsequently finetune it on the Runaways dataset (see App A.1 for training details).
To evaluate our model's performance on the QA datasets, we employ various metrics.The primary metrics include binary accuracy, which indicates whether the model agrees with the ground truth regarding the presence of an answer in the context.Additionally, we utilise patch-based accuracy, which measures the ratio of overlapping answer patches between the ground truth mask M and the predicted mask M , averaged over all the dataset instances for which an answer exists.Finally, we measure the number of times a predicted answer and the ground truth overlap by at least a single patch.We balance the test sets to contain an equal number of examples with and without an answer.

Results
Baselines We compare PHD's performance on GLUE to a variety of strong baselines, covering both OCR-free and OCR-based methods.First, we use CLIP with a ViT-L/14 image encoder in the lin- Although BERT has been shown to be overall more effective on standard GLUE than PIXEL, PIXEL is more robust to orthographic noise (Rust et al., 2023).Finally, to obtain an empirical upper limit to our model, we finetune BERT and PIXEL on a standard, not-OCRed version of GLUE.Likewise, for the QA tasks, we compare PHD to BERT trained on a non-OCRed version of the datasets (the Runaways dataset was manually transcribed).We describe all baseline setups in App B.
GLUE Tab 2 summarises the performance of PHD on GLUE.Our model demonstrates noteworthy results, achieving scores of above 80 for five out of the nine GLUE tasks.These results serve as evidence of our model's language understanding capabilities.Although our model falls short when compared to text-based BERT by 13 absolute points on average, it achieves competitive results compared to the OCR-then-finetune baselines.Moreover, PHD outperforms other pixel-based models by more than 10 absolute points on average, highlighting the efficacy of our methodology.
Question Answering According to Tab 3, our model achieves above guess-level accuracies on these highly challenging tasks, further strengthening the indications that PHD was able to obtain impressive language comprehension skills.Although the binary accuracy on SQuAD is low, hovering around 60% compared to the 72% of BERT, the relatively high "At least one overlap" score of above 40 indicates that PHD has gained the ability to locate the answer within the scan correctly.Furthermore, PHD displays impressive robustness to noise, with only a marginal decline in performance observed between the clean and noisy versions of the SQuAD dataset, indicating its potential in handling the highly noisy historical domain.The model's performance on the Runaways Slaves dataset is particularly noteworthy, reaching a binary accuracy score of nearly 75% compared to BERT's 78%, demonstrating the usefulness of the model in application to historically-oriented NLP tasks.We believe that the higher metrics reported for this dataset compared to the standard SQuAD might stem from the fact that Runaways Slaves in Britain contains repeated questions (with different contexts), which might render the task more trackable for our model.
Saliency Maps Our patch-based QA approach can also produce visual saliency maps, allowing for a more fine-grained interpretation of model predictions and capabilities (Das et al., 2017).

Conclusion
In this study, we introduce PHD, an OCR-free language encoder specifically designed for analysing historical documents at the pixel level.We present a novel pretraining method involving a combination of synthetic scans that closely resemble historical documents, as well as real historical newspapers published in the Caribbeans during the 18th and 19th centuries.Through our experiments, we observe that PHD exhibits high proficiency in reconstructing masked image patches, and provide evidence of our model's noteworthy language understanding capabilities.Notably, we successfully apply our model to a historical QA task, achieving a binary accuracy score of nearly 75%, highlighting its usefulness in this domain.Finally, we note that better evaluation methods are needed to further drive progress in this domain.

Limitations
We see several limitations regarding our work.First, we focus on the English language only, a high-resource language with strong OCR systems developed for it.By doing so, we neglect lowresource languages for which our model can potentially be more impactful.
On the same note, we opted to pretrain our model on a single (albeit diverse) historical corpus of newspapers, and its robustness in handling other historical sources is yet to be proven.To address this limitation, we plan to extend our historical corpora in future research endeavours.Expanding the range of the historical training data would not only alleviate this concern but also tackle another limitation; while our model was designed for historical document analysis, most of its pretraining corpora consist of modern texts due to the insufficient availability of large historical datasets.
We also see limitations in the evaluation of PHD.As mentioned in Section 4.4, it is unclear how to empirically quantify the quality of the model's reconstruction of masked image regions, thus necessitating reliance on qualitative evaluation.This qualitative approach may result in a suboptimal model for downstream tasks.Furthermore, the evaluation tasks used to assess our model's language understanding capabilities are limited in their scope.Considering our emphasis on historical language modelling, it is worth noting that the evaluation datasets predominantly cater to models trained on modern language.We rely on a single historical dataset to evaluate our model's performance.
Lastly, due to limited computational resources, we were constrained to training a relatively smallscale model for a limited amount of steps, potentially impeding its ability to develop the capabilities needed to address this challenging task.Insufficient computational capacity also hindered us from conducting comprehensive hyperparameter searches for the downstream tasks, restricting our ability to optimize the model's performance to its full potential.This, perhaps, could enhance our performance metrics and allow PHD to achieve more competitive results on GLUE and higher absolute numbers on SQuAD.

A.1 Training
Pretraining We pretrain PHD for 1M steps on with the artificial dataset using a batch size of 176 (the maximal batch size that fits our system) using AdamW optimizer (Kingma and Ba, 2014;Loshchilov and Hutter, 2017) with a linear warmup over the first 50k steps to a peak learning rate of 1.5e−4 and a cosine decay to a minimum learning rate of 1e−5.We then train PHD for additional 100k steps with the real historical scans using the same hyperparameters but without warm-up.Pretraining took 10 days on 2 × 80GB Nvidia A100 GPUs.
GLUE Table 4 contains the hyperparameters used to finetune PHD on the GLUE benchmark.We did not run a comprehensive hyperparameter search due to compute limitations; these settings were manually selected based on a small number of preliminary runs.
SQuAD To finetune PHD on SQuAD, we used a learning rate of 6.75e−6, batch size of 128, dropout probability of 0.0 and weight decay of 1e−5.We train the model for 50 000 steps.

Runaways Slaves in Britain
To finetune PHD on the Runaways Slaves in Britain dataset, first trained the model on SQuAD using the hyperparameters mentioned above.Then, we finetuned the resulting model for an additional 1000 steps on the Runaways Slaves in Britain.The only hyperparameter we changed between the two runs is the dropout probability, which we increased to 0.2.

A.2 Dataset Generation
List of dataset augmentations To generate the synthetic dataset described in Section 4.1, we applied the following transformations to the rendered images: text bleed-through effect; addition of random horizontal and lines; salt and pepper noise; Gaussian blurring; water stains effect; "holes-inimage" effect; colour jitters on image background; and random rotations.
Converting the Caribbean Newspapers dataset into 368 × 368 scans We convert full newspaper pages into a collection of 368 × 368 pixels using the following process.First, we extract the layout of the page using the Python package Eynollah.This package provides the location of every paragraph on the page, as well as their reading order.
As newspapers tend to be multi-columned, we "linearise" the page into a single-column document.
We crop each paragraph and resize it such that its width equals 368 pixels.We then concatenate all the resized paragraphs with respect to their reading order to generate a long, single-column document with a width of 368 pixels.Finally, we use a sliding window approach to split the linear page into 368 × 368 crops, applying a stride of 128 pixels.We reserve 5% of newspaper issues for validation, using the rest for training.See

B Historical GLUE Baselines
For all baselines below, we compute and average scores over 5 random initializations.
OCR + BERT/PIXEL For each GLUE task, we first generate 5 epochs of noisy training data and run Tesseract on it to obtain noisy text datasets.Similarly, however without oversampling, we obtain noisy versions of our fixed validation sets.We then finetune BERT-base and PIXEL-base in the same way as Rust et al. (2023), with one main difference: the noisy OCR output prevents us from separating the first and second sentence in sentencelevel tasks.Therefore we treat each sentence pair as a single sequence and leave it for the models to identify sentence boundaries itself, similar to how PHD has to identify sentence boundaries in the images.We use the codebase and training setup from Rust et al. (2023).9 CLIP We run linear probing on CLIP using an adaptation of OpenAI's official codebase. 10We first extract image features from the ViT-L/14 CLIP model and then train a logistic regression model with L-BFGS solver for all classification tasks and an ordinary least squares linear regression model for the regression tasks (only STS-B).
Donut We finetune Donut-base using an adaptation of ClovaAI's official codebase. 11We frame each of the GLUE tasks as image-to-text tasks: the model receives the (noisy) input image and is trained to produce an output text sequence such as <s_glue><s_class><positive/> </s_class></s>.In this example, taken from SST-2, the < X > tags are new vocabulary items added to Donut and the label is an added vocabulary item for the positive sentiment class.All classification tasks in GLUE can be represented in this way.For STS-B, where the label is a floating point value denoting the similarity score between two sentences, we follow Raffel et al. (2020) to round and convert the floats into strings. 12We finetune with batch size 32 and learning rate between 1e−5 and 3e−5 for a maximum of 30 epochs or 15 000 steps on images resized to a resolution of 320 × 320 pixels.
OCR-free BERT/PIXEL For GLUE, we take results reported in (Rust et al., 2021).For SQuAD, we take a BERT model finetuned on SQuAD-v2, 13and evaluate it on the validation set of SQuAD-v2, after being balanced for the existence of an answer.For the Runaways Slaves in Britain dataset, we finetune a BERT-base-cased model14 on a manually transcribed version of the dataset.We use the default SQuAD-v2 hyperparameters reported in the official Huggingface repository for training on SQuAD-v2. 15We then evaluate the model on a balanced test set, containing 20% of the ads.

C Additional Material
Figure 9 additional examples from our artificially generated dataset.
Figure 10 Sample scans from the real historical dataset, as described in Section 4.2.
Figure 11 The process of generating the Visual SQuAD dataset.We first render the context as an image (a), generate a patch-level label mask highlighting the answer (b), add noise and concatenate the question (c).
Figure 12 Additional examples of PHD's completions over test set samples.
Figure 13 Dimensionality reduction of embedding calculated by our model on historical scans.We see that scans are clustered based on visual similarity and page structure.However, further investigation is required to determine whether scans are also clustered based on semantic similarity.Figure 16 Examples of shipping ads Newspapers.Newspapers in the Caribbean region routinely reported on passenger and cargo ships porting and departing the islands.These ads are usually wellstructured and contain information such as relevant dates, the ship's captain, route, and cargo.
Figure 17 Input samples for PIXEL.The images are rolled, i.e., the actual input resolution is 16 × 8464 pixels.The grid represents the 16 × 16 patches that the inputs are broken into.
Figure 18 An example of a full newspaper page downloaded from the "Caribbean project".106 Figure 18: An example of a full newspaper page downloaded from the "Caribbean project".Section 4.2 details the way of processing full newspaper pages so that they can be inputted to our model.

Figure 1 :
Figure 1: Our proposed model, PHD.The model is trained to reconstruct the original image (a) from the masked image (b), resulting in (c).The grid represents the 16 × 16 pixels patches that the inputs are broken into.composed of instances where the image and text are aligned.Fig 1 visualises our proposed training approach.Given the paucity of large, high-quality datasets comprising historical scans, we pretrain our model using a combination of 1) synthetic scans designed to resemble historical documents faithfully, produced using a novel method we propose for synthetic scan generation; and 2) real historical English newspapers published in the Caribbeans in the 18th and 19th centuries.The resulting pixelbased language encoder, PHD (Pixel-based model for Historical Documents), is subsequently evaluated based on its comprehension of natural language and its effectiveness in performing Question Answering from historical documents.We discover that PHD displays impressive reconstruction capabilities, being able to correctly predict both the form and content of masked patches of historical newspapers ( §4.4).We also note the challenges concerning quantitatively evaluating these predictions.We provide evidence of our model's noteworthy language understanding capabilities while exhibiting an impressive resilience to noise.Finally, we demonstrate the usefulness of the model when applied to the historical QA task ( §5.4).To facilitate future research, we provide the dataset, models, and code at https://gith ub.com/nadavborenstein/pixel-bw.
Fig 1a demonstrates (and compare to Fig 17a in App C).Therefore, we set the input size of PHD to be 368 × 368 pixels (or 23 × 23 patches).

Figure 2 :
Figure 2: Process of generating a single artificial scan.Refer to §4.1 for detailed explanations.
Finally, we use the Python package WeasyPrint 4 to render the HTML file as a PNG image.Fig 2a visualises this process' outcome.In some cases, if the text span is short or the selected font is small, the resulting image contains a large empty space (as in Fig2a).When the empty space within an image exceeds 10%, a new image is generated to replace the vacant area.We create the new image by randomly choosing one of two options.In 80% of the cases, we retain the font of the original image and select the next paragraph.In 20% of the cases, a new paragraph and font are sampled.This pertains to the common case where a historical scan depicts a transition of context or font (e.g., Fig1a).This process can repeat multiple times, resulting in images akin to Fig 2b.Finally, to simulate the effects of scanning ageing historical documents, we degrade the image by adding various types of noise, such as blurring, rotations, salt-and-pepper noise and bleed-through effect (see Fig 2c and Fig 9 in App C for examples).App A.2 enumerates the full list of the degradations and augmentations we use.
] patches, respectively, and then place them in random image locations (See Fig 1b for an example).Training hyperparameters can be found in App A.1.
Fig 3 presents a visual representa-

Figure 3 :
Figure 3: Examples of some image completions made by PHD .Masked regions marked by dark outlines.

Figure 4 :
Figure 4: Single word completions made by our model.Figure captions depict the missing word.Fig (a) depicts a successful reconstruction, whereas Fig (b) and (c) represent fail-cases.

Figure 5 :
Figure 5: Semantic search using our model.(a) is the target of the search, and (b) are scans retrieved from the newspaper corpus.
Fig 13, however, does not provide insights regarding the semantic properties of the clusters.Therefore, we also directly use the model in semantic search settings.Specifically, we search our newspapers corpus for scans that are semantically similar to instances of the Runaways Slaves in Britain dataset, as well as scans containing shipping ads (See Fig 16 in App C for examples).To do so, we embed 1M random scans from the corpus.We then calculate the cosine similarity between these embeddings and the embedding of samples from the Runaways Slaves in Britain and embeddings of shipping ads.Finally, we manually examine the ten most similar scans to each sample.Our results (Fig 5 and Fig 14 in App C) are encouraging, indicating that the embeddings capture not only structural and visual information, but also the semantic content of the scans.However, the results are still far from perfect, and many retrieved scans are not semantically similar to the search's target.It is highly plausible that additional specialised finetuning (e.g., SentenceBERT's(Reimers and Gurevych, 2019) training scheme) is necessary to produce more semantically meaningful embeddings.

Figure 6 :
Figure 6: Samples from the clean and noisy visual GLUE datasets.

Figure 7 :
Figure 7: Example from the Runaways Slaves in Britain dataset, rendered as visual question answering task.The gray overlay marks the patches containing the answer.

Figure 8 :
Figure 8: Saliency maps of PHD fine-tuned on the Runaways Slaves in Britain dataset.Ground truth label in a grey box.The figures were cropped in post-processing.
Fig 8 presents two such saliency maps produced by applying the model to test samples from the Runaways Slaves in Britain dataset, including a failure case (Fig 8a) and a successful prediction (Fig 8b).More examples can be found in Fig 15 in App C.
Fig 10 in App C for dataset examples.

Figure 14
Figure9additional examples from our artificially generated dataset.Figure10Sample scans from the real historical dataset, as described in Section 4.2.Figure11The process of generating the Visual SQuAD dataset.We first render the context as an image (a), generate a patch-level label mask highlighting the answer (b), add noise and concatenate the question (c).Figure12Additional examples of PHD's completions over test set samples.Figure13Dimensionality reduction of embedding calculated by our model on historical scans.We see that scans are clustered based on visual similarity and page structure.However, further investigation is required to determine whether scans are also clustered based on semantic similarity.Figure14Using PHD for semantic search.Figure14aand is the target of the search (the concept we are looking for), while Figure14band are the retrieved scans.Figure15Additional examples of PHD's saliency maps for samples from the test set of the Runaways Slaves in Britain dataset.Figure16Examples of shipping ads Newspapers.Newspapers in the Caribbean region routinely reported on passenger and cargo ships porting and departing the islands.These ads are usually wellstructured and contain information such as relevant dates, the ship's captain, route, and cargo.Figure17Input samples for PIXEL.The images are rolled, i.e., the actual input resolution is 16 × 8464 pixels.The grid represents the 16 × 16 patches that the inputs are broken into.Figure18An example of a full newspaper page downloaded from the "Caribbean project".

Figure 9 :Figure 10 :
Figure 9: Samples of our artificially generated dataset, and compare to Figure 10.
(a) Rendering context c as an image I. (b) Generating a label mask M .(c) Adding q and degradations.

Figure 11 :Figure 12 :
Figure 11: Process of generating the Visual SQuAD dataset.We first render the context as an image (a), generate a patch-level label mask highlighting the answer (b), add noise and concatenate the question (c).

Figure 13 :
Figure 13: Dimensionality reduction of embedding calculated by our model on historical scans.

Figure 14 :
Figure 14: Semantic search using our model.(a) is the target of the search, and (b) are scans retrieved from the newspaper corpus.

Figure 15 :Figure 16 :
Figure 15: Additional examples of PHD's saliency maps for samples from the test set of the Runaways Slaves in Britain dataset.

Figure 17 :
Figure 17: Input samples for PIXEL.The images are rolled, i.e., the actual input resolution is 16 × 8464 pixels.The grid represents the 16 × 16 patches that the inputs are broken into.

Table 1 :
Statistics of the newspapers dataset.

Table 2 :
Results for PHD finetuned on GLUE.The metrics are F 1 score for QQP and MRPC, Matthew's correlation for COLA, Spearman's ρ for STS-B, and accuracy for the remaining datasets.Bold values indicate the best model in category (noisy/clean), while underscored values indicate the best pixel-based model.
∼200Mparameters and is the closest and strongest OCR-free alternative to PHD.Moreover, we finetune BERT and PIXEL on the OCR output of Tesseract.Both BERT and PIXEL are comparable in size and compute budget to PHD.

Table 4 :
8The hyperparameters used to train PHD on GLUE tasks.