Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0

In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.


Introduction
This paper lies at the focal point of three orthogonal advances.First, the recent surge in GLAM 1 -led digitisation efforts (Terras, 2011), open citizen science (Haklay et al., 2021) and the expansive commodification of data (Hey and Trefethen, 2003), have enabled a new mode of historical inquiry that capitalises on the 'big data of the past' (Kaplan and Di Lenardo, 2017).Second, the 2017 breakthrough that was the transformer architecture (Vaswani et al., 2017) has led to the so-called ImageNet moment of Natural Language Processing (Ruder, 2018) and brought about unprecedented progress in transfer-learning (Raffel et al., 2020), few-shot learning (Schick and Schütze, 2021), zero-shot learning (Sanh et al., 2021), and prompt-based learning (Le Scao and Rush, 2021) for natural language.Third, the growing popularity of promptbased methods (Liu et al., 2021) has resulted in a new paradigm for training and fine-tuning Large Language Models (LLM) as well as novel applications in Named Entity Recognition (NER) (Liu et al., 2022).NER for historical texts has been the focus of a growing body of research, most recently surveyed by Ehrmann et al. (2021).Both NER and the related task of Entity Linking can enhance our ability to search and navigate digitised historical materials (Neudecker et al., 2014;Kim and Cassidy, 2015).However, applying NER to historical texts poses a number of challenges, including those due to errors in Optical Character Recognition (OCR) (Ehrmann et al., 2021;Hamdi et al., 2019;Boros et al., 2020) and domain transfer (Baptiste et al., 2021).To advance research in this area, an increasing number of datasets have been created to support the development and evaluation of NER approaches in historical text (Neudecker, 2016;Ehrmann et al., 2020Ehrmann et al., , 2022) ) In this paper, we examine the zero-shot abilities of T0-a prompt-based LLM developed as part of the BigScience project for open research (Sanh et al., 2021)-on the challenging task of historical NER2 .This endeavour had two main hurdles: (1) the model was neither trained to recognize entities, nor was it ever tested on that task; (2) our evaluation dataset was out-of-distribution, containing both multilingual and historical data.To better contextualize the results of our experiments, we also run zero-shot prompt-based probing (Zhong et al., 2021) to assess T0's broader ability of extracting factual knowledge about two key factors in our experiment, that is, language variation and historical variation in the dataset.We mix the original training and validation sets to constitute our test set5 , and we split this new set by language and date (using 20 years time intervals,6 see Table 1).Each language dataset is relatively balanced between 1810 and 1910, with English containing between 2,202 and 4,697 tokens per split with the exception of one split (1850-1870 English) for which there are no tokens.German contains between 6,735 and 12,829 tokens, and French contains between 8,550 and 16,874 tokens.The end periods contain on average more tokens for German and French.Overall, the dataset contains 3.8% of named entities (from 1.9 to 5.6%, depending on time periods and datasets).The most balanced dataset across time periods is the French one (between 3.8 and 4.6% named entities).

Model description
In our experiments, we use the T0++ variant of the T0 language model (Sanh et al., 2021), based on the LM-adapted T5 model (Lester et al., 2021), itself a variant of the T5 model (Raffel et al., 2020), which further pretrains the original encoder-decoder architecture of T5 with an autoregressive language modeling objective. 7Crucially, this pretraining is done using a prompt-based training setup, in which training examples are transformed into prompts using a variety of crowd-sourced prompt templates.This setup allows T0 to perform few-shot and zeroshot learning when presented with new prompts for a previously unseen task.

Experiments
Our goal in this paper is to see if and how state-ofthe-art language models can be used for historical NLP tasks, with minimal modifications and finetuning. 8As such, we choose to use a 'naive' approach, by directly asking the model which named entities a given sentence contains.To do so, we first design prompts for each named entity type (see Table 2).For each sentence in the dataset, we then 1) use all the generation prompts to determine if the sentence contains named entities of each entity type9 ; 2) filter the model's answer to keep only tokens that are actually in the input sentence, keeping the entity covering the longer span in case of nested entities; and 3) ask a disambiguation question if needed (if a token was assigned to multiple entities by the model).Results are stored at each step.
We then evaluate the results and conduct two additional experiments to better understand the im- pact of the dataset language and time period on the performance of the LM.

Limitations
Results reveal limitations in our proposed approach.First, T0 exhibits a clear tendency to produce nonempty outputs regardless of the presence or absence of named entities in the input: none of the prompts generates an empty answer.This is especially visible for the entity PROD, for which T0 answers over 55% of the queries with the name of the entity itself (e.g.either media or doctrine) rather than with any other token from the input sentence.Second, adequately matching T0's output with tokens in the input sentence proved difficult.Even when T0 generates an answer semantically very close to the correct token in the sentence, differences in spelling prevent the algorithm from correctly associating T0's answer with said token in the input sentence.This problem is inherent to the nature of our dataset: frequent OCR errors generate unpredictable variations in 'gold' word spelling (including spacing between words and letters or diacritics variation), which are automatically corrected by T0 during its predictions, 10 which negatively affects our ability to automatically match its answers with corresponding tokens in the sentence.In other instances, the model translated words from French and German into English.Further experiments might need to mitigate language variety by adding input text to the prompt, to help the model correctly assess the language in which it must answer.As all answers predicted are considered strictly incorrect, the algorithm never enters its disambiguation phase.We therefore analyse non disambiguated results.
10 E.g.Respelling words that were garbled due to noisy OCR.

Evaluation
To evaluate proximity between predictions and gold, we compare 'gold' tokens with predicted tokens using normalized Levenshtein distance, 11 using this metric as a proxy to identify best predictions for each entity query in each sentence.For a given example, we define (1) the true positive as the prediction with the shortest Levenshtein distance from the gold; (2) false positives as predictions of entities that are not actually present in the input sentence; and (3) false negatives as predictions that have longer Levenshtein distance to the gold tokens (i.e.predictions that would have failed to identify entity tokens in the sentence).Precision and F1score are relatively low, especially for PROD entities, which were the most difficult to define in terms of text prompts.Higher values for recall are due to the fact that increasing the Levenshtein threshold makes it more likely to find an acceptable answer among those generated by T0.Unsurprisingly, the highest increase is found in TIME entities (dates have fixed formats, which makes it more likely to find an acceptable distance between predictions and correct tokens).Precision scores for each entity type are shown in Figure 1 (see Fig. 3 in Appendix for recall and F1-score).The results of our experiment suggest that, although T0 struggles to return exact matches of the entities in the input sentence, it is still capable of generating answers that are semantically close to the correct tokens.
After manually inspecting the dataset and its numerous OCR artifacts, we choose 0.4 as a reasonable heuristic of close semantic similarity between T0's output and gold tokens.We find that using a threshold of 0.4 prevents the apparition of false positives, and therefore we use it to analyze differences 11 Normalization was done with regard to the length of the longest token (predicted or correct), and results were kept below a threshold.We tried 0.0, 0.1, 0.2, 0.3, 0.4 and 0.5.between languages and between historical periods within the dataset.With respect to variations across languages, we observe that the precision of predictions in English does not have a clear edge over precision in French and German (Fig. 2; see also Fig. 4 in Appendix).This is unexpected, as T0 should display considerable bias towards English, which constitutes most of its training data.With respect to variations across periods, we observe an improvement in precision (and F1-score) for PERS and LOC entities in English texts from 1850s onwards (Fig. 3; for recall and F1-score, see Fig. 5 in Appendix), when for other entities and languages, precision and F1-score are either stable or show a downward trend (e.g.LOC in German)12 .Variations in recall cannot be reduced to clear trends, but they are particularly erratic in English texts.A possible explanation could be that T0 is more sensitive to English text inputs, and therefore outputs a higher or lower number of irrelevant answers based on the specific content of each input sentence.
Baseline comparison with the results of the HIPE 2020 evaluation campaign13 confirms that our implementation of zero-shot NER with T0 is below SOTA performance.As baselines, we considered the micro precision, recall and F1-score of coarse NER (literal sense) with fuzzy boundary matching from HIPE 2020 (see Table 3).
All the scores from our experiments with T0 are below the best results from HIPE 2020.We We observe no significant improvement in precision and F1-score compared to the results of our experiments on the combined training and validation sets.We observe some improvements in recall, especially for English and for TIME, with recall reaching 1.0 for some combinations of language, entity and time period.However, we believe that this improvement is not significant and it is due to our choice of the Levenshtein threshold, as already explained above.

Prompt-based factual probing
In addition to our main experiment on NER, we run two further experiments to assess T0's ability to do inference in a multilingual setting and to identify historical variation in textual corpora.
Probing for language To gauge T0's ability to reason in a multilingual setting, we test the model's language identification ability.To that end, we use a trilingual 14 subset of the WiLI-2018 -Wikipedia Language Identification dataset (Thoma, 2018) and prompt the model on language ( Table 4 shows the prediction errors.Subtle language change can occur in a measurable way in as short a period as a decade (Juola, 2003), and therefore a median absolute error of 30 suggests that T0 is good in predicting publication dates.We notice some variation in performance between different languages, with French performing slightly worse on both metrics (possibly because it belongs to a different language family from English, contrary to German).

Conclusion
We have presented our experiment to evaluate T0 for zero-shot historical NER, as well as on the prediction of language and publication date of historical texts.Our results show that historical texts present additional challenges for zero-shot NER (especially because historical datasets often include noisy OCR), but that T0 can however be used as is for language and date prediction.Next steps will be experimenting on different prompts and matching methods, as well as testing few-shot NER. 14French, German, and English; 1000 sentences each.

Broader Impacts Statement
In this paper, we take exploratory first steps toward instrumentalising the T0 large language model on the task of historical NER.We deem it appropriate to briefly discuss the ethical considerations that are implied by such a usage.First, if a model can be used in a context for which it was not explicitly intended for, it stands to reason that it can be misused in that same context: while recognizing entities in historical texts might at first glance seem innocuous, numerous studies focused on BIPOC representation in history have shown that this is not the case, as some marginalized groups tend to suffer from history erasure (Kellow, 1999;Ram, 2020;Stanley, 2021).Second, the automation and scaling of historical inquiry could potentially lead to unreflected (mis)interpretations of the past (Gibbs and Owens, 2013;Gibbs, 2016).Third, the experimental nature of prompt-based inference could lead to a considerable carbon footprint, owing to the trial-and-error nature of manual prompt calibration, though this cost would still be lower than training a new model from scratch or fine-tuning an existing LLM (see footnote 8).

Figure 1 :
Figure 1: Precision for the different languages at different Levenshtein distance thresholds.Languages are distinguished by the line color.

Figure 2 :
Figure 2: Precision for the different languages at Levenshtein threshold 0.4 across periods.Languages are distinguished by both the line color and the type of dot.

Table 1 :
Data description: splits by date and language of the CLEF-HIPE 2020 dataset.

Table 2 :
In input, what are the names of person?Separate answers with commas.LOC Input: <sentence>\n In input, what are the names of location?Separate answers with commas.PROD Input: <sentence>\n In input, what are the names of media or doctrine?Separate answers with commas.Example prompts for generation and disambiguation (Sec.2.3), as well as factual probing (Sec.4).

Table 4 :
Date prediction results.