Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling

Despite recent advances, standard sequence labeling systems often fail when processing noisy user-generated text or consuming the output of an Optical Character Recognition (OCR) process. In this paper, we improve the noise-aware training method by proposing an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text. Using an OCR engine, we generated a large parallel text corpus for training and produced several real-world noisy sequence labeling benchmarks for evaluation. Moreover, to overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings. Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets. To facilitate future research on robustness, we make our code, embeddings, and data conversion scripts publicly available.


Introduction
Deep learning models have already surpassed human-level performance in many Natural Language Processing (NLP) tasks 1 . Sequence labeling systems have also reached extremely high accuracy (Akbik et al., 2019;Heinzerling and Strube, 2019). Still, NLP models often fail in scenarios, where non-standard text is given as input (Heigold et al., 2018;Belinkov and Bisk, 2018).
NLP algorithms are predominantly trained on error-free textual data but are also employed to process user-generated text (Baldwin et al., 2013;Derczynski et al., 2013) or consume the output of prior Optical Character Recognition (OCR) or Automatic Speech Recognition (ASR) processes (Miller et al., 2000). Errors that occur in any upstream

Training Loss
Sailing is a passion. Sailing 1s o passion.

Seq2Seq Model
Figure 1: Our modification of the NAT approach (green boxes). We propose a learnable seq2seq-based error generator and re-train FLAIR embeddings using noisy text to improve the accuracy of noisy neural sequence labeling. Γ is a process that induces noise to the input x producing erroneousx. E(x) is an embedding matrix. F(x) is a sequence labeling model. e(x) and e(x) are the embeddings of x andx, respectively. y(x) and y(x) are the outputs of the model for x andx, respectively. component of an NLP system deteriorate the accuracy of the target downstream task (Alex and Burns, 2014).
In this paper, we focus on the problem of performing sequence labeling on the text produced by an OCR engine. Moreover, we study the transferability of the methods learned to model OCR noise to the distribution of the human-generated errors. Both misrecognized and mistyped text pose a challenge for the standard models trained using error-free data (Namysl et al., 2020).
We make the following contributions ( Figure 1): • We propose a noise generation method for OCR that employs a sequence-to-sequence (seq2seq) model trained to translate from error-free to erroneous text ( §4.1). Our approach improves the accuracy of noisy neural sequence labeling compared to prior work ( §6.1).
• We present an unsupervised parallel training data generation method that utilizes an OCR engine ( §4.2). Similarly, realistic noisy versions of popular sequence labeling data sets can be synthesized for evaluation ( §5.5).
• We exploit erroneous text to perform Noisy Language Modeling (NLM; §4.5). Our NLM embeddings further improve the accuracy of noisy neural sequence labeling ( §6.3), also in the case of the human-generated errors ( §6.4).
• To facilitate future research on robustness, we integrate our methods into the Noise-Aware Training (NAT) framework (Namysl et al., 2020) and make our code, embeddings, and data conversion scripts publicly available. 2

Related Work
Errors of OCR, ASR, and other text generators always pose a challenge to the downstream NLP systems (Lopresti, 2009;Packer et al., 2010;Ruiz et al., 2017). Hence, methods for improving robustness are becoming increasingly popular.
Data Augmentation A widely adopted method of providing robustness to non-standard input is to augment the training data with examples perturbed using a model that mimics the error distribution to be encountered at test time (Cubuk et al., 2019). Apparently, the exact modeling of noise might be impractical or even impossible-thus, methods that employ randomized error patterns for training recently gained increasing popularity (Heigold et al., 2018;Lakshmi Narayan et al., 2019). Although trained using synthetic errors, these methods are often able to achieve moderate improvements on data from natural sources of noise (Belinkov and Bisk, 2018;Karpukhin et al., 2019).
Spelling-and OCR Post-correction The most widely used method of handling noisy text is to apply error correction on the input produced by human writers (spelling correction) or the output of an upstream OCR component (OCR post-correction).
A popular approach applies monotone seq2seq modeling for the correction task (Schnober et al., 2016). For instance, Hämäläinen and Hengchen (2019) proposed Natas-an OCR post-correction method that uses character-level Neural Machine Translation (NMT). They extracted parallel training data using embeddings learned from the erroneous text and used it as input to their translation model. Grammatical Error Correction (GEC;Ng et al., 2013Ng et al., , 2014Bryant et al., 2019) aims to automatically correct ungrammatical text. GEC can be approached as a translation from an ungrammatical to a grammatical language, which enabled NMT seq2seq models to be applied to this task (Yuan and Briscoe, 2016). Due to the limited size of human-annotated GEC corpora, NMT models could not be trained effectively (Lichtarge et al., 2019), though.

Grammatical Error Correction
Several studies investigated generating realistic erroneous sentences from grammatically correct text to boost training data (Kasewa et al., 2018;Grundkiewicz et al., 2019;Choe et al., 2019;Qiu and Park, 2019). Inspired by back-translation (Sennrich et al., 2016;Edunov et al., 2018), Artificial Error Generation (AEG) approaches (Rei et al., 2017;Xie et al., 2018) train an intermediate model in reverse order-to translate correct sentences to erroneous ones. Following AEG, we generate a large corpus of clean and noisy sentences and train a seq2seq model to produce rich and diverse errors resembling the natural noise distribution ( §3.3, 4.2).

Noise-Invariant Latent Representations
Robustness can also be improved by encouraging the models to learn a similar latent representation for both the error-free and the erroneous input. Zheng et al. (2016) introduced stability training-a general method used to stabilize predictions against small input perturbations. Piktus et al. (2019) proposed Misspelling Oblivious Embeddings that embed the misspelled words close to their error-free counterparts. Jones et al. (2020) developed robust encodings that balance stability (consistent predictions across various perturbations) and fidelity (accuracy on unperturbed input) by mapping sentences to a smaller discrete space of encodings. Although their model improved robustness against small perturbations, it decreased accuracy on the error-free input. Recently, Namysl et al. (2020) proposed the Noise-Aware Training method that employs stability training and data augmentation objectives. They exploited both the error-free and the noisy samples for training and used a confusion matrixbased error model to imitate the errors. In contrast to their approach, we employ a more realistic empirical error distribution during training ( §3.3) and observe improved accuracy at test time ( §6.1). Namysl et al. (2020) pointed out that the standard NLP systems are generally trained using error-free textual input, which causes a discrepancy between the training and the test conditions. These systems are thus more susceptible to non-standard, corrupted, or adversarial input.

Noisy Neural Sequence Labeling
To model this phenomenon, they formulated the noisy neural sequence labeling problem, assuming that every input sentence might be subjected to some unknown token-level noising process Γ=P (x i |x i ), where x i is the original i-th token, andx i is its distorted equivalent. As a solution, they proposed the NAT framework, which trains the sequence labeling model using auxiliary objectives that exploit both the original sentences and their copies corrupted using a noising process that imitates the naturally occurring errors (Figure 1).

Confusion Matrix-Based Error Model
Namysl et al. (2020) used a confusion matrix-based method to model insertions, deletions, and substitutions of characters. Given a corpus of paired noisy and manually corrected sentences P, they estimated the natural error distribution by calculating the alignments between the pairs (x, x) ∈ P of noisy and clean sentences using the Levenshtein distance metric (Levenshtein, 1966).
Moreover, as P is usually laborious to obtain, they proposed a vanilla error model, which assumes that all types of edit operations are equally likely: where c andc are the original and the perturbed characters, respectively, Σ is an alphabet, and ε is a symbol introduced to model insertion and deletions.

Realistic Empirical Error Modeling
Namysl et al. (2020) compared the NAT models that used the vanilla-and the empirically-estimated confusion matrix-based error model and observed no advantages of exploiting the test-time error distribution during training. Would we make the same observation given a more realistic error model?
Even though the methods that used randomized error patterns were often successful, we argue that leveraging the empirical noise distribution for training would be beneficial, providing additional accuracy improvements. The data produced by the naïve noise generation methods may not resemble naturally occurring errors, which could lead the downstream models to learn misleading patterns. Digitized text OCR-aware baseline model This work Vanilla baseline error model Figure 2: Distributions of the token error rates of sentences produced by the proposed and the baseline error models. For comparison, we plot the distribution of error rates in the text that contains naturally occurring errors. Each value n is the percentage of sentences with a token error rate in [n − 10, n).
In Figure 2, we compare the distributions of error rates of sentences produced by the proposed and the prior noise models with the distribution of errors in the digitized text. We can observe that the distribution of naturally occurring errors follows Zipf's law, while the baseline noise models produce Bell-shaped curves. Interestingly, both the vanilla and the empirical models exhibit similar characteristics, which could explain the observations from the prior work. In practice, the error rate is not uniform throughout the text. Some passages are recognized perfectly, while others can barely be deciphered. Our objective is thus to develop a noise model that produces a smoother distribution, imitating the errors encountered at test time more precisely (cf. This work in Figure 2).
Moreover, although the exact noise distribution in the test data cannot always be known beforehand, the noising process, e.g., an OCR engine, used to provide the input, can often be identified. We would thus take advantage of such prior knowledge to improve the efficiency of the downstream task.

Data Sparsity of Natural Language
Embeddings pre-trained on a large corpus of monolingual text are ubiquitous in NLP (Mikolov et al., 2013;Peters et al., 2018;Devlin et al., 2019). They capture syntactic and semantic textual features that can be exploited to solve higher-level NLP tasks.
Embeddings are generally trained using corpora that contain error-free text. Due to the data sparsity problem that arises from the large vocabulary sizes and the exponential number of feasible contexts, the majority of possible word sequences do not appear in the input data. Even though increasing the size of the training corpora was shown to improve the performance of language processing tasks (Brown et al., 2020), most of the misrecognized or mistyped tokens would still be unobserved and therefore poorly modeled when using the errorfree text only. Would it be beneficial to pre-train the embeddings on data that includes realistic erroneous sentences?

The Flaws of Error Correction
Furthermore, we believe that the correction methods, although widely adopted, can only reliably manage moderately perturbed text (Flor et al., 2019). OCR post-correction has been reported to be challenging in the case of historical books that exhibit high OCR error rates (Rigaud et al., 2019).
We note that correction methods have no information about the downstream task to be performed. Moreover, in the automatic correction setting, they only provide the best guess for each token. Comparing their performance with the NAT approach in the context of sequence labeling would be informative.

Empirical Error Modeling
Figure 1 presents our modifications of the NAT framework. Firstly, we propose to replace the confusion matrix-based noising process ( §3.2) with a noise induction method that generates a more realistic error distribution ( §4.1-4.4). Secondly, to overcome the data sparsity problem ( §3.4), we train language model-based embeddings using digitized text and use them as a substitution of the pre-trained model used in prior work ( §4.5).

Sequence-to-Sequence Error Generator
Motivated by the AEG approaches (Rei et al., 2017;Xie et al., 2018), we propose a learnable error generation method that employs a character-level seq2seq model to perform monotone string translation (Schnober et al., 2016). It directly models the conditional probability p(x|x) of mapping errorfree text x into erroneous textx using an attentionbased encoder-decoder framework (Bahdanau et al., 2015). The encoder computes the representation h={h 1 , . . . , h n } of x, where n is the length of x. The decoder generatesx one token at a time: where c=f attn ({h 1 , . . . , h n }) is a vector generated from h, and f attn is an attention function.
Our models are trained to maximize the likelihood of the training data. At inference time, we randomly sample the subsequent tokens from the learned conditional language model.

Error-Free Sentence
Erroneous Sentence Sequence-to-Sequence Model Figure 3: Schematic visualization of the error generation (blue arrows) and the error correction (green arrows) methods. The parallel data can be utilized to train seq2seq models for both tasks.
Note that our approach reverses the standard seq2seq error correction pipeline, which uses the erroneous text as input and trains the model to produce the corresponding error-free string ( Figure 3). By interchanging the input and the output data, we can also readily train sentence correction models. One difference is that at inference time we would prefer to perform beam search and select the best decoding result rather than sampling subsequent characters from the learned distribution.

Unsupervised Parallel Data Generation
To train our error generation model ( §4.1), we need a large parallel corpus P of error-free and erroneous sentences. AEG approaches use seed GEC corpora to learn the inverse models directly. Unfortunately, we are not aware of any comparably large resources for digitized text that could be used for this task.
To address this issue, we propose an unsupervised sentence-level parallel data generation approach for OCR ( Figure 4). First, we collect a large seed corpus T that contains the error-free text. We then render each sentence and subsequently run text recognition on the rendered images using an OCR engine. Moreover, to increase the variation in training data, we sample different fonts for rendering. Furthermore, to simulate the distortions and degradation of the printed material, we induce pixel-level noise to the images before recognition.
Note that our approach is universal and could be used to generate parallel data sets for other tasks, e.g., an ASR system could be trained on samples from a Text-to-Speech engine (Wang et al., 2018b).

Sentence-and Word-Level Modeling
We note that the sequence labeling problem is formulated at the word-level, i.e., each word has a class label assigned to it. To employ our method in Text Renderer OCR

Rendered Text Images Noisy and Clean Sentence Pairs
Error-free Sentences Figure 4: Our parallel data generation method for OCR. We render sentences extracted from a text corpus. Subsequently, an OCR engine recognizes the text depicted in the rendered images. Finally, the pairs of original and recognized sentences are gathered together to form a parallel corpus used to train translation models.
this scenario, we develop (i) a sentence-level and (ii) a token-level variant of our error generator. Our sentence-level error generator uses a seq2seq model trained to translate from error-free to erroneous sentences. It can potentially utilize contextual information from surrounding words, which may improve the quality of the results. During the training of a NAT model, a learned seq2seq model translates the original input x to generatẽ x. Subsequently, we use an alignment algorithm ( §4.4) to transfer the word-level annotations from x tox.
Our token-level error generator uses a seq2seq model trained to translate from error-free to erroneous words. It relies exclusively on the input and the output words. We use the alignment algorithm to build a training set for this task, i.e., extract word-level parallel data from the corpus of parallel sentences ( §4.2). During the training of a NAT model, a learned generator translates each word x i from x to produce the erroneous sentencex. Figure 5 illustrates the alignment procedure, which we developed to extract word-level parallel training data for our token-level generator and to transfer the labels to the erroneous sentences for the sentence-level generator in the sequence labeling scenario.

Word-Level Sentence Alignment
To this end, we align each pair of error-free and noisy sentences at the word-level using the Levenshtein Distance algorithm. Our alignment procedure produces pairs of aligned words. The annotations for words are transferred accordingly.

Noisy Language Modeling
Recently, Xie et al. (2017) drew a connection between input noising in neural network language models and smoothing in n-gram models. We believe that data noising could be an effective tech-  Figure 5: Our sentence alignment procedure. We align the original and the recognized sentences (x andx, respectively) using the sequence of edit operations a, which include insertions "i", deletions "d", and substitutions "s" of characters. We use "¬" and "¦" as placeholders for the insertion and the deletion operation, respectively. Matched characters are marked with "-". The alignment procedure produces a list of paired error-free and possibly erroneous words with class labels (optional).
nique for regularizing neural language models that could help to overcome the data sparsity problem of imperfect natural language text and enable learning meaningful representation of erroneous tokens.
To this end, we propose to include the data from noisy sources in the corpora used to train LM-based embeddings. Specifically, in this work, we learn a noisy language model using the output of an OCR engine ( §4.2) that captures the characteristics of OCR errors. Any other noisy source could be readily used to model related domains, e.g., ASRtranscripts or ungrammatical text.

Sequence-to-Sequence Error Generator
To learn our error generators ( §4.1), we utilize the OpenNMT 3 toolkit (Klein et al., 2017). 4 We encode the input sentence at the character-level before feeding it to the seq2seq model. Subsequently, the output produced by the seq2seq model is decoded back to the original form ( Figure 6).
Sailing is a passion.
S a i l i n g ¬ i s ¬ a ¬ p a s s i o n .
S a i l i n g ¬ 1 s ¬ o ¬ p a s s i o n . Sailing 1s o passion.

Seq2Seq Model
Encoding Decoding Figure 6: Sentence encoding-decoding schema. The whitespace characters are first replaced with a placeholder symbol "¬". The sentences are tokenized at the character-level by adding whitespace between every pair of characters. Decoding reverses this process.

Unsupervised Parallel Data Generation
Following the approach from §4.2, we generated a large parallel corpus P to train our error generation and correction models. We sampled 10 million sentences 5 from the English part of the 1 Billion Word Language Model Benchmark 6 and used them as the source of error-free text, i.e., the seed corpus T . We rendered each sentence as an image using the Text Recognition Data Generator package 7 . We used 90 different fonts for rendering and applied random distortions to the rendered images. Subsequently, we performed OCR on each image of text using a Python wrapper 8 for Tesseract-OCR 9 (Smith, 2007). We present the distribution of error rates in our noisy corpus in Figure 2 (cf. the digitized text plot).

Sequence Labeling
Training Setup We employed the NAT framework 10 ( Figure 1) to study the robustness of sequence labeling systems. Following Akbik et al. (2018), we used a combination of FLAIR and GloVe embeddings in all experiments. 11 We employed the data augmentation (L AUGM ) and the stability training (L STAB ) objectives with default weights (α = 1.0), as proposed by Namysl et al. (2020). Consistent with prior work, erroneous sentencesx were generated dynamically in every epoch.

Tasks
We experimented with the Named Entity Recognition (NER) and Part-of-Speech Tagging (POST) tasks. NER aims to locate all named entity mentions in text and classify them into predefined classes, e.g., person names, locations, and organizations. POST is the process of tagging each word in the text with the corresponding part of speech.

Evaluation Setup
The evaluation pipeline is shown in Figure 7. Following Akbik et al. (2018), we report the entity-level micro-average F1 score for NER and the accuracy for POST.

Baselines
Error Generation We compared our error generator with the OCR-aware noise model from Namysl et al. (2020). We used the noisy part of the parallel corpus P to estimate the confusion matrix em- Error Correction To evaluate error correction, we trained the sequence labeling models using the standard objective (L 0 ) and employed the text correction method on the erroneous input before feeding it to the network (Figure 7).
We examined Natas 12 , the seq2seq OCR postcorrection method proposed by Hämäläinen and Hengchen (2019). We trained context-free error correction models compatible with Natas using our parallel corpus ( §5.2). Moreover, we also employed the widely adopted spell checker Hunspell 13 .

Data Sets
Original Benchmarks For NER, we employed the CoNLL 2003 data set (Tjong Kim Sang and De Meulder, 2003). To evaluate POST, we utilized the Universal Dependency Treebank (UD English EWT; Silveira et al., 2014). We present the detailed statistics of both data sets in Table 5  Noisy Benchmarks Unfortunately, we did not find any publicly available noisy sequence labeling data set that could be used to benchmark different methods for improving robustness. To this end, we generated several noisy versions of the original sequence labeling data sets (Table 1). We extracted the sentences from each original benchmark and applied the procedure described in §4.2. 14 We transferred the word-level annotations as described in §4.4. Finally, we produced the data in the CoNLL format (Table 7). Moreover, to evaluate the transferability of error generators, we followed Namysl et al. (2020)

Empirical Noise Generation Approaches
In this experiment, we compared the NAT models that employed either our seq2seq noise generators 14 We directly applied both Tesseract v3.04 and v4.0. We used different sets of distortions and image backgrounds than those employed to generate parallel training data. 15 We merged both sets of misspellings for evaluation.
or the baseline error models (Table 2). In this evaluation scenario, we do not employ C(x) (Figure 7). Our error generators outperformed the OCRaware confusion matrix-based model on the noisy benchmarks generated using the Tesseract 4 engine. The advantage of our method was less emphasized in the case of the Tesseract 3 ♣ data sets. The tokenlevel translation method performed better than the sentence-level variant, while the latter was more efficient when the error rate of the input was lower (cf. the original data and the Tesseract 4 ♠ columns), although it often struggled with translating long sentences. Moreover, data augmentation generally outperformed stability training, which is consistent with the observation from Namysl et al. (2020).
Furthermore, we observe a slight decrease in accuracy on the original UD English EWT with both auxiliary objectives. We believe that this was caused by the different proportions of the tokens that were perturbed during training by our seq2seq error generators (e.g., 18% and 19.5% in the case of our token-level model for CoNLL2003 and UD English EWT, respectively). The trade-off between accuracy for clean and noisy data has thus been shifted towards the latter. We also notice a greater advantage of the seq2seq method over the baseline  on the noisy UD English EWT data sets. Additionally, in §A, we analyze the relationship between the size of the parallel corpus used for training and the F1 score of the NER task.

Error Generation vs. Error Correction
We compared the NAT approach with the baseline correction methods ( §5.4). Preliminary experiments revealed that these baselines underperformed due to the overcorrection problem. To make them more competitive, we extended their default dictionaries by adding all tokens from the corresponding test sets for evaluation. Although the vocabulary of a test set could rarely be entirely determined, this setting would simulate a scenario where accurate in-domain vocabularies could be exploited. Table 2 includes the results of this experiment. As expected, although more general, error correction techniques were outperformed by the NAT approach regardless of the noising method used. Surprisingly, Hunspell performed better than Natas on CoNLL 2003. We carried out a thorough inspection of the results of both methods and found out that Natas, although generally more accurate, had problems with recognizing tokens that were a part of entities. This behavior could be a flaw of data-driven error correction methods, as the entities are relatively rare in written text and are often out-of-vocabulary tokens (Alex and Burns, 2014).

Noisy Language Modeling
FLAIR (Akbik et al., 2018) learns a bidirectional LM to represent sequences of characters. We used the target side of our parallel data corpus ( §5.2) to re-train FLAIR embeddings on the noisy digitized text. 16 Subsequently, we compared the accuracy of the vanilla NAT models ( §3.2) that employed either the pre-trained or our NLM embeddings. Moreover, we do not use C(x) in this scenario (Figure 7).
Note that the noise model and the embeddings are two distinct components of the NAT architecture (Γ and E(x) in Figure 1, respectively) and therefore they could be easily combined. However, in this work, we do not mix our NLM with empirically estimated error models to avoid the twofold empirical error modeling effect. We leave the evaluation of this combination to future work. Table 3 summarizes the results of this experiment. Our method significantly improved the accuracy across all training objectives, even when we employed exclusively the standard training objective for the sequence labeling task (L 0 ). Surprisingly, we also achieved evident improvements for the noisy data set generated using the Tesseract 3 engine, which confirms that NLM embeddings can model the features of erroneous tokens even in the out-of-domain scenarios. On the other hand, the NLM slightly decreased the accuracy on the original data for the standard training objective. We plan to investigate this effect in future work by eliminating possible differences in the pre-training procedure and comparing our NLM against a model trained on the original error-free text corpus instead of using the embeddings from Akbik et al. (2018).

Human-Generated Errors
In this experiment, we evaluated the utility of our seq2seq error generators learned to model OCR noise ( §6.1) and our NLM embeddings ( §6.3) in a scenario where the input contains human-generated 16 The hyper-parameters were consistent with prior work.  Table 4: Transferability of the methods learned to model OCR noise to the distribution of the human-generated errors ( §6.4): (a) Comparison of the NAT approach with and without our NLM embeddings on the English CoNLL 2003 test set with human-generated errors. (b) Comparison of empirical error generation approaches on the English CoNLL 2003 and the UD English EWT test sets with human-generated errors. We report mean and standard deviation F1 scores (CoNLL 2003) and accuracies (UD English EWT) over five runs with different random initialization. L 0 , L AUGM , L STAB is the standard, the data augmentation, and the stability objective, respectively (Namysl et al., 2020). The NLM column indicates whether the model employed our NLM embeddings. Bold values indicate top results (within the models trained using the same objective) that are statistically inseparable (Welch's t-test; p < 0.05).
errors. For evaluation, we used the noisy data sets with synthetically induced misspellings ( §5.5). We do not employ C(x) in this scenario (Figure 7). Table 4 summarizes the results of this experiment. The models with our NLM embeddings outperformed the baselines for all training objectives (Table 4a). The seq2seq error generation approach performed on par with the confusion matrix-based models on the CoNLL 2003 data set, while the latter achieved better accuracy on the UD English EWT data set (Table 4b).
We believe that this difference was caused by the discrepancy between the data distributions. Note that although the data used in this experiment reflects the patterns of human-generated errors, the distribution of these errors does not necessarily follow the natural distribution of human-generated errors, as it was synthetically generated using a fixed replacement probability that was uniform across all candidates. 17 Nevertheless, our methods proved to be beneficial in this scenario, which would suggest that the errors made by human writers and by the text recognition engines have common characteristics that were exploited by our method.

Conclusions
In this work, we studied the task of performing sequence labeling on noisy digitized and humangenerated text. We extended the NAT approach and proposed the empirical error generator that per-17 For comparison, we visualized the error distributions of our noisy benchmarks in Figure 9. forms the translation from error-free to erroneous text ( §4.1). To train our generator, we developed an unsupervised parallel data synthesis method ( §4.2). Analogously, we produced several realistic noisy evaluation benchmarks ( §5.5). Moreover, we introduced the NLM embeddings ( §4.5) that overcome the data sparsity problem of natural language.
Our approach outperformed the baseline noise induction and error correction methods, improving the accuracy of the noisy neural sequence labeling task ( §6.1-6.3). Furthermore, we demonstrated that our methods are transferable to the out-of-domain scenarios -human-generated errors ( §6.4) and the noise induced by a different OCR engine ( §6.1, 6.3). We incorporated our approach into the NAT framework and make the code, embeddings, and scripts from our experiments publicly available.
Grundkiewicz and Junczys-Dowmunt (2019) showed that that unsupervised systems benefit from domain adaptation on authentic labeled data. For future work, we plan to fine-tune NAT models pretrained on synthetic samples using the labeled data generated directly by the noising process. Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018

A Relationship with the Corpus Size
Empirical error generators are especially beneficial when we can approximate the noise distribution to be encountered at test time. In this experiment, we aimed to answer the question, how much parallel training data is required to train a solid seq2seq error generation model. Figure 8 shows that the NAT models that used our seq2seq error generator performed better than those employing the baseline vanilla error model proposed by Namysl et al. (2020) for all noisy benchmarks that were generated using the Tesseract 4 OCR engine. The improvements were observed even when we used as few as 1000 parallel training sentences. Our method also outperformed the baseline on the original CoNLL 2003 benchmark. On the contrary, the accuracy of models trained using our generator fell slightly behind the baseline on the Tesseract 3 ♣ and Typos data sets.

B Sequence Labeling Data Sets
Original Benchmarks Table 5 presents the detailed statistics of the original sequence labeling benchmarks used in our experiments. For NER, we employed CoNLL 2003 18 (Tjong Kim Sang and De Meulder, 2003). To evaluate POST, we utilized Universal Dependency Treebank (UD English EWT 19 ; Silveira et al., 2014). Table 6 presents the error rates and the correction accuracies of the Natas and Hunspell methods calculated on the test sets of the noisy sequence labeling benchmarks. Moreover, Table 7 shows an excerpt from a noisy sequence labeling data set generated for evaluation. Furthermore, Figure 9 presents the distribution of token error rates in relation to the percentage number of sentences in our noisy data sets. For comparison, we also included the distributions obtained by applying different noise generation methods -the vanilla-and the OCR-aware confusion  We note that the error distribution of our noisy data sets is closer to the Zipf distribution in contrast to the results of prior methods that exhibit a Bell-Curve pattern. Note that the Typos data set was generated by randomly sampling possible lexical replacement candidates from the lookup tables, hence its distribution exhibits slightly different characteristics than the noisy data sets generated by directly applying the OCR engine to the rendered text images. Based on the above results, Number of sentences   we believe that our noisy data sets are better suited for the evaluation of the robustness of sequence labeling models than the data generated by the prior approaches.

Noisy Benchmarks
Data Conversion Scripts Because of licensing and copyright reasons, we did not submit the noisy data sets directly. Our code includes the scripts for the conversion of the original benchmarks into their noisy variants. For reference, we added excerpts of the noisy UD English EWT data set in the supplementary materials.

C Reproducibility
In this section, we present additional information that could facilitate reproducibility.
Hyper-parameters To train our seq2seq translation models, we generally used the default hyperparameters of the OpenNMT toolkit. We list all non-default values in Table 8. Moreover, we decayed the learning rate eight times during the training for all models. Furthermore, we utilized copy attention (See et al., 2017) for our error generation models and global attention (Luong et al., 2015) for the error correction model. Table 9 summarizes the validation accuracy of our seq2seq models for error generation. We trained the sentence-level models for 1.6×10 4 and the token-level models for  Figure 9: Distributions of the token error rates of sentences in our noisy sequence labeling data sets (Tesseract 3 ♣ , Tesseract 4 ♦ , Tesseract 4 ♠ , and Typos). For comparison, we include error distributions obtained by applying our seq2seq token-level error generator and the baseline confusion matrix-based error models (Namysl et al., 2020) to the sentences extracted from the original benchmark. η CER is the character-level noising factor used by the vanilla error model. Each point is the percentage of sentences with a token error rate that falls into a specific token error range, i.e., the value of 50 corresponds to the sentences with a token error rate greater than 40 and lower than or equal to 50.  . O Table 7: Example of a sentence from the noisy CoNLL 2003 data set. The first and the second column contains the noisy and the error-free tokens, respectively. The third column denotes the class label in BIO format. and achieved 96.9% accuracy on the validation set of 5000 sentences.

Learnable Parameters
The number of parameters in our sequence labeling models was constant among different models, as we used the same architecture in all experiments. The number of all model parameters was 60.3 million (including embeddings that were fixed during the training), and the number of all trainable parameters was 25.5 million. Moreover, all our seq2seq error generation and correction models had about 7.7 million parameters.   Table 9: Validation accuracy of the seq2seq models for error generation. We trained both the token-level and the sentence-level variants. The first and the second column show the number of parallel sentences used for training and validation, respectively.
Average Runtime The evaluation of the complete test set took 7 and 10 seconds on average in the case of UD English EWT and English CoNLL 2003, respectively. The runtime did not depend on the training method that was used. Nevertheless, when we employed the correction method, the runtime was significantly lengthened, e.g., it took almost 3 minutes to evaluate a model that employed the Natas correction method on English CoNLL 2003.
Computing Architecture The evaluation was performed on a workstation equipped with an Intel Xeon CPU with 10 cores and an Nvidia Quadro RTX 6000 graphics card with 24GB of memory.