Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we identify the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently by the tokenizer, and thus leads to performance drop as well as hallucination. We propose a simple yet effective fix to this issue and conduct a case study on extractive QA. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets, with a notable average of +1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA datasets. Further, the model converges faster, and becomes less likely to generate out-of-context answers. With these findings, we would like to call for more attention on how tokenization should be done when solving extractive tasks and recommend applying consistent tokenization during training.


Introduction
Pretrained sequence-to-sequence (seq2seq) models have achieved remarkable success in a wide range of tasks (Lewis et al., 2020, Raffel et al., 2020).As an important component of the models, tokenizer is frequently discussed, including different tokenization methods and the model's robustness to different tokenization (Provilkov et al., 2020).
In our work, we identify an issue of tokenization consistency that can affect the performance of seq2seq models.More specifically, when a seq2seq task has extractive nature, i.e., parts of the output * Work done during an internship at AWS AI Labs.text are extracted from the input text, the desired output can be tokenized differently from how it is tokenized in the input, leading to tokenization inconsistency during model training.For example, in extractive question answering (QA) task, which takes a context question pair as input and outputs a span of context as answer, the answer might be tokenized differently as its appearance in the given context (Figure 1).This seemingly minor difference in tokenization at training time can lead to a notable impact on model performance during inference.We use extractive QA as a case study to identify when tokenization inconsistency happens and propose an simple and effective approach to mitigate this issue -extracting the tokenized answer from the context for training.We discover that, when fine-tuning with consistently tokenized instances, the model 1) achieves better in-domain and out-of-domain performance (+0.9% F 1 in-domain and +2.0 % zeroshot out-of-domain), 2) converges faster during training, and 3) is less likely to generate out-ofcontext output (i.e. less likely to hallucinate textually).To our knowledge, this is the first work that identifies the issue of tokenization consistency in extractive tasks and proposes a simple yet effective solution.We would like to note that inconsistent tokenization can affect other tasks that can be cast as seq2seq tasks with an extractive nature beyond QA, e.g.coreference resolution, summarization, etc, and call on researchers and practitioners to consider applying consistent tokenization technique when working on these tasks.

Related Work
Byte-Pair-Encoding (BPE) (Sennrich et al., 2016), language-model-based segmentation (Kudo, 2018), and their variants (Kudo and Richardson, 2018) are commonly used as the tokenizers for NLP models due to their simplicity and universality.
Recent research has identified these tokenization approaches as sources of poor model generalization.For example, BPE has been shown to produce less ideal morphological segmentation (Bostrom and Durrett, 2020;Provilkov et al., 2020), and the same text can be tokenized differently when different BPE rules are applied (Kudo, 2018).Provilkov et al. (2020) propose BPE-Dropout to stochastically corrupt its segmentation so that it is not produced determinstically.He et al. (2020) further propose Dynamic Programming Encoding (DPE) by marginalizing output sequence as a latent variable to create better segmentation.Vilar and Federico (2021) extend BPE to generate morphologically advantageous segmentation for machine translation (MT).These approaches either modify model train-ing by using stochastic inputs or by modifying how BPE segmentation is done to improve model robustness and generalization.In contrast, we propose a simple, deterministic, and effective approach that does not rely on altering the segmentation approach in any way.
Besides these generic approaches, researchers have also investigated domain-specific approaches to improve segmentation on specific texts.Petrak et al. (2022) propose a new pre-training approach, aim at overcoming the inconsistency of tokenizing numbers and enhancing model's numerical reasoning ability.Rosenbaum et al. (2022) use "spacejoined tokens" to resolve many string matching anomalies after tokenization that lead to unfair evaluation in semantic parsing.In contrast, we look at inconsistency in a more general perspective: not only numbers and preceding spaces can be tokenized inconsistently, but also punctuation and their compounds.
In addition to subword segmentation methods, another line of research focuses on character-and byte-level modeling as procedures that are free of tokens (Graves, 2013;Al-Rfou et al., 2019;Xue et al., 2022;Tay et al., 2022).Li et al. (2021) argue that character-level models better capture morphological phenomena and rare words under low-resource machine translation, while Gaido et al. (2021) show that character-based segmentation can help with alleviating gender bias during translation.While tokenization-free methods show potential to surpass subword segmentation approaches, we keep our focus to be the consistency issue of subword tokenizations.

Consistent Tokenization: What It Is and How To Achieve It
Consistent Tokenization Consider a seq2seq task, which takes text x = (x 1 , ..., x n ) as input, and outputs y = (y 1 , ..., y m ).When the task is extractive, there exists two sets of indices, I and J , such that x I = y J .In the example of Figure 1, x is the context and the question, y = 1912, and Let tokenization be a function T that maps text into a sequences of token ids.Suppose x I → T (x) I and y J → T (y) J .Here, I denotes the set of indices that maps to the x I in the tokenized input, while J denotes the position of y J in the tokenized output.Note that T (x) I = T (y) J is not always true: in Figure 1, x I is mapped to the ids of " Ġ1912", while y J is mapped to the ids of "19" and "12", because there is no preceding space in y J .We call it inconsistent tokenization when T (x) I = T (y) J .Analogously, tokenization is consistent when T (x) I = T (y) J .Inconsistent tokenization could emerge due to the existence of preceding space, numbers, or punctuation when using the BPE tokenizer.SentencePiece is less subject to inconsistency due to preceding space, although also showing inconsistency when tokenizing numbers and punctuation.
Consistent Tokenization Training When the input and output are tokenized inconsistently, the task can no longer be solved by simply extracting the output from the input token ids, but requires an additional step from the model to "paraphrase" the input ids into the output token ids that do not exist in the input ids.Therefore, learning with inconsistently tokenized instances become an inherent predicament for model compared to learning with their consistently tokenized counterparts.
Instead of tokenizing output in situ, we propose to retrieve T (y) J from T (x) I such that the tokenization is always consistent among all x, y pairs.Compared to proposing a new tokenization method that is immunized to inconsistency, this is a simple yet effective fix that every researcher can implement without any non-trivial effort and without the need to pretrain the model again.

Case Study: Extractive QA with Generative Models
In this work, we use extractive QA as a representative for extractive tasks.Extractive QA is an ideal candidate for studying the effect of consistent tokenization, since its output is always a substring of the input.Recent work has demonstrated that applying generative models to this task leads to great performance gains (Lewis et al., 2020;Raffel et al., 2020;Brown et al., 2020;Izacard and Grave, 2021) and greater potential to unify different input formats (Khashabi et al., 2020).

Task Description
In extractive QA, the model is given a question with a context, and expected to output a span (substring) of the context as an answer.Extractive models are typically configured with the question and context as the input, and trained to return start and end indices to indicate the location of the predicted answer in the context.To apply a generative model to this task, we simply replace the index prediction task with a task of directly predicting the answer string from the decoder of a seq2seq model, thus the need to tokenize the answer string separately from the context.

Experimental Setup
Data MRQA (Fisch et al., 2019)  consistent variants on all of SQuAD, Trivi-aQA and NewsQA, with in-domain performance gains of 1.0, 1.1 and 0.6 for the three datasets, respectively (marked by shaded cells).A similar observation is made with the EM results.One potential explanation for the improvement is that, the task becomes inherently simpler with consistent tokenization.We hypothesize that consistent tokenization allows the model to simply extract answers from the context instead of also needing to learn to paraphrase the answer into something that does not exist in the context, as we mentioned in Section 3. We will validate this hypothesis when we examine the convergence speed of the models with or without consistent tokenization during training.
Consistent tokenization also improves zeroshot QA performance on out-of-domain datasets.We also examine zero-shot model performance on unseen domains in Table 2.We find that on all the OOD datasets, the corresponding F 1 of the consistent model is either comparable or higher than the original model, with the highest gain more than 4%.This implies that training with consistent tokenization systematically improves model generalization beyond better overfitting the training domain.The improvement on unseen domain can also be explained by the reduction of difficulty: training with consistent tokenization provides the model with information that the answer must exist within the input, thus the model is no longer required to extensively search the entire vocabulary on a previously unknown domain.
Training with consistent tokenization leads to faster model convergence, and improved model confidence on the gold answer.We present the training curves of models fine-tuned on SQuAD with both versions of tokenization in Figure 2, and include training curves for other datasets in Appendix B. Across all the datasets, the consistent models converge faster than the original models.This finding aligns with our discussion that it is easier to solve extractive QA when it can be solved by simply extract the answers at token level.
To further validate this, we also examine the log perplexity of the gold answer for each instance, and compare the distributional difference between consistent and original models (shown in Figure 3).We see that the overall distribution of log perplexity difference leans to the left, suggesting that the model is more confident in generating gold answers when tokenization consistency is enforced during training.
Training with consistent tokenization leads to less textual hallucination.An important angle to look at generative models is hallucination.It is worth noting that while the general meaning of hallucination, or more precisely factual hallucina- When the instance is located on the left of the dotted line (LP difference less than zero), the consistent model is more confident in generating gold answer than the original model.
tion, refers to the problem when a model produces factual content that is either inconsistent with or ungrounded to the input.Here in the context of extractive QA, we instead refer to a more specific definition of textual hallucination, where the QA model produces out-of-context answer sequence that does not align with any given span of text in the input.We show the percentage of instances that models generate out-of-context answers on different datasets in Figure 4.An example of such out-of-context answers can be found in Appendix E.
We find that when trained with consistent tokenization, the model is less likely to textually hallucinate.We conjecture that this is because with inconsistently tokenized training data, the model is undesirably exposed to a large amount of contextanswer pairs that are not directly aligned at the token level.This misalignment between the tokenized answer and its inconsistent form in the tokenized context possibly leads to the higher hallucination rate.

Conclusion
We identify the issue of tokenization consistency in extractive tasks, propose an easy-to-implement method to guarantee consistent tokenization, and show that the model can be benefited in several aspects when training with consistent tokenization.We find that models gain clear improvement on indomain performance, convergence speed, out-of- domain adaptation, and textual hallucination when trained with consistently tokenized instances.It is worth noting that inconsistent tokenization may affect any extractive tasks.Using these findings, we suggest to apply consistent tokenization to inputs and outputs whenever researchers or practitioners are tackling extractive tasks.
In this work, we only investigate consistent tokenization in fine-tuning.Future work might consider focusing on the effect of consistent tokenization in in-context learning.In addition, it is possible that fine-tuning with consistent tokenization does not align with model's pre-training objective, but the role of pre-training objective is also not explored in this work.Furthermore, besides BPE tokenizers, there are also other tokenizers that do not produce encoding by merging subwords.Whether these tokenizers will produce a non-negligible amount of inconsistent tokenizations is unknown.We leave all these questions for future study.

A In-domain and Out-of-domain Exact Match Accuracy
The exact match accuracy is shown in Table 3.Similar to what we discover in section 5, the model obtains a better performance whent training with consistent tokenization.

B Learning Curve on Other Training Sets
The learning curves with models training on Triv-iaQA and NewsQA are shown in Figure 5 and 6, in both of which consistent model exhibits a faster convergence speed than that of original model.

C Log Perplexity Difference on
Out-of-Domain Datasets Figure 3 shows the log perplexity difference between consistent and original models on in-domain dataset.We present an example of log perplexity difference on out-of-domain dataset in Figure 7, using BioASQ as an example.

E Example of Out-of-Context Generation
Table 4 presents an example of out-of-context answer generated by the model.

Context:
CBS set the base rate for a 30-second advertisement at $5,000,000, a record high price for a Super Bowl ad.As of January 26, the advertisements had not yet sold out.CBS mandated that all advertisers purchase a package covering time on both the television and digital broadcasts of the game, meaning that for the first time, digital streams of the game would carry all national advertising in pattern with the television broadcast.This would be the final year in a multi-year contract with Anheuser-Busch InBev that allowed the beer manufacturer to air multiple advertisements during the game at a steep discount.It was also the final year that Doritos, a longtime sponsor of the game, held its "Crash the Super Bowl" contest that allowed viewers to create their own Doritos ads for a chance to have it aired during the game.Nintendo and The Pokémon Company also made their Super Bowl debut, promoting the 20th anniversary of the Pokémon video game and media franchise.

Question:
Which company has held contests for fans to create their own ad for the company?Gold Answer: Doritos Prediction: Dorfos Context: LONDON, England (CNN) -Israeli military action in Gaza is comparable to that of German soldiers during the Holocaust, a Jewish UK lawmaker whose family suffered at the hands of the Nazis has claimed.A protester confronts police in London last weekend at a demonstration against Israeli action in Gaza.Gerald Kaufman, a member of the UKś ruling Labour Party, also called for an arms embargo on Israel, currently fighting militant Palestinian group Hamas, during the ...

Question:
What does the lawmaker say?Answer: Israeli military action in Gaza is comparable to that of German soldiers during the Holocaust Prediction: Nazi soldiers during the Holocaust

Figure 1 :
Figure 1: An example of tokenization inconsistency in training a BART model (which uses the BPE tokenizer) for extractive QA.The number "1912" is tokenized differently alone (blue) and in context (green), because unlike in context, the answer is often provided without preceding spaces, which triggers different BPE merging rules during tokenization.We propose to extract the tokenized answer from context (green) for training.

Figure 2 :
Figure 2: Learning curve of BART with original tokenization and consistent tokenization.

Figure 3 :
Figure 3: Instance amount distribution with respect to LP(consistent) -LP(original), where LP represents log perplexity.Model is trained on SQuAD.When the instance is located on the left of the dotted line (LP difference less than zero), the consistent model is more confident in generating gold answer than the original model.

Figure 4 :
Figure 4: Percentage of instances that model generates out-of-context answers during inference.The models are trained on SQuAD and numbers are averaged over three random seeds.Appendix D reports percentage of out-of-context anwers when the models are trained on TriviaQA and NewsQA.

Figure 6 :
Figure 6: Learning curve of BART with original tokenization and consistent tokenization.The models are trained on NewsQA.

Figure 9
Figure 9 and 8 show textual hallucination rate when the models are trained on NewsQA and TriviaQA.

Figure 7 :Figure 8 :
Figure 7: Instance amount distribution with respect to LP(consistent) -LP(original) in BioASQ development set, where LP is log perplexity.Model is trained on SQuAD.When the instance is located on the left of the dotted line (LP difference less than zero), the consistent model is more confident in generating gold answer than the original model.Compare to Figure 3, this figure shows an example of log perplexity difference on out-of-domain datasets.

Table 4 :Figure 9 :
Figure 9: Percentage of instances that model generates out-of-context answers during inference.The models are trained on NewsQA and numbers are averaged over three random seeds.

Table 1 :
Percentage of training instances whose tokenized gold answers do not exist verbatim in the tokenized context using BART and T5 tokenizer.BART tokenizer exhibits more significant issue of inconsistency than that of SentencePiece used by T5.

Table 2 :
F 1 of BART QA models fine-tuned on different datasets (first column) and evaluated on in-domain and out-of-domain datasets.Original represents models fine-tuned with original tokenization and consistent represents models fine-tuned with consistent tokenization (our method).Shaded cells indicate in-domain evaluation results.All results are averaged over three random seeds.* marks results with statistically significant improvement (p < 0.05) over the other model variant on the same dataset.
(Kudo and Richardson, 2018)widely used generative models, BART-base(Lewis et al., 2020)is used for experiments.Compare to T5, which uses SentencePiece(Kudo and Richardson, 2018)as tokenizer, BART tokenizer is more likely to produce inconsistent tokenization (Table1), and can therefore provide us with more exemplary results.For each dataset, We fine-tune two variants of the model: the first variant (denoted as original) tokenizes gold answers separately with the contexts, and the second variant (denoted as consistent) applies our method to guarantee consistent tokenization.1 scores) and Appendix A (EM scores).Consistent tokenization training improves indomain QA performance.Overall, we observe statistically significant improvement in F 1 with the

Table 3 :
EM of BART fine-tuned on different datasets (first column) and evaluated on in-domain and out-of-domain datasets.Original represents models fine-tuned with original tokenization and consistent represents models fine-tuned with consistent tokenization.All results are averaged over three random seeds.