Improving Tokenisation by Alternative Treatment of Spaces

Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations of limited linguistic validity, and representing equivalent strings differently depending on their position within a word. We hypothesise that these problems hinder the ability of transformer-based models to handle complex words, and suggest that these problems are a result of allowing tokens to include spaces. We thus experiment with an alternative tokenisation approach where spaces are always treated as individual tokens. Specifically, we apply this modification to the BPE and Unigram algorithms. We find that our modified algorithms lead to improved performance on downstream NLP tasks that involve handling complex words, whilst having no detrimental effect on performance in general natural language understanding tasks. Intrinsically, we find our modified algorithms give more morphologically correct tokenisations, in particular when handling prefixes. Given the results of our experiments, we advocate for always treating spaces as individual tokens as an improved tokenisation method.


Introduction
Tokenisation is a key initial step in processing natural language, as it identifies the linguistic units to be processed, converting them to numerical IDs which can then be vectorised and manipulated by mathematical operations.
Earlier NLP approaches used simple stringsearching techniques with regular expressions to tokenise text; however, these pattern-matching tokenisation methods have drawbacks: they require large vocabulary sizes to cover the training data, they cannot handle out-of-vocabulary words, and they do not work for languages without spaces as word boundaries.To address these issues, subword tokenisation was introduced.The first explicit mention (and popularisation) of this approach was by Sennrich et al. (2015), though it was indirectly introduced earlier by Schuster and Nakajima (2012).This method works by learning from training data to build a vocabulary (of a fixed size) and then tokenising text at inference time using this vocabulary (and possibly other learnt parameters).More frequent words are represented as single tokens, with rare words being broken down into multiple subword tokens, possibly down to the character level.
The BPE and Unigram algorithms are implemented in the SentencePiece library (Kudo and Richardson, 2018).There is a lack of clarity regarding SentencePiece in the literature, with it being erroneously considered as its own algorithm rather than an implementation of other algorithms.For example, in the paper introducing T5 (Raffel et al., 2019) they state that they "use SentencePiece to en-code text as WordPiece tokens", which is not in fact implemented in SentencePiece.Looking at their code, we find that they use the default Sentence-Piece implementation, which is Unigram.XLNET (Yang et al., 2019) say they tokenise with Senten-cePiece, but do not say which algorithm they use -again, looking at their code, we find they use the default of Unigram.Equivalently, ALBERT (Lan et al., 2019) say that they tokenise with Senten-cePiece as for XLNET, meaning they again use Unigram.
Despite their ubiquity, existing tokenisation algorithms have problems, which we hypothesise hinders the ability of language models to handle complex words (Section 2).We suggest that these problems are pervasive across all existing subword tokenisation algorithms due to a shared fundamental design choice of allowing tokens to include spaces, and thus experiment with an alternative treatment of spaces where they are always taken as individual tokens.We implement this approach by making simple modifications to the existing Word-Piece, BPE, and Unigram algorithms (Section 3).We first evaluate our modified algorithms intrinsically (Section 4), quantitatively finding that they improve morphological correctness, in particular when handling prefixes.Qualitatively, we take examples from previous papers critiquing existing tokenisation algorithms, and show how our modified algorithms are able to alleviate the discussed issues.We then evaluate our modified algorithms extrinsically by pretraining and finetuning transformerbased models (Section 5), showing that they give improved performance on NLP tasks that require handling complex words with no detrimental effect on performance in the general domain.

Problems with Existing Tokenisation Algorithms
Existing tokenisation algorithms often produce unintuitive tokenisations for complex words, incorrectly splitting prefixes, and producing unmeaningful subword tokens, which are problems that have been discussed in previous works.Church (2020) looks at the BERT (WordPiece) tokenisations for complex words, highlighting the many unnatural tokenisations that arise, with tokens often splitting up morphemes and digraphs.Nayak et al. (2020) also discuss the issues with BERT's tokeniser, specifically highlighting problems with the splitting of prefixes, and they show that poor tokenisation leads to weak semantic representations.Hofmann et al. (2021) find that BERT performs poorly on classifying complex words containing prefixes, performing much better on suffixes.They suggest that a reason is that BERT's tokeniser is seldom accurate for splitting prefixes, but is much more often correct for splitting suffixes.Schick and Schütze (2020) argue that a reason BERT struggles to understand rare words is due to suboptimal tokenisation of these words.Here we give a few of our own examples of BERT tokenisations that illustrate the problems1 : We see here that the prefixed words are tokenised poorly: the prefix is either incorrectly split, as in "disjointed" and "unisex", or the prefix is correctly split, but the rest of the word is tokenised differently from the standalone case, as in "untrue" and "overestimate".We note that suffixes are handled better than prefixes, which is due to spaces being prepended rather than appended to words (see Section 3).
For these latter examples, there is a second problem: even if the base were tokenised as a single token, the addition of the space symbol means that there would be no explicit link between the prefixed word and the standalone base.As an example, we cherry-pick a rare example of a morphologically correct tokenisation by BERT of a word containing a prefix, showing both strings and token IDs: beatable → _beat, able (3786, 3085) unbeatable → _un, beat, able (4895,19442,3085) We can see that, even though these tokenisations are reasonable, the subword "beat" is assigned different IDs in the two cases due to the prepending of the special space symbol.
We hypothesise that both of these problems hinder the ability of existing language models (such as BERT) to deal with complex words.Regarding the first problem, we argue that the morphological correctness of a tokeniser is a metric which will correlate with the ability of language models to deal with complex words: correctly splitting affixes means morphologically related words (those sharing a common base) are given related tokenisations.The splitting of prefixes is particularly important, as prefixes always have a semantic function, unlike suffixes which can have both syntactic and semantic functions (Giraudo and Grainger, 2003).Also, tokenisations made up of meaningful subword tokens (morphemes or groups of morphemes) will allow language models to build stronger representations with less data, since the representations of complex words can be computed from the representations of the subwords.Regarding the second problem, the fact that base forms are represented differently depending on their position within a word means a reduction in relevant training instances and hence a further weakening of representations for complex words.

Our Modified Algorithms
We suggest that the problems discussed in Section 2 arise as a result of how spaces are handled by existing algorithms: All subword tokenisation algorithms currently used by transformer-based models allow tokens to include space symbols as the first character2 .This means equivalent strings are treated differently depending on whether they appear at the start of a word or not.This difference occurs when training these tokenisers, which leads to suboptimal tokenisations of prefixed words.It also occurs when using these tokenisers in NLP models, leading to equivalent strings being assigned different tokens depending on whether they occur at the start of a word or not.
Thus, to attempt to alleviate these issues, and hence improve the handling of complex words by language models, we propose an alternative treatment of spaces where they are always assigned individual tokens.This simple modification can be made to any existing subword tokenisation algorithm, though for brevity we focus our attention on BPE and Unigram; this modification can also be made to the WordPiece algorithm, and we see similar (intrinsic) performance improvements from doing so.In Section 4, we perform a qualitative analysis of our modified WordPiece algorithm and also include the default WordPiece algorithm in our quantitative evaluation for comparison.Our modified algorithms and the defaults are shown in Figure 1 and Figure 2 for BPE and Unigram, respectively3 .
In the following sections, we compare our modified tokenisation algorithms to the defaults by evaluating them intrinsically (Section 4) and extrinsically (Section 5).

Intrinsic Evaluation: Morphological Correctness
Given our hypothesis that the morphological correctness of a tokeniser, especially when handling prefixes, correlates with the performance of language models in dealing with complex words (Section 2), we perform a controlled intrinsic evaluation of our tokenisers using this metric.We train our modified algorithms and the defaults on 1 million sentences from English Wikipedia for BPE and Unigram, with a fixed vocabulary size of 16,000, and then run evaluation on four morphological datasets: LADEC, MorphoLex, MorphyNet and DagoBERT.The LADEC dataset (Gagné et al., 2019) consists of 7,804 noun compounds with a unique morphological parse (we exclude those with multiple parses).MorphoLex (Sánchez-Gutiérrez et al., 2018) provides derivational morphology for 68,624 entries from the English Lexicon Project (Balota et al., 2007).Here we only consider those with a concatenative parse (i.e.no overlapping tokens), resulting in 12,028 entries.MorphyNet (Batsuren et al., 2021) provides derivational and inflectional morphology for words across 15 languages, expanding the UniMorph dataset (McCarthy et al., 2020).Taking only those derivational morphology entries in English with a concatenative parse gives 193,945 entries.The DagoBERT dataset (Hofmann et al., 2020) comprises 279,443 words containing low-frequency derivatives, taken from Reddit posts.Again, we take those with a concatenative parse, giving 268,513 entries.
We evaluate a tokeniser on these datasets using the evaluation method introduced by Creutz et al. (2004), which produces metrics by comparing the boundaries of a generated tokenisation with a gold standard reference: false negatives are boundaries appearing in the reference but not in the generated  For each substring in V , compute the loss from removing this from the vocabulary 6 Remove the substring with the smallest loss from V 7 end Tokenisation input :text T , vocabulary V , language model parameters Θ output :tokens τ 1 Replace whitespace in T with the space symbol 2 Prepend the space symbol to the first word of every sentence in T 3 Use the Viterbi algorithm with the learned language modelling parameters and the vocabulary to tokenise T Tokenisation input :text T , vocabulary V , language model parameters Θ output :tokens τ 1 Replace whitespace in T with the space symbol 2 Use the Viterbi algorithm with the learned language modelling parameters and the vocabulary to tokenise T with spaces being given an arbitrarily high score so they are always selected as individual tokens tokenisation, whilst false positives are boundaries appearing in the generated tokenisation but not in the reference.Because it makes sense to store common words as single tokens in the vocabulary, even if they can be decomposed into morphemes, we report precision along with F1 as a potentially more meaningful metric, since this allows undersegmentation whilst penalising oversegmentation.We also compute the mean sequence length (number of to-kens) for each tokeniser across each dataset.Results are shown in Table 1.Here, and throughout, the prime symbol ( ′ ) denotes the given algorithm modified to always treat spaces as individual tokens.
The general trend is that Unigram outperforms BPE (consistent with findings by Bostrom andDurrett 2020, Hofmann et al. 2022), with the modified algorithms performing better than their default counterparts -the average F1 scores across the four datasets are 43.0,50.9, 59.7, and 62.4 for the four algorithms BPE, BPE ′ , Unigram, and Unigram ′ , respectively.On the MorphoLex dataset, however, the default Unigram algorithm performs the best.This is also the only dataset where default Unigram gives a shorter mean sequence length than Unigram ′ .To further investigate this, we evaluate on the subsets of the data containing only prefixed and only suffixed entries, shown in Table 2.We can see that Unigram ′ performs best on prefixed entries, but worse than default Unigram on suffixed entries.Since the dataset consists of many more entries containing suffixes than those containing prefixes (7,422 vs 2,692), this could explain the performance difference.Because the correct tokenisation of prefixed words is particularly important (Section 2), we believe that this performance trade-off is beneficial.In Section 5, we confirm this through evaluation on downstream tasks.
Interestingly, BPE ′ gives the shortest sequence length on three of the four datasets, but not the most morphologically correct tokenisations.Since BPE was developed as a compression algorithm, the short sequence lengths are perhaps expected, but here we only see a weak correlation between sequence length and morphological correctness 6 .
For a qualitative analysis, we take examples from papers that highlight problems with existing tokenisers (Section 2) and generate the output from the default and modified algorithms for BPE and Unigram, shown in Table 3.These examples illustrate how our modified algorithms are able to generate improved tokenisations for complex words.For example, whereas the default Unigram algorithm tokenises "unicycle" into "_un" "i" "cycle", which is misleading as the string "un" does not have its typical semantic role, our modified Unigram algorithm tokenises it more meaningfully into "uni" "cycle".Also, the modified algorithms explicitly create links between words containing prefixes and their bases.For the words "accessible" and "unaccessible", the modified algorithms tokenise the subword "accessible" identically in both cases.The default Unigram and BPE algorithms do correctly split the prefix "un", but the rest of the word is tokenised differently, which is problematic, and even if the tokenisation was equivalent, the inclusion of the space symbol means there would be no link between these forms (Section 2).We note that our modified algorithms are not immune to oversegmentation, with Unigram ′ tokenising "responsiveness" into seven tokens, although this is arguably inevitable with a limited vocabulary size.
In Table 4, we show the same qualitative analysis between the default and modified WordPiece algorithms, finding parallels with default and modified BPE.
We investigate the vocabularies of the default and modified algorithms, shown in Table 5.We remove the tokens "[CLS]", "[SEP]", and "[UNK]" from the vocabularies.For the default algorithms, we also remove tokens that are duplicates apart from prepended space symbols, and we find that there is significant vocabulary degeneracy (8.7% and 9.1% for BPE and Unigram, respectively).We also find that a large percentage of the vocabulary is transferred over from the default to the modified algorithm (90.0% and 90.1% for BPE and Unigram, respectively).Additionally, we see that all of the algorithms have a similar number of prefixes in their vocabularies, which suggests the tokenisation algorithm plays an important role, as performance differences on handling prefixes are large (Table 2) despite similar vocabularies.This is supported by work by Hofmann et al. (2021), who find that employing a fixed vocabulary in a morphologically correct way leads to performance improvements.We also see, however, that Unigram ′ has fewer suffixes in its vocabulary than default Unigram, which reflects the performance difference seen in Table 2.
We note that an interesting result of our modifications is an improvement at word segmentation.As an example, the outputs of the default and modified Unigram algorithms when passed the concatenated sentence "thisisasentencethatneedstobesegmented" are: Unigram _this, isa, s, ent, ence, that, ne, ed, s, to, be, s, eg, ment, ed Unigram ′ this, is, a, sentence, that, needs, to, be, segment, e, d

Extrinsic Evaluation: Pretrain-Finetune
Given the improved intrinsic performance of our algorithms, we wish to evaluate how this impacts the extrinsic performance of NLP models, both in general, and in particular on tasks involving complex words.As in Section 4, we train the default and modified BPE and Unigram algorithms on 1 million sentences from English Wikipedia, with a fixed vocabulary size of 16,000, but we also implement a variant of our modified algorithm that removes spaces as a post-processing step.The reasoning behind this is that it reduces the sequence length significantly with minimal information loss, and more closely mirrors existing models which have no explicit space information.Example tokenisations for the Unigram algorithms given the input "This is an input sentence."are: Unigram _This, _is, _an, _input, _sentence, .
For each of the tokenisers, we pretrain RoBERTa (base) on the full text of English Wikipedia, and then finetune on downstream tasks, keeping all hyperparameters fixed, changing only the tokenisation algorithm used.For evaluation of the models in a general domain, we use the GLUE benchmark (Wang et al., 2018), excluding WNLI.For evaluation in specifically handling complex words, we use the two Superbizarre topicality tasks (Hofmann et al., 2021), which require the binary classification of derivationally complex English words 7 .
Over the whole of the English Wikipedia data, the sequence lengths for each of the tokenisation approaches are: Unigram ′ no spaces 3.67e+09 7 We do not consider the Superbizarre sentiment task due to a higher proportion of uninformative words.
As in the evaluation in Table 1, the modified models without spaces give shorter sequences than their default counterparts, with BPE ′ without spaces giving the shortest mean sequence length.The difference in sequence lengths of the models means a difference in number of updates per epoch during pretraining.Hence, fixing the number of updates (and thus training time) will advantage models with shorter sequence lengths, especially disadvantaging the models that include spaces.Because of this, we perform two evaluations: one fixing the number of pretraining updates, and one fixing the number of pretraining epochs8 .
Due to computational constraints, we only ran pretraining once for each model.For finetuning, we ran each experiment with 10 different seeds, reporting the mean development result and standard deviation.Results are shown in Table 6 and Table 7 for fixed updates and fixed epochs, respectively.Full training procedure is given in Appendix A.
On the Superbizarre datasets, we can see that Unigram outperforms BPE, with Unigram ′ no spaces performing significantly better than all other models using a Welch's t-test (p < 0.05), see Appendix C. Note that DelBERT (Hofmann et al., 2021), a model which is passed the input segmented by a morphological algorithm, achieves 73.1 on the Arxiv dev set and 72.3 on the Arxiv test set, both worse than our (unsupervised) model, although Del-BERT outperforms our best models on the Reddit task, achieving 69.6 and 70.1 on the dev and test sets, respectively.
On the mean GLUE benchmark, the modified models without spaces perform as well or better than their default counterparts, with Unigram ′ performing the best when both updates and epochs are fixed.However, this result is not statistically   significant (see Appendix C), and over the individual GLUE tasks the best performing models vary, with high variances across seeds on some tasks due to the small dataset sizes (see Appendix B).Since the GLUE tasks do not rely on handling complex words, a significant performance difference is probably not expected, but we see no drop in performance with the modified algorithms.
The modified models that include spaces perform poorly on the GLUE benchmark, even when the number of epochs is fixed rather than updates, meaning they are trained for ∼65% more updates than the modified models without spaces.This suggests that this method of including spaces as additional tokens is suboptimal for general language tasks, though interestingly Unigram ′ with spaces is the second best performing model across all Superbizarre datasets.The tokenisers themselves perform splitting on spaces as a first step, so additionally including spaces may be simply passing noise to the model for the masked language modelling task, especially due to the high frequency of spaces.This means the pretraining loss decreases rapidly due to space prediction, but plateaus earlier (see Appendix A).Due to the much greater sequence lengths, the models that include spaces also discard examples that are too long during finetuning, which could lead to worse results.

Related Work
There are previous works that have performed controlled extrinsic comparisons of existing subword tokenisation algorithms (BPE, Unigram, and Word-Piece), and have provided results which we relate here to our own findings.Gallé (2019) investigates various compression algorithms for tokenisation, including BPE, and finds an inverse link between mean tokens per sentence and translation quality, hypothesising that the compression capability of BPE leads to its effectiveness in NLP tasks.In our experiments we find that Unigram ′ outperforms BPE ′ on the complex words tasks, and there to be no significant difference between them on the general language understanding (GLUE) tasks.This is despite Unigram ′ having a longer sequence length, suggesting this factor is not wholly indicative of model performance.However, if we look at the results for fixed pretraining updates, we do see a slight negative correlation between sequence length and performance on the Superbizarre datasets, and a very strong negative correlation on the GLUE benchmark 9 , though this is skewed by the models including spaces performing very poorly.Intrinsically, we see a correlation (albeit weak) between sequence length and morphological correctness (Section 4).Bostrom and Durrett (2020) compare Unigram and BPE, finding that Unigram generates more morphologically cor-9 Pearson correlations between -0.157 and -0.224 for the Superbizarre datasets, and -0.985 for GLUE.rect tokenisations and gives improved downstream task performance.Whilst we saw similar improvements in intrinsic performance, we were unable to replicate the performance difference on MNLI that they found, finding no significant difference in performance (see Appendix B).We did not perform evaluation on the other two English datasets they used.Hofmann et al. (2022) corroborate these intrinsic results, additionally finding the morphological quality of WordPiece to lie in between that of BPE and Unigram, reflecting our own findings (Section 4).Wei et al. (2021) perform comparison between byte-level BPE and byte-level Unigram, finding BPE to perform better than Unigram across seven languages on the XNLI dataset, which is contrary to our findings and those of Bostrom and Durrett (2020) and Hofmann et al. (2022).
There have also been some recent attempts to develop improved subword tokenisation methods.Hofmann et al. (2021) introduce DelBERT, which takes input words tokenised according to gold standard morphological references, with an unchanged vocabulary.They find that this improves performance on their Superbizarre datasets (Section 5).Hofmann et al. (2022) also introduce FLOTA (Few Longest Token Approximation), which improves the performance of BERT, GPT-2, and XLNET at classifying ArXiv papers into their subareas from the title.Yehezkel and Pinter (2022) introduce a context-aware tokeniser, SaGe, which they find improves performance over BPE on GLUE tasks, the Turkish subset of XLNI, and NER in both Turkish and English.There are also alternative subword tokenisation algorithms which have a history of use in machine translation tasks, including Morfessor (Creutz and Lagus, 2002) and its successors (Virpioja et al. 2013, Grönroos et al. 2020), and Dynamic Programming Encoding (DPE) (He et al., 2020b).(See Mielke et al. 2021 for a more extensive review.) For all of these approaches, spaces still occur as the first character of start-of-word tokens, and we believe this hinders performance: our alternative treatment of spaces could be combined with these algorithms, and the impact on performance investigated.
Finally, we note that Wei et al. (2021) experiment with different methods of handling spaces within their byte-level BPE algorithm which appear similar to those implemented here, although they find these alternatives perform worse than the default on XNLI.They do not release code for their experiments so unfortunately we are unable to make a controlled comparison.

Conclusion and Future Work
We hypothesise that problems with current tokenisation algorithms arise from allowing tokens to include spaces, and thus experiment with an alternative tokenisation approach where spaces are always treated as individual tokens.We find that this leads to improved performance on NLP tasks involving complex words, whilst having no detrimental effect on performance in general natural language understanding tasks.Whilst our work focuses on BPE and Unigram, our modifications can be applied to any existing subword tokenisation algorithm, including WordPiece, and hence to any transformer-based model.Also, although our experiments have only been in English, the algorithms used are unsupervised and languageindependent and our results should extend to other languages.
Our best-performing models use lossy tokenisation (removing the space tokens as a postprocessing step), which may not be ideal for all tasks.We did not perform evaluation on sequenceto-sequence tasks, and indeed the subword tokenisation algorithms discussed here were introduced in the field of NMT, where space information needs to be generated in the output.Future work could thus look at alternative methods for including space information that maintain the performance gains seen here whilst keeping tokenisation lossless.
data T , vocabulary size s output :vocabulary V 1 Replace whitespace in T with the space symbol 2 Prepend the space symbol to the first word of every sentence in T 4 3 Vocabulary V initialised as all characters 4 while |V | < s do 5 Find the most frequently occurring bigram in T , only allowing spaces as the first character 6 Apply merge operation on the bigram to make a new token 7 Add merge operation to V 8 end (b) Modified BPE (BPE ′ ) Training input :training data T , vocabulary size s output :vocabulary V 1 Replace whitespace in T with the space symbol 2 Vocabulary V initialised as all characters 3 while |V | < s do 4 Find most frequently occurring bigram in T that does not include spaces 5 Apply merge operation on the bigram to make a new token 6Add merge operation to V 7 end Tokenisation input :text T , vocabulary V output :tokens τ 1 Replace whitespace in T with the space symbol 2 Prepend the space symbol to the first word of every sentence in T 3 Apply the merge operations from V in order to T .Tokenisation input :text T , vocabulary V output :tokens τ 1 Replace whitespace in T with the space symbol 2 Apply the merge operations from V in order to T .

Figure 1 :
Figure 1: Default and modified BPE algorithms.Red, bold text is removed from the default algorithm, whilst green, italic text is added.

Figure 2 :
Figure 2: Default and modified Unigram algorithms.Red, bold text is removed from the default algorithm, whilst green, italic text is added.
Unigram ′ no spaces

Figure 3 :
Figure 3: Pretraining loss curves for the six models.

Table 1 :
Creutz et al. (2004)okenisation algorithms across four morphological datasets, showing the average sequence length, precision and F1 score generated following the standard introduced byCreutz et al. (2004).Best results are shown in bold.

Table 2 :
Performance of the tokenisation algorithms on subsets of the MorphoLex dataset with entries containing only prefixes and only suffixes.Best results are shown in bold.

Table 5 :
Vocabularies of the models, showing size, number of unique elements, and numbers of prefixes and suffixes.

Table 6 :
(Hofmann et al., 2021)er pretraining for 100,000 updates.Shown are mean results across 10 seeds.Results that are significantly better than all others using a Welch's t-test (p < 0.05) are shown in bold.More detailed results are given in Appendix B. We include DelBERT(Hofmann et al., 2021)as a supervised baseline, where the models are passed a morphological parse of the input.

Table 7 :
(Hofmann et al., 2021)er pretraining for 30 epochs.Shown are mean results across 10 seeds.Results that are significantly better than all others using a Welch's t-test (p < 0.05) are shown in bold.More detailed results are given in Appendix B. We include DelBERT(Hofmann et al., 2021)as a supervised baseline, where the models are passed a morphological parse of the input.

Table 11 :
Full finetuning results after pretraining for 100000 updates.Shown are mean dev set results across 10 seeds, with standard deviations in parentheses.

Table 12 :
Full finetuning results after pretraining for 30 epochs.Shown are mean dev set results across 10 seeds, with standard deviations in parentheses.