You should evaluate your language model on marginal likelihood over tokenisations

Neural language models typically tokenise input text into sub-word units to achieve an open vocabulary. The standard approach is to use a single canonical tokenisation at both train and test time. We suggest that this approach is unsatisfactory and may bottleneck our evaluation of language model performance. Using only the one-best tokenisation ignores tokeniser uncertainty over alternative tokenisations, which may hurt model out-of-domain performance. In this paper, we argue that instead, language models should be evaluated on their marginal likelihood over tokenisations. We compare different estimators for the marginal likelihood based on sampling, and show that it is feasible to estimate the marginal likelihood with a manageable number of samples. We then evaluate a pretrained language model on both the one-best-tokenisation and marginal perplexities, and show that the marginal perplexity can be significantly better than the one best, especially on out-of-domain data. We link this difference in perplexity to the tokeniser uncertainty as measured by tokeniser entropy. We discuss some implications of our results for language model training and evaluation, particularly with regard to tokenisation robustness.


Introduction
Neural end-to-end language models have largely done away with traditional pipeline approaches towards building NLP systems. However, one component which stubbornly remains is the tokenisation step, used right at the start of preprocessing. At the time of writing, the most widely used tokenisers, such as BPE (Sennrich et al., 2016) and unigram (Kudo, 2018), break up the input text into subword units, potentially backing off to character-level segmentation if necessary. This allows for coverage of every possible input sequence; on the downside, a single input sequence may now have multiple possible tokenisations.
Typically, language models are trained and evaluated using a single canonical tokenisation out of the multitude of possible ones, but this tokenisation may be suboptimal (Bostrom and Durrett, 2020) for many reasons. For example, different tokenisations -that is, different surface segmentations -can reveal different morphological analyses of the word in question (think un-ion-izeable vs. union-izable), and committing to a particular analysis can discard useful information, particularly if the best analysis from the tokeniser is erroneous (Dyer, 2010).
Further, tokenisers themselves are trained using an objective which optimises the likelihood of the data. This can be explicit (the unigram tokeniser of Kudo (2018) optimises a unigram language modelling objective) or implicit (BPE aims to minimise the description length of the training data, which has close connections to probabilistic methods; MacKay 2003). In this sense they are also language models, albeit far less powerful than the neural language models we train on their outputs. This raises a difficult question: to what extent are our large language models bottlenecked by the tokenisers that we use to train them?
We argue that rather than evaluating language models using the one-best tokenisation from the tokeniser, one should evaluate language models using the marginal likelihood over all possible tokenisations of an input. This divorces language model performance from the performance of the tokenisation model, and we believe this gives a better indicator of the intrinsic quality of the language model.
In this paper, we take a language model pretrained using a single tokenisation, and estimate the marginal likelihood of the model on test data, taking multiple tokenisations of each input into account. While summing exactly over exponentially many tokenisations is intractable, we can estimate the marginal likelihood using importance sampling. One contribution of this paper is to show-case low-variance estimators of the marginal likelihood based on sampling without replacement. We cast the tokeniser as the proposal distribution for our importance sampling estimator, which clearly delimits the role of the tokeniser. Indeed, as the number of samples we consider increases, the language model becomes less and less coupled to the tokeniser, and our evaluation becomes more intrinsic to the language model itself, rather than the language model + tokeniser combination.
We demonstrate that there can be a significant difference -which we call the marginal gap -in marginal likelihood compared to one-best tokenisation likelihood, especially on out-of-domain evaluation sets. This suggests that the tokeniser is failing to generalise well to out-of-domain data, and is therefore a significant bottleneck to the generalisation capability of the language model. Thus, taking the one-best tokenisation likelihood is a poor proxy for the true language model performance.
We next show that there is a correlation between the uncertainty of the tokeniser (as measured by the entropy of the segmentation lattice) and the marginal gap. We give an efficient dynamic program to calculate the entropy of the segmentation lattice, and show that this entropy is predictive of how poorly the tokeniser fails to generalise. This suggests that measuring tokeniser entropy can be a useful signal for adding additional samples to our estimate of the marginal likelihood. We also use our sampled tokenisations to demonstrate that language models are particularly sensitive to variations in tokenisation, a challenge that must be mitigated for marginal likelihood evaluation.
Finally, we investigate how many samples are necessary to obtain an accurate estimate of the marginal likelihood. We show that many samples are necessary, but only relatively few samples contribute significantly to this estimate. This shows that the tokeniser distribution over tokenisations differs significantly from the language model posterior distribution over tokenisations -indeed, taking only the best tokenisation from the samples can recover most of the performance increase obtained by marginalisation. This gives weight to our finding that tokenisers generalise poorly, and that the one-best tokenisation can often be suboptimal.
We conclude by discussing some implications of our results, particularly for languages with richer morphology than English. Finally, we sketch potential future directions to bridge this gap by using sampled tokenisations at training time, and how this might improve language model robustness.

Taking multiple tokenisations into consideration
We denote by D (for document) a string of text whose score we would like to calculate. Given a vocabulary V of sub-word tokens (which is usually induced by the tokeniser), we denote by T i potential tokenisations of D -i.e. sequences of tokens t 1 t 2 . . . t n i such that each t i ∈ V and the sequence detokenises to D. An autoregressive neural language model (with parameters θ) is a model which decomposes the probability of the full sequence into a series of left-to-right predictions: . Crucially, neural language models P θ do not score D directly, but rather token sequences P θ (T, D). For any input document D, a tokeniser will define a canonical tokenisation T * , and one usually approximates P θ (D) with P θ (T * , D).
We believe, on the other hand, that it is more principled to marginalise over all possible tokenisations; that is, calculate T P θ (T, D) directly. There could be significant tokeniser uncertainty over the correct tokenisation; we can view the uncertainty as either caused by ambiguity in local context imposed by the strong independence assumptions made by tokenisers, or because of inherent tokeniser uncertainty when confronted with out-of-domain input. In either case, incorporating additional analyses in the form of extra tokenisations can give the language model extra information compared to the one-best tokenisation. We believe that the marginal likelihood better represents the true capability of the language model, without the constraint of the tokeniser.
However, exactly calculating the marginal likelihood is infeasible, as the number of possible tokenisations is exponential in the length of the input text. Whenever calculating a marginal exactly is infeasible, the classical approach is to approximate it using samples. The best distribution to sample from would be the model posterior distribution over tokenisations given text, as this gives the lowest variance estimator; unfortunately, we are unaware of any methods that would let us sample directly from this distribution. Therefore, to estimate the marginal language model likelihood, we turn to importance sampling. Given some proposal distribution Q(T |D) of possible tokenisations, we can use the importance sampling estimator Now, it remains to find a suitable proposal distribution Q(T |D). In this paper, we use the unigram tokeniser of Kudo (2018), as this is the only probabilistic tokeniser that we are aware of. This tokeniser first constructs a lattice of all possible tokenisations given an input and a lexicon of word pieces. Distinct tokenisations of the input correspond to paths through this lattice, and the score of a tokenisation is the sum of the scores of the tokens along the path. As the score decomposes along lattice segments, many interesting quantities, such as Q(D) (the marginal likelihood of an input text under the tokeniser), are exactly calculable. This allows not only for sampling from the lattice of possible tokenisations, but also calculating the score of a given tokenisation (i.e. estimate Q(T |D) = Q(T, D)/Q(D)), which is necessary to estimate the importance weight.
Tokenising consistently There is prior evidence (Lazaridou et al., 2021) to suggest that Transformer language models are able to effectively leverage memory, and that perplexities of repeated words in a document can be much lower than the perplexity of the first occurrence of that word. We show in Section 4.3 that this copying ability is tied to the exact tokenisation of that word: if a word reoccurs in a document with a different tokenisation, its perplexity is much higher than if it reappears with the same tokenisation.
Armed with this insight, we design an alternative proposal distribution which samples a single tokenisation for each unique whitespace-delimited type in a document, and then shares that tokenisation for each token of that type in the document. We note that it is possible to adapt a pre-trained unigram tokeniser to do this, by passing in only the unique whitespace types in a document to the tokeniser and reconstructing the document from the sampled tokenisations. This is possible because the unigram tokeniser does not consider context when tokenising, and whitespace tokens are tokenised independently. We note that this two-stage word generation process, where first we generate the vocabulary for a document, and then generate the document from that vocabulary, has close connections to the two-stage language models proposed in Goldwater et al. (2011). The problem of tokenising consistently only arises when sampling from the tokeniser; the one-best tokenisation of an input from the unigram tokeniser will always tokenise each occurrence of a type identically.

Lowering the variance of the estimator
A naive approach to estimating the marginal likelihood using Equation 1 would be to sample n tokenisations T 1 , . . . , T n at random from Q(T |D), score the resulting tokenisations using the language model P θ (T i , D), and average the resulting importance weighted scores. However, due to Jensen's inequality, this is only a lower bound of the true marginal likelihood. We can obtain a tighter bound with the same number of samples by taking the average in probability space rather than log space (as in Burda et al. (2016)) Changing the sampling procedure Taking n independent samples from Q can result in highvariance estimates if the entropy of Q is low and it assigns low probability to tokenisations with high posterior probability under the language model P θ .
In this case, one would expect to see multiple repeated samples, which do not sufficiently explore the sample space. One option to lower the variance of the estimate is to instead sample without replacement (WOR). By enforcing that all samples are distinct, we can explore the sample space better, However, sampling without replacement without exactly enumerating all possible sample outcomes is tricky. Kool et al. (2019) show how to sample without replacement for sequence models using stochastic beam search (SBS). Unfortunately, the segmentation lattice used in the unigram tokeniser is not locally normalised, and we cannot naively use SBS. We therefore adapt the SBS algorithm by first running the forward algorithm on the segmentation lattice to calculate the normalising constant at each point of the lattice; we can then combine Viterbi backwards n-best search with the constrained Gumbel-max trick used in SBS to exactly sample n tokenisations WOR.
If we sample without replacement, the inclusion probability of a tokenisation T i is no longer equal to Q(T i |D). Kool et al. (2019) show that, for the expectation of a function f under a distribution Q, an unbiased estimator using a set of k samples without replacement is given by κ is the perturbed score of the k + 1th item during search and q κ (T ) = 1−exp(− exp(log Q(T )−κ)) is the probability that a Gumbel variable with location log Q(T ) takes a value greater than κ. In our case, f (T ) = P θ (T, D)/Q(T ), and if we calculate this sum before taking the logarithm to obtain a tighter bound, then the Q(T ) terms cancel and we obtain the following estimator for the marginal likelihood of a document: Including the best tokenisation To lower the variance of the estimate further (at the cost of introducing some bias), we can always include the best tokenisation from the tokeniser in our set of samples (Botev et al., 2017). This method decom- . We can then estimate the sum over all tokenisations using exactly the same methods as before, using the new distribution Q * which places 0 mass on T * and renormalises the resulting probabilities for other tokenisations. It remains to simulate samples from Q * using samples from Q. We note that for sampling with replacement, a simple technique to sample from Q * is simple rejection sampling, where we discard any sample from Q that equals T * . However, if Q(T ) is particularly peaked around T * , then this procedure may require many rejection steps. Therefore, we do not investigate this estimator further.
When sampling without replacement, we have to be a little more careful. We note that the following scheme samples k times exactly without replacement from Q * : We also note (by conditioning on the event that T * appears in the sample) that the inclusion probabilities are easily calculated (if T * appears in the sample, take κ to be the perturbed score of the k + 2th item; otherwise take it to be the perturbed score of the k + 1th item).

Algorithm 1: Recursive algorithm for lattice entropy
Result: entropy H n of segmentation lattice init H 0 = 0, α[i] the forward marginals ; for i = 1 to n do for w token terminating at position i do j = start position of w ; // ϕ(w) is the score of token w ;

Summing over the n-best tokenisations
An alternative approach to estimating T P θ (T, D) is to restrict the sum to a smaller set of suitable candidates. As the unigram tokenisation objective decomposes over segments, one can use Viterbi search to find exactly the n highest scoring tokenisations from the tokeniser. We then score each tokenisation using the language model, and sum the contribution of each estimate to obtain a (lower bound) estimate of the marginal likelihood. This estimator is high-bias and low-variance compared to the sampling-based estimators; we show in Section 4.1 that, although the n-best estimator performs well, it is possible to tune the sample-based estimators to perform better by trading bias for variance.

Measuring segmentation lattice entropy
We believe that the entropy of the tokeniser segmentation lattice is an important quantity to measure. The entropy quantifies the uncertainty of the tokeniser, and has a nice interpretation as the (logarithm of the) size of the set of alternatives the tokeniser is choosing uniformly over. While the entropy over hidden states of other structured models like HMMs and CRFs have previously been published (Hernando et al., 2005;Mann and Mc-Callum, 2007;Ilic, 2011), and a uniform treatment in terms of expectation semirings is given in Li and Eisner (2009), we are unaware of previous elementary derivations of the entropy of a segmentation lattice. We give the algorithm in Algorithm 1. Note that the recursion has a particularly nice interpretation in terms of information theory. Recall that the entropy of a random variable can be thought of as the necessary number of bits to transmit the random variable. The recursion states that, to transmit the lattice up to position i (which takes H i bits), we can transmit a prefix of the lattice (using H j bits), and then transmit the token w that goes from j to i (using log P (w) bits). The total number of bits necessary is then the weighted sum of all possible ways of doing this, where the weights are given by the probability of that particular decomposition.

Experiments
For our experiments, we first pretrain language models using one-best tokenisations from a tokeniser using WMT news shared task data (Barrault et al., 2020). We train models on both English and German data up to September 2017, reserving the rest of the 2017 data for validation and model selection. We use a Transformer-XL (Dai et al., 2019) model with 18 layers and a hidden size of 1024. During evaluation time, we do not use Transformer-XL memory, due to the interaction of batching and sampled tokenisation. While this may depress our results, we are not interested in absolute model performance per se, but rather in the relative performance of the marginal likelihood vs. the one-best likelihood.
The tokeniser we use at both training and evaluation time is a unigram tokeniser as implemented in the SentencePiece package (Kudo, 2018), with a vocabulary size of 50529. We train the tokeniser on the same training set, with a random sample of 100 million sentences for English, and 10 million documents for German.

Measuring the marginal likelihood
For both English and German, we use 500 documents sampled randomly from the WMT train and test data and 500 randomly sampled Wikipedia documents (WIKI). For English, we also use 500 documents from the CUSTOMNEWS and arXiv abstracts (ARXIV) datasets of Lazaridou et al. (2021), and for German, we additionally use 200 documents from the MC4 dataset in Xue et al. (2020).
For each method outlined in Section 2, we sample 128 different tokenisations of each document, and calculate P θ (T i , D) for each sample, before aggregating the sample scores into an estimate of the marginal likelihood. We parallelise evaluating all the samples for a document on a multi-host TPU setup; each dataset takes 15-30 minutes to evaluate. Figure 1: The effect of temperature scaling on the estimated perplexity on all English datasets, using WOR 1-best. The y-axis is the percentage difference in perplexity relative to the n-best baseline (lower is better). Note the x-axis is scaled as 1/τ , rather than τ .
Further, to ensure results are comparable across different tokenisations with potentially different numbers of tokens, we calculate perplexity by dividing the total likelihood across all documents by the total number of whitespace-delimited tokens. We present our results in Table 1.
Our results show that there can be a significant difference between the one-best tokenisation likelihood and the marginal likelihood, particularly as one moves further away from the training data domain. Indeed, the relative perplexity improvement reaches up to 1.9% on EN-ARXIV, and 0.9% on DE-MC4. Further, tokenising words consistently in a document has a large impact on the marginal likelihood estimation. We investigate this effect further in Section 4.3. While the n-best estimator appears to perform the best in this comparison, we show in the next section that by tuning the sampling temperature of the WOR 1-best estimator, it is possible to obtain even better estimates of the marginal likelihood.
The effect of sampling temperature We also investigate sharpening the tokeniser distribution before sampling by multiplying the log-probability of each tokenisation by a factor of 1/τ before sampling. Using τ < 1 has often shown to give improved results in various tasks (Kool et al., 2019;Melis et al., 2019;Adlam et al., 2020), and can be understood as a way of tuning the bias-variance tradeoff with the n-best estimator at the high-bias, low variance end, and independently sampling at the other. We compare the WOR with 1-best estimator at a various rate of temperatures on our English datasets, and show the results in Figure  Consistent  1. One can see that it is possible to improve on the n-best estimator by trading some bias for variance, and this can result in a better estimate of the marginal, especially for out of domain datasets.

Tokeniser entropy and the marginal gap
Next, we investigate what causes the gap between marginal likelihood and one-best likelihood, and whether there are easily measurable factors that might predict this difference. We hypothesise that, the more uncertain the tokeniser is, the bigger this gap becomes. We pool together the documents in all our evaluation sets, and test whether there is a correlation between tokeniser entropy and marginal gap. Our results, shown in Figure 2, demonstrate that there is a correlation between entropy and the marginal gap (Spearman r = 0.57 for English, 0.49 for German); interestingly, it appears that high tokeniser entropy is predictive of a bigger marginal gap, but large marginal gaps are possible even if the tokeniser has low entropy.

Analysing the caching behaviour of language models
Our results show that tokenising word types consistently within a document leads to significantly tighter estimates of the marginal likelihood compared to independently tokenising input tokens. We analyse this phenomenon in this section, by investigating the loss language models assign to repeated tokens in a document, conditioned on whether the token appears in the same tokenised form or not. Concretely, let w 1 , . . . , w m be the whitespacedelimited words in a document D, and let T 1 , . . . , T n be the sampled tokenisations of the document. Each word w i appears as a token sequence

All words
Multi-token words First (1) (2) First (1)  Table 2: Investigating the caching ability of language models. For words which appear multiple times with different tokenisations, we show the average loss of the first occurrence of that word, of subsequent occurrences of that word with the same tokenisation (1), and subsequent occurrences of that word in a different tokenisation (2). WMT Tr and WMT Te are the WMT training and test evaluation sets respectively.
T w i = t 1 w i . . . t n i w i , and each sampled tokenisation T i can have different token sequences T i w i for the same underlying word. We look for words w k ∈ (w i , . . . , w n ) such that: 1. For some tokenisation T i of w i , for some l < k, w l = w k and T i w k = T i w l (the word has appeared before with the same tokenisation). 2. For some other tokenisation T j , for all l < k such that w l = w k , T j w k = T j w l (all previous occurrences of this word in the document were tokenised differently).
We then calculate P θ (w k |w <k ) for each tokenisation T i (by summing the scores of the tokens in w k ), and microaverage separately the loss for tokenisations which fulfill condition (1) and condition (2). The microaveraged loss for (1) represents the language model being able to copy the word as a sequence of tokens from its memory, while the microaveraged loss for (2) represents the model having to generate the word afresh as a new sequence  Figure 2: The correlation between entropy per token and the marginal gap per token in nats (not in perplexity), categorised by evaluation dataset. Some data points which extend beyond the right of the graph are trucated; they follow the same trend. of tokens. By comparing the loss of words paired in this way, we can control for extra confounding factors (such as token unigram probability), and isolate the ability of the language model to recognise whether different token sequences correspond to the same underlying form.
We show our results for our various datasets, together with selected subsets of words, in Table 2. We see that, if the language model sees a word after already seeing it in the same tokenisation, its loss is significantly lower than the loss associated with the first time the word is seen (as was also reported in Lazaridou et al. (2021)). However, this ability is strongly tied to the exact tokenisation of the word: if it appears again, but in a different tokenisation, then its loss can in fact be even greater.

How many samples are necessary?
Next, we investigate how many samples are necessary to obtain an accurate estimate of the marginal likelihood. We experiment on the EN-ARXIV dataset, as this showed the biggest relative improvement between the marginal likelihood and the onebest likelihood. We take the samples from our n-best estimator with n = 128, and incrementally sum the samples (which are given in decreasing order of likelihood under the tokeniser) to simulate having smaller n. As an oracle experiment to to see how many samples contribute significantly to the marginal likelihood, we also order the samples by their language model scores (i.e. we order according to P θ (T, D) rather than Q(T |D)) before taking the incremental sum. We show the results in Figure  3. Our results show that, although ostensibly many samples are necessary to estimate the marginal likelihood accurately, only very few samples (in the order of 5) actually contribute significantly.
In practical terms, our results suggest that one needs to take many samples with current tokenisers to accurately estimate the marginal likelihood, but that many of these samples are not effective. We therefore believe that a prerequisite for more widespread adoption of marginal likelihood as an evaluation metric is tokenisers that better fit the language model posterior over tokenisations. Current tokenisers make very strong independence assumptions to make learning and inference tractable, and we believe there is significant scope to design tokenisers which relax these assumptions.

Tokenisation and segmentation
Unsupervised word segmentation has a long and illustrious history. The earliest motivations were in information retrieval, and the motivation was that collapsing a set of related query terms might help smooth counts over each of those terms individually and result in better retrieval results. The earliest approaches, such as the Porter stemmer (Porter, 1997), were rule-based. However, the power of data-driven statistical methods quickly became apparent, and tools such as Morfessor (Virpioja et al., 2013) used likelihood-based objectives, typically with Bayesian smoothing methods (see also Goldwater et al. (2011)), to induce segmentations. Sennrich et al. (2016) used a different algorithm to induce segmentations: byte-pair encoding (Gage, 1994). Originally designed as a data compression algorithm, BPE tokenisers are now some of the predominantly used tokenisation methods. Alternative approaches, such as WordPiece (Schuster and Nakajima, 2012) and SentencePiece (Kudo, 2018), explicitly use a language modelling objective to induce a token lexicon. Previous methods have used train-time tokenisation randomisation as a regularisation aid (Kudo, 2018;Provilkov et al., 2020), but still use the one-best tokenisation at test time.
Another strand of work has investigated whether tokenisers that caputre linguistic morphology can improve language models. Bostrom and Durrett (2020) showed that unigram and BPE tokenisers for English and Japanese have low recall on recovering linguistic segments, since many morphologically complex words are treated as a single token. Linguistically aligned tokenisers have been shown to result in better language model perplexity (Schwartz et al., 2020;Park et al., 2021) and better downstream task performance (Alkaoud and Syed, 2020), especially for morphologically rich languages. These experiments also use one-best tokenisation at test time.
Rather than considering one-best or stochastic samples of tokenisations, one can use entire segmentation lattices as input to a model. This approach has been considered for morphological tagging , parsing (Goldberg and Tsarfaty, 2008), and spoken intent recognition (Ladhak et al., 2016), among others.

Tokenisation-free approaches
An alternative approach to inducing a tokenisation is to decompose input sequences into well-defined orthographic units, such as characters. These approaches circumvent the problem of inducing a lexicon, and have been used for text classification (Conneau et al., 2017), language modelling (Al-Rfou et al., 2019), machine translation (Lee et al., 2017), and word representation (Cao and Rei, 2016). One downside is that dependency lengths become longer on the character-level, and lexical information has to be memorised by the compositional machinery of the model. For this reason, traditionally fully character-based approaches did not perform as well as their token-level counterparts, although recent progress suggests this may change soon Clark et al., 2021). There also exist approaches which mix characterlevel and segment-level approaches (Buckman and Neubig, 2018;Kawakami et al., 2019;He et al., 2020), although these segmental language models require more complex inference procedures.

Conclusion
In this paper, we argue for using model marginal likelihood over tokenisations as an evaluation metric for language models, rather than one-best tokenisation likelihood. We introduce practical lowvariance estimators for measuring the marginal likelihood, and demonstrate that there can be significant difference between the marginal and the onebest likelihoods, particularly on strongly out-ofdomain evaluation sets. Evaluating with marginal likelihood thus goes some way toward loosening the bottleneck imposed by tokeniser quality in the currently dominant language modelling paradigm, and our results suggest that the field may be underestimating the generalisation capability of modern language models. We further demonstrate that tokeniser entropy is a good predictor of this "marginal gap", suggesting that tokeniser entropy, especially when out-of-domain, can be a guide to the number of samples needed for evaluation.
More broadly, our experiments suggest that the field should continue seeking better ways to incorporate tokenisation into end-to-end language modelling. Sampling from the tokeniser during training is an obvious possibility; alternatively, one could incorporate the segmentation lattice into the model directly, which has been beneficial for parsing morphologically rich languages (Goldberg and Tsarfaty, 2008;. Further, developing more contextual tokenisers which make fewer independence assumptions can also result in both better language models trained on their onebest tokenisation, and better evaluation estimates of the marginal likelihood with fewer samples. We conduct experiments on German and English corpora in this paper. However, these two languages are only a small sample in the full space of language typology. English is a morphologically impoverished language, and while German compounding and inflection offer some additional challenges, many languages have more complex patterns of word formation and inflection. We believe that estimating marginal likelihood will be important for morphologically richer languages, where tokenisation makes a bigger difference (Gerz et al., 2018;Mielke et al., 2019).
Finally, improved understanding of the interaction between tokenisation and language modelling has implications for evaluating language models on both downstream tasks and language generation tasks. Evidence has shown that gains in language modelling, as measured in perplexity, often lead to improvements in downstream task performance (Radford et al., 2019). It would be instructive to extend our marginal likelihood approach to downstream task evaluation. On generation tasks, since the tokeniser affects language model training but is only implicitly used when sampling (via the tokeniser vocabulary), the effect of tokenisation algorithms requires careful investigation.