Koala: An Index for Quantifying Overlaps with Pre-training Corpora

In very recent years more attention has been placed on probing the role of pre-training data in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is no public tool that supports such analysis of pre-training corpora at large scale. To help research in this space, we launch Koala, a searchable index over large pre-training corpora using compressed suffix arrays with highly efficient compression rate and search support. In its first release we index the public proportion of OPT 175B pre-training data. Koala provides a framework to do forensic analysis on the current and future benchmarks as well as to assess the degree of memorization in the output from the LLMs. Koala is available for public use at https://koala-index.erc.monash.edu/.


Introduction
Large Language Models (LLMs) have achieved state-of-the-art results in NLP and on many benchmarks have reached the performance ceiling (Chowdhery et al., 2022).This evergrowing success has been facilitated by the algorithmic and computational progress in scaling up model sizes (Wei et al., 2022a;Chowdhery et al., 2022;Zhang et al., 2022;Brown et al., 2020), integrating human feedback (Ouyang et al., 2022), adopting modes of instructional inference at both zero-or few-shot settings (Chen et al., 2022;Kojima et al., 2022;Wei et al., 2022b;Nye et al., 2021), as well as the ability of feeding them massive volumes of free text during pre-training.
Recent works exhibit various cases which highlight the sensitivity of downstream behaviour of LLMs (and their smaller variants) to the frequency of observed overlap between pre-training corpora and test set (Carlini et al., 2022;Tänzer et al., 2022;Razeghi et al., 2022;Magar and Schwartz, 2022;Lewis et al., 2020).In the generative setting, several issues such as hallucination (Dziri et al., 2022), undesired biases (Feng et al., 2023;Kirk et al., 2021), or toxicity (Gehman et al., 2020) have been attributed partly or fully to the characteristics of the pre-training data, while a parallel line of works have emphasised on the positive role of filtering the pre-training data for safety and factual grounding (Thoppilan et al., 2022).
The above observations are not a comprehensive list but echo the undeniable role of pre-training data in how these models would function in practice.Understanding the limitations imposed by pre-training data would also lead to more informed algorithmic and computational innovations (Collier et al., 2022).However, these forensic studies are done either at a small scale or by using surrogate sources such as web search hit counts.This is mainly due to the absence of reliable tools for supporting deeper analyses in this space at large scale.Our work attempts to fill this gap.
We launch the Koala project, a service backed by lossless compressed suffix arrays (CSA) (Navarro and Mäkinen, 2007), with efficient compression rate and query support.Koala contains a searchable index over the public portion of the pre-training corpora1 of several existing pre-trained language models from OPT 175B (Zhang et al., 2022) to BERT (Devlin et al., 2019a).Koala is intended to provide various overlap statistics for text query files provided by researchers.We foresee several areas of impact for Koala; (i) as a tool to measure data leakage between existing benchmarks and pre-training corpora of LLMs, (ii) and evaluate the degree of memorisation or creativity in generative models' output, (iii) and to support designing harder benchmarks by reducing the overlap with pre-training corpora.We present an overview of the Koala pipeline for pre-processing and constructing the index.We also provide examples of the types of analyses that could be done via Koala by looking at a few commonly used test benchmarks.
2 Pre-processing and Corpora Coverage 2.1 Pre-processing Steps Our pre-processing pipeline includes three main steps: cleaning, deduplication and tokenization 2 .The cleaning step varies according to the pretrained corpus and is described in Section 2.2 where we introduce the corpora covered by Koala.In this section, we describe the deduplication and tokenization steps which are shared across all pretrained corpora.
We use MinHashLSH (Rajaraman and Ullman, 2011, Chapter 3)-a widely-adopted duplicate detection method for large-scale dataset, in the deduplication step.Documents are first converted into a set of unigram tokens (shingling) and then are hashed into a short signature, namely minhash, such that the similarity among documents is preserved.Min-Hash is a hashing algorithm based on permutation to generate random hashes to approximate the Jaccard similarity (Broder, 1997;Cohen et al., 2001).We generate the minhashes with 100 permutations.Finally, the locality-sensitive hashes (LSH) of the minhash values are calculated to detect the duplicated candidate pairs.We follow Zhang et al. (2022) to remove those having Jaccard similarity scores above 0.95 threshold.Our deduplication implementation is based on the datasketch library. 3  To scale the deduplication process to the large corpus, we first perform deduplication in a small batch and gradually merge the deduplicated batches.The deduplication, by far, proved to be the most time consuming step of our pre-processing and takes 2-3 orders of magnitude longer that indexing itself.We only applied deduplication to a corpus if the models trained on that corpus also have done so (i.e., according to their corresponding published details).
The deduplicated corpus is then tokenized with Moses (Koehn et al., 2007) to normalize punctuation and remove non-printing characters.
2 While most existing LLMs use more sophisticated forms of tokenization (i.e., BytePiece, SentencePiece) we choose Moses tokenization as measuring data overlap under token boundaries is a more interpretable and intuitive metric.

Corpora Coverage
The latest version of koala at the time of writing this manuscript covers the following corpora:4 BookCorpus (Zhu et al., 2015) is a large-scale dataset of text derived from books across various genres and topics.We obtained this corpus from Hugging Face5 .This dataset has been used in pretraining multiple large language models such as BERT (Devlin et al., 2019b), RoBERTA (Liu et al., 2019), GPT3 (Brown et al., 2020) and OPT (Zhang et al., 2022).
Pushshift Reddit is a project that collects and provides access to Reddit data for research and analysis 13 .We used langdetect14 to detect and extract the English comments and submissions posted from 2005 to 2019.We followed pre-processing procedure in (Roller et al., 2021) to remove the post from known non-English subreddits and bot15 , comments longer than 2048 characters or containing URL, or at depth larger than 7 in a thread.The dataset constitutes a subtantial portion of the pretraining data for OPT (Zhang et al., 2022).
Table 1 reports the size of each corpus in raw and deduplicated (if applicable) version.
3 Pipeline and Features of Koala

Data Structure of Koala
Our index construction is inspired by the language models of Shareghi et al. (2015), which leverage compressed data structures for building language models on large text corpora.In this subsection we provide a brief overview of the data structures behind Koala and refer the readers to Shareghi et al. (2016) for further details on the compression framework.
A Suffix Array (SA) (Manber and Myers, 1993) of a string T with alphabet σ is an array of its sorted suffixes.A cell in a suffix array, denoted by SA[i], stores a number indicating the starting position of its corresponding suffix in T .Using a suffix array, searching for any sequence u in T translates into a binary search to find the range that spans over all substrings that have u as their prefix, and is O(|u| log |T |).Constructing SA takes 4-8|T | bytes in practice, making them impractical to use for large data.
To support search on large collections, Compressed Suffix Array exploits the compressibility of T while providing the same functionality of SA in space equal to bzip2 compressed T in practice.We follow Shareghi et al. (2016) and use the FM-Index (Ferragina et al., 2008) that utilises the lossless text compressibility vi the Burrows-Wheeler transformation (BWT) (Burrows and Wheeler, 1994) of the text.The BWT is defined as, BWT Searching for a sequence in BWT is done in reverse order and requires O(|u| log |σ|).For more details on BWT and reverse searching, refer to Navarro and Mäkinen (2007).
The CSA is at the core of Koala's index and search backbone.We used the SDSL library (Gog et al., 2014) to implement our corpus indexer.We index each corpus separately.Once a corpus is indexed, its constructed index sits on disk and could be queried through the Koala web interface (introduced shortly).Each query is launched into the indexed collection of corpora and returns the hit counts of the query in the corresponding corpus.Table 1 reports the time and memory usage for construction of indexes.

n-gram Overlap Statistics of Koala
Given a text query, Koala can provide its count statistics in several pretraining corpora by querying the indexes constructed.An example of the raw count output for the phrase plastic bags floating in the ocean is shown in Table 2 on OPT 175B pretraining corpora.Meaningful insights can be derived from these raw statistics.Figure 1 illustrates two high-level statistics built on top of the n-gram counts for two question answering benchmark test sets, PIQA (Bisk et al., 2020) and Open-BookQA (Mihaylov et al., 2018), highlighting the amount of leakage or overlap that exists between these test sets and the entire pre-training data collection indexed in Koala.We first introduce how these statistics are calculated per instance, noting that Figure 1 is reporting them as an average across all instances in each test set.The high-level statistics are defined as follows:   OpenBookQA Answer Set ; Bottom: PIQA Answer Set.Left: Average of Per Instance K-gram hit ratio (i.e., K-gram hit ratio = 1 means 100% of k-grams in one instance were a hit); Right: Average of Per Instance K-gram hit length ratio (i.e., K-gram hit length ratio with respect to the instance length = 1 means the k-gram was fully covered, 0.75 means it was 3/4 covered, etc).PIQA test set size is 1838, OpenBookQA test set size is 500.

Per Instance k-gram hit ratio measures
where N k x is the set of all k-grams of instance x, and M k,t x is the subset of N k x containing only the k-grams with frequency above the pre-set thresholds t (e.g., ≥ 1, ≥ 10, ≥ 100, ≥ 1k, ≥ 10k, ≥100k, ≥1M).Per Instance k-gram hit length ratio measures x is the set of all substrings of instance x that fall within the length bin l (e.g., l = [0.75,1.00] means all substrings whose lengths are 3/4 of the length of x or more), and M l,t is the subset of N l x , containing only the substrings with frequency above the pre-set thresholds t (e.g., ≥ 1, ... , ≥1M).In this illustration we considered 4 length bins: [0,0.25),[0.25,0.50), [0.5,0.75), and [0.75,1].While a deep dive into exploring the dependence between data overlap, model size, and model performance requires a separate work, here we unpack some highlights from the figures: Highlights from Figure 1 (Left Panel): The topleft panel highlights that for OpenBookQA above 75% of the unigrams and bigrams of test set occur at least once (≥ 1) in the pretraining data, while this drops to below 50% with a higher threshold (≥ 1k).We observe that above 25% of trigrams occur at least 100 times in the pretraining data.Looking at the bottom-left panel for PIQA, we see a much  stronger indication of data overlap.For instance we observe above 55% over bigrams occur at least 100 times in the pre-training data.Comparing the two dataset at the extreme frequency threshold of ≥ 1M, we observe that above 50% of PIQA unigrams occur at least 1M times in the pretraining data, while this is roughly 30% for OpenBookQA.
Highlights from Figure 1 (Right Panel): Noting that average answer length in PIQA and Open-BookQA test sets are 101, 20.This means that [0.25,0.5)length bin covers sequences of roughly 25-50 tokens for PIQA, while this is roughly 5-10 tokens for OpenBookQA.We now turn to the highlights from the right panel.For OpenBookQA (top-right) we observe from the red bars that above 25% of test instances (roughly 125 cases out of 500 test instances in OpenBookQA) are almost [75%,100%] covered in the pre-training data for at least 100 times (≥ 100).This corresponds to matches of length 15-20 words.Looking at PIQA (Bottom-Right), although the coverage with respect to the full length is not as apparent as Open-BookQA, matches in each corresponding length bin of PIQA are roughly 4× longer than Open-BookQA.For instance, about 5% of test instances of PIQA (roughly 90 cases out of 1838 test instances in PIQA) have a matching substring of 25-50 words which occur at least 1000 times in the pretraining data (see yellow bar for ≥ 1000).
The performance ceiling obtained by GPT-3 and OPT models for these two benchmarks (reported numbers in Appendix A of Zhang et al. (2022) indicate the largest variant of both models achieve roughly 80% accuracy for PIQA, and above 57% accuracy on OpenBookQA) and our highlighted findings suggests a positive correlation between the amount of data overlap we highlighted and the task performance ceiling by the LLMs trained on the same pre-training corpora.As a future direction of analysis, it would be interesting to leverage Koala to analyse the interdependence of the amount of data overlap, model size, and task performance.

Interface of Koala
In this section, we give an overview of the interface of Koala. Figure 2a and 2b demonstrate some of Koala's features.In addition to reporting the raw counts, Koala provides an interface to upload an n-gram file and to visualize different hit ratio statistics ( §3.2).The n-gram file is a plain text file where each line is an n-gram whose overlap statistics will be computed.Figure 2a shows the output from this feature.We also provide the interactive version of the ratio plots (e.g., Figure 1) for 3 question answering benchmarks: HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020) and Open-BookQA (Mihaylov et al., 2018) where overlap and memorization are critical in the evaluation.
For resource management, we limit the live demo queries to n-gram files below 2MB.For larger files and more comprehensive statistics, we provide a form for users to submit the data and queue the computation.Upon completion (within 72 hours depending on the queuing load), a JSON file is returned to the user with overlap breakdowns per pre-training corpus for various n-gram lengths.The query files and JSON file are only kept for 72 hours, after which we deep delete them from the server.
Another use case of the overlap statistics is to provide a measure of the creativity for generative LLMs, i.e. whether the generated text is novel or memorization of the pretraining corpora.Koala implements a tool to verify the novelty of an output of generative LLM given a prompt.Figure 2b shows an example of this feature which provides the count statistics of the n-grams in the generated text and highlight the overlap n-grams.

Conclusion and Future Work
We presented Koala, a web-based service powered by a compressed data structure backbone that facilitates efficient search over large collections of texts.Koala is a tool for comprehensive overlap analysis with potential use-cases including but not limited to assessing leakage of test benchmarks, measuring the degree of memorization in genera-tive LLMs outputs.Additionally, Koala not only provides a public tool for forensic analysis of these phenomena it could also help benchmark designers towards constructing more challenging testbeds for LLMs.
The n-gram hit statistics per corpus for the correct answer (plastic bags floating in the ocean) to the query Which of these situations is an example of pollutants?, choices : [plastic bags floating in the ocean, mallard ducks floating on a lake, cottonwood seeds floating in the air, cirrus clouds floating in the sky].This is a sample from the OpenBookQA benchmark.

Figure 1 :
Figure1: Visualisations of n-gram overlap statistics for OpenBookQA and PIQA test sets, Answer side.Top: OpenBookQA Answer Set ; Bottom: PIQA Answer Set.Left: Average of Per Instance K-gram hit ratio (i.e., K-gram hit ratio = 1 means 100% of k-grams in one instance were a hit); Right: Average of Per Instance K-gram hit length ratio (i.e., K-gram hit length ratio with respect to the instance length = 1 means the k-gram was fully covered, 0.75 means it was 3/4 covered, etc).PIQA test set size is 1838, OpenBookQA test set size is 500.
(a) Various n-gram statistics which are available both through the interface and JSON result files.(b) Count statistics of various n-grams in the generated text and highlight the overlap n-grams.

Figure 2 :
Figure 2: Snapshots from a few of the Koala webpage features.

Table 1 :
Statistics of corpora, deduplication step, and the index construction.Indexing is done on a single CPU core of a 2.70 GHz Intel Xeon Gold 6150, and requires 2.5× of index size of RAM memory.