GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts. We leverage the existing functionalities of both platforms while proposing novel features further facilitating their integration. Our goal is to give NLP researchers tools that will allow them to develop retrieval-based instrumentation for their data analytics needs with ease and agility.We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub: https://github.com/huggingface/gaia.We then demonstrate how the ideas we present can be operationalized to create a powerful tool for qualitative data analysis in NLP. We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections. GAIA serves a dual purpose of illustrating the potential of methodologies we discuss but also as a standalone qualitative analysis tool that can be leveraged by NLP researchers aiming to understand datasets prior to using them in training. GAIA is hosted live on Hugging Face Spaces: https://huggingface.co/spaces/spacerini/gaia.


Introduction
Training large language models, or LLMs (Brown et al., 2020;Lieber et al., 2021;Rae et al., 2021;Smith et al., 2022;Le Scao et al., 2022;Chowdhery et al., 2022;Touvron et al., 2023), established itself as the central task of the modern Natural Language Processing (NLP) research. The attempts to understand the scaling laws of LLMs led researchers to believe that simply increasing the number of parameters may not bring the desired improvements without a simultaneous increase in the size of the LLM training data (Kaplan et al., 2020;Hoffmann et al., 2022). These observations only increased an already pressing need for massive textual datasets, fueling the proliferation of Web-based corpora of TB-scale created with varying levels of curation and quality control.
Rather than investing in scraping the Web on their own, dataset creators typically turn to Common Crawl 1 as the main source of text to include in their corpora. A repository of Web snapshots dating back to 2011, Common Crawl contains various types of low-quality text (Luccioni and Viviano, 2021). Pre-processing steps commonly introduced by dataset creators aiming to filter out undesired content include removing any documents with words matching a pre-defined, static blacklist, like in the case of C4 (Raffel et al., 2020), perplexitybased filtering like in CCNet and ROOTS (Wenzek et al., 2019;Laurençon et al., 2022), removing malformed text via simple text statistics like in the case of OSCAR (Abadji et al., 2022) or through deduplication, studied extensively by Lee et al. (2022). However, the generated artifacts still tend to contain a multitude of worrying phenomena, such as synthetic data (Dodge et al., 2021), private and copyrighted data (Huang et al., 2022) or incorrect language codes and translations (Kreutzer et al., 2022). A lack of representation of diversity and socio-cultural and socio-economic biases constitute another big challenge of Common Crawl and datasets derived from it (Bender et al.;Blodgett et al., 2020;Field et al., 2021;Stanczak and Augenstein, 2021;Beaulieu and Leonelli, 2021).
Aware of the mounting problems with training data for modern LLMs, and appreciating the value of data exploration for better modeling in general, we focus our current work on building tools that can facilitate the qualitative analysis of NLP datasets. We propose to leverage the extensive experience of the Information Retrieval community in build- ing relevance-based search indices for large-scale document collections and put it into practice in the context of NLP data exploration work. We follow with a demonstration of ways in which the interoperability between Pyserini (Lin et al., 2021), a leading toolkit for reproducible IR research on one side, and Hugging Face 2 , a platform for open AI research on the other, can be leveraged to build tools for easy and effective analysis of textual data. To facilitate the adoption of the proposed methods we provide a collection of Jupyter Notebooks with step-by-step explanations of explored functionalities available on GitHub.
Finally, we release GAIA-a simple, yet powerful search engine giving relevance-based interface to four popular, large-scale, textual datasets, namely C4 (Raffel et al., 2020), the Pile (Gao et al., 2021;Biderman et al., 2022), ROOTS (Laurençon et al., 2022) and captions from LAION-2B-en (Schuhmann et al., 2022). All considered datasets rely to a big extent on data mined from Common Crawl. GAIA benefits from the interoperability between Pyserini and Hugging Face that we discuss in the first part of the paper, while also 2 https://huggingface.co/ constituting a standalone contribution which can benefit the NLP research community by making it easy to study leading corpora qualitatively. GAIA is available online at hf.co/spaces/spacerini/ gaia.

Background
The ability to analyze large collections of textual data is core in multiple research and engineering disciplines. While the industrial standard is to rely on robust, scalable database and data analytics infrastructure, in the research environment, we typically resort to more local, granular and flexible, if ad-hoc, solutions which leverage toolkits such as NumPy (Harris et al., 2020), Pandas (pandas development team, 2020; Wes McKinney, 2010), SciPy (Virtanen et al., 2020) and others. A common research approach to data analytics involves using one of the aforementioned packages in combination with Jupyter Notebooks 3 . Notebooks make it easy to deploy and share analyses, however, typically they remain essentially non-interactive, requiring at least a basic understanding of programming to be able to work with them efficiently. With the commodification of AI, and NLP in particular, and the expansion of NLP technologies into research areas beyond AI (Yang et al., 2022;Smith et al., 2015;Bhardwaj et al., 2017;Niezni et al., 2022), the need for easy to use, no-code tools for understanding AI artifacts arises. This need is partly addressed by Python packages such as Streamlit 4 and Gradio 5 , designed to facilitate the creation of interactive Machine Learning (ML) demos. As the authors of the Gradio white paper (Abid et al., 2019) point out, the accessibility and ease of use of the analysis tools is critical if we want to build an understanding of AI and trust in it. The Hugging Face Spaces platform, providing free hosting of both Streamlit, Gradio, and Docker-based applications, serves this exact purpose. However, it puts emphasis on demonstrating the capabilities of models while paying less attention to the datasets used to train them. Even more so than in NLP, the evaluation of IR systems is heavily dependent on the implementation details of the retrieval systems serving the search indices being evaluated. The lack of standardisation of IR evaluation was the main motivator behind creating Anserini (Yang et al., 2017), a Lucene 6 -based toolkit for reproducible IR research, and the follow-up Pyserini (Lin et al., 2021)-a convenient Python API to the underlying Javabased implementation of Anserini. While it is relatively easy to build and serve search indices backed by Pyserini and Lucene, the task of building and deploying interactive user interfaces generally comes with a higher engineering barrier of entry.
Relevance-based search interfaces have been previously explored in the context of NLP-e.g. in the C4 analysis (Dodge et al., 2021), in COVIDrelated datasets (Zhang et al., 2020) or in news quotes (Vuković et al., 2022). Rather than focusing only on providing finished artifacts, however, we intend our current work to serve as a reference and inspiration for NLP researchers looking to develop and deploy similar applications by themselves.
We attempt to bring together the power of Pyserini-backed retrieval and the agility of ML demo development within the Hugging Face ecosystem to serve the goal of building intuitive data exploration tools. We believe that resulting applications will make a great difference for NLP researchers trying to study their data qualitatively, as well as to non-technical researchers looking for tools allowing them to perform dataset analysis in a no-code fashion. We propose our search engine GAIA as a compelling case in point.

Pyserini and Hugging Face: From Data to Search
In the current section we discuss core components which need to be considered when building a search application for textual datasets. We focus on how each step can be facilitated by the use of Pyserini, Hugging Face, or a combination of the two. We also provide hands-on tutorials covering basic concepts and search engine building blocks such as data loading and indexing, tokenization, search, and index analysis. We further release the preprocessing, backend and frontend code that allowed us to index 3.5 billion documents-chunked into 5.8 billion snippets-and serve 5.55TB worth of BM25 indexes.

Data Access
The Hugging Face hub is the repository of over 20,000 datasets from across AI domains. This includes the most popular large-scale text corpora in NLP-for example all the datasets we consider in GAIA (see Table 1  provides convenient and parallelizable APIs for downloading and processing the data. Memorymapping is supported by default and uses the efficient an Apache Arrow format, 7 making it possible to seamlessly handle datasets surpassing the RAM constraints of a given machine. Datasets also provide a streaming functionality which dispenses of downloading data to disk, making it possible to work with larger-than-disk datasets.

Tokenization
Tokenization is a crucial pre-processing step in NLP and Information Retrieval. In the context of IR, this process typically includes removing stop words, stemming, lemmatization, and removing non-alphanumeric characters. By default, Pyserini uses Lucene analyzers-heuristics-based algorithms designed for various languages and use cases, to tokenize text. The drawback of this approch is that only some languages have dedicated analyzers, while others have to resort to simply breaking on whitespace, which inadvertently leads to suboptimal performance. An alternative to whitespace tokenization that has shown promise in Information Retrieval and is a mainstay in NLP is subword tokenization (Mielke et al., 2021), a process which splits words into smaller units based on their frequency in the corpus. Hugging Face provides a range of tokenizers that are specifically designed to work with its pretrained transformer language models, as well as the means to train such tokenizers (MOI et al., 2022).
As of recently, Pyserini can leverage Hugging Face pre-trained subword tokenizers to improve indexing and searching for multiple languages. Pretrained tokenizers from Hugging Face can serve as drop-in replacements for Lucene Analyzers, improving retrieval effectiveness, particularly in lowresource languages (Ogundepo et al., 2022). This interoperability between Hugging Face and Pyserini makes it easy for researchers to incorporate deep learning-based language models into their information retrieval workflows and opens up new avenues for research in the field.

Building the Index
Indexing constitutes the core functionality of Pyserini. The library enables experiments with bag-ofwords sparse retrieval using Lucene, and dense vector retrieval using Faiss (Johnson et al., 2019), as Offline Indexing. Arrow-backed Hugging Face datasets readily lend themselves to being indexed by Pyserini's standard Lucene indexer. In principle, one can build an index of a Hugging Face dataset simply by downloading it locally and then passing the file path to the Pyserini indexer via a command line argument. The scenario where a pre-processing step is required in between the data download and the indexing step-as with document segmentation which we discuss later in Section 4-can be realised straightforwardly for smaller datasets, which fit both on disk and into RAM. The larger-than-RAM datasets which fit on disk, can be easily sharded into any of the disk text formats supported by Pyserini (those include CSV, TSV, JSON, and JSONL) and processed concurrently within RAM limits to be then passed to the indexer.
Datasets Streaming. As of recently, it is also possible to index datasets which don't fit on disk. 8 This new addition to Pyserini-one that resulted out of our current collaboration-allows users to stream text into the index directly-in other words, build an index on the fly from a text stream rather than from a static file saved on disk. As a result, larger-than-disk collections can be streamed from the Hugging Face Hub directly into the local indexing process. Data streaming can also improve experimental agility for smaller datasets, by removing the data downloads step from the Hugging Face dataset-Pyserini index pipeline.

Backend: Custom Pyserini Server
Once the data index is ready we need a way to host it and serve the search functionality to the clients. We propose a simple Python-based, Pyserini server implementation for GAIA, which can be easily generalized to other use-cases. The server code can be accessed on GitHub.

Frontend: Interactive Demos
Providing interactive demos which enable the exploration of AI artifacts is crucial in order to be able to collaborate across research disciplines and share results with colleagues without imposing the burden of setting up their own engineering stack on them. By offering the hosting of Gradio and Streamlit applications Hugging Face Spaces meet this need perfectly. We encourage readers to follow the implementations of GAIA for an example of how to build a simple UI for a search tool.

Case Study: GAIA Search
Relevance-based search tools have the potential of the largest impact on massive-scale datasets, common in modern NLP. Unlike with smaller data collections, where simpler investigation strategies, e.g. via a combination of Pandas and Jupyter Notebooks, may be feasible, huge datasets are generally too cumbersome to process this way. A big benefit of search engines in the form that we propose is also the fact that after being set up, they require no engineering skills or extensive computing resources to operate, expanding the community of potential users. We demonstrate this with GAIA search, available online at hf.co/spaces/spacerini/gaia.

Included Datasets
GAIA proposes a simple interface to four largescale textual datasets-C4, The Pile, ROOTS, and captions from LAION-2B-en. The reader may consult Table 1 for details on respective datasets. All of the datasets included in GAIA are sourced at least partly from Common Crawl. The users of the tool are therefore bound by the Common Crawl terms of use 9 in respect of the content contained in the datasets. Additionally, in order to respect the data subjects' rights (Jernite et al., 2022) we refrain from presenting full documents in the tool, and instead include snippets of at most 256 words. We redact the personally identifiable information (PII) on all search results on the backend side, using the PII redaction script open-sourced alongside the Big-Science 10 language model BLOOM (Le Scao et al., 2022). Below we discuss details of the respective datasets' pre-processing.
C4. This is a dataset fully sourced from Common Crawl. We index the variant of the English split of the dataset available on the Hugging Face hub. C4 has been used to train T5 (Raffel et al., 2020), a major encoder-decoder model with a multitude of downstream applications, parts of it have also contributed to the training of other LLMs, e.g. LaMDA (Thoppilan et al., 2022) and Chinchilla (Hoffmann et al., 2022), which makes it a compelling dataset to study.
The Pile. This corpus has been a standard dataset for many English LLM releases from various organizations (Biderman et al., 2023;Black et al., 2021;Wang and Komatsuzaki, 2021;Black et al., 2022;Smith et al., 2022;Tang, 2021;Lieber et al., 2021), so we believe that it is important to expose its contents to public view. The Pile is an English-only corpus containing multiple sub-corpora from various sources (Biderman et al., 2022). We use a variant of The Pile which has been deduplicated with MinhashLSH and a threshold of 0.87, following the advice of Lee et al. (2022). Notably, this variant of the Pile has also been used to train an LLMs (Biderman et al., 2023). We hope that providing the search interface will allow further investigation of the subjective differences between deduplicated and unprocessed corpora. Both the canonical variant of The Pile and it's deduplicated counterpart are available on the Hugging Face Hub.

ROOTS.
Developed for the purpose of training BLOOM (Le Scao et al., 2022), this is the only multilingual dataset available in GAIA. We therefore, create independent indices for each language or language group provided in the corpus, resulting in 13 separate indices-Arabic, Catalan, Code (comprising all programming languages included in the corpus), English, Spanish, Basque, French, Indonesian, Indic and Niger-Congo (language groups), Portuguese, Vietnamese and Chinese. We return results for each index when issuing queries in the tool.
LAION-2B-en LAION is a dataset of image caption-image URL pairs scraped from the Web. It has been used to train Stable Diffusion (Rombach et al., 2021), a textual-prompt-based image generation model, constituting an open-source counterpart to OpenAI's DALL-E 2 (Ramesh et al., 2022). We use LAION-2B-en, the subset of the original dataset with captions in English, as the starting point for further pre-processing. We start by deduplicating captions, which yields clusters of image URLs with identical captions (deduplication code is available on GitHub). We then index unique captions. For textual queries to our tool, we return results consisting of the relevant captions. Alongside each result, we include the list of associated image URLs.

Implementation and Functionality
The implementation of GAIA makes use of a variety of interoperability features we've discussed in Section 3. As detailed in Table 1, all of the considered datasets are available on the Hugging Face Hub. We download and segment them locally. Such segmented datasets are then provided as input to a Pyserini indexer. We leverage Streamlit to build the user interface for our tool and host it on Hugging Face Spaces. On the backend side, the indices are served from Hugging Face provisioned machines. We open-source helper functions for segmenting long documents and the backend server code at github.com/huggingface/gaia.

Limitations and Future Plans
A major area for consideration when developing data access tools is that of data governance, privacy and data ownership (Jernite et al., 2022;Carlini et al., 2020). In our current work we focus on the technical aspects of giving access to large data collections, however, we urge users to consider data governance principles when designing their own tools. In terms of the infrastructure, the cost and complexity of hosting the retrieval index falls on the creator of the tool, which can be easy to manage for small datasets but becomes more problematic when entering the realm of TB-scale corpora. We are currently investigating a parallel workstream that could address this limitation at least partly.

Conclusions
We showcase interoperability between Hugging Face and Pyserini and provide value to the NLP community by demonstrating easy ways to perform high-quality, large-scale retrieval with open-source tools. We also introduce GAIA -a search engine for retrieval-based exploration of four major textual datasets. We wish to encourage NLP and IR practitioners to follow our examples and build their own tools to explore both large and smaller-scale textual datasets.

Acknowledgements
Authors would like to thank Carlos Muñoz Ferrandis, Daniel van Strien, Katie Link and Quentin Lhoest for valuable tips and suggestions. This research was also supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Impact Statement
As mentioned in Section 5, accessing large-scale, web-scraped textual corpora comes with a variety of ethical considerations, pertaining to the protection of rights of the data owners and people whose privacy or copyright might be infringed upon. We introduce guardrails, namely the PII redaction and the segmentation of documents into short snippets, preventing the ability to reconstruct full documents or full corpora, into the GAIA Search design. We strongly encourage researchers aiming to build similar tools to do the same. Overall, a lot of these problems seem to occur because we're proposing the tool only after the datasets have been created and models trained on them. The workflow we envision for future research projects would involve building data exploration tools prior to the release of the datasets, so that core problems can be observed, studied and addressed before datasets reach an external audience. years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models., booktitle = Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages = 610-623, numpages = 14, location = Virtual Event, Canada, series = FAccT '21.