PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PrimeQA: a one-stop and open-source QA repository with an aim to democratize QA research and facilitate easy replication of state-of-the-art (SOTA) QA methods. PrimeQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation. It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on public benchmarks, and expanding pre-existing methods. PrimeQA is available at: https://github.com/primeqa.


Introduction
Question Answering (QA) is a major area of interest in Natural Language Processing (NLP), consisting primarily of two subtasks: information retrieval (IR) Schütze et al., 2008) and machine reading comprehension (MRC) (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Kwiatkowski et al., 2019a;Chakravarti et al., 2020). IR and MRC systems, also referred to as retrievers and readers, respectively, are commonly assembled in an end-to-end open-retrieval QA pipeline (henceforth, OpenQA) (Chen et al., 2017;Lee et al., 2019;Karpukhin et al., 2020;Santhanam et al., 2022b) that accepts a query and a large document collection as its input and provides an answer as output. The specific role of the retriever is to identify documents or passages (i.e., contexts) that contain information relevant to the query, while the reader component extracts a precise answer from such contexts. * Corresponding author While QA as a field has advanced rapidly, software to perform and replicate QA experiments has mostly been written in silos. At the time of this writing, no central repository exists that facilitates the training, analysis and augmentation of stateof-the-art (SOTA) models for different QA tasks at scale. In view of the above, and with an aim to democratize QA research by providing easy replicability, here, we propose PRIMEQA: an open-source repository 1 designed as an end-to-end toolkit, with all the necessary tools to easily and quickly create a custom QA application. We provide a main repository that contains easy-to-use scripts for retrieval, machine reading comprehension, and question generation with the ability to perform training, inference, and performance evaluation. Additionally, several sibling repositories provide features for easily connecting various retrievers and readers and creating a front-end user interface (UI) for end users. PRIMEQA has been designed as a platform for QA development and research and encourages collaboration from everyone in the field from beginners to experts. PRIMEQA already has a growing developer base with contributions from major academic institutions.
Our paper makes several major contributions: • We present PRIMEQA, a first-of-its-kind repository for comprehensive QA research. It is free to use, well documented, easy to contribute to, and license friendly (Apache 2.0) for both academic and commercial usage.
• PRIMEQA provides the mechanism via accompanying repositories to create custom OpenQA applications containing both retrievers and readers for industrial deployment including a front-end UI.
• We provide easy-to-use implementations of SOTA retrievers and readers that are at the 1 https://github.com/primeqa top of major QA leaderboards, with capabilities for performing training, inference and performance evaluation of these models. • PRIMEQA models are built on top of Transformers (Wolf et al., 2020)  Recently, among the most used repositories for NLP users have been the Transformers repository (Wolf et al., 2020). However, while being widely adopted by the community, they lack a distinct focus on QA. Unlike PRIMEQA, they only support one general script for extractive QA and several stand-alone python scripts for retrievers. Similarly FairSeq,  and Al-lenNLP (Gardner et al., 2018) also focus on a wide array of generic NLP tasks and hence do not solely present a QA repository allowing anyone plug and play components for their custom search application. There exists several toolkits catered to build customer-specific search applications (NVDIA, 2022;Deepset, 2021) or searchbased virtual assistants (IBM, 2020). However, while they have a good foundation for software deployment, unlike PRIMEQA, they lack the focus on replicating (and extending) the latest SOTA for QA on academic benchmarks which is an essential component needed in order for us to make rapid progress in this field.

PRIMEQA
PRIMEQA is a comprehensive open-source resource for cutting-edge QA research and development, governed by the following design principles: 2 https://huggingface.co/PrimeQA
• Customizable: We allow users to customize and extend SOTA models for their own applications. This often entails fine-tuning on users' custom data, which they can provide through one of several supported data formats, or process on their own by writing a custom data processor.
• Reusable: We aim to make it straightforward for developers to quickly deploy pre-trained off-theshelf PRIMEQA models for their QA applications, requiring minimal code change.
• Accessible: We provide easy integration with Hugging Face Datasets and the Model Hub, allowing users to quickly plug in a range of datasets and models as shown in Table 1.
PRIMEQA in its entirety is a collection of four different repositories: a primary research and replicability 3 repository and three accompanying repositories 4,5,6 for industrial deployment. Figure 1 shows a diagram of the PrimeQA repository. It provides several entry points, supporting the needs   The repository is centered around three core components: a retriever, a reader, and a question generator for data augmentation. These components can be used as individual modules or assembled into an end-to-end QA pipeline. All components are implemented on top of existing AI libraries.

The Core Components
Each of the three core PRIMEQA components supports different flavors of the task it has been built for, as we detail in this section.

Retriever: run_ir.py
Retrievers predict documents (or passages) from a document collection that are relevant to an input question. PRIMEQA has both sparse and SOTA dense retrievers along with their extensions, as shown in Table 1. We provide a single Python script run_ir.py that can be passed arguments to switch between different retriever algorithms. Sparse: BM25 (Robertson and Zaragoza, 2009) is one of the most popular sparse retrieval methods, thanks to its simplicity, efficiency and robustness. Our Python-based implementation of BM25 is powered by the open-source library PySerini. Dense: Modern neural retrievers have utilized dense question and passage representations to achieve SOTA performance on various benchmarks, while needing GPUs for efficiency. We currently support ColBERT (Santhanam et al., 2022b) and

Reader: run_mrc.py
Given a question and a retrieved passage-also called the context-a reader predicts an answer that is either extracted directly from the context or is generated based on it. PRIMEQA supports training and inference of both extractive and generative readers through a single Python script: run_mrc.py. It works out-of-the-box with different QA models extended from the Transformers library (Wolf et al., 2020). Extractive: PRIMEQA's general extractive reader is a pointer network that predicts the start and end of the answer span from the input context ( (Mathew et al., 2021). Examples of several extractive readers along with their extensions are provided in Table 1.
Generative: PRIMEQA provides generative read-ers based on the popular Fusion-in-Decoder (FiD) (Izacard and Grave, 2020) algorithm. Currently, it supports easy initialization with large pre-trained sequence-to-sequence (henceforth, seq2seq) models Raffel et al., 2022). With FiD, the question and the retrieved passages are used to generate relatively long and complex multisentence answers providing support for long form question answering tasks, e.g., ELI5 (Petroni et al., 2021;Fan et al., 2019).

Entry Points
We cater to different user groups in the QA community by providing different entry points to PRIMEQA, as shown in Figure 1.
• Top-level Scripts: Researchers can use the top level scripts, run_ir/mrc/qg.py, to reproduce published results as well as train, fine-tune and evaluate associated models on their own custom data.
• Jupyter Notebooks: These demonstrate how to use built-in classes to run the different PRIMEQA components and perform the corresponding tasks. These are useful for developers and researchers who want to reuse and extend PRIMEQA functionalities.
• Inference APIs: The Inference APIs are primarily meant for developers, allowing them to use PRIMEQA components on their own data with only a few lines of code. These APIs can be initialized with the pre-trained PRIMEQA models provided in the HuggingFace hub, or with a custom model that has been trained for a specific use case.
• Service Layer: The service layer helps developers set up an end-to-end QA system quickly by providing a wrapper around the core components that exposes an endpoint and an API.
• UI: The UI is for end-users, including the nontechnical layman who wants to use PRIMEQA services interactively to ask questions and get answers.

Pipelines for OpenQA
PRIMEQA core components and entry points make it intuitive for users to build an OpenQA pipeline and configure it to use any of the PRIMEQA retrievers and readers. This is facilitated through a lightweight wrapper built around each core component, which implements a training and an inference API. The retrieval component of the pipeline predicts relevant passages/contexts for an input question, and the reader predicts an answer from the retrieved contexts. PRIMEQA pipelines are easy to construct using the pre-trained models in the model hub and our inference APIs.
An example of such a pipeline can be connecting a ColBERT retriever to an FiD reader to construct a long form QA (LFQA) system. This pipeline uses the retriever to obtain supporting passages that are subsequently used by the reader to generate complex multi-sentence answers. A different pipeline can also be instantiated to use an extractive reader instead that is available through our model hub.

Services and Deployment
Industrial deployment often necessitates running complex models and processes at scale. We use Docker to package these components into micro-  services that interact with each other and can be ported to servers with different hardware capabilities (e.g. GPUs, CPUs, memory). The use of docker makes the addition, replacement or deletion of services easy and scalable. All components in the PRIMEQA repository are available via REST and/or gRPC micro-services. Our docker containers are available on the public DockerHub and can be deployed using technologies such as OpenShift and Kubernetes.
In addition to the main PrimeQA repository, we provide three sibling repositories for application deployment: primeqa-ui is the front-end UI. Users can personalize the front-end UI by adding custom organization logos or changing the display fonts. primeqa-orchestrator is a REST server and is the central hub for the integration of PRIMEQA services and external components and the execution of a pipeline. For instance, the orchestrator can be configured to search a document collection with either a retriever from PrimeQA such as ColBERT, or an external search engine such as Watson Discovery. 7 create-primeqa-app provides the scripts to launch the demo application by starting the orchestrator and UI services. 7 https://www.ibm.com/cloud/watson-discovery Figure 2 illustrates how to deploy a QA application at scale using the core PrimeQA services (e.g. Reader and Retriever) and our three sibling repositories. We provide this end-to-end deployment for our demo, however users can also utilize PrimeQA as an application with their own orchestrator or UI. Figure 3 shows an OpenQA demo application built with PRIMEQA components. Our demo application provides a mechanism to collect user feedback. The thumbs up / down icons next to each result enables a user to record feedback which is then stored in a database. The user feedback data can then be retrieved and used as additional training data to further improve a retriever and reader model.

Community Contributions
While being relatively new, PRIMEQA has already garnered positive attention from the QA community and is receiving constant successful contributions from both international academia and industry via Github pull requests. We describe some instances here and encourage further contributions from all in the community. We provide support for those interested in contributing through a dedicated slack channel 8 , Github issues and PR reviews. Neural Retrievers: ColBERT, one of our core neural retrievers, was contributed by Stanford NLP. Since PRIMEQA provides very easy entry points into its core library, they were able to integrate their software into the retriever script run_ir.py independently. Their contribution to PRIMEQA provides the QA community with the ability to obtain SOTA performance on OpenQA benchmark datasets by performing 'late interaction' search on a variety of datasets. They also contributed Col-BERTv2 (Santhanam et al., 2022b) and its PLAID (Santhanam et al., 2022a) variant. The former reduces the index size by 10x over its predecessor while the latter makes search faster by almost 7x on GPUs and 45x on CPUs. Few shot learning: The SunLab from Ohio State University provided the ability to easily perform few-shot learning in PRIMEQA . Their first contribution, ReasonBERT (Deng et al., 2021), provides a pretrained methodology that augments language models with the ability to reason over longrange relations. Under the few-shot setting, Rea-sonBERT in PRIMEQA substantially outperforms a RoBERTa  baseline on the extractive QA task. PRIMEQA gives any researcher or developer the capability to easily integrate this component in their custom search application e.g. a DPR retriever and a ReasonBERT reader.  (Zhong et al., 2017a) and Wiki-TableQuestions (Pasupat and Liang, 2015a). Their contribution reused the seq2seq trainer from Transformers (Wolf et al., 2020) for a seamless integration into PRIMEQA. Another contribution comes from LTI CMU's NeuLab which integrated OmniTab (Jiang et al., 2022b). It proposes an efficient pre-training strategy combining natural and synthetic pre-training data. This integration happened organically as Om-niTab builds on top of Tapex in PRIMEQA. Currently, their model produces the best few-shot performance on Wiki-TableQuestions, making it suitable for domain adaptation experiments for anyone using PRIMEQA. Custom search app for Earth Science: Joint work between NASA and University of Alabama in Huntsville, involved creating a custom search application over scientific abstracts and papers related to Earth Science. First, using the top level scripts in PRIMEQA, they easily trained an OpenQA system on over 100k abstracts by training a ColBERT retriever and an extractive reader. Then, they were able to quickly deploy the search application using the create-primeqa-app and make it available publicly 9 .

Conclusion
PRIMEQA is an open-source library designed by QA researchers and developers to easily facilitate reproduciblity and reusability of existing and future works. This is an important and valuable contri-bution to the community as it enables these models to be easily accessible to researchers and endusers in the rapidly progressing field of QA. Our library also provides a 'service layer' that allows developers to take pre-trained PRIMEQA models and deploy them for their custom search application. PRIMEQA is built on top of the largest NLP open-source libraries and tools and provides simple python scripts as entry points to easily reuse its core components across different applications. Our easy access and reusability has already garnered significant positive traction and enables PRIMEQA to grow organically as an important resource for the rapid state-of-the-art progress within the QA community.