UKP-SQuARE v3: A Platform for Multi-Agent QA Research

The continuous development of Question Answering (QA) datasets has drawn the research community’s attention toward multi-domain models. A popular approach is to use multi-dataset models, which are models trained on multiple datasets to learn their regularities and prevent overfitting to a single dataset. However, with the proliferation of QA models in online repositories such as GitHub or Hugging Face, an alternative is becoming viable. Recent works have demonstrated that combining expert agents can yield large performance gains over multi-dataset models. To ease research in multi-agent models, we extend UKP-SQuARE, an online platform for QA research, to support three families of multi-agent systems: i) agent selection, ii) early-fusion of agents, and iii) late-fusion of agents. We conduct experiments to evaluate their inference speed and discuss the performance vs. speed trade-off compared to multi-dataset models. UKP-SQuARE is open-source and publicly available.


Introduction
The current high-speed development of Artificial Intelligence yields thousands of datasets and trained models in repositories such as GitHub and Hugging Face (Rogers et al., 2023). These models are creating new research and application opportunities, such as high-performing Question Answering (QA) skills in chatbots (Burtsev et al., 2018;Miller et al., 2017). Comparing and analyzing these models usually requires learning libraries, writing code to run the models, and unifying their formats to compare them, which makes this process timeconsuming and not scalable.
UKP-SQuARE (Baumgärtner et al., 2022b;Sachdeva et al., 2022) addresses this challenge, providing the first online platform that offers an ecosys-tem for QA research enabling reproducibility, analysis, and comparison of QA models through a standardized interface and from multiple angles (i.e., general behavior, explainability, adversarial attacks, and behavioral tests).
The large variety of tasks and domains in QA datasets is pushing the research community towards creating models that generalize across domains (Fisch et al., 2019;Talmor and Berant, 2019;Khashabi et al., 2020). Currently, there are two main approaches to achieve this: i) multi-dataset models and ii) multi-agent models. While the former trains a model on multiple datasets (Talmor and Berant, 2019;Khashabi et al., 2020), the latter combines multiple expert agents (Geigle et al., 2021;Friedman et al., 2021;Puerto et al., 2021). Concurrently, large language models (LLM) such as GPT-3 (Brown et al., 2020) are emerging as new powerful systems for multi-task and multi-domain NLP applications. These LLM models are complementary to the focus of our work, multi-agent systems. While LLMs show impressive performance, they are extremely expensive to run and can usually only be accessed through APIs or deployed with great hardware resources. On the other hand, multi-agent systems offer a solution to create multi-domain models reusing available pretrained models that can be run on more modest hardware, which is an important requirement, e.g. where data cannot be sent to third parties.
Multi-agent models are particularly promising due to the thousands of models readily available on online model hubs and their current exponential growth. 2 This growth in the number of models is increasing the interest of the community in multiagent model research (Wang et al., 2020;Matena and Raffel, 2021;Geigle et al., 2021;Friedman et al., 2021;Puerto et al., 2021;Wortsman et al., 2022;Jin et al., 2022) ual models, disregarding the possibility of combining them to make systems modular and multidomain. This is a severe limitation as Puerto et al. (2021) showed that combining several QA models can yield performance gains of over 10 percentage points with respect to multi-dataset models (i.e., a single model trained on multiple datasets). Therefore, we extend UKP-SQuARE to democratize access and research to multi-agent models. In particular, we add support to the three main methods to combine agents (Skills in UKP-SQuARE): i) Skill selection, ii) early-fusion of Skills, and iii) late-fusion of Skills. The first one consists of identifying the Skill with the highest likelihood of giving the correct answer and then routing the input to that Skill. We deploy TWEAC (Geigle et al., 2021) as an example of this method. The second one combines multiple models' weights to obtain a new model with the distributional knowledge of the source weights. We deploy MADE (Friedman et al., 2021) as an example of this method. Lastly, the late-fusion of models consists of running multiple models to get their predictions and then combing them. This creates a system that can combine heterogeneous expert agents without reducing their performance in each domain. We provide MetaQA (Puerto et al., 2021) as an example of this method.
UKP-SQuARE facilitates research on multiagent QA systems by offering a platform equipped with dozens of agents and three methods to combine them. This upgrade holds paramount signifi-cance as the number of QA models created annually is increasing exponentially. UKP-SQuARE enables users to run, compare, and evaluate the strengths and weaknesses of multi-agent models, and compare them with multi-dataset models.

Related Work
The most famous types of multi-agent systems are Mixture of Experts (MoE) and ensemble methods. MoE consists of a gating mechanism that routes the input to a set of agents (Jacobs et al., 1991) while ensemble methods aggregate the outputs of multiple experts through a voting mechanism (Breiman, 1996;Freund and Schapire, 1996). Much work has been made to simplify the training of these multi-agent systems (Pedregosa et al., 2011;Chen and Guestrin, 2016;He et al., 2021;Hwang et al., 2022). However, as far as we know, there are no online platforms to run and compare them.
The most similar works to ours are the online model hubs such as Hugging Face's Model Hub 3 and AdapterHub (Pfeiffer et al., 2020a). They both offer a large number of models to download. In addition, Hugging Face's Model Hub also allows running models through Spaces. 4 However, this requires implementing the Space, which can be nontrivial for complex scenarios such as ours (i.e., deploying and comparing multi-agent systems). UKP-SQuARE removes technical barriers and allows researchers to deploy multi-agent systems with a user-friendly interface.
Transformer (Vaswani et al., 2017) models using adapters (Houlsby et al., 2019) can also be seen as a type of multi-agent system. For this type of architecture, AdapterHub (Pfeiffer et al., 2020a) is a well-established library. In addition to simplifying the training of adapter-based models, it allows composing adapters (i.e., agents) with methods such as AdapterFusion (Pfeiffer et al., 2021) or stacking (Pfeiffer et al., 2020b). However, this library is not an online platform for analyzing models such as UKP-SQuARE. Their focus is to offer tools to create models based on adapters.

UKP-SQuARE
UKP-SQuARE (Baumgärtner et al., 2022b;Sachdeva et al., 2022) is the first online platform that offers an ecosystem for QA research. Its goal is to provide a common place to share, run, compare, and analyze QA models from multiple angles, such as explainability, adversarial attacks, behavioral tests, and I/O behaviors. The platform follows a flexible and scalable microservice architecture containing five main services: • Datastores: Provide access to collections of unstructured text such as Wikipedia and Knowledge Graphs such as ConceptNet (Speer and Havasi, 2012).
• Models: Enable the dynamic deployment and inference of any Transformer model that implements a Hugging Face pipeline (Wolf et al., 2020) including models that use the adapter-transformers (Pfeiffer et al., 2020a) or sentence-transformers (Reimers and Gurevych, 2019) framework.
• Skills: central entity of the UKP-SQuARE. They specify a configurable QA pipeline (e.g., extractive, multiple-choice, and open-domain QA) leveraging Datastores and Models. Users interact with Skills since the platform's goal is to remove technical barriers and focus on QA research (i.e., the QA pipeline).
• Explainability: Provides saliency maps, behavioral tests, and graph visualizations 5 that explains the outputs of a Skill. 5 For graph-based models.
• Adversarial Attacks: Create modified versions of the input to create adversarial attacks to expose vulnerabilities of the Skills.
All these services allow UKP-SQuARE to offer an ecosystem of tools to analyze Skills through a user-friendly interface without writing any code or complex configurations. UKP-SQuARE helps researchers identify the models' strengths and weaknesses to push the boundaries of QA research.

Target Users and Scenarios
This new update of UKP-SQuARE targets researchers working on multi-agent and multi-dataset systems. These users can use the platform as a showcase of their systems. The dozens of Skills already available in UKP-SQuARE simplify the deployment of multi-agent systems since users can employ our user-friendly interface to select the Skills they want to combine using the three families of methods we deploy. Furthermore, researchers can deploy their new multi-skill methods through a pull request in our repository. The platform can also be used to analyze and compare multiple multiagent systems from efficiency (i.e., inference time) and effectiveness (i.e., performance) points of view. Furthermore, it can also be used to compare multiagent with multi-dataset systems. Lastly, UKP-SQuARE can also be used for teaching QA. The ecosystem of QA tools can be used to help students understand explainability, adversarial attacks, multi-dataset, and multi-agent models through interactive explanations with examples. Our platform can also be used to design homeworks where students train QA models and analyze them with the aforementioned QA tools.

Multi-Agent Systems
Multi-Agent systems are a type of multi-domain system that aggregate multiple expert agents from different domains to create a unified system. i.e., their focus is on the agents (Skills in UKP-SQuARE). On the other hand, multi-dataset systems aim to learn a unified model from multiple data distributions to create a single, general agent. For example, UnifiedQA (Khashabi et al., 2020) is a single QA model trained on multiple datasets using a generative model to overcome format boundaries.
However, Raffel et al. (2020) show that a model trained on multiple datasets may underperform the same architecture trained on a single dataset, i.e., multi-dataset models may underfit certain distributions. Based on this observation, Puerto et al. (2021) show that multi-agent models can avoid this limitation while being data-efficient to train and even outperform multi-dataset models by large margins in both in-domain and out-of-domain scenarios. This is possible because instead of using a very general architecture to solve multiple tasks, it uses a list of expert agents with specific architectures designed to solve those tasks (i.e., SOTA agents) and establishes a collaboration between these agents. However, this performance comes at a cost. The inference time is higher because it needs to run more than one model (at least one expert agent and one answer aggregator).
Therefore, we extend UKP-SQuARE to add support to the three main approaches for multi-agent systems, which we refer to as Meta-Skills on the platform: i) Skill Selection ( §4.1), ii) Early-Fusion of Skills ( §4.2), and iii) Late-Fusion of Skills ( §4.3). An overview of the different architectures is illustrated in Figure 1.

Skill Selection
Skill selection is the simplest method of the three. It aims to identify the Skill with the highest likelihood of returning the correct answer to the input question and then route the input to that Skill. More formally, it defines a function f : Q → S that maps any question Q to an available Skill S. Geigle et al. (2021) follow this approach and propose a Transformer model with a classification head for each Skill that maps questions to Skills. However, in- stead of predicting Skills, they predict datasets, i.e., they identify the dataset from which the input question comes. Then, they select a Skill trained on that dataset. Using this method, they report a Skill prediction accuracy higher than 90% across ten different QA types.
We train TWEAC on 16 datasets (shown in Appendix 5) with an accuracy of 79% and deploy it in UKP-SQuARE. The cause of the accuracy difference is the selection of the datasets. While the authors experiment on widely different QA tasks such as SQuAD, CommunityQA, and Weather Report, we use the most popular QA datasets, including the 2019 MRQA Shared Task (Fisch et al., 2019), which are more similar and thus, the task becomes more challenging since it is more difficult to distinguish the type of questions. We deploy two TWEAC Skills on UKP-SQuARE: one for extractive QA and another for multiple-choice. Figure 2 shows an extractive QA TWEAC that identifies the question as SQuAD-like and routes it to two Skills trained on SQuAD.

Early-Fusion of Skills
This method combines the weights of multiple models to create a new model that generalizes across all the input models. Friedman et al. (2021) propose to average adapters fine-tuned on multiple datasets to create a multi-agent system that gets better zero-shot and few-shot generalization compared to multi-dataset models.
We extend UKP-SQuARE to allow the creation of Skills that average the weights of a series of adapters. To do this, on the Skill creation page (Figure 3), users are prompted to select whether they wish to combine adapters and, if affirmative, which ones to average.

Late-Fusion of Skills
Lastly, Puerto et al. (2021) propose MetaQA, a system that combines 18 heterogeneous expert agents across multiple formats. This system yields significant gains over multi-dataset models because some tasks require particular architectures to solve them, such as DROP (Dua et al., 2019), which requires numerical reasoning. Thus, while a one-size-fitsall architecture cannot learn such a wide variety of distributions, a multi-agent system that combines predictions can use expert agents to solve these datasets and yield a higher-performing model in general. Figure 4 shows how MetaQA answers a question from the DuoRC dataset but selects an out-of-domain (OOD) agent instead of the indomain agent to answer, which gives a wrong answer. Thanks to the interface provided by UKP-SQuARE, it is easier to analyze the collaboration between the Skills established by MetaQA.
One limitation of this type of system is its need to run multiple models, which makes it more expensive than the previous two approaches. To alleviate this limitation, we run the expert agents in parallel. In this way, the inference time of MetaQA remains close to the other multi-agent systems, as shown in Table 1.

Comparison of Multi-Skill Models
In this section, we compare the inference time of the deployed multi-skill systems (i.e., MetaQA, TWEAC, and MADE) and UnifiedQA as a repre- sentative of the multi-dataset models. We extract 20 random questions from the six datasets from the MRQA 2019 Shared Task (Fisch et al., 2019) yielding a total of 120 questions and measure the time needed by each Skill to solve them. We repeat this process with five different random seeds and show the means and standard deviations in Table 1. Each model has 8 CPUs 6 assigned to it and runs behind an asynchronous API.
As shown in Table 1, MetaQA is the slowest model. This is expected since it needs to run all the expert agents to get the predictions. However, its inference time is remarkably close to both MADE and TWEAC. TWEAC is surprisingly as fast as MADE, considering that TWEAC has to run at least two models (router and expert agent), while MADE only runs one. We conjecture that MADE is not faster because the adapter layers increase the depth of the transformer stack. UnifiedQA is the fastest model, as expected, since it is a multidataset model and hence, does not need to combine multiple agents.
Beyond inference, training time and cost are also interesting factors to take into account. TWEAC and MetaQA are considered cheap to train assuming the existence of pretrained agents on online model hubs such as the Hugging Face Model Hub. 7 Hence, the part that they train is a small router or answer aggregator. On the other hand, MADE and UnifiedQA require training a neural network from scratch in the task of question answering, which is much more challenging than simply routing questions or aggregating answers. Therefore, MADE and UnifiedQA need more training data than TWEAC and MetaQA, making them more expensive. Table 1 shows the trade-off between performance, training, and inference efficiency. Although MetaQA is the slowest Skill to run, its inference time is very close to the other models' thanks to the parallel inference of the expert agents offered by UKP-SQuARE (cf. Figure 1). Furthermore, it is cheap to train, has almost the highest performance, and is compatible with any QA format. This makes it interesting for scenarios where model updating, performance, and flexibility are vital. TWEAC is also cheap and as flexible as MetaQA. Although, it is significantly worse than MetaQA on extractive QA datasets. This makes TWEAC ideal in the same scenarios as MetaQA but where running the expert agents in parallel is difficult (i.e., when MetaQA cannot be used). MADE has the highest performance and is as fast as TWEAC. However, it is more expensive to train than MetaQA and TWEAC, and it is not as flexible as MetaQA and TWEAC since it cannot be used for multiple formats simultaneously. Therefore, it should be used when inference, performance, and simple deployment are vital, while the model is not expected to need re-training (i.e., updates) often and is not required to be compatible with multiple QA formats at the same time. Lastly, UnifiedQA is compatible with any text-based QA format but has lower (although competitive) results. Although it is the fastest to run, it is more expensive to train than TWEAC and MetaQA. Thus, its ideal use case is a scenario where a simple deployment is needed while being flexible to process any text-based QA format. Therefore, this small study suggests that in scenarios where new domains are introduced often, router-based systems such as MetaQA might be more suitable, whereas, in scenarios where inference speed or simple deployment are needed, MADE and UnifiedQA might be more appropriate.

Conclusions and Discussions
In this work, we have extended UKP-SQuARE to support multi-agent models. In particular, we deployed a routing system, TWEAC (Geigle et al., 2021), a method to combine adapter weights, MADE (Friedman et al., 2021), and a model that combines the prediction of multiple Skills, MetaQA (Puerto et al., 2021). We have conducted experiments on these three models and UnifiedQA (Khashabi et al., 2020), a multi-dataset system, to analyze the trade-off between the performance, efficiency, and flexibility of these systems. We showed that in scenarios where new domains or expertise are often needed, MetaQA provides the best tradeoff since its performance is close to the best model, it is compatible with any QA format, cheap to train, and its inference runtime is close to TWEAC and MADE using the parallel engine provided by UKP-SQuARE. However, when simple deployment is needed or the model is not expected to be updated, MADE and UnifiedQA might be more appropriate.
This update of UKP-SQuARE is of utmost importance due to the current speed of development of QA models that creates thousands of models per year. Our platform eases the deployment, running, comparison, and analysis of QA Skills. With this update, we also facilitated the aggregation of these Skills into Multi-Skills simplifying research on multi-agent systems. We leave as future work the comparison of these modular systems with prompting-based QA in large language models (Brown et al., 2020;Zhong et al., 2022).
Limitations UKP-SQuARE v3 does not aim to provide all existing multi-skill systems off the shelf. Instead, we deploy three different approaches and encourage the community to share, deploy and compare their multi-skill systems. Using the modular Skill system of UKP-SQuARE and the reference implementations, users can reconfigure the existing multi-skill pipelines or implement and deploy their own through a streamlined pull request. 8 Another limitation is that the multi-skill systems deployed in this paper have been shown to work effectively with no more than 20 Skills. Hence, the effectiveness of multi-skill systems remains unknown for a larger number of Skills. We hope that UKP-SQuARE v3 can help shed light on this topic.
Lastly, since multi-skill systems combine several models, it is feasible that the resulting system can inherit biases and unfair behaviors. Although the Skills we used are not intended to exhibit any bias or unfairness, users should use them at their own discretion.

Ethics Statement
Intended Use The intended use of UKP-SQuARE v3 is deploying, running, comparing, analyzing, and combining Skills. Our platform pro-vides dozens of Skills readily available to be combined using the implemented multi-agent systems or new systems to be created by the community. This simplifies the analysis of these systems and thus fosters multi-agent QA research.
Potential Misuse A malicious user could train multiple Skills with biased and unfair behaviors, such as a QA system that responds harmful answers, and combine them with the deployed methods available in UKP-SQuARE. UKP-SQuARE does not provide any Skill with such an intended behavior, but the community is free to upload any model to our platform. Therefore, we encourage the community not to publicly upload such models unless there is a clear research intention with a discussion of the ethics of such research, and in this case, make the Skills private, so that nobody can use them in an unintended way. We are not liable for errors, false, biased, offensive, or any other unintended behavior of the Skills. Users should use them at their own discretion.
Environmental Impact The use of UKP-SQuARE can reduce the computational cost of reproducing prior research since it prevents the community from training models that are already trained.