Chain-of-Skills: A Configurable Model for Open-Domain Question Answering

The retrieval model is an indispensable component for real-world knowledge-intensive tasks, e.g., open-domain question answering (ODQA). As separate retrieval skills are annotated for different datasets, recent work focuses on customized methods, limiting the model transfer- ability and scalability. In this work, we propose a modular retriever where individual modules correspond to key skills that can be reused across datasets. Our approach supports flexible skill configurations based on the target domain to boost performance. To mitigate task interference, we design a novel modularization parameterization inspired by sparse Transformer. We demonstrate that our model can benefit from self-supervised pretraining on Wikipedia and fine-tuning using multiple ODQA datasets, both in a multi-task fashion. Our approach outperforms recent self-supervised retrievers in zero-shot evaluations and achieves state-of-the-art fine-tuned retrieval performance on NQ, HotpotQA and OTT-QA.


Introduction
Gathering supportive evidence from external knowledge sources is critical for knowledgeintensive tasks, such as open-domain question answering (ODQA; Lee et al., 2019) and fact verification (Thorne et al., 2018).Since different ODQA datasets focus on different informationseeking goals, this task typically is handled by customized retrieval models (Karpukhin et al., 2020;Yang et al., 2018;Wu et al., 2020;Ma et al., 2022a).However, this dataset-specific paradigm has limited model scalability and transferability.For example, augmented training with single-hop data hurts multi-hop retrieval (Xiong et al., 2021b).Further, as new information needs constantly emerge, dataset-specific models are hard to reuse.
Figure 1: Comparison of dense retrievers in terms of considered query type and supported skill configuration [a]  (Karpukhin et al., 2020)  [b]  (Xiong et al., 2021b)  [c]  (Wu et al., 2020).Each box represents a skill ( a =single retrieval, a =expanded retrieval, a =linking, a =reranking, ) and the arrows represent the order of execution.In our case, we can flexibly combine and chain the skills at inference time for different tasks to achieve optimal performance.
In this work, we propose Chain-of-Skills (COS), a modular retriever based on Transformer (Vaswani et al., 2017), where each module implements a reusable skill that can be used for different ODQA datasets.Here, we identify a set of such retrieval reasoning skills: single retrieval, expanded query retrieval, entity span proposal, entity linking and reranking ( §2).As shown in Figure 1, recent work has only explored certain skill configurations.We instead consider jointly learning all skills in a multi-task contrastive learning fashion.Besides the benefit of solving multiple ODQA datasets, our multi-skill formulation provides unexplored ways to chain skills for individual use cases.In other words, it allows flexible configuration search according to the target domain, which can potentially lead to better retrieval performance ( §4).
For multi-task learning, one popular approach is to use a shared text encoder (Liu et al., 2019a), i.e., sharing representations from Transformer and only learning extra task-specific headers atop.However, this method suffers from undesirable task interference, i.e., negative transfer among retrieval skills.
To address this, we propose a new modularization parameterization inspired by the recent mixture-ofexpert in sparse Transformer (Fedus et al., 2021a), i.e., mixing specialized and shared representations.Based on recent analyses on Transformer (Meng et al., 2022), we design an attention-based alternative that is more effective in mitigating task interference ( §5).Further, we develop a multi-task pretraining using self-supervision on Wikipedia so that the pretrained COS can be directly used for retrieval without dataset-specific supervision.
To validate the effectiveness of COS, we consider zero-shot and fine-tuning evaluations with regard to the model in-domain and cross-dataset generalization.Six representative ODQA datasets are used: Natural Questions (NQ; Kwiatkowski et al., 2019), WebQuestions (WebQ; Berant et al., 2013), SQuAD (Rajpurkar et al., 2016), EntityQuestions (Sciavolino et al., 2021), HotpotQA (Yang et al., 2018) and OTT-QA (Chen et al., 2021a), where the last two are multi-hop datasets.Experiments show that our multi-task pretrained retriever achieves superior zero-shot performance compared to recent state-of-the-art (SOTA) self-supervised dense retrievers and BM25 (Robertson and Zaragoza, 2009).When fine-tuned using multiple datasets jointly, COS can further benefit from high-quality supervision effectively, leading to new SOTA retrieval results across the board.Further analyses show the benefits of our modularization parameterization for multi-task pretraining and finetuning, as well as flexible skill configuration via Chain-of-Skills inference. 1

Background
We consider five retrieval reasoning skills: single retrieval, expanded query retrieval, entity linking, entity span proposal and reranking.Convention-ally, each dataset provides annotations on a different combination of skills (see Table A1).Hence, we can potentially obtain training signals for individual skills from multiple datasets.Below we provide some background for these skills.Single Retrieval Many ODQA datasets (e.g., NQ; Kwiatkowski et al., 2019) concern simple/singlehop queries.Using the original question as input (Figure 2 bottom-left), single-retrieval gathers isolated supportive passages/tables from target sources in one shot (Karpukhin et al., 2020).Expanded Query Retrieval To answer complex multi-hop questions , it typically requires evidence chains of two or more separate passages (e.g., Hot-potQA; Yang et al., 2018) or tables (e.g., OTT-QA; Chen et al., 2021a).Thus, follow-up rounds of retrieval are necessary after the initial single retrieval.The expanded query retrieval (Xiong et al., 2021b) takes an expanded query as input, where the question is expanded with the previous-hop evidence (Figure 2 bottom-center).The iterative retrieval process generally shares the same target source.Entity Span Proposal Since many questions concern entities, detecting those salient spans in the question or retrieved evidence is useful.The task is related to named entity recognition (NER), except requiring only binary predictions, i.e., whether a span corresponds to an entity.It is a prerequisite for generating entity-centric queries (context with target entities highlighted; Figure 2 bottom-right) where targeted entity information can be gathered via downstream entity linking.Entity Linking Mapping detected entities to the correct entries in a database is crucial for analyzing factoid questions.Following Wu et al. (2020), we consider an entity-retrieval approach, i.e., using the entity-centric query for retrieving its corresponding Wikipedia entity description.Rereanking Previous work often uses a reranker to improve the evidence recall in the top-ranked candidates.Typically, the question with a complete evidence chain is used together for reranking.

Approach
In this work, we consider a holistic approach to gathering supportive evidence for ODQA, i.e., the evidence set contains both singular tables/passages (from single retrieval) and connected evidence chains (via expanded query retrieval/entity linking).As shown in Figure 2, COS supports flexible skill configurations, e.g., expanded query retriever and the entity linker can build upon the single-retrieval results.As all retrieval skill tasks are based on contrastive learning, we start with the basics for our multi-task formulation.We then introduce our modularization parameterization for reducing task interference.Lastly, we discuss ways to use selfsupervision for pretraining and inference strategies.

Reasoning Skill Modules
All reasoning skills use text encoders based on Transformer (Vaswani et al., 2017).Particularly, only BERT-base (Devlin et al., 2019) is considered without further specification.Text inputs are prepended with a special token [CLS] and different segments are separated by the special token [SEP].The bi-encoder architecture (Karpukhin et al., 2020) is used for single retrieval, expanded query retrieval, and entity linking.We use dot product for sim(•, •).Retrieval As single retrieval and expanded query retrieval only differ in their query inputs, these two skills are discussed together here.Specifically, both skills involve examples of a question Q, a positive document P + .Two text encoders are used, i.e., a query encoder for questions and a context passage encoder for documents.For the expanded query case (Figure 2 where q, p are the query and document vectors respectively and P is the set of negative documents.Entity Span Proposal To achieve a multi-task formulation, we model entity span proposal based on recent contrastive NER work (Zhang et al., 2022a).Specifically, for an input sequence with N tokens, x 1 , . . ., x N , we encode it with a text encoder to a sequence of vectors h m 1 , . . ., h m N ∈ R d .We then build the span representations using the span start and end token vectors, m (i,j) = tanh((h m i ⊕ h m j )W a ), where i and j are the start and end positions respectively, ⊕ denotes concatenation, tanh is the activation function, and W a ∈ R 2d×d are learnable weights.For negative instances, we randomly sample spans within the maximum length of 10 from the same input which do not correspond to any entity.Then we use a learned anchor vector s ∈ R d for contrastive learning, i.e., pushing it close to the entity spans and away from negative spans.
where M is the negative span set which always contains a special span corresponding to [CLS], m [CLS] = h m 0 .However, the above objective alone is not able to determine the prediction of entity spans from null cases at test time.To address this, we further train the model with an extra objective to learn a dynamic threshold using m [CLS]   L cls = − exp(sim(s, m [CLS] ) . (3) The overall entity span proposal loss is computed as L span = (L pos + L cls )/2.Thus, spans with scores higher than the threshold are predicted as positive.
Entity Linking Unlike Wu et al. (2020) where entity markers are inserted to the entity mention context (the entity mention with surrounding context), we use the raw input sequence as in the entity span proposal task.For the entity mention context, we pass the input tokens x 1 , . . ., x N through the entity query encoder to get h e 1 , . . ., h e N ∈ R d .Then we compute the entity vector based on its start position i and end position j, i.e., e = (h e i + h e j )/2.For entity descriptions, we encode them with the entity description encoder and use the [CLS] vector p e as representations.The model is trained to match the entity vector with its entity description vector p ′ ∈Pe∪{p + e } exp(sim(e, p ′ )) , where p + e is the linked description vector and P e is the negative entity description set.Reranking Given a question Q and a passage P , we concatenate them as done in expanded query retrieval format [CLS] Q [SEP] P [SEP], and encode it using another text encoder.We use the pair consisting of the [CLS] vector h r [CLS] and the first [SEP] vector h r [SEP] from the output for reranking.The model is trained using the loss where P r is the set of negative passages concatenated with the same question.Intuitively, our formulation encourages h r [CLS] to capture more information about the question and h r [SEP] to focus more on the evidence.The positive pair where the evidence is supportive likely has higher similarity than the negative ones.Our formulation thus spares the need for an extra task-specific header.As the model only learns to rerank single passages, we compute the score for each passage separately for multi-hop cases.

Modular Skill Specialization
Implementing all aforementioned modules using separate models is apparently inefficient.As recent work finds that parameter sharing improves the biencoder retriever (Xiong et al., 2021b), we thus focus on a multi-task learning approach.
One popular choice is to share the text encoder's parameter of all modules (Liu et al., 2019a).However, this approach suffers from task interference, resulting in degraded performance compared with the skill-specific model ( §5.1).We attribute the cause to the competition for the model capacity, i.e., conflicting signals from different skills require attention to individual syntactic/semantic patterns.For example, the text encoder for entity-centric queries likely focuses on the local context around the entity while the expanded query one tends to represent the latent information based on the relation between the query and previous hop evidence.
Motivated by recent modular approaches for sparse Transformer LM (Fedus et al., 2021b), we propose to mitigate the task interference by mixing skill-specific Transformer blocks with shared ones.A typical Transformer encoder is built with a stack of regular Transformer blocks, each consisting of a multi-head self-attention (MHA) sub-layer and a feed-forward network (FFN) sub-layer, with residual connections (He et al., 2015) and layernormalization (Ba et al., 2016) applied to both sublayers.The shared Transformer block is identical to a regular Transformer block, i.e., all skill inputs are passed through the same MHA and FFN functions.
As shown in Figure 2, for skill-specific Transformer blocks, we select a specialized sub-layer from a pool of I parallel sub-layers based on the input, i.e., different skill inputs are processed independently.One option is to specialize the FFN expert sub-layer for individual skills, which is widely used by recent mixture-of-expert models (Fedus et al., 2021b;Cheng et al., 2022).As the FFN sub-layer is found to be important for factual associations (Meng et al., 2022), we hypothesize that using the popular FFN expert is sub-optimal.Since most reasoning skills require similar world knowledge, specializing FFN sub-layers likely hinders knowledge sharing.Instead, different skills typically require the model to attend to distinct input parts.Thus, we investigate a more parameterefficient alternative, i.e., MHA specialization.In our experiments, we find it to be more effective in reducing task interference ( §5.1).Expert Configuration Regarding the modularization, a naive setup is to route various task inputs to their dedicated sub-layers (experts), i.e., two experts for each bi-encoder task (single retrieval, expanded query retrieval and entity linking) and one expert for each cross-encoder task (entity span proposal and reranking), leading to eight experts in total.To save computation, we make the following adjustments.Given that single and expanded query retrievers share the same set of target passages, we merge the context expert for both cases.Due to data sparsity, we find that routing the expanded queries and reranker inputs which are very similar to separate experts is problematic ( §5.1).Thus, we merge the expert for expanded queries and reranker inputs.During self-supervised pretraining with three bi-encoder tasks, we further share the expert for single and expanded queries for efficiency.The overall expert configuration is shown in Figure 3. Multi-task Self-supervision Inspired by the recent success of Izacard et al. (2021), we also use selfsupervision on Wikipedia for pretraining.Here, we only consider pretraining for bi-encoder skills (i.e., single retrieval, expanded query retrieval, and entity linking) where abundant self-supervision is available.Unlike prior work focusing only on single-type pretraining, we consider a multi-task setting using individual pages and the hyperlink relations among them.Specifically, we follow Izacard et al. (2021) and Wu et al. (2020) to construct examples for single retrieval and entity linking, respectively.For single retrieval, a pair of randomly cropped views of a passage is used as a positive example.For entity linking, a short text snippet with a hyperlinked entity (entity mention context) is used as the query, and the first paragraph of its linked Wikipedia page is treated as the target (entity description).For a given page, we construct an expanded query using a randomly-sampled short text snippet with its first paragraph, and use one first paragraph from linked pages as the target.

Inference
During inference, different skills can be flexibly combined to boost retrieval accuracy.Those studied configurations are illustrated in Figure 1.To consolidate the evidence set obtained by different skills, we first align the linking scores based on the same step retrieval scores (single or expanded query retrieval) for sorting.Documents returned by multiple skills are considered more relevant and thus promoted in ranking.More details with running examples are provided in Appendix A.

Evaluation Settings
We evaluate our model in three scenarios.Zero-shot Evaluation Similar to recent selfsupervised dense retrievers on Wikipedia, we conduct zero-shot evaluations using the retrieval skill from our pretrained model on NQ, WebQ, Enti-tyQuestions and HotpotQA.To assess the model's ability to handle expanded query retrieval, we design an oracle second-hop retrieval setting (gold first-hop evidence is used) based on HotpotQA.Following Izacard et al. (2021)  i.e., the percentage of questions for which the answer string is found in the top-k passages.

Supervised In-domain Evaluation
We further fine-tune our pretrained model with two extra skills (entity span proposal and reranking) on NQ, Hot-potQA and OTT-QA, again in a multi-task fashion.Unlike multi-hop data with supervision for all skills, only single retrieval and reranking data is available for NQ.During training, all datasets are treated equally without any loss balancing.Different from previous retrieval-only work, we explore Chain-of-Skills retrieval by using different skill configurations.Specifically, we use skill configuration for task A, B and C shown in Figure 1 for NQ, OTT-QA and HotpotQA, respectively.We again report top-k retrieval accuracy for NQ and OTT-QA following previous work.For HotpotQA, we follow the literature using the top-1 pair of evidence accuracy (passage EM).Passage EM MDR (Xiong et al., 2021b) 81.20 Baleen (Khattab et al., 2021) 86.10 IRRR (Qi et al., 2021) 84.10 TPRR (Zhang et al., 2021a) 86.19

Results
Zero-shot Results For zero-shot evaluations, we use two recent self-supervised dense retrievers, Contriever (Izacard et al., 2021) and Spider (Ram et al., 2022), and BM25 as baselines.The results are presented in Table 1.As we can see, BM25 is a strong baseline matching the average retrieval performance of Spider and Contriever over considered datasets.COS achieves similar results on NQ and WebQ compared with self-supervised dense methods.On the other hand, we observe significant gains on HotpotQA and EntityQuestions, where both dense retrievers are lacking.In summary, our model shows superior zero-shot performance in terms of average answer recall across the board, surpassing BM25 with the largest gains, which indicates the benefit of our multi-task pretraining.

Supervised In-domain Results
As various customized retrievers are developed for NQ, OTT-QA and HotpotQA, we compare COS with different dataset-specific baselines separately.For NQ, we report two types of baselines, 1) bi-encoders with multi-dataset training and 2) models with augmented pretraining.For the first type, we have DPR-multi (Karpukhin et al., 2020) and ANCEmulti (Xiong et al., 2021a), where the DPR model is initialized from BERT-based and ANCE is initialized from DPR.For the second type, DPR-PAQ (Oguz et al., 2022) is initialized from the RoBERTalarge model (Liu et al., 2019b) with pretraining using synthetic queries (the PAQ corpus (Lewis et al., 2021)), co-Condenser (Gao and Callan, 2022) 2022) has shown that scaling up the retriever from base to large size only provides limited gains after pretraining.Moreover, DPR-PAQ only learns a single retrieval skill, whereas COS can combine multiple skills for inference.We defer the analysis of the advantage of chain-of-skills inference later ( §5.2).
For OTT-QA, we only compare with the SOTA model CORE (Ma et al., 2022a), because other OTT-QA specific retrievers are not directly comparable where extra customized knowledge source is used.As CORE also uses multiple skills to find evidence chains, we include a baseline where the inference follows the CORE skill configuration but uses modules from COS.For HotpotQA, we compare against three types of baselines, dense retrievers focused on expanded query retrieval MDR (Xiong et al., 2021b) and Baleen (Khattab et al., 2021), sparse retrieval combined with query reformulation IRRR (Qi et al., 2021) and TPRR (Zhang et al., 2021a) and ensemble of dense, sparse and hyperlink retrieval HopRetriever (Li et al., 2021) and AISO (Zhu et al., 2021).The results on OTT-QA and HotpotQA are summarized in Table 3 and Table 4.It is easy to see that COS outperforms all the baselines here, again showing the advantage of our configurable multi-skill model over multiple types of ODQA tasks.Later, our analyses show that both Chain-of-Skills inference and pretraining contribute to the observed gains.

Cross-data Results
Given that both EntityQuestions and SQuAD are single-hop, we use baselines on NQ with improved robustness for comparison.Particularly, SPAR-wiki is an ensemble of two dense models with one pretrained using BM25 supervision on Wikipedia and the other fine-tuned on NQ.BM25 is included here, as it is found to achieve better performance than its dense counterpart on those two datasets.The evaluation results are shown in Table 5.Overall, our model achieves the largest gains over BM25 on both datasets, indicating that our multi-task fine-tuned model with Chain-of-Skills inference is more robust than previous retrieval-only approaches.

Task Interference
We conduct ablation studies on HotpotQA to compare different ways of implementing skill-specific specialization (discussed in §3.2) and their effects on task interference.As MHA experts are used for our model, we consider two variants for comparison: 1) the no-expert model where all tasks share one encoder, and 2) the FFN expert model where specialized FFN sub-layers are used.Then we also compare the proposed expert configuration with a variant where the expanded query retrieval inputs share the same expert as single retrieval, denoted as the naive setting.The results are shown in the upper half of Table 6.Compared with the no-expert model, both FFN and MHA experts can effectively reduce task interference, wherein MHA expert is more effective overall.Our proposed expert config-  uration can further help.

Benefit of Chain-of-Skills Inference
Here we explore the benefits of the chained skill inference over the retrieval-only version.We additionally train a multi-hop retriever following Xiong et al. (2021b), and compare it with the two MHA expert models using the same two rounds of retrieval-only inference.The comparison is shown in the lower part of Table 6.As we can see, retrieval-only inference suffers large drops in performance.Although our proposed and naive MHA expert configurations have similar performance using Chain-of-Skills inference, the naive configuration model shows severe degradation caused by task interference compared with the multi-hop retriever, validating the effectiveness of our proposed model.We further compare our Chain-of-Skills inference with the retrieval-only inference on NQ, EntityQuestions and SQuAD in Figure 4.It is easy to see that our pretraining can benefit the retrieval-only version.However, using better skill configurations via Chain-of-Skills inference yields further improvements, particularly on those unseen datasets.

Effect of Pretraining
To further demonstrate the benefit of our proposed multi-task pretraining, we fine-tune another multi-  4), we find that COS consistently outperforms the multi-task model without pretraining across all considered datasets using Chain-of-Skills inference.Again, the pretrained model is found to achieve improvements across the board, especially on out-of-domain datasets, which validates the benefits of our multi-task pretraining.

Swapping Experts
To understand if different experts in our model learned different specialized knowledge, we experiment with swapping experts for different inputs on HotpotQA.In particular, we feed the single query input and expanded query input to different query experts and then retrieve from either the context passage index or the entity description index.For single query input, we measure if the model can retrieve one of the positive passages.For expanded query input, we compute the recall for the other positive passage as done in ( §4.3).The results are shown in Table 7.Although both the single query expert and the expanded query expert learn to retrieve evidence using the [CLS] token, swapping the expert for either of these input types leads to a significant decrease in performance.Also, switching to the entity query expert and retrieving from the entity description index results in a large drop for both types of inputs.This implies that each specialized expert acquires distinct knowledge and cannot be substituted for one another.

Question Answering Experiments
Here, we conduct end-to-end question-answering experiments on NQ, OTT-QA and HotpotQA, using retrieval results from COS.Following the literature, we report exact match (EM) accuracy and F1 score.
For NQ and OTT-QA, we re-implement the Fusion-in-Encoder (FiE) model (Kedia et al., 2022) because of its superior performance on NQ.For NQ, the model reads top-100 passages returned by COS, and for OTT-QA, the model reads top-50 evidence chains, in order to be comparable with previous work.Here, separate models are trained for each dataset independently.Due to space constraints, we only present the results on OTT-QA and leave the NQ results to Table A2.The OTT-QA results are summarized in Table 8.Our model, when coupled with the FiE, is able to outperform the previous baselines by large margins on OTT-QA, and we can see that the superior performance of our model is mainly due to COS.
Finally, for HotpotQA, since the task requires the model to predict supporting sentences in addition to the answer span, we follow Zhu et al. (2021) to train a separate reader model to learn answer prediction and supporting sentence prediction jointly.Due to space constraints, we leave the full results to Table A3.Overall, our method achieves competitive QA performance against the previous SOTA with improved exact match accuracy.

Related Work
Dense retrievers are widely used in recent literature for ODQA (Lee et al., 2019;Karpukhin et al., 2020).While most previous work focuses on single retrieval (Xiong et al., 2021a;Qu et al., 2021), some efforts have also been made towards better handling of other query types.Xiong et al. (2021b) propose a joint model to handle both single retrieval and expanded query retrieval.Chen et al. (2021b) train a dense model to learn salient phrase retrieval.Ma et al. (2022a) build an entity linker to handle multi-hop retrieval.Nevertheless, all those models are still customized for specific datasets, e.g., only a subset of query types are considered or separate models are used, making them un-reusable and computationally intensive.We address these problems by pinning down a set of functional skills that enable joint learning over multiple datasets.
Mixure-of-expert models have also become popular recently (Fedus et al., 2021b).Methods like gated routing (Lepikhin et al., 2020) or stochastic routing of experts (Zuo et al., 2021) do not differentiate the knowledge learned by different experts.Instead, our work builds expert modules that learn reusable skills which can be flexibly combined for different use cases.
Another line of work focus on unsupervised dense retrievers using self-supervised data constructed from the inverse-cloze-task (Lee et al., 2019), random croppings (Izacard et al., 2021), truncation of passages with the same span (Ram et al., 2022), hyperlink-induced passages (Zhou et al., 2022) or synthetic QA pairs (Oguz et al., 2022).Other model architecture adjustments on Transformer for retrieval are proposed (Gao andCallan, 2021, 2022).Our work can be viewed as a synergy of both.Our multi-task pretrained model can perform better zero-shot retrieval.Our modular retriever can be further fine-tuned in a multi-task fashion to achieve better performance.

Conclusions
In this work, we propose a modular model Chain-of-Skills (COS) that learns five reusable skills for ODQA via multi-task learning.To reduce task interference, we design a new parameterization for skill modules.We also show that skills learned by COS can be flexibly chained together to better fit the target task.COS can directly perform superior zero-shot retrieval using multitask self-supervision on Wikipedia.When finetuned on multiple datasets, COS achieves SOTA results across the board.For future work, we are interested in exploring scaling up our method and other scenarios, e.g., commonsense reasoning (Talmor et al., 2022) and biomedical retrieval (Nentidis et al., 2020;Zhang et al., 2022b).group at Microsoft Research for their helpful discussions and anonymous reviewers for their valuable suggestions on this paper.

Limitations
We identify the following limitations of our work.
Our current COS's reranking expert only learns to rerank single-step results.Thus it can not model the interaction between documents in case of multipassage evidence chains, which might lead to suboptimal performance, e.g., when we need to rerank the full evidence path for HotpotQA.At the same time, we hypothesize that the capacity of the small model used in our experiments is insufficient for modeling evidence chain reranking.We leave the exploration of learning a full path reranker for future work.Also, our current pretraining setup only includes the three bi-encoder tasks, and thus we can not use the pretrained model out-of-box to solve tasks like end-to-end entity linking.Consequently, the learned skills from self-supervision can not be chained together to perform configurable zero-shot retrieval.It would be interesting to also include the entity span proposal skill in the pretraining stage, which could unleash the full potential of the Chain-of-Skills inference for zero-shot scenarios.

A Inference Pipeline
At inference time, our model utilizes the retrieving skill or the linking skill or both in parallel to gather evidence at every reasoning step.When both skills are used, one problem is that the scores associated with the evidence found by different skills are not aligned, i.e., naively sorting the retrieved documents and linked documents together may cause one pool of documents to dominate over the other.Thus we propose to align the linking scores based on the same step retrieval score: where ls i represents the linking score of the document i and {ls}, {rs} represent the set of linking scores and retrieving scores for top-K documents from each skill.Effectively, if the raw linking score is larger than the retrieving score, we would align the top-1 document from each set.On the other hand, if the raw linking score is smaller, it would not get scaled.The reason is that certain common entities may also be detected and linked by our model e.g., United States, but they usually do not contribute to the answer reasoning, thus we do not want to encourage their presence.
In the case of a document being discovered by both skills, we promote its ranking in the final list.To do so, we take the max of the individual score (after alignment) and then multiply by a coefficient α, which is a hyper-parameter.
Finally, we use the reranking skill to compute a new set of scores for the merged evidence set, and then sort the documents using the combination of retrieving/linking score and reranking score: β is another hyper-parameter.For multi-hop questions, the same scoring process is conducted for the second-hop evidence documents and then the two-hop scores are aggregated to sort the reasoning chains.The inference pipeline is also illustrated in Figure A1.

B.1 Data Statistics
The detailed data statistics are shown in Table A1.
Pretraining We follow Izacard et al. (2021) and Wu et al. (2020) to construct examples for single retrieval and entity linking, respectively.For single retrieval, a pair of randomly cropped views of a passage is treated as a positive example.Similar to Spider (Ram et al., 2022), we also use the processed DPR passage corpus based on the English Wikipedia dump from 2018/12/20.For entity linking, we directly use the preprocessed data released by BLINK (Wu et al., 2020) based on the English Wikipedia dump from 2019/08/01.For expanded query retrieval, we construct the pseudo query using a short text snippet with the first passage from the same page, and we treat the first passage from linked pages as the target.As no hyperlink information is preserved for the DPR passage corpus, we use the English Wikipedia dump from 2022/06/01 for data construction.In each Wikipedia page, we randomly sample 30 passages with hyperlinks.(If there are less than 30 passages with hyperlinks, we take all of them.)Each sampled passage, together with the first passage of the page, form a pseudo query.Then, in each sampled passage, we randomly pick an anchor entity and take the first passage of its associated Wikipedia page as the target.To avoid redundancy, if an anchor entity has been used 10 times in a source page, we no longer pick it for the given source.If the query and the target together exceed 512 tokens, we will truncate the longer of the two by randomly dropping its first token or its last token.
Finetuning For NQ, we adopted the retriever training data released by Ma et al. (2022b) and further used them for the reranking skill.Note that data from Ma et al. (2022b)  For HotpotQA, we adopted single retrieval and expanded query retrieval data released by Xiong et al. (2021b).For question entity linking data, we heuristically matched the entity spans in the question with the gold passages' title to construct positive pairs, and we use the same set of negative passages as in single retrieval.For passage entity linking, we collected all unique gold passages in the training set and their corresponding hyperlinks for building positives and mined negatives using BM25.Finally, the reranking data is the same as single retrieval.
For OTT-QA, we adopt the single retrieval and ta-Figure A1: The reasoning pipeline of Chain-of-Skills (COS).Given a question, COS first identifies salient spans in the question, then the retrieving and linking skills are both used to find first-hop evidence, using the [CLS] token and entity mention representation respectively.Then we merge all the evidence through score alignment and the reranking skill.For top-ranked evidence documents, we concatenate each of them with the question and perform another round of retrieving and linking.Then the second hop evidence are merged and reranked in the same fashion.Finally, the reasoning paths are sorted based on both hops' scores ble entity linking data released by Ma et al. (2022a).For expanded query retrieval, we concatenate the question with the table title, header, and row that links to the answer-containing passage as the query, and the corresponding passage is treated as a positive target.The negatives are mined with BM25.Finally, reranking data is the same copy as in single retrieval except that we further break down tables into rows and train the model to rank rows.This is because we want to make the reranking and expanded query retrieval more compatible.
Since iterative training is shown to be an effective strategy by previous works (Xiong et al., 2021a;Ma et al., 2022b), we further mined harder negatives for HotpotQA and OTT-QA skill training data.Specifically, we train models using the same configuration as in pretraining (four taskspecific experts, with no reranking data or span proposal data) for HotpotQA and OTT-QA respectively (models are initialized from BERT-baseduncased).Then we minded harder negatives for each of the data types using the converged model.The reranking and the entity span proposal skills are excluded in this round because the reranking can already benefit from harder negative for single retrieval (as two skills share the same data) and the entity span proposal does not need to search through a large index.Finally, the data splits coupled with harder negatives are used to train our main Chain-of-Skills (COS) and conduct ablation studies.

B.2 Training Details
Pretraining Similar to Contriever (Izacard et al., 2021), we adopt a continual pretraining setup based on the uncased BERT-base architecture, but our model is initialized from the Contriever weights.We train the model for 20 epochs with the batch size of 1024 and the max sequence length of 256.Here, we only use in-batch negatives for contrastive learning.The model is optimized using Adam with the initial learning rate of 1e-4.The final checkpoint is used for fine-tuning later.Finetuning When initializing from pretrained COS, the weights mapping for the first 5 experts are illustrated in Figure 3 and the last expert is initialized from BERT-base-uncased.For all experiments, we train models for 40 epochs with the batch size of 192, the learning rate of 2e-5, and the max sequence length of 256.During training, each batch only contains training data for one of the skills from one dataset, thus the model can effectively benefit from the in-batch negatives.To train the entity span proposal skill, we use the same data as entity linking.
In particular, we route the data to span proposal experts 20% of the time otherwise the data go through entity linking experts.

B.3 Inference Details
Zero-shot-evaluation We directly use the single retrieval skill to find the top100 documents and compute the results in Table 1.Supervised and Cross-dataset For NQ, Enti-tyQuestions and SQuAD, the reasoning path has a length of 1, i.e., only single passages.We use both single retrieval and linking skills to find a total of top 1000 passages first, and then reduce the set to top 100 using the reranking skill.
Both HotpotQA and OTT-QA have reasoning paths with max length 2. For OTT-QA, we first find top 100 tables using the single retrieval skill following (Ma et al., 2022a).Then we break down tables into rows and use the reranking skill to keep only top 200 rows.Then for each row, expanded query retrieval and linking skills are used to find the second-hop passages, where we keep top 10 passages from every expanded query retrieval and top 1 passage from every linked entity.Finally, we apply the same heuristics, as done in Ma et al. (2022a) to construct the final top 100 evidence chains.
For HotpotQA, single retrieval and linking are used jointly to find the first-hop passages where we keep top 200 passages from single retrieval and top 5 passage from each linked question entity.The combined set is then reranked to keep the top 30 first-hop passages.Then expanded query retrieval and passage entity linking are applied to these 30 passages, where we keep top 50 passages from expanded query retrieval and top 2 passages from every linked passage entity.Next, another round of reranking is performed on the newly collected passages and then we sort the evidence passage chains based on the final aggregated score and keep top 100 chains.Since all of the baselines on HotpotQA adopt a large passage path reranker, we also trained such a model following (Zhu et al., 2021) (Cheng et al., 2021) 1.87B 54.7 R2-D2 (Fajcik et al., 2021) 1.29B 55.9 FiE (Kedia et al.,  The hyperparameters for OTT-QA and Hot-potQA inference are selected such that the total number of evidence chains are comparable to previous works (Ma et al., 2022a;Xiong et al., 2021b).

C.1 Training Details
We follow descriptions in (Kedia et al., 2022) for re-implementation of FiE model and the model is initialized from Electra-large (Clark et al., 2020).For NQ, we train the model for 5,000 steps with the effective batch size of 64, the learning rate of 5e-5, the layer-wise learning rate decay of 0.9, the max answer length of 15, the max question length of 28, the max sequence length of 250, and 10 global tokens.Note that although Kedia et al. (2022) reports that training with 15,000 steps leads to better performance, we actually found it to be the same as 5,000 steps.Thus we train with fewer steps to save computation.For OTT-QA, we used the same set-up of hyperparameters except that the max sequence length is changed to 500.
For HotpotQA path reranker and reader, we prepare the input sequence as follows: "[CLS] Q [SEP] yes no [P] P1 [P] P2 [SEP] ", where [P] is a special token to denotes the start of a passage.Then the input sequence is encoded by the model and we extract passage start tokens representations p 1 , ...p m and averaged sentence embeddings for every sentence in the input s 1 , ...s n to represent passages and sentences respectively.The path reranker is trained with three objectives: passage ranking, supporting sentence prediction and answer span extraction, as we found the latter two objectives also aid the passage ranking training.For answer extraction, the model is trained to predict the start and end token indices as commonly done in recent literature (Xiong et al., 2021b;Zhu et al., 2021).For both passage ranking and supporting sentence prediction, the model is trained with the ListMLE loss (Xia et al., 2008).In particular, every positive passage in the sequence is assigned a label of 1, and every negative passage is assigned 0. To learn a dynamic threshold, we also use the [CLS] token p 0 to represent a pseudo passage and assign a label of 0.5.Finally, the loss is computed as follows: where P contains all passages representations that have labels smaller than p i .W p ∈ R d are learnable weights and d is the hidden size.In other words, the model learns to assign scores such that positive passages > thresholds > negative passages.The supporting sentence prediction is also trained using Equation 9. Overall, use the following loss weighting: where L a is the answer extraction loss and L s is the supporting sentence prediction loss.
During training, we sample 0-2 positive passages and 0-2 negative passages from the top 100 chains returned by COS, and the model encodes at most 3 passages, i.e., the passage chain structure is not preserved and the passages are sampled independently.We train the model for 20,000 steps with the batch size of 128, the learning rate of 5e-5, the layer-wise learning rate decay of 0.9, the max answer length of 30, the max question length of 64, and the max sequence length of 512.For inference, the model ranks top 100 passage chains with structure preserved.We sum the scores of the two passages in every chain and subtract the dynamic threshold score and sort the chains based on this final score.
Next, we train a reader model that only learns answer extraction and supporting sentence prediction.We only train the model using the two gold passages with the following loss weighting.
L reader = L a + 0.5 × L s (11) The model uses the same set of hyperparameters as the path reranker except that the batch size is reduced to 32.At inference time, the model directly read the top 1 prediction returned by the path reranker.Both models here are initialized from Electra-large.

C.2 Results
The NQ results are presented in Table A2.Overall, our model achieves a similar performance as our own FiE baseline.FiE baseline uses the reader data released by the FiD-KD model, which has an R100 of 89.3 (vs 90.2 of COS).Considering that the gap between our method and FiD-KD model's top 100 retrieval recall is relatively small, this result is not surprising.
The HotpotQA results are shown in Table A3.Overall our results are similar to previous SOTA methods on the dev set.At the time of the paper submission, we have not got the test set results on the leaderboard.We adopted DPR evaluation scripts2 for all the retrieval evaluations and MDR evaluation scripts3 for all the reader evaluations.

Figure 2 :
Figure 2: Chain-of-Skills (COS) model architecture with three different query types.The left blue box indicates the single retrieval query input.The middle green box is the expanded query retrieval input based on the single retrieval results.The right orange case is the entity-centric query with "deep learning" as the targeted entity.
bottom-center), we concatenate Q with the previous-hop evidence as done in Xiong et al. (2021b), i.e., [CLS] Q [SEP] P + 1 [SEP].Following the literature, [CLS] vectors from both encoders are used to represent the questions and documents respectively.The training objective is

Figure 3 :
Figure 3: Expert configuration for COS at pretraining and fine-tuning.Each numbered box is a skill-specific expert.The lines denote input routing where solid ones also indicate weight initialization mappings.Green lines highlight the expanded query routing which is different for pretraining and fine-tuning.

Figure 5 :
Figure 5: Comparison on the effect of pretraining using top-100 retrieval accuracy with COS inference.

Table 1 :
and Ram et al. (2022), we report top-k retrieval accuracy (answer recall), Zero-shot top-k accuracy on test sets for NQ, WebQ and EntityQuestions, and dev set for HotpotQA.

Table 2 :
Supervised top-k accuracy on NQ test.

Table 3 :
Supervised top-k accuracy on OTT-QA dev.
Sciavolino et al., 2021)volino et al., 2021)compared with BM25.In particular, we are interested to see whether Chain-of-Skills retrieval is more robust.Again, top-k retrieval accuracy is used.
Table 2), COS outperforms all baselines with or without pretraining.It is particularly encouraging that despite being a smaller model, COS achieves superior performance than DPR-PAQ.The reasons are two-fold: Oguz et al. (

Table 5 :
Cross-dataset top-k accuracy on test sets.

Table 6 :
Ablation results on HotpotQA dev using topk retrieval accuracy.All models are initialized from BERT-base and trained on HotpotQA only.

Table 7 :
Results of feeding the inputs to different experts, where the first two columns represent the query expert id and document expert id.* denotes the proposed setup Query Doc Top-20 Top-100 task model following the same training protocol as COS but BERT model weights are used for initialization.Both COS and the model without pretraining are then using the same skill configuration for inference.The results are illustrated in Figure 5. Similar to the retrieval-only version (Figure

Table A1 :
Statistics of datasets used in our experiments, columns 2-4 represent the number of questions in each split.The last two columns contain the type of training data and the corresponding number of instances

Table A2 :
End-to-end QA Exact Match score on NQ chains to get the top 1 prediction.