Strong and Efficient Baselines for Open Domain Conversational Question Answering

Unlike the Open Domain Question Answering (ODQA) setting, the conversational (ODConvQA) domain has received limited attention when it comes to reevaluating baselines for both efficiency and effectiveness. In this paper, we study the State-of-the-Art (SotA) Dense Passage Retrieval (DPR) retriever and Fusion-in-Decoder (FiD) reader pipeline, and show that it significantly underperforms when applied to ODConvQA tasks due to various limitations. We then propose and evaluate strong yet simple and efficient baselines, by introducing a fast reranking component between the retriever and the reader, and by performing targeted finetuning steps. Experiments on two ODConvQA tasks, namely TopiOCQA and OR-QuAC, show that our method improves the SotA results, while reducing reader's latency by 60%. Finally, we provide new and valuable insights into the development of challenging baselines that serve as a reference for future, more intricate approaches, including those that leverage Large Language Models (LLMs).


Introduction
In an automated information-seeking conversation scenario between two parties, the human questioner asks a series of questions and expects to receive a relevant response from the answering system (Oddy, 1977).Current State-of-the-Art (SotA) shapes the answerer via two neural models, the Dense Passage Retrieval (DPR) (Karpukhin et al., 2020) and the Fusion-in-Decoder (FiD) (Izacard and Grave, 2021b), which act as retriever and reader, respectively.Their success stems from the ability to overcome certain limitations of their sparse and extractive counterparts, such as not relying on lexical retrieval heuristics or extracting spans as a response (Chen et al., 2017;Yang et al., 2019;Lee et al., 2019;McCallum et al., 2019;Guu et al., 2020;Lewis et al., 2020;Shen et al., 2023).* Work was done during an internship at Amazon Science.
Unlike the Open Domain Question Answering (ODQA) setting, a reassessment of the baselines in terms of both efficiency and effectiveness appears to be under-explored in the conversational (OD-ConvQA) domain.In this paper, we focus on the typical DPR retriever and FiD reader (DPR+FiD) pipeline, and show its limitations when applied to the ODConvQA setting.Despite its popularity, we find that this baseline significantly underperforms when finetuned on downstream tasks.We show that simple improvements in the training, architecture, and inference setups of the DPR+FiD pipeline, provide a strong and efficient baseline that exceeds the performance of SotA models on two common ODConvQA datasets: TopiOCQA (Adlakha et al., 2022) and ORConvQA (OR-QuAC) (Qu et al., 2020).
We point out several limitations of the pipeline, such as: 1) reader's susceptibility to noisy input, 2) retriever's reduced coverage, 3) retriever's lack of cross semantic encoding between the conversation and the retrieved passages, and 4) reader's latency is heavily impacted by the number of input passages.To mitigate these, we propose and evaluate a simple and effective approach by including a fast reranking component between the retriever and the reader, and by performing targeted finetuning steps.The proposed Retriever-Reranker-Reader finetuning (R3FINE) strategy leads to baseline models with a better latency/performance trade-off.These baselines, which are simple and easy to replicate, serve as a reference point for comparing new and more complex models, and determining their effectiveness.Our contributions are the following: • We identify and address several limitations of the typical pipeline used in ODConvQA.
• We propose the R3FINE strategy, which improves SotA results on two common datasets and reduces pipeline's latency by 60%.
• We provide new and valuable insights for creating simple and efficient baselines, which serve as a reference point for future comparison of new more complex approaches.
2 End-to-End Baselines for ODConvQA This section provides a brief introduction to the pipeline on which this work focuses.Figure 1 shows the typical pipeline used within the OD-ConvQA setting, featuring an additional reranker component.A conversation history is input to the DPR retriever.This module exploits a dualencoder based on the BERT (Devlin et al., 2019) model.First, it encodes the conversation history via the ConversationEncoder component, which takes as input the text of the conversation history c 1 , c 2 , . . ., c i , and then it outputs a dense representation h c .Next, this representation is used to perform a dense search to retrieve the most relevant passages, i.e., text blocks that serve as basic retrieval units, from an external knowledge source (e.g., Wikipedia).The latter contains dense representations of the passages that have been encoded via the P assageEncoder component, which takes as input a j-th passage with a given text length N , i.e., p j 1 , p j 2 , . . ., p j N , and outputs a dense representation h p j .The dense search is performed via the Maximum Inner-Product Search (MIPS) function which outputs the value corresponding to h ⊺ c • h p j .Once the top-k relevant passages have been retrieved, their text is appended to the conversa-tion history and subsequently passed to the FiD reader, which is based on the T5 (Raffel et al., 2020) model.The newly created textual sequences of length S are then encoded in parallel via the Encoder component that outputs a dense representation h = {h 1 , . . ., h S }.As a final step, the dense representations of the entire list of k input passages are concatenated to form a single h1 ⊕ h2 , . . ., ⊕ hk sequence that forms the input to the Decoder component responsible for generating the answer a.

Strong Baseline Models
This work focuses on two main datasets.TOPIOCQA (Adlakha et al., 2022) is a large-scale open-domain information-seeking conversational dataset that contains a challenging phenomenon in the form of topic switching.OR-QuAC (Qu et al., 2020) leverages CANARD's (Elgohary et al., 2019) context-independent question rewrites of the QuAC (Choi et al., 2018) dataset, and adapts it to the open-domain setting.Further details regarding the datasets are provided in Appendix A.
We outline a number of limitations of the typical DPR+FiD pipeline, along with suggestions on how to mitigate them.While some of those interconnect at different levels the various efforts made in the ODQA domain (Balachandran et al., 2021;Yu et al., 2022), our goal is to offer a perspective on the ODConvQA setting.

Current Limitations and Bottlenecks
Reader's susceptibility to noisy input.Previous findings have shown that the FiD reader performance significantly improves when increasing the number of retrieved passages (Izacard and Grave, 2021b).While confirming this finding, in Table 1 we also present a different perspective to it.We show that when the same reader model is provided with the relevant (i.e., gold) passage in input, the performance decreases as the number of retrieved passages increases.This suggests that there is a balance in presenting input to the reader: if the gold passage is present, i.e., the retriever could retrieve it, a small relevant list is best, but otherwise a larger list is better.
Retriever's reduced coverage.Current solutions impose a hard top-k limit on the number of passages returned by the DPR retriever and assume that the relevant ones are present within this limit.
Table 1 shows that coverage is key during the retrieval phase for the reader to perform well.To improve it, we suggest introducing a simple and efficient Transformer-based (Vaswani et al., 2017) reranker component after the retriever.This component, shown in Figure 1 and described in the next paragraph, is designed to reconsider a larger pool of passages returned by the DPR and to provide the FiD with a reduced and improved list of passages.Since this module operates at the semantic level, we refer to it as the SemanticReranker.Table 2 shows the potential coverage margins and the retrieval results obtained after the introduction of such a module, when a larger number of passages (50 vs 1000) is considered.
Retriever's lack of cross semantic encoding between the conversation and the retrieved passages.The DPR retriever performs independent encoding of the passages via the P assageEncoder function.This means that it is not able to exploit the semantic relationship among them.This can be mitigated via the introduction of the previously mentioned SemanticReranker component.This new module is based on the Trans-formerEncoder and applies the following function: where each element of the input attends to both the conversation dense representation h c and passages dense representations h p i .Reranking is performed over the new output sequence hc , hp 1 , . . ., hp k via the previously mentioned MIPS function.Reader's latency is heavily impacted by the number of input passages.Figure 2 shows that the latency of the reader can be significantly reduced by decreasing the number of input passages.However, a trivial limitation to top-k considerably degrades the performance of the module, thus leading to an inevitable trade-off.The task of the SemanticReranker involves pushing relevant passages into the top-k list, and allowing for a low k value to be set.

A Strong and Efficient Baseline
Based on the findings above, we introduce the Retriever-Reranker-Reader finetuning (R3FINE) strategy, which can be used to design strong and efficient baselines for ODConvQA.First, we increase the number of passages returned by the DPR from the initial 50 to 1000.Then, we add the SemanticReranker component, which corresponds to a single T ransf ormerEncoder layer.We train/finetune the SemanticReranker along with the ConversationEncoder while keeping the P assageEncoder frozen.Guided by the intu-

Experiments and Results
This section shows the impact that the introduction of the SemanticReranker module has on the pipeline, as well as the finetuning steps we followed to make the pipeline more efficient without compromising its performance.
Experimental Setup.As the starting point of our experiments, we used the DPR and FiD models provided with the TOPIOCQA dataset.Currently, only the train and dev splits are made available for this dataset.We followed the same experimental setup and exploited TOPIOCQA's DPR module for both datasets.Unlike TOPIOCQA, OR-QuAC is of extractive type, and for this reason we trained the FiD module from scratch by following the same training configuration of TOPIOCQA.
End-to-End Results.R3FINE achieves an F1 score of 59 points on TOPIOCQA, and 32.9 on OR-QuAC, which are 3.9 and 3 points higher than the best models proposed in the original papers.It is worth noting that these large improvements are achieved with simple adjustments in the training, architecture, and inference setups of the well-established DPR+FiD baseline, and not via the introduction of new heavier and complex models.
To further support our R3FINE strategy, in Table 3 we present an ablation study which quantifies its impact on the DPR+FiD pipeline.We note that introducing the SemanticReranker (w/ SR) always outperforms the DPR+FiD baseline (w/o SR), and at the same time it allows for a 5-fold input size reduction (top-10) while obtaining on-par or better results.In addition, a further finetuning step of the FiD (w/ SR + FT) outperforms the results obtained by the SemanticReranker (w/ SR) by 1.7 and 2.9 F1 points on TOPIOCQA and OR-QuAC, respectively.Further experiments and ablation studies are provided in Appendix A.
Finally, in Figure 2 it can also be observed that using top-10 instead of top-50 can reduce FiD's latency by 60% on average across the two datasets.We conducted a latency measurement to evaluate the impact of the SemanticReranker and its associated parameters, with detailed information available in Appendix A. Given that the SemanticReranker consists of a single T ransf ormerEncoder layer, its parameters are negligible when compared to both the DPR and FiD.Moreover, the SemanticReranker accounts only for 0.34% of the overall latency of the FiD reader, adding an additional 2.4ms per example on top of the 710ms taken by FiD.It is important to note that this impact is only considered in relation to FiD, as the retrieval phase remains constant regardless of the inclusion of the SemanticReranker.

Conclusions
In this paper, we identified several limitations of the typical Depnse Passage Retrieval (DPR) retriever and Fusion-in-Decoder (FiD) reader pipeline when applied in an ODConvQA setting.We proposed and evaluated an improved approach by including a fast reranking component between these two modules and by performing targeted finetuning steps.The proposed R3FINE strategy lead to a better latency/performance trade-off.The new baseline has proven to be both strong and efficient when compared to previous baselines, thus making it suitable for future comparisons of new approaches.

Limitations
The study presented in this work aimed to identify and address various limitations of the commonly used ODConvQA pipeline.While our approach may not be technically groundbreaking, the work's novelty lies in the presented findings to design strong and efficient baselines for ODConvQA.It should be noted that further research is needed to compare the performance of the proposed R3FINE strategy with other rerankers on non-conversational QA datasets, which would provide valuable insights into how effective the R3FINE approach is in different contexts.For both datasets, we report the presence of the gold passage within the top-k limit.For both datasets, we report the presence of the gold passage within the top-k limit.

L h c
Table 8 shows the dev split retrieval coverage before/after the introduction of the SemanticReranker (w/o SR) when a larger number of passages is considered (50 vs 1000).For TOPIOCQA, we report the presence of the gold passage within the top-k limit.For OR-QuAC we report the presence of the gold answer within the top-k limit.Table 9 shows the OR-QuAC test split retrieval coverage before/after the introduction of the SemanticReranker (w/o SR) when a larger number of passages is considered (50 vs 1000).We report the presence of the gold answer within the top-k limit.For TOPIOCQA, we report the presence of the gold passage within the top-k limit.For OR-QuAC we report the presence of the gold answer within the top-k limit.Table 9: OR-QuAC test split retrieval coverage before/after the introduction of the SemanticReranker (w/o SR) when a larger number of passages is considered (50 vs 1000).We report the presence of the gold answer within the top-k limit.

A.5 Reader is susceptible to noisy input
Table 13 shows the FiD reader performance on the TOPIOCQA dev split, with/without the gold passage (w/o gold) in the top-k limit.This analysis is limited to the TOPIOCQA dataset as it is the only one to provide information about the gold passage for the dev split.

A.6 Further reader study
To better understand the impact that the introduction of the SemanticReranker has on the FiD reader, Table 14, Table 15, and Table 16 show the results obtained after taking a non-finetuned FiD and training it on the top-10 passages returned by the initial DPR retriever and on the top-10 passages returned by the SemanticReranker.On both datasets, we followed the same training configuration as the one used for TOPIOCQA.

A.7 Latency measurement
Latency measurement (see Figure 2) has been performed on the same NVIDIA V100 16GB GPU, by following the FiD's test_reader.pyscript provided with the TOPIOCQA dataset.We set the per_gpu_batch_size paramenter to 4 in all runs and chose the value of the n_context parameter from 1, 3, 5, 10, 20, 30, and 50, based on the number of input passages.For each value, we report the latency relative to the maximum n_context parameter value, i.e., 50.We used CUDA events synchronization markers to measure the elapsed time for the preprocessing and evaluation of TOPIOCQA's dev split and OR-QuAC's test split.

Figure 2 :
Figure 2: FiD reader performance (F1 score and latency) on the TOPIOCQA dev split and OR-QuAC test split, with varying top-k input passages.Latency is relative to the top-50 (top-k vs top-50).

Table 1 :
FiD reader performance (Exact Match and F1 scores) on the TOPIOCQA dev split, with/without the gold passage (w/o gold) in the top-k limit.

Table 3 :
FiD reader performance (Exact Match and F1 scores) on the TOPIOCQA dev split and OR-QuAC test split before/after the introduction of the SemanticReranker (w/o SR), together with the results obtained after a further reader finetuning step with top-10 output by the SR (w/ SR + FT).Underlined values indicate the results obtained by the DPR+FiD pipeline.Bold values indicate the results obtained after the introduction of the SR together with targeted finetuning steps.
Table 4: TOPIOCQA dev split baselines performance (Exact Match and F1 scores) comparison between sparse/dense retrievers (BM25/DPR Retriever) and extractive/generative readers (DPR Reader/FiD).itionthatlessbut more relevant passages are beneficial to FiD as reported in Table1, we finally perform an additional finetuning step by leveraging the new top-10 list of passages returned by the SemanticReranker.
Table 4 and Table 5 compare our R3FINE strategy with previous baselines.

Table 7 :
Train split retrieval coverage before/after the introduction of the SemanticReranker (w/o SR) when a larger number of passages is considered (50 vs 1000).

Table 10 ,
Table 11, and Table 12 show the impact the introduction of the SemanticReranker has on the FiD reader.The input to the FiD reader are either passages returned by the initial DPR retriever (w/o SR) or passages returned by the SemanticReranker (w/ SR).

Table 8 :
Dev split retrieval coverage before/after the introduction of the SemanticReranker (w/o SR) when a larger number of passages is considered (50 vs 1000).

Table 10 :
FiD reader performance (Exact Match and F1 scores) on the TOPIOCQA dev split before/after the introduction of the SemanticReranker (w/o SR).

Table 11 :
FiD reader performance (Exact Match and F1 scores) on the OR-QuAC dev split before/after the introduction of the SemanticReranker (w/o SR).

Table 12 :
FiD reader performance (Exact Match and F1 scores) on the OR-QuAC test split before/after the introduction of the SemanticReranker (w/o SR).

Table 13 :
FiD reader performance (Exact Match and F1 scores) on the TOPIOCQA dev split, with/without the gold passage (w/o gold) in the top-k limit.

Table 14 :
FiD reader performance (Exact Match and F1 scores) on the TOPIOCQA dev split after taking a non-finetuned FiD and training it on the top-10 passages returned by the initial DPR retriever (w/o SR) and on the top-10 passages returned by the SemanticReranker (w/ SR).

Table 17 ,
Table 18, and Table19show instead 6313 the results obtained after taking an already finetuned FiD reader and further finetuning it on the top-10 passages returned by the initial DPR retriever and on the top-10 passages returned by the SemanticReranker.On both datasets, the amount of finetuning steps is equal to the one used for training the already finetuned FiD reader.

Table 17 :
FiD reader performance (Exact Match and F1 scores) on the TOPIOCQA dev split after taking an already finetuned FiD and further finetuning it on the top-10 returned by the initial DPR retriever (w/o SR) and on the top-10 returned by the SemanticReranker (w/ SR).

Table 18 :
FiD reader performance (Exact Match and F1 scores) on the OR-QuAC dev split after taking an already finetuned FiD and further finetuning it on the top-10 passages returned by the initial DPR retriever (w/o SR) and on the top-10 passages returned by the SemanticReranker (w/ SR).

Table 19 :
FiD reader performance (Exact Match and F1 scores) on the OR-QuAC test split after taking an already finetuned FiD and further finetuning it on the top-10 passages returned by the initial DPR retriever (w/o SR) and on the top-10 passages returned by the SemanticReranker (w/ SR).