Training Adaptive Computation for Open-Domain Question Answering with Computational Constraints

Adaptive Computation (AC) has been shown to be effective in improving the efficiency of Open-Domain Question Answering (ODQA) systems. However, the current AC approaches require tuning of all model parameters, and training state-of-the-art ODQA models requires significant computational resources that may not be available for most researchers. We propose Adaptive Passage Encoder, an AC method that can be applied to an existing ODQA model and can be trained efficiently on a single GPU. It keeps the parameters of the base ODQA model fixed, but it overrides the default layer-by-layer computation of the encoder with an AC policy that is trained to optimise the computational efficiency of the model. Our experimental results show that our method improves upon a state-of-the-art model on two datasets, and is also more accurate than previous AC methods due to the stronger base ODQA model. All source code and datasets are available at https://github.com/uclnlp/APE.


Introduction
Open-Domain Question Answering (ODQA) requires finding relevant information for a given question and aggregating the information to produce an answer. The retriever-reader architecture, popularised by Chen et al. (2017), has shown great success in this task. The retriever acquires a set of documents from external sources (e.g., Wikipedia) and the reader extracts the answer spans from these documents (Clark and Gardner, 2018;Yang et al., 2019;Wang et al., 2019;Min et al., 2019;Asai et al., 2020). Recently, ; Lewis et al. (2020b); Izacard and Grave (2020b) showed that generative reader models that exploit an encoder-decoder architecture can significantly outperform previous extractive models, thanks to their better capability in aggregating and combining evidence from multiple passages. However, these generative models are much more computationally expensive than extractive models, and often need to be trained with a large number of passages, making it hard to train these models for most researchers (Schwartz et al., 2020a).  show that Adaptive Computation (AC) can significantly improve the efficiency of extractive ODQA models at inference time. However, it requires fine-tuning all model parameters with a multitask learning objective, making it computationally challenging to apply this method to current state-of-the-art models.
In this work, we explore an efficient approach to apply adaptive computation to large generative ODQA models. We introduce the Adaptive Passage Encoder (APE), a module that can be added to the encoder of an existing ODQA model, which has the following features: 1) it efficiently reuses the encoder's hidden representations for calculating the AC priorities; 2) it does not require tuning of the base model and hence allows efficient training under limited resource; 3) it does not require confidence calibration. Our experimental results on Nat-uralQuestions and TriviaQA show that our method improves the performance of the state-of-the-art model FiD (Izacard and Grave, 2020b), while also producing more accurate results (12.4% EM) than the AC method proposed by .

Related Work
Open Domain Question Answering ODQA is a task that aims to answer a factoid question given a document corpus. Most works in this domain follow a retriever-reader design first proposed by Chen et al. (2017). The retriever collects a set of relevant passages, then the reader comprehends and aggregates the information from multiple passages to produce the answer. Depending on the design of the reader model, these systems could be further categorised into extractive models and generative models. Extractive models (Min et al., 2019;Yang et al., 2019;Wang et al., 2019;Asai et al., 2020;Karpukhin et al., 2020) exploit an answer extraction model to predict the probabilities of answer spans, and use global normalisation (Clark and Gardner, 2018) to aggregate the answer probabilities across multiple passages.
However, thanks to recent advances in sequenceto-sequence pretrained language models (Raffel et al., 2020;Lewis et al., 2020a), generative ODQA models Lewis et al., 2020b;Izacard and Grave, 2020b) achieve significant improvement upon extractive models, demonstrating stronger capability in combining evidence from multiple passages. We focus on generative models in this work. for ranking passages (Robertson, 2004). Recently, Karpukhin et al. (2020); Lewis et al. (2020b); Izacard and Grave (2020a) achieved substantial increase in retrieval performance using dense representations. Our work is based on the retrieval results from a dense retriever (Izacard and Grave, 2020b), but we show that the proposed method can still improve the quality of the support passages despite the strong retrieval performance. Nogueira and Cho (2019); Qiao et al. (2019); Mao et al. (2021) show that adding a separate cross-encoder re-ranker can improve the performance, but that comes with a significant increase of the computation at train or inference time. Despite that our proposed adaptive passage encoder can be viewed as an encoder with an integrated re-ranker, the focus of our work is to improve the computational efficiency, namely, enhancing the performance without a substantial increase in computation.
Adaptive Computation Adaptive computation allows the model to condition the computation cost on the input. For example, Schwartz et al. (2020b); ; Xin et al. (2020) propose models that can dynamically decide to early exit at intermediate layers when the confidence at the layer exceeds a threshold. They show that adaptively early exiting can significantly reduce the computational cost for various sequence classification tasks. Closest to our work,  introduced adaptive computation for extractive ODQA models. We extend adaptive computation to generative ODQA models, and our approach can be incorporated in existing generative ODQA models without finetuning the base model.

Method
In this section, we will introduce the base model and how our proposed adaptive passage encoder works with it.

Base Model
Large generative ODQA models (Lewis et al., 2020b;Izacard and Grave, 2020b) share a similar encoder-decoder architecture. They first concatenate the question with all retrieved passages. Then the encoder encodes all passages and produces their hidden representations h L 1 , · · · , h L N , where L is the number of encoder layers and N is the number of retrieved passages. We denote the hidden representation of the i-th passage at its j-th encoder layer as h j i . The decoder will attend to these hidden representations and generate the answer tokens sequentially.

Adaptive Passage Encoder
As shown in Fig. 1, the adaptive passage encoder overrides the layer-by-layer computation of the encoder of the base model with an adaptive computation policy. It adds two components on top of the base encoder to define the policy: an answerability prediction model HasAnswer and a scheduler.
The HasAnswer model predicts the probability that a passage contains an answer to the question, given its hidden representation h j i . It first pools hidden representation h j i into a vector, then feeds the pooled representation to a multi-layer perceptron to produce the probability p j i . The scheduler is then responsible for the selection and prioritisation of passages that are likely to contain the answer . As shown by the blue arrows in Fig. 1, the scheduler learns a scheduling policy to allocate encoder layer computation to passages. The scheduler will exit in early layers for those spurious passages while allocating more layers to the ones that it finds promising.
To achieve this goal, the scheduler produces a priority score q n for each passage: q n = σ(g(p ln n , n, l n ))p ln n + f (p ln n , n, l n ) (1) where n is the passage rank by the retriever, l n is the index of its current encoder layer, g and f are two multi-layer perceptrons that learn the weight and bias respectively. Starting at the initial layer for all passages, the scheduler will select a passage with the maximum priority, forward one encoder layer for it l n = l n + 1, and updates its priorities q n with its new hidden representation h l n n and hasanswer probability p l n n . This process will iterate for B (budget) steps, and only k passages with the most layers computed are retained in the end.

Training the Adaptive Passage Encoder
Differently from , our method does not require tuning the underlying base model. Since the number of parameters introduced by the HasAnswer model and the scheduler is less than 4% of the base model, APE can be trained very efficiently. The HasAnswer model is first trained with cross-entropy loss, supervised by the has-answer labels of the passages. Then we fix HasAnswer and train the scheduler with REINFORCE algorithm (Williams, 1992) to maximise the expected return, which is defined to encourage selection and prioritisation of passages that contain the answer. The selection action gains a positive reward (1 − c) if it selects a relevant passage, otherwise a negative reward −c. Since the weight g and bias f in Eq. (1) are automatically learned during the training of the scheduler, our method does not require confidence calibration of the HasAnswer model, unlike the method proposed by .

Experimental Setup
Datasets Following Izacard and Grave, 2020b), we evaluate our method on Natu-ralQuestions (Kwiatkowski et al., 2019) and Trivi-aQA (Joshi et al., 2017) whose statistics are shown in Table 1. Following Wu et al. (2020), we conduct the evaluation under different computational costs at inference time. Since the number of passages k is almost linearly correlated with memory consumption and number of operations, we evaluate the performances with various number of passages k ∈ {5, 10, 20}. To evaluate the end performance of ODQA models, we use the standard Exact Match (EM) score, which is the proportion of questions whose predicted answer matches exactly with the ground truth. We also include the unrestricted setting to compare the best performances of different models.

Evaluation Metrics
Technical Details We use FiD (Izacard and Grave, 2020b) as our base model. FiD-base and FiD-large contain L = 12 and 24 layers respectively, and we set the budget B = Lk. For the pooling operation in the HasAnswer model, we found max-pooling works better than mean-pooling and the [CLS] token, so max-pooling is used in all our experiments. We use discount factor γ = 0.8 and step penalty c = 0.1 during the REINFORCE training of the scheduler. More hyperparameters are presented in Appendix A.1.
Computational Feasibility Tuning a FiD-base model with k = 20 or a FiD-large model with k = 10 (batch size=1) would yield out-of-memory errors on a V100 (16GB) GPU. Hence, it is infeasible to train FiD with the previous AC method  in our setting. However, training with our proposed approach can be done in the same setting with a batch size 4 or larger within 8-15   hours.

Experimental Results
As shown in Table 2 under restricted top-k, our proposed method improves upon the FiD model on both datasets, and by a statistically significant margin on TriviaQA. It also outperforms the previous AC method  by 12.4% when k = 10 due to the stronger base model. The addition of APE allows FiD to significantly outperform RAG (Lewis et al., 2020b) on NaturalQuestions when k ∈ {10, 20}. Previous adaptive computation methods Schwartz et al., 2020b) was reported to have plateaued or degraded performances in the unrestricted setting. However, Table 2 shows that our approach does not have this issue.

Analysis of Passage Quality
To understand how APE outperforms the baselines, we analyse the quality of the final top-k passages retained by APE. Table 3 reports the top-k retrieval accuracy of the top-k passages. The results show that the top-k accuracy of the selected collection of documents by APE is significantly better than BM25, DPR, and FiD, which are strong retrieval baselines for ODQA. Combined with Table 2, it indicates that the better passage quality yielded by APE helps to improve the end ODQA performance of the model.

Conclusions
In this work, we explore an adaptive computation method that can be efficiently applied to an existing generative ODQA model. We find that, by replacing the encoder of generative ODQA models with our proposed adaptive passage encoder, we can train an effective adaptive computation policy without tuning the base model. This allows applying adaptive computation to large state-of-the-art generative models, which was previously challenging computation-wise. Our experimental results show that our method produces more accurate results than a state-of-the-art generative model on both NaturalQuestions and TriviaQA, and it outperforms the previous AC method by a large margin. The analysis also shows that our approach achieves better passage quality that leads to improvements in ODQA performance.