Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems

Large transformer models can highly improve Answer Sentence Selection (AS2) tasks, but their high computational costs prevent their use in many real-world applications. In this paper, we explore the following research question: How can we make the AS2 models more accurate without significantly increasing their model complexity? To address the question, we propose a Multiple Heads Student architecture (named CERBERUS), an efficient neural network designed to distill an ensemble of large transformers into a single smaller model. CERBERUS consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads; unlike traditional distillation technique, each of them is trained by distilling a different large transformer architecture in a way that preserves the diversity of the ensemble members. The resulting model captures the knowledge of heterogeneous transformer models by using just a few extra parameters. We show the effectiveness of CERBERUS on three English datasets for AS2; our proposed approach outperforms all single-model distillations we consider, rivaling the state-of-the-art large AS2 models that have 2.7x more parameters and run 2.5x slower. Code for our model is available at https://github.com/amazon-research/wqa-cerberus


Introduction
Answer Sentence Selection (AS2) is a core task for designing efficient retrieval-based Web QA systems: given a question and a set of answer sentence candidates (e.g., retrieved by a search engine), AS2 models select the sentence that correctly answers the question with the highest probability.
AS2 research originated from the TREC competitions (Wang et al., 2007), which targeted large amounts of unstructured text.AS2 models are very efficient, and can enable Web-powered question The model consists of a shared encoder body and multiple ranking heads.CERBERUS independently scores up to hundreds candidate answers a i for question q; The one with highest likelihood is selected as answer.
answering systems of real-world virtual assistants such as Alexa, Google Home, Siri, and others.
As most research areas in text processing and retrieval, AS2 has been dominated by the use of ever larger transformer model architectures (Vaswani et al., 2017).These models are typically pre-trained using language modeling tasks on large amounts of text (Devlin et al., 2019;Liu et al., 2019;Conneau et al., 2019), and then fine-tuned on specific downstream tasks (Wang et al., 2018(Wang et al., , 2019;;Hu et al., 2020).Garg et al. (2020) achieved an impressive accuracy by fine-tuning pre-trained Transformers to the AS2 task on the target datasets.They established the new state of the art performance for AS2 using a RoBERTa LARGE model.
Unfortunately, larger transformer models come at a cost: they require large computing resources, consume a lot of energy (critically impacting the environment (Strubell et al., 2019)), and may have unacceptable latency and/or memory usage.These downsides are critical for AS2 applications, where, for any given query, a model is required to score hundreds or thousands of candidates to select the top-k answers.Therefore, in this work, we investigate how AS2 models can be made more accurate without significantly increasing their complexity.
Previous work has addressed the general problem of high computational cost of transformer models by developing techniques for reducing their overall size while maintaining most of their performance (Polino et al., 2018;Liu et al., 2018;Li et al., 2020).In particular, Knowledge Distillation (KD) techniques have been shown to be particularly effective (Sanh et al., 2019;Turc et al., 2019;Sun et al., 2019Sun et al., , 2020;;Yang et al., 2020;Jiao et al., 2020).KD techniques use a larger model, known as a teacher, to obtain a smaller and thus more efficient model, known as a student (Hinton et al., 2015).The student is trained to mimic the output of the teacher.However, we empirically show that, at least for AS2, BASE models trained through distillation are still significantly behind the state of the art, i.e., models based on LARGE transformers.
In this paper, we introduce a new transformer model for AS2 that matches the state of the art while being dramatically more efficient.Our main idea is based on the following considerations: first, in recent years, several transformer model families have been introduced, each pretrained using different datasets and modeling techniques (Rogers et al., 2021).Second, ensembling several diverse models has shown to be an effective way to improve performance in many question answering and ranking tasks (Xu et al., 2020;Zhang et al., 2020;Liu et al., 2020;Lin and Durrett, 2020).Our contribution lies in a new approach to approximate a computationally expensive ranking ensemble into a single efficient architecture for AS2 tasks.
More specifically, our investigation proceeds as follows.First, we optimize ranking architectures for AS2 by training k student models to replicate k unique teacher architectures.When ensembled, we show that they achieve better performance than any standalone models at the cost of increased computational burden.Then, to preserve the accuracy of this ensemble while achieving lower complexity, we propose a new Multiple Heads Student architecture, which we refer to as CERBERUS.As shown in Fig. 1, CERBERUS is composed of a shared encoder body and multiple ranking heads.The encoder body is designed to derive a shared representation of input sequences, which gets fed to ranking heads.We show that if each ranking head is trained to mimic a unique teacher distribution, it is possible to achieve the desirable diversity through ensemble model while being significantly more efficient.
We train a CERBERUS model using three different teachers: RoBERTa (Liu et al., 2019), ELEC-TRA (Clark et al., 2019), and ALBERT (Lan et al., 2019).We conduct experiments on three AS2 datasets: ASNQ (Garg et al., 2020), WikiQA (Yang et al., 2015), and an internal corpus (IAS2).Our results show that CERBERUS consistently improves over all models trained with single teachers, rivaling performance of much larger models including multiple variants of ensemble models; further, CER-BERUS matches current state-of-the-art AS2 models (TANDA by Garg et al. (2020)), while saving 64% and 60% in model size and latency, respectively.
In summary, our contribution is four-fold: (i) We propose CERBERUS, an efficient architecture specifically designed to distill an ensemble of heterogeneous transformer models into a single transformer model for AS2 tasks while preserving ensemble diversity.
(ii) We conduct large-scale experiments with multiple transformer model families and show that CERBERUS improves performance of equally sized distilled model, rivaling much larger ensemble and state-of-the-art AS2 models.
(iii) We discuss various training methods for CER-BERUS and show three key factors to improve AS2 performance: (a) multiple ranking heads in CERBERUS, (b) multiple teachers, and (c) heterogeneity in teacher models.
(iv) We present a comprehensive analysis of the CERBERUS, both in terms of ranking behavior and efficiency, highlighting the effect of several design decisions on its performance.
2 Related Work

Answer Sentence Selection (AS2)
Several approaches for AS2 have been proposed in recent years.Severyn and Moschitti (2015) used CNNs to learn and score question and answer representations, while others proposed alignment networks (Shen et al., 2017;Tran et al., 2018;Tay et al., 2018).Compare-and-aggregate architectures have also been extensively studied (Wang and Jiang, 2016;Bian et al., 2017;Yoon et al., 2019;Matsubara et al., 2020 Previous studies on transformer distillation have also leveraged its intermediate representation (Sun et al., 2019(Sun et al., , 2020;;Jiao et al., 2020;Mukherjee and Awadallah, 2020;Liang et al., 2020).These approaches typically lead to more accurate performance, but severely limit which pairing of teacher and students can be used (e.g., same transformer family/tokenization, identical hidden dimensions).

Ensemble Distillation
Yang et al. (2020) discussed two-stage multiteacher knowledge distillation for QA tasks.Similarly, Jiao et al. (2020) used BERT models as teachers for their proposed model, TinyBERT, in a two-stage learning strategy.Unlike their twostage approach, our study focuses on distilling the knowledge of multiple teachers while preserving the individual teacher distributions.Furthermore, we explore several pretrained transformer models for knowledge distillation instead of focusing on a specific architecture.More recently, Allen-Zhu and Li (2020) formally proved that an ensemble of models of the same family can be distilled into a single model while retaining the same performance of the ensemble; however, their experiments are exclusively focus on ResNet models for image classification tasks.Kwon et al. (2020) tried to dynamically select, for each training sample, one among a set of teachers.These studies focus distillation on models that strictly share the same architecture and training strategy, which we show not achieving the same accuracy as our CERBERUS model.

Multi-head Transformers
To the best of our knowledge, no previous work discusses multi-head transformer models for ranking problems; however, some related works exist for classification tasks.TwinBERT (Lu et al., 2020)

Methodology
We build up to introducing CERBERUS by first formalizing the AS2 task (Section 3.1), and then summarizing typical transformer distillation and ensembling techniques (Section 3.2).Finally, details of the CERBERUS approach are explained in Section 3.3.

Training Transformer Models for Answer Sentence Selection (AS2)
The AS2 task consists of selecting the correct answer from a set of candidate sentences for a given question.Like many other ranking problems, it can be formulated as a max element selection task: given a query q ∈ Q and a set of candidates select a j that is an optimal element for q.We can model the task as a selector function π : Q × P(A) → A, defined as π(q, A) = a j , where P(A) is the powerset of A, j = argmax i (p(a i |q)), and p(a i |q) is the probability of a i to be the required element for q.In this work, we evaluate CERBERUS, as well as all our baselines, as an estimator for p(a i |q) for the AS2 task.In the remainder of this work, we formally refer to an estimator by using a uppercase calligraphy letter and a set of model parameters Θ, e.g., M Θ .We fine-tune three models to be used as a teacher T Θ : RoBERTa LARGE , ELECTRA LARGE , and ALBERT XXLARGE .The first two share the same architecture, consisting of 24 layers and a hidden dimension of 1,024, while ALBERT XXLARGE is wider (4,096 hidden units) but shallower (12 layers).All three models are optimized using cross entropy loss in a point-wise setting, i.e., they are trained to maximize the log likelihood of the binary relevance label for each answer separately.
While approaches that optimize the ranking over multiple samples (such as pair-wise or listwise methods) could also be used (Bian et al., 2017), they would not change the overall findings of our study; further, point-wise methods have been shown to achieve competitive performance for transformer models (MacAvaney et al., 2019).
When training models for the IAS2 and WikiQA datasets, we follow the TANDA technique introduced by Garg et al. (2020): models are first finetuned on ASNQ to transfer to the QA domain, and then adapted to the target task.
Besides the three teacher models, we also train their equivalent BASE version, namely RoBERTa BASE , ELECTRA BASE , and ALBERT BASE .These baselines serve as a useful comparison for measuring the effectiveness of distillation techniques.

Distilled Models and Ensembles
Knowledge distillation (KD), as defined by Hinton et al. (2015), is a training technique which a larger, more powerful teacher model T Θ is used to train a smaller, more efficient model, often dubbed as student model S Θ .S Θ is typically trained to minimize the difference between its output distribution and the teacher's.If labeled data is available, it is often used in conjunction with the teacher output as it often leads to improved performance (Ba and Caruana, 2014).In these cases, we train S Θ using a soft loss with respect to its teacher and a hard loss with respect to the human-annotated labels.
To distill the three LARGE models introduced in Section 3.1, we use the loss formulation from Hinton et al. (2015), as it performs comparably to other, more recent distillation techniques (Tian et al., 2019).Given a pair of input sequence x and the target label y, it is defined as follows: where α and τ indicate a balancing factor and temperature for distillation, respectively.We independently tune hyperparameters α ∈ {0.0, 0.1, 0.5, 0.9} and τ ∈ {1, 3, 5} for each dataset on their respective dev sets.As previously mentioned, we use cross entropy as hard loss L H for all our experiments.L S is a soft loss function based on the Kullback-Leibler divergence KL(p(x), q(x)), where p(x) and q(x) are softened-probability distributions of teacher T Θ and student S Θ models for a given input x, that is, ] defined as follows: where C indicates a set of class labels.
Using the technique described above, we distill three LARGE models into their corresponding BASE counterparts: i.e., ALBERT BASE from ALBERT XXLARGE , and so on.Furthermore, we create an ensemble of BASE models by linearly combining their outputs; hyperparameters for ensembles were tuned by Optuna (Akiba et al., 2019).
Finally, we build another ensemble model of three ELECTRA BASE distilled from the three LARGE models mentioned above.As we will show in Section 4, ELECTRA BASE outperforms all other BASE models; therefore, we are interested in measuring whether it could be used for inter transformer family model distillation.Once again, Optuna was used to tune the ensemble model.
We note that the ensemble of the three LARGE models is not used as a teacher.In our preliminary experiment, we found that the ensemble is not a good teacher, as the model was too confident in its prediction, a trend that is studied by Panagiotatos et al. (2019).Most softmaxed categoryprobabilities by the ensemble model are close to either 0 or 1 and behave like hard-target rather than soft-target, which did not improve over the KD baselines (rows 7-9) in Table 2.

CERBERUS: Multiple-Heads Student
As mentioned in the previous section, students trained using different teachers can be trivially en- sembled using a linear combination of their outputs.However, this results in a drastic increase in model size, as well as a synchronization latency overhead, which are both undesirable properties in many applications.In this section, we introduce CERBERUS, a transformer architecture designed to emulate the properties of an ensemble of distilled models while being more efficient.As illustrated in Fig. 2, our CERBERUS model consists of two components: (i) an input encoder comprised of stacked transformer layers, and (ii) a set of k ranking heads, each designed to be trained with respect to a specific teacher.Each ranking head is comprised of one or more transformer layers; it receives as input the output of the shared encoder, and produces classification output.To obtain its final prediction, the CERBERUS averages the outputs of its ranking heads.A schematic representation of CERBERUS is shown in Figure 2.
Formally, let M Θ be a pretrained transformer1 of n layers.To obtain a CERBERUS model, we first split the model into two groups: the first b blocks are used for the shared encoder body B b , while the next h = (n − b) blocks are replicated and assigned as initial states for each head H i h , i = {1, . . ., k}.To compute the output for the i th head, we first encode an input x using B b , and then use it as input to H i h .To train CERBERUS, we use a linear combination of k loss functions, each of which uses output from a different ranking head: ) where λ i and L i are the weight and loss function for the i-th head in the CERBERUS model.Specifically, we apply the loss function of Equation 1 to each head, i.e., L i = L KD for the i th head-teacher pair.We note that, while the encoder body and all ranking heads are trained jointly, each head is optimized only by its own loss.Conversely, when backpropagating L CERBERUS , the parameters of the encoder body are affected by the output of all k ranking heads.This ensures that each head learns faithfully from their teacher while the parameters of the encoder body remain suitable for the entire model.
For inference, a single score for CERBERUS is obtained by averaging the outputs of all ranking heads: score In our experiments, we use k = 3 heads, each trained with one of the LARGE models described in Section 3.1.We discuss a variety of combination for values of b and h; the performance for each configuration is analyzed in Section 5.4.For training, we set λ i = 1 for all i = {1, . . ., k} and reuse the search space of the hyperparameters α and τ for knowledge distillation (see Section 3.2).

Datasets
While many studies on Transformer-based models (Devlin et al., 2019;Liu et al., 2019;Clark et al., 2019;Lan et al., 2019) are assessed for GLUE tasks (10 classification and 1 regression tasks), our interests are in ranking problems for question answering such as AS2.To fairly assess the AS2 performance of our proposed method against conventional distillation techniques, we report experimental results on a set of three diverse English AS2 datasets: Wik-iQA (Yang et al., 2015), a small academic dataset that has been widely used; ASNQ (Garg et al., 2020), a much larger corpus (3 orders of magnitude larger than WikiQA) that allow us to assess models' performance in data-unbalanced settings; finally, we measure performance on IAS2, an internal dataset we constructed for AS2.Compared to the other two corpora, IAS2 contains noisier data and is much closer to a real-world AS2 setting.Table 1 reports the statistics of the datasets, and more details are described in Appendix.

Evaluation Metrics
We assess AS2 performance on ASNQ, WikiQA and IAS2 using three metrics: mean average precision (MAP), mean reciprocal rank (MRR), and precision at top-1 candidate (P@1).The first two metrics are commonly used to measure overall performance of ranking systems, while P@1 is a stricter metric that captures effectiveness of high-precision applications such as AS2.
Our models are implemented with PyTorch 1.6 (Paszke et al., 2019) using Hugging Face Transformers 3.0.2(Wolf et al., 2020); all models are trained on a machine with 4 NVIDIA Tesla V100 GPUs, each with 16GB of memory.Latency benchmarks are executed on a single GPU to eliminate variability due to inter-accelerator communication.

Results
Here we present our main experimental findings.In Section 5.1, we compare CERBERUS to stateof-the-art models and other distillation techniques using three datasets (IAS2, ASNQ, WikiQA).In Sections 5.2 -5.4,we motivate our design and hyperparameter choices for CERBERUS by empirically validating them.Finally, in Section 5.5, we discuss inference latency of CERBERUS comparing to other transformer models.

Answer Sentence Selection Performance
The performance of CERBERUS on IAS2, ASNQ, and WikiQA datasets are reported in Table 2. Specifically, we compared our approach (row 14) to four groups of baselines: larger transformerbased models (rows 1-3), including the state-ofthe-art AS2 models by Garg et al. (2020) (rows 2 and 5); equivalently sized models, either directly fine-tuned on target datasets (rows 4-6), or distilled using their corresponding LARGE model as teacher (rows 7-9); ensembles of BASE models (rows 10-12).We also adapted the ensembling technique of Hydra (Tran et al., 2020), which is originally designed for image recognition, to work in our AS2 setting2 and used it as a baseline (row 13).All the comparisons are done with respect to a B 11 3H 1 CERBERUS model initialized from an ELECTRA BASE model: performance of other model configurations are discussed in Section 5.4.Due to the volume of experiments, we train a model with a random seed for each model given a set of hyperparameters and report the AS2 performance with the best hyperparameter set according to each dev set.

Vs. TANDA (BASE) & Single-Model Distillation
We find BASE models trained by TANDA (rows 4-6), the state-of-the-art training method for AS2 tasks, are further improved (rows 7-9) by introducing knowledge distillation to its 2nd fine-tuning stage.Our CERBERUS achieves a significantly improvement over all single BASE models for all the considered datasets (Wilcoxon signed-rank test, p < 0.01).We empirically show in Section 5.2 that this significant improvement was achieved by both the architecture of our CERBERUS and using heterogeneous teacher models rather than a small amount of extra parameters.

Vs. Ensembles & Hydra
For all the datasets we considered, our CER-BERUS achieves similar or better performance of much larger ensemble models, including an ensemble of ALBERT BASE , RoBERTa BASE , and ELECTRA BASE trained with and without distillation (rows 10 and 11), as well as the ensemble of three ELECTRA BASE models each trained using ALBERT XXLARGE , RoBERTa LARGE , and ELECTRA LARGE as teachers (row 12).We also note that CERBERUS outperforms our adaptation of Hydra (Tran et al., 2020) (row 13), which emphasizes the importance of using heterogeneous teacher models for AS2.

Are Multiple Ranking Heads and Heterogeneous Teachers Necessary?
Using the heterogeneous teacher models shown in Table 2, we discuss how AS2 performance varies when using different combinations of teachers for knowledge distillation.The first method, KD Sum , simply combines loss values from multiple teachers to train a single transformer model, similarly to the task-specific distillation stage with multiple teachers in Yang et al. (2020).In the second method, KD RR , we switch teacher models for each training batch in a round-robin style; i.e., the student transformer model will be trained with the first teacher model in the first batch, with the second teacher model in the second batch, and so forth.
Table 3 compares performance of the multipleteacher knowledge distillation strategies described above to that of our proposed method; we also evaluate the effect of using one teacher per head, rather than a single teacher (ELECTRA LARGE ), on CERBERUS.For ELECTRA BASE , we found that KD Sum method slightly outperforms KD RR ; this result highlights the importance of leveraging multiple teachers for knowledge distillation in the same mini-batch.For CERBERUS, we found that using multiple heterogeneous teachers (specifically, one per ranking head) is crucial in achieving the best performance; without it, CERBERUS B 11 3H 1 achieves the same performance of ELECTRA BASE despite having more parameters.Besides these two trends, the results of rows 13 and 14 in  emphasize the importance of heterogeneity in the set of teacher models.As a result, CERBERUS B 11 3H 1 performs the best and achieves the comparable performance with some of the teacher (LARGE) models, while saving between 45% and 63% of model parameters.From the aforementioned three trends, we can confirm that the improved AS2 performance was achieved thanks to the multiple ranking heads in CERBERUS, the use of multiple teachers, and heterogeneity in teacher model families; on the other hand, the slightly increased parameters compared to ELECTRA BASE did not contribute to performance uplift.

Do Heads Resemble Their Teachers?
To better understand the relationship between CER-BERUS's ranking heads and the teachers used to train them, we analyze the top candidates chosen by each of teacher and student models.Figure 3 shows how often each CERBERUS head agrees with its respective teacher model.To calculate agreement, we normalize number of correct candidates heads and teachers agree on by the total number of correct answer for each head.
Intuitively, we might expect that ranking heads would agree the most with their respective teachers; however, in practice, we notice that the highest agreement for all heads is measured with ELECTRA LARGE .However, one should consider that the agreement measurement is confounded by the fact that all heads are more likely to agree with the head that is correct the most (ELECTRA LARGE ).Furthermore, in all our experiments, CERBERUS is initialized from a pretrained ELECTRA BASE , which also increase the likelihood of agreement with ELECTRA LARGE .Nevertheless, we do note that both the head distilled from ALBERT XXLARGE and from RoBERTa BASE achieve high agreement with their teachers, suggesting that CERBERUS ranking heads do indeed resemble their teachers.
In our experiments, we also observed that CER-BERUS is able to mimic the behavior of an ensemble comprised of the three large models; for example, on the WikiQA dataset, CERBERUS always predicts the correct label when all three models are correct (197/243 queries), it follows majority voting in 17 cases, and in one case it overrides the majority voting when one of the teachers is very confident.In the remaining cases, either only a minority or no teachers are correct, or the confidence of the majority is low.

How Many Blocks Should Heads Have?
In Table 2, we examined the performance of a CER-BERUS model with configuration B 11 3H 1 ; that is, a body composed of 11 blocks, and 3 ranking heads with one transformer block each.In order to understand how specific hyperparameters setting for CERBERUS influences model performance, we examine different CERBERUS configurations in this section.Due to space constraints, we only report results on IAS2; we observed similar trends on ASNQ and IAS2.In order to keep a latency comparable to that of other BASE models, we keep the total depth of CERBERUS constant, and vary the number of blocks in the ranking heads and shared encoder body.Table 4 shows the results for alternative CER-BERUS configurations.Overall, we noticed that the performance is not significantly affected by the specific configuration of CERBERUS, which yields consistent results regardless of the number of transformer layers used (1 to 6, B 11 3H 1 to B 6 3H 6 ).All CERBERUS models are trained with a combination of hard and soft losses, which makes it more likely to have different configurations converge on a set of stable but similar configurations.Despite the similar performance, we note that B 6 3H 6 is comprised of significantly more parameters than our leanest configuration, B 11 3H 1 (199M vs 124M).Given the lack of improvement from the additional parametrization, all experiments in this work were conducted by using 11 shared body blocks and 3 heads, each of which consists 1 block (B 11 3H 1 ).

Benchmarking Inference Latency
Besides AS2 performance, we examine the inference latency for CERBERUS and models evaluated in Section 5.1, using an NVIDIA Tesla V100 GPU.The results are summarized in Table 5.For a fair comparison between the models, we used the same batch size (128) for all benchmarks, and ignored any tokenization and CPU/GPU communication overhead while recording wall clock time.Overall, we confirm that CERBERUS achieves a comparable latency of other BASE models.All four are within the one standard deviation of each other.
All the LARGE models including the state-ofthe-art AS2 model (RoBERTa LARGE by Garg et al. (2020)) produce significantly higher latency, (on average, 3.4× slower than CERBERUS); specifically, ALBERT XXLARGE , which is comprised of 12 very wide transformer blocks, shows the worst latency among single models.Further, the latency of the two ensemble models are comparable to some of the LARGE models, thus supporting our argument that they are not suitable for high performance

Conclusions and Future Work
In this work, we introduce a technique for obtaining a single efficient AS2 model from an ensemble of heterogeneous transformer models.This efficient approach, which we call CERBERUS, consists of a sequence of transformer blocks, followed by multiple ranking heads; each head is trained with a unique teacher, ensuring proper distillation of the ensemble.Results show that the proposed model outperforms traditional, single teacher techniques, rivaling state-of-the-art AS2 models while saving 64% and 60% in model size and latency, respectively.CERBERUS enables LARGE-like AS2 accuracy while maintaining BASE-like efficiency.Further analysis demonstrates that reported improvements in AS2 performance are due to to three key factors: (i) multiple ranking heads, (ii) multiple teachers, and (iii) heterogeneity in teacher models.
Future work would focus on two key aspects: how CERBERUS performs on non-ranking tasks, and whether it could achieve similar improvements in ranking tasks outside QA.For the former, we remark that, while the core idea of CERBERUS can be extended to tasks such as those in the GLUE benchmark (Wang et al., 2018), further investigation is necessary in establishing the best set of trade-offs for different objectives and metrics.A similar concern exists in the case of extending CERBERUS to ranking tasks, such as ad-hoc retrieval.

Limitations
In this study, we discussed the experimental results and empirically showed the effectiveness of our proposed approach for English datasets only.While this is a major limitation of the study, our approach is not specific to English, thus it could be extended in the future using models in other languages, although improvements might not translate to less resource-rich languages.
As described in Section 4.2, our experiments are compute-intensive and have been conducted on 4 NVIDIA V100 GPUs.Thus, researchers with less compute might not be able to replicate CERBERUS.
Next, all models we present in this work are trained to optimize answer relevance to a given question.Therefore, they might be unfair towards protected categories (race, gender, sex, nationality, etc.) or present answers from a biased point of view.Our work does not address this challenge.
Finally, we evaluated our approach only in the context of answer sentence ranking; thus, the reader might be left wondering whether such an approach would work for other tasks.We note that, although a study on the general applicability of our approach is very interesting and needed, it would require more space than a conference submission has in order to be accurately described and evaluated.Therefore, we leave further investigation of CERBERUS on other domains and tasks as future work.Soldaini and Moschitti (2020).

C IAS2
This is an in-house dataset, called Internal Answer Sentence Selection, we built as part of our efforts of understanding and benchmarking web-based question answering systems.To obtain questions, we first collected a non-representative sample of queries from traffic log of our commercial virtual assistant system.We then used a retrieval system containing hundreds of million of web pages to obtain up to 100 web pages for each question.From the set of retrieved documents, we extracted all candidate sentences and ranked them using AS2 models trained by TANDA Garg et al. (2020); at least top-25 candidates for each question are annotated by humans.Overall, IAS2 contains 6,939 questions and 283,855 candidate answers.We reserve 3,000 questions for evaluation, 808 for development, and use the rest for training.Compared to ASNQ and WikiQA, whose candidate answers are mostly from Wikipedia pages, IAS2 contains answers that are from a diverse set of pages, which allow us to better estimate robustness with respect to content obtained from the web.

D Common Training Configurations
Besides the method-specific hyperparameters described in Sections 3.2 and 3.3, we describe training strategies and hyperparameters commonly used to train AS2 models in this study.Unless we specified, we used Adam optimizer (Kingma and Ba, 2015) with a linear learning rate scheduler with warm up3 to train AS2 models.The number of training iterations was 20,000, and we assess a AS2 model every 250 iterations using the dev set for validation.If the dev MAP is not improved within the last 50 validations4 , we terminate the training session.As described in Section 5.1, we independently tuned hyperparameters based on the dev set for each dataset, including an initial learning rate {10 −6 , 10 −5 } and batch size {8, 16, 24, 32, 64}.Note that we train AS2 models on the ASNQ dataset for 200,000 iterations due to the size of the dataset.
For model configurations, we used the default configurations available in Hugging Face Transformers 3.0.2(Wolf et al., 2020).For instance, the number of attention heads are 12 and 64 for ALBERT BASE and ALBERT XXLARGE , 12 and 16 for RoBERTa BASE and RoBERTa LARGE , and 12 and 16 for ELECTRA BASE and ELECTRA LARGE , respectively.In this paper, we designed CERBERUS leveraging the default ELECTRA BASE architecture, thus the number of attention heads is 12.

Figure 1 :
Figure1: CERBERUS model for answer sentence selection.The model consists of a shared encoder body and multiple ranking heads.CERBERUS independently scores up to hundreds candidate answers a i for question q; The one with highest likelihood is selected as answer.

Figure 2 :
Figure 2: Detailed overview of CERBERUS model that consists of a shared encoder body of b transformer layers, followed by k ranking heads of h layers each; we use notation B b kH h to identify a CERBERUS configuration.All heads are jointly trained, but each head learns from a unique teacher model; at inference time, predictions from heads are combined by a pooler layer.

Figure 3 :
Figure 3: Agreement between heads and their teacher model in CERBERUS.It is obtained by diving the number of correct candidates each head and teacher agree on by the total number of correct answer for each head.

Table 3 :
Comparison of single and multiple teachers distillation for ELECTRA BASE and CERBERUS B 11 3H 1 models on the IAS2 test set.Overall, we found that combining the CERBERUS architecture with multiple teachers is essential to achieve the best performance.
Garg et al. (2020)., 2019)8, Lisbon, Portugal.Association for Computational Linguistics.Garg et al. (2020)introduced Answer Sentence Natural Questions, a large-scale answer sentence selection dataset.It was derived from the Google Natural Questions (NQ)(Kwiatkowski et al., 2019), and contains over 57k questions and 23M answer candidates.Its large-scale (at least two orders of magnitude larger than any other AS2 dataset) and class imbalance (approximately one correct answer every 400 candidates) properties make it particularly suitable to evaluate how well our models generalize.Samples in Google NQ consist of tuples ⟨question, answer long , answer short , label⟩, where answer long contains multiple sentences, answer short is fragment of a sentence, and label indicates whether answer long is correct.Google NQ has long and short answers for each question.To construct ASNQ,Garg et al. (2020)labeled any sentence from answer long that contains answer short as positive; all other sentences are labeled as negative.The original release of ANSQ only contains train and development splits; We use the dev and test splits introduced by B ASNQ