Enhancing Dual-Encoders with Question and Answer Cross-Embeddings for Answer Retrieval

Dual-Encoders is a promising mechanism for answer retrieval in question answering (QA) systems. Currently most conventional Dual-Encoders learn the semantic representations of questions and answers merely through matching score. Researchers proposed to introduce the QA interaction features in scoring function but at the cost of low efficiency in inference stage. To keep independent encoding of questions and answers during inference stage, variational auto-encoder is further introduced to reconstruct answers (questions) from question (answer) embeddings as an auxiliary task to enhance QA interaction in representation learning in training stage. However, the needs of text generation and answer retrieval are different, which leads to hardness in training. In this work, we propose a framework to enhance the Dual-Encoders model with question answer cross-embeddings and a novel Geometry Alignment Mechanism (GAM) to align the geometry of embeddings from Dual-Encoders with that from Cross-Encoders. Extensive experimental results show that our framework significantly improves Dual-Encoders model and outperforms the state-of-the-art method on multiple answer retrieval datasets.


Introduction
Answer retrieval (Surdeanu et al., 2008) is an important mechanism in question answering (QA) systems to obtain answer candidates given a new question. Currently, the most widely used framework for answer retrieval task is Dual-Encoders (Seo et al., 2019;Chang et al., 2020;Cer et al., 2018), also known as "Siamese Network" (Triantafillou et al., 2017;Das et al., 2016). The Dual-Encoders model consists of two encoders to compute the embeddings of questions and answers independently, and also a predictor to estimate the relevance by a similarity score between the two embeddings.
Recently, due to the application of advanced encoding techniques, e.g., Transformer (Vaswani et al., 2017), BERT , the Dual-Encoders achieved a huge boost on the overall performance (Karpukhin et al., 2020;Maillard et al., 2021). However, there remains some room to improve since the embeddings of questions and answers are encoded separately, while the cross information between questions and answers are important for answer retrieval .
Many efforts have been devoted in developing more powerful scoring by considering the interactions among questions and answers. For example, Xie and Ma (2019) introduced additional wordlevel interaction features between questions and answers for matching degree estimation. Similarly, Humeau et al. (2020) implemented attention mechanism to extract more information when computing matching score. Though such approaches improve the scoring mechanism, the overall efficiency derived from separate and off-line embeddings of questions and answers is sacrificed to some extent. Therefore, it deserves discussing how to achieve better trade-off for maintaining the independent encoding in inference stage. To this end, the Dual-VAEs (Shen et al., 2018) is proposed by using the question-to-question and answer-to-answer reconstruction as joint training task along with the retrieval task to improve the representation learning, which maintains the independent encoding in inference stage. However, the embeddings produced by Dual-encoders or Dual-VAEs can still only preserve isolated information for questions or answers, while cross information between questions and answers is only learned through similarity score computed by two embeddings. Those embeddings preserving isolated semantics can lead to confusing results particularly when an answer can have multiple matched questions and vice versa, which is referred as one-to-many problem .
To address this challenge,  further proposed Cross-VAEs by reconstructing answers from question embeddings and reconstructing ques-tions from answer embeddings. In such way, the embeddings of questions or answers preserve the cross information from matched answers or questions and improve the performance in one-to-many cases. Nevertheless, both Dual-VAEs and Cross-VAEs rely on the generation sub-task to enhance the embeddings in retrieval task, while the need of text generation (the word-level joint distribution of sentences) and that of answer retrieval (the sentence-level matching distribution of QA-pairs) are different, which are suspected to conflict in joint training (Deudon, 2018). It then brings an interesting question: is it feasible to exploit the cross information in retrieval task and keep the independence of sentence encoding in inference stage.
In this research we proposed a Cross-Encoders (details in section 3.3) as an additional guidance during Dual-Encoders training besides the similarity score. The Cross-Encoders could form comprehensive representation through cross-attention to reflect the complex relations (e.g., one-to-many) between matched questions and answers. We also developed Geometry Alignment Mechanism (details in section 3.4) as the guiding way to effectively bridge the gap between Cross-Encoders and Dual-Encoders by forcing the Dual-Encoders to mimic Cross-Encoders on the geometry (i.e., semantic feature structure) in embedding space.
The contributions of this paper are in three folds: 1) Focusing on the lack of interactions in Dual-Encoders architecture, we introduce an ENhancing Dual-encoders with CROSS-Embeddings (ENDX) framework to solve this limitation, where a Cross-Encoders model is proposed to guide the training of Dual-Encoders model; 2) To achieve such enhancement in ENDX, we propose a novel Geometry Alignment Mechanism (GAM) to align the geometry of embeddings from Dual-Encoders with that from Cross-Encoders, which models the interactions between words within question and answer. This frees the Dual-Encoders from having to encode necessary information with no access to matched sentence; 3) To validate our framework, we conduct extensive experiments and show that the proposed framework significantly improves Dual-Encoders model and outperforms the state-ofart model on multiple QA datasets.

Related Work
Traditional answer retrieval consists of two-stage pipeline including key words matching (BM25 (Robertson and Zaragoza, 2009)) to efficiently retrieve multiple relevant passages and re-ranking by neural network to select correct answers from retrieved results. But it may fall short here as the connection between answers and questions in context is not modelled directly, while the large document where the answer locates could be not highly relevant to the question (Ahmad et al., 2019).
To address the problem in two-stage pipeline retrieval, there is growing interest in training endto-end retrieval systems that can efficiently surface relevant results without an intermediate document retrieval phase (Karpukhin et al., 2020;Chang et al., 2020;Ahmad et al., 2019;Seo et al., 2019;Henderson et al., 2019). In recent works (Karpukhin et al., 2020;Chang et al., 2020;Maillard et al., 2021), using dense representation learned by Dual-Encoders framework outperformed BM25 in large-scale retrieval task. Dual-Encoders can encode questions and answers independently and thus enables offline processing to support efficient online response, but there exists a bottleneck that impedes the QA alignment for lack of interaction between questions and answers in their independent encoding.
Another popular way of sentence-level representation learning is Variational AutoEncoder (VAE). By encoding sentences into latent variables and reconstructing the same sentences from corresponding latent variables, VAE compacts the joint distribution of words in sentence into latent variable. Shen et al. (2018) adopted VAE in Dual-Encoders and optimized the variational lower bound and matching loss jointly.  proposed to reconstruct questions and answers in a crossed way to improve their interaction and allow for oneto-many projection. We do not include text reconstruction into our training goal for the difference between the need of sentence representation in reconstruction and that in answer retrieval.
Our proposed framework consists of a Dual-Encoders and a Cross-Encoders. The conventional Dual-Encoders provides the system with practicality in large-scale retrieval (Karpukhin et al., 2020;Chang et al., 2020;Maillard et al., 2021), while the Cross-Encoders has interaction between question and answer to guide the training of Dual-Encoders.

Problem Definition
The answer retrieval task in this work is formalized as: given a question set S Q and an answer set Dual Embedding Geometry Cross Embedding Geometry Figure 1: The overview of proposed framework that enhances Dual-Encoders with cross-embeddings, Dual-Encoders (blue) and Cross-Encoders (red) are both used for training and only Dual-Encoders is used for inference.
S A , each sample could be represented as (q, a, y) where q ∈ S Q is a question, a ∈ S A is a sentencelevel answer, and y denotes whether the answer a matches the question q or not. The target is to find the best-matched answer for the question q and a list of candidate answers C(q) ⊂ S A .

Dual-Encoders
Our baseline model is Dual-Encoders and we refer to the sentence embedding encoded by Dual-Encoders as dual-embedding. As shown in Fig.  1, the question (answer) dual-embedding R dual q (R dual a ) is processed from question (answer) text by encoder and aggregator in Dual-Encoders, where the encoder, marked as Encoder dual in Fig. 1, can be BERT and we employ multiple hops selfattention (Lin et al., 2017) as the aggregator in this work. The scoring function f is defined as the inner product between the dual-embeddings of question and answer: f (q, a) = R dual q · R dual a . Intuitively, an excellent Dual-Encoders should give high scores to matched QA pairs and low scores to mismatched QA pairs. We use in-batch negatives training strategy, which is effective for learning a Dual-Encoders model (Karpukhin et al., 2020). Assuming that a mini-batch has B matched question-answer pairs, then the retrieval loss of a mini-batch is: (1) where B is the batch size; i and j are the indexes of QA pairs in a given batch.

Cross-Encoders
The cross-embeddings that involve rich questionanswer interaction are obtained from the Cross-Encoders. As shown in Fig 1, the Cross-Encoders gets input from both answer and question sentences. To capture precise question-answer interaction, the matched answer (question) is used to guide the encoding of question (answer).
Let H q ∈ R N ×dr and H a ∈ R M ×dr denote the contextualized representations of words in question and answer sentences from Encoder cross respectively, where N and M are the number of words in question and answer sentences and d r is the dimension of contextualized representation. A multi-head scaled dot-production attention (Vaswani et al., 2017), marked as Cross Attention in Fig. 1, is used to refine question (answer) contextualized representation by matched answer (question). Take the refinement of question for instance, the i th head is calculated as Eq. 2 and all heads are concatenated as Eq. 3 to obtain the answer-attended question representation H q , then position-wise feedforward networks (FFN) and layer normalization (LayerNorm) are used to further refine H q to obtain enhanced question contextualized representation H cross q as Eq. 4: where H cross q ∈ R M ×dr ; l h is the number of heads; W i q , W i k , W i v and W o are learnable weights. Similarly we can obtain the enhanced answer contextualized representation H cross a ∈ R N ×dr . Multihead attention can model word-level relationships across question and answer, and reflect the similarity between every pair of word contextualized representation across question and answer to capture the question-answer interaction and to form the comprehensive embedding of source sentence.
The sequence of H cross q and H cross a are then aggregated into fixed-length cross-embeddings R cross q and R cross a , which can precisely model the relations between questions and answers. The Cross-Encoders can be trained through loss function that is defined on a mini-batch as Eq. 5: where B is the batch size; i and j are the indexes of the QA pairs in a given batch.

Geometry Alignment Mechanism
The dual-embeddings mechanism can save much response time through off-line processing while the cross-embeddings introduce early interaction and produce retrieved answer set with better relevance. To meet the gap between the dual-embeddings and cross-embeddings, regression is a direct way that can be easily thought of. However, this elementwise alignment in high dimensional space is too rigid for answer retrieval.
Inspired by the geometry-preserved dimensionality reduction for pair-wise interaction modeling proposed in SNE (Hinton and Roweis, 2002), we relax the element-wise alignment to the pair-wise alignment in the form of geometry, which is also proved to be crucial in representation learning (Passalis and Tefas, 2018). Therefore, in this research we propose the Geometry Alignment Mechanism (GAM) to align the geometry of dual-embeddings with that of cross-embeddings, which capture the question-answer interaction. Specifically, the geometry of embeddings tells who are the neighbors of a question or an answer in the embedding space. In other words, it tells which question-answer pairs, question-question pairs or answer-answer pairs are likely to be close in the feature space.
Since Dual-Encoders are not able to exploit the information from matched questions or answers, it might be difficult to accurately recreates the whole geometry of cross-embedding. Therefore we use the conditional probability converted from pairwise dissimilarities to represent the geometry of data sample in feature space (Hinton and Roweis, 2002;Van der Maaten, 2008). The conditional probability expresses the asymmetric probability of each datapoint e i being close to another datapoint e j in feature space as Eq. 6: where the d(e i , e j ) measures the dissimilarity between e i and e j .
Consequently the probability of question q i being close to answer a j in feature space can be described by the conditional probability p(a j |q i ).
To estimate such probabilities, we can use kernel density estimation (KDE) (Scott, 1992), which replaces the negative dissimilarity function −d(e i , e j ) with a symmetric kernel function K(e i , e j ; σ 2 ) to model the similarity between e i and e j , where σ 2 is width. The conditional probability p(a j |q i ) of cross-embeddings p cross (a j |q i ) and that of dual-embeddings p dual (a j |q i ) can be estimated using a batch of samples as Eqs. 7 and 8 consequently: where B is the batch size; i, j and k are the indexes of the QA pairs in a given batch. The conditional probabilities p(q j |q i ), p(a j |a i ) can be estimated similarly. Since the conditional probability is asymmetric, p(q j |a i ) is also needed. One of the most natural choices of the kernel for kernel density estimation is Gaussian kernel defined as Eq. 9, while it suffers from the need of well-tuned width (Turlach, 1993): To alleviate the problem of domain-dependent tuning and adapt the kernel to our scoring function, we use inner product-based similarity metric as defined in Eq. 10: In order that dual-embeddings of questions q i and q j can precisely model the similarity between the cross-embeddings of questions q i and q j , the conditional probabilities p dual (q j |q i ) and p cross (q j |q i ) should be as close as possible. Therefore, the GAM aims to learn a dual-embeddings representation that can minimize the divergence between p dual (q j |q i ) and p cross (q j |q i ), p dual (a j |a i ) and p cross (a j |a i ), p dual (a j |q i ) and p cross (a j |q i ) as well as p dual (q j |a i ) and p cross (q j |a i ). To achieve the aim of enhancement, the widely used Kullback-Leibler Divergence (KLD) is employed in this research. The loss function L q|q defined on a minibatch is adopted to minimize the divergence between p dual (q j |q i ) and p cross (q j |q i ), which can be calculated as Eq. 11: where B is the batch size; i and j are the indexes of the QA pairs in a given batch.
The same way can be used to calculate the loss functions L a|a , L a|q and L q|a . Then the overall loss function of GAM can be defined as Eq. 12, where the hyper-parameters α a|q , α q|q , α q|a and α a|a are weights on different loss components: L ga = α a|q L a|q + α q|q L q|q + α q|a L q|a + α a|a L a|a (12)

Model Training and Inference
During training stage, we jointly train the Dual-Encoders and Cross-Encoders, and align the geometry of Dual-Encoders with that of Cross-Encoders. The overall loss function to train the full model is defined as Eq. 13, where α dual , α cross and α ga are hyper-parameters to control the loss weight.
L = α dual L dual + α cross L cross + α ga L ga (13) Since we only use the enhanced Dual-Encoders to encode questions in the inference stage while embeddings of answers are processed off-line, no extra computation is needed consequently.

Ahmad et al. (2019) introduced the Retrieval
Question-Answering (ReQA) task, which focuses on sentence-level answer retrieval and establish a pipeline to transform a reading comprehension dataset to ReQA dataset. We conduct our experiments on ReQA SQuAD and ReQA NQ established from SQuAD v1.1 (Rajpurkar et al., 2016) and NQ  respectively by Ahmad et al. (2019). We also use the same pipeline to process HotpotQA  and NewsQA (Trischler et al., 2017) datasets for more experiments. ReQA HotpotQA and ReQA NewsQA are used to denote the processed version of HotpotQA and NewsQA datasets respectively in this research. Since the original test sets of datasets above are not publicly available, the original validation sets are used as test sets. The statistics of ReQA datasets are shown in Table 1.

Evaluation Metrics
We adopt two popular metrics 1 for evaluation, i.e., mean reciprocal rank (MRR) and recall at N (R@N), which are widely used for measuring retrieval-based QA task (Ahmad et al., 2019).
MRR is the average reciprocal ranks of retrieval results, as illustrated in Eq. 14, where Q is a set of questions and rank i is the rank of the first correct answer for the i th question.
R@N is the recall score in top-N predicted subsets, as illustrated in Eq. 15, where A i is the ranked answer list for the i th question and A * i is the corresponding correct answer set.

Compared Methods
BM25 A classical ranking method using TF-IDF like scoring function for information retrieval (Robertson and Zaragoza, 2009).
InferSent A universal sentence encoder trained with supervised natural language inference task, not in need of fine-tuning for specific retrieval task (Conneau et al., 2017).
USE-QA A multi-task pre-trained model based on the Transformer, which learns universal sentence representation through a multi-feature ranking task, a translation ranking task and a natural language inference task (Yang et al., 2020).

Dual-Encoders
The vanilla Dual-Encoders train from scratch and can be implemented using different encoders. For instance, we use Dual-BERTs to denote the Dual-Encoders using BERT as encoder.
Dual-VAEs A model trained jointly with the question-to-question and answer-to-answer reconstruction tasks using VAE (Shen et al., 2018).
Cross-VAEs A model to solve one-to-many problem in answer retrieval, aligning the feature spaces of questions and answers by the question-to-answer and answer-to-question reconstruction .

ENDX-Encoders (Ours)
The Dual-Encoders is enhanced by our ENDX framework. For instance, ENDX-BERTs is used to denote the Dual-BERTs enhanced by ENDX.

Implementation Details
We split the training sets of all datasets into new training set and validation set in a ratio of 9:1. The hyper-parameters are chosen according to the model performance (R@1) on validation set. Specifically, Dual-BERTs and ENDX-BERTs are initialized using BERT base model , and the encoder of other models has 2 layers and uses 768-dim BERT token embedding as input. The cross attention modules of all ENDX-Encoders have 12 heads. We use AdamW optimizer (Loshchilov and Hutter, 2017) to train BERT-based model with 30 epochs and linearly decay the learning rate initialized as 2e-5, and train other models with 100 epochs using constant learning rate initialized as 1e-5. We set the loss weights α dual , α cross and α ga to 0.25, 0.25 and 0.5 respectively. The loss weights α a|q and α q|a increase linearly from 0 to 0.5, while α q|q and α a|a increase linearly from 0 to 1e4 both over the first 5 epochs. The batch size of BERT-based model is set to 12, and that of other models is set to 100. Finally the parameters that perform best on validation set are used on test set.

Results and Analysis
Main Results The results on ReQA SQuAD are shown in Table 2. BM25 shows competitive performance, since keywords overlap is common in ReQA SQuAD. As a pre-trained universal sentence encoder without fine-tuning, InferSent does not perform well as the pre-training datasets are relatively small. USE-QA gets stronger performance because of the use of a more powerful encoder and a larger-scale pre-training dataset. Compared to Dual-VAEs, Cross-VAEs improves MRR, R@1 and R@5 by 1.32%, 1.07% and 2.28% respectively, while our ENDX-BERTs outperforms the current best model Cross-VAEs     . Table 3 shows the performance comparison on ReQA NQ, ReQA HotpotQA and ReQA NewsQA datasets. Since the results in Table 2 have already shown the Dual-BERTs and ENDX-BERTs can significantly outperform BM25, InferSent, USE-QA, Dual-VAEs and Cross-VAEs, we only compare Dual-Encoders and ENDX-Encoders. The results in Table 2 and Table 3    Performance on sub-dataset We conduct more experiments on sub-datasets of ReQA SQuAD to validate the effectiveness of our framework on coping with the one-to-many problem. The comparison results between Dual-BERTs and ENDX-BERTs on sub-datasets, in which answers have different minimum number of matched questions, are shown as Fig. 2. It is observed that ENDX-BERTs outperforms Dual-BERTs solidly. The results of the most difficult sub-dataset, in which answers have at least 8 different questions, are shown in Table 4. Compared to Dual-BERTS, USE-QA and Cross-VAEs, our proposed model prominently shows the best performance under such a significant one-to-many circumstance.
Analysis on the effects of GAM We also sample multiple questions with same answer and encode the questions by Cross-Encoders, Dual-Encoders  enhanced with the proposed GAM, and basic Dual-Encoders, respectively. The question-question similarity matrices are visualized in Fig. 3. In crossembeddings, questions could attend the matched answer which results in more accurate question representations and better capture of the correlations between questions (see Fig. 3(a)). During ENDX training, we use GAM to align the geometry of dual-embeddings with that of cross-embeddings. As shown in Fig. 3(b), dual-embeddings enhanced by GAM are able to capture more correlations in question-question similarities compared to baseline dual-embeddings (Fig. 3(c)).
Ablation study on loss function of GAM We perform the ablation study on the proposed ENDX-BERTs in ReQA NQ by removing different components of GAM loss function. As shown in Table 5, all metric scores drop significantly without optimizing L a|q or L q|a , which indicates that p(a j |q i ) and p(q j |a i ) describe the most important parts of geometry. The reason we conjecture is that the answer retrieval task focus more on the relative distance of question-to-answer in feature space, while L q|q and L a|a are also helpful.   Comparison with BERT QA We also compare the proposed ENDX-BERTs against the interactionbased model BERT QA , which encodes concatenated sequence for every candidate QA pair. Due to the extremely large computational cost of BERT QA , we only sample 500 QA pairs from 27 passages in ReQA SQuAD as the test set. The experimental result is shown in Table 6, where ENDX-BERTs improves MRR, R@1 and R@5 over Dual-BERTs by +4.61%, +4.59% and +5.44% respectively and only falls behind BERT QA by -3.38%, -4.93% and -0.81%. However, the inference runtime complexity is significantly reduced from O(n × m) to O(n + m) compared to BERT QA , where n and m are the numbers of questions and answers respectively. Therefore, the propopsed ENDX-BERTs can better balance between accuracy and efficiency for answer retrieval.  Figure 4 shows the dual-embeddings projection (t-SNE, Van der Maaten, 2008) of 6 different questions and their shared answer. It can be seen that the dual-embeddings of our ENDX-BERTs are more compact than that of Dual-BERTs, which proves that our method can better align the questions and answers, and can produce more general representation to alleviate the one-to-many problem.

CONCLUSION
In this work, we propose a framework that enhances Dual-Encoders with cross-embeddings for answer retrieval. A novel geometry alignment mechanism is introduced to align the geometry of Dual-Encoders with cross-embeddings. Extensive experimental results show that our method significantly improves Dual-Encoders model and outperforms the state-of-the-art method on multiple answer retrieval datasets.