Joint Models for Answer Verification in Question Answering Systems

This paper studies joint models for selecting correct answer sentences among the top k provided by answer sentence selection (AS2) modules, which are core components of retrieval-based Question Answering (QA) systems. Our work shows that a critical step to effectively exploiting an answer set regards modeling the interrelated information between pair of answers. For this purpose, we build a three-way multi-classifier, which decides if an answer supports, refutes, or is neutral with respect to another one. More specifically, our neural architecture integrates a state-of-the-art AS2 module with the multi-classifier, and a joint layer connecting all components. We tested our models on WikiQA, TREC-QA, and a real-world dataset. The results show that our models obtain the new state of the art in AS2.


Introduction
Automated Question Answering (QA) research has received a renewed attention thanks to the diffusion of Virtual Assistants. Among the different types of methods to implement QA systems, we focus on Answer Sentence Selection (AS2) research, originated from TREC-QA track (Voorhees and Tice, 1999), as it proposes efficient models that are more suitable for a production setting, e.g., they are more efficient than those developed in machine reading (MR) work . Garg et al. (2020) proposed the TANDA approach based on pre-trained Transformer models, obtaining impressive improvement over the state of the art for AS2, measured on the two most used datasets, WikiQA (Yang et al., 2015) and TREC-QA (Wang et al., 2007). However, TANDA was applied only to pointwise rerankers (PR), e.g., simple binary classifiers. Bonadiman and Moschitti (2020) tried to improve this model by jointly modeling all answer candidates with listwise methods, e.g., (Bian et al., 2017). Unfortunately, merging the embeddings from all candidates with standard approaches, e.g., CNN or LSTM, did not improve over TANDA.
A more structured approach to building joint models over sentences can instead be observed in Fact Verification Systems, e.g., the methods developed in the FEVER challenge (Thorne et al., 2018a). Such systems take a claim, e.g., Joe Walsh was inducted in 2001, as input (see Tab. 1), and verify if it is valid, using related sentences called evidences (typically retrieved by a search engine). For example, Ev 1 , As a member of the Eagles, Walsh was inducted into the Rock and Roll Hall of Fame in 1998, and into the Vocal Group Hall of Fame in 2001, and Ev 3 , Walsh was awarded with the Vocal Group Hall of Fame in 2001, support the veracity of the claim. In contrast, Ev 2 is neutral as it describes who Joe Walsh is but does not contribute to establish the induction. We conjecture that supporting evidence for answer correctness in AS2 task can be modeled with a similar rationale.
In this paper, we design joint models for AS2 based on the assumption that, given q and a target answer candidate t, the other answer candidates, (c 1 , ..c k ) can provide positive, negative, or neutral support to decide the correctness of t. Our first approach exploits Fact Checking research: we adapted a state-of-the-art FEVER system, KGAT (Liu et al., 2020), for AS2. We defined a claim as a pair constituted of the question and one target answer, while considering all the other answers as evidences. We re-trained and rebuilt all its embeddings for the AS2 task.
Our second method, Answer Support-based Reranker (ASR), is completely new, it is based on the representation of the pair, (q, t), generated by state-of-the-art AS2 models, concatenated with the representation of all the pairs (t, c i ). The latter summarizes the contribution of each c i to t using a maxpooling operation. c i can be unrelated to (q, t) since the candidates are automatically retrieved, thus it may introduce just noise. To mitigate this problem, we use an Answer Support Classifier (ASC) to learn the relatedness between t and c i by classifying their embedding, which we obtain by applying a transformer network to their concatenated text. ASC tunes the (t, c i ) embedding parameters according to the evidence that c i provides to t. Our Answer Support-based Reranker (ASR) significantly improves the state of the art, and is also simpler than our approach based on KGAT.
Our third method is an extension of ASR. It should be noted that, although ASR exploits the information from the k candidates, it still produces a score for a target t without knowing the scores produced for the other target answers. Thus, we jointly model the representation obtained for each target in a multi-ASR (MASR) architecture, which can then carry out a complete global reasoning over all target answers.
We experimented with our models over three datasets, WikiQA, TREC-QA and WQA, where the latter is an internal dataset built on anonymized customer questions. The results show that: • ASR improves the best current model for AS2, i.e., TANDA by ∼3%, corresponding to an error reduction of 10% in Accuracy, on both Wik-iQA and TREC-QA.
• We also obtain a relative improvement of ∼3% over TANDA on WQA, confirming that ASR is a general solution to design accurate QA systems.
• Most interestingly, MASR improves ASR by additional 2%, confirming the benefit of joint modeling.
Finally, it is interesting to mention that MASR improvement is also due to the use of FEVER data for pre-fine-tuning ASC, suggesting that the fact verification inference and the answer support inference are similar.

Problem definition and related work
We consider retrieval-based QA systems, which are mainly constituted by (i) a search engine, retrieving documents related to the questions; and (ii) an AS2 model, which reranks passages/sentences extracted from the documents. The top sentence is typically used as final answer for the users.

Answer Sentence Selection (AS2)
The task of reranking answer-sentence candidates provided by a retrieval engine can be modeled with a classifier scoring the candidates. Let q be an element of the question set, Q, and A = {c 1 , . . . , c n } be a set of candidates for q, a reranker can be defined as R : is the set of all permutations of A. Previous work targeting ranking problems in the text domain has classified reranking functions into three buckets: pointwise, pairwise, and listwise methods.
Pointwise reranking: This approach learns p(q, c i ), which is the probability of c i correctly answering q, using a standard binary classification setting. The final rank is simply obtained sorting c i , based on p(q, c i ). Previous work estimates p(q, c i ) with neural models (Severyn and Moschitti, 2015), also using attention mechanisms, e.g., Compare-Aggregate (Yoon et al., 2019), inter-weighted alignment networks (Shen et al., 2017), and pre-trained Transformer models, which are the state of the art. Garg et al. (2020) proposed TANDA, which is the current most accurate model on WikiQA and TREC-QA. Pairwise reranking: The method considers binary classifiers of the form χ(q, c i , c j ) for determining the partial rank between c i and c j , then the scoring function p(q, c i ) is obtained by summing up all the contributions with respect to the target candidate t = c i , e.g., p(q, c i ) = j χ(q, c i , c j ). There has been a large body of work preceding Transformer models, e.g., (Laskar et al., 2020;Tayyar Madabushi et al., 2018;Rao et al., 2016). However, these methods are largely outperformed by the pointwise TANDA model. Listwise reranking: This approach, e.g., (Bian et al., 2017;Cao et al., 2007;Ai et al., 2018), aims at learning p(q, π), π ∈ Π(A), using the information on the entire set of candidates. The loss function for training such networks is constituted by the contribution of all elements of its ranked items. The closest work to our research is by Bonadiman and Moschitti (2020), who designed several joint models. These improved early neural networks based on CNN and LSTM for AS2, but failed to improve the state of the art using pre-trained Transformer models.

Joint Models in Question Answering
MR is a popular QA task that identifies an answer string in a paragraph or a text of limited size for a question. Its application to retrieval scenario has also been studied Hu et al., 2019;Kratzwald and Feuerriegel, 2018). However, the large volume of retrieved content makes their use not practical yet. Moreover, the joint modeling aspect of MR regards sentences from the same paragraphs. Jin et al. (2020) use the relation between candidates in Multi-task learning approach for AS2. However, they do not exploit transformer models, thus their results are rather below the state of the art. In contrast with the work above, our modeling is driven by an answer support strategy, where the pieces of information are taken from different documents. This makes our model even more unique; it allows us to design innovative joint models, which are still not designed in any MR systems.

Fact Verification for Question Answering
Fact verification has become a social need given the massive amount of information generated daily. The problem is, therefore, becoming increasingly important in NLP context (Mihaylova et al., 2018).
In QA, answer verification is directly relevant due to its nature of content delivery (Mihaylova et al., 2019). The problem has been explored in MR setting (Wang et al., 2018). Zhang et al. (2020a) also proposed to fact check for product questions using additional associated evidence sentences. The latter are retrieved based on similarity scores computed with both TF-IDF and sentence-embeddings from pre-trained BERT models. While the process is technically sound, the retrieval of evidence is an expensive process, which is prohibitive to scale in production. We instead address this problem by leveraging the top answer candidates.

Baseline Models for AS2
In this section, we describe our baseline models, which are constituted by pointwise, pairwise, and listwise strategies.

Pointwise Models
One simple and effective method to build an answer selector is to use a pre-trained Transformer model, adding a simple classification layer to it, and fine-tuning the model on the AS2 task. Specifically, q = Tok q 1 ,...,Tok q N and c =Tok c 1 ,...,Tok c M are encoded in the input of the Transformer by delimiting them using three tags: [CLS], [SEP] and [EOS], inserted at the beginning, as separator, and at the end, respectively. This input is encoded as three embeddings based on tokens, segments and their positions, which are fed as input to several layers (up to 24). Each of them contains sublayers for multi-head attention, normalization and feed forward processing. The result of this transformation is an embedding, E, representing (q, c), which models the dependencies between words and segments of the two sentences.
For the downstream task, E is fed (after applying a non-linearity function) to a fully connected layer having weights: W and B. The output layer can be used to implement the task function. For example, a softmax can be used to model the probability of the question/candidate pair classification, as: p(q, c) = sof tmax(W × tanh(E(q, c)) + B).
We can train this model with log cross-entropy loss: L = − l∈{0,1} y l × log(ŷ l ) on pairs of texts, where y l is the correct and incorrect answer label,ŷ 1 = p(q, c), andŷ 0 = 1 − p(q, c). Training the Transformer from scratch requires a large amount of labeled data, but it can be pre-trained using a masked language model, and the next sentence prediction tasks, for which labels can be automatically generated. Several methods for pretraining Transformer-based language models have been proposed, e.g., BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), AlBERT (Lan et al., 2020).

Our joint model baselines
To better show the potential of our approach and the complexity of the task, we designed three joint model baselines based on: (i) a multiclassifier approach (a listwise method), and (ii) a pairwise joint model operating over k + 1 candidates, and our adaptation of KGAT model (a pairwise method).
Joint Model Multi-classifier The first baseline is also a Transformer-based architecture: we concatenate the question with the top k + 1 answer can-didates, i.e., (q[SEP ]c 1 [SEP ]c 2 . . . [SEP ]c k+1 ), and provide this input to the same Transformer model used for pointwise reranking. We use the final hidden vector E corresponding to the first input token [CLS] generated by the Transformer, and a classification layer with weights W ∈ R (k+1)×|E| , and train the model using a standard cross-entropy classification loss: y × log(sof tmax(EW T )), where y is a one-hot vector representing labels for the k + 1 candidates, i.e., |y| = k + 1. We use a transformer model fine-tuned with the TANDA-RoBERTa-base or large models, i.e., RoBERTa models fine-tuned on ASNQ (Garg et al., 2020). The scores for the candidate answers are calculated as p(c 1 ), .., p(c k+1 ) = sof tmax(EW T ). Then, we rerank c i according their probability.

Joint Model Pairwise
Our second baseline is similar to the first. We concatenate the question with each c i to constitute the (q, c i ) pairs, which are input to the Transformer, and we use the first input token [CLS] as the representation of each (q, c i ) pair. Then, we concatenate the embedding of the pair containing the target candidate, (q, t) with the embedding of all the other candidates' [CLS]. (q, t) is always in the first position. We train the model using a standard classification loss. At classification time, we select one target candidate at a time, and set it in the first position, followed by all the others. We classify all k + 1 candidates and use their score for reranking them. It should be noted that to qualify for a pairwise approach, Joint Model Pairwise should use a ranking loss. However, we always use standard cross-entropy loss as it is more efficient and the different is performance is negligible.
Joint Model with KGAT Liu et al. (2020) presented an interesting model, Kernel Graph Attention Network (KGAT), for fact verification: given a claimed fact f , and a set of evidences Ev = {ev 1 , ev 2 , . . . , ev m }, their model carries out joint reasoning over Ev, e.g., aggregating information to estimate the probability of f to be true or false, p(y|f, Ev), where y ∈{true, false}.
The approach is based on a fully connected graph, G, whose nodes are the n i = (f, ev i ) pairs, and p(y|f, Ev) = p(y|f, ev i , Ev)p(ev i |f, Ev), where p(y|f, ev i , Ev) = p(y|n i , G) is the label probability in each node i conditioned on the whole graph, and p(ev i |f, Ev) = p(n i |G) is the probability of selecting the most informative evidence. KGAT uses an edge kernel to perform a hierarchi-cal attention mechanism, which propagates information between nodes and aggregate evidences.
We built a KGAT model for AS2 as follows: we replace (i) ev i with the set of candidate answers c i , and (ii) the claim f with the question and a target answer pair, (q, t). KGAT constructs the evidence graph G by using each claim-evidence pair as a node, which, in our case, is ((q, t), c i ), and connects all node pairs with edges, making it a fully-connected evidence graph. This way, sentence and token attention operate over the triplets, (q, t, c i ), establishing semantic links, which can help to support or undermine the correctness of t.
The original KGAT aggregates all the pieces of information we built, based on their relevance, to determine the probability of t. As we use AS2 data, the probability will be about the correctness of t. More in detail, we initialize the node representation using the contextual embeddings obtained with two TANDA-RoBERTa-base models 1 : the first produces the embedding of (q, t), while the second outputs the embedding of (q, c i ). Then, we apply a max-pooling operation on these two to get the final node representation. The rest of the architecture is identical to the original KGAT. Finally, at test time, we select one c i at a time, as the target t, and compute its probability, which ranks c i .

Joint Answer Support Models for AS2
We proposed the Answer Support Reranker (ASR), which uses an answer pair classifier to provide evidence to a target answer t. Given a question q, and a subset of its top-k+1 ranked answer candidates, A (reranked by an AS2 model), we build a function, σ : Q × C × C k → R such that σ(q, t, A \ {t}) provides the probability of t to be correct, where C is the set of sentence-candidates. We also design a multi-classifier MASR, which combines k ASR models, one for each different target answer.

Answer Support-based Reranker (ASR)
We developed ASR architecture described in Figure 1c. This consists of three main components: 1. a Pointwise Reranker (PR), which provides the embedding of the input (q, t), described in Figure 1a. This is essentially the state-of-the-art AS2 model based on the TANDA approach applied to RoBERTa pre-trained transformer. 2. To reduce the noise that may be introduced by irrelevant c i , we use the Answer Support Classifier (ASC), which classifies each (t, c i ) in one of the following four classes: 0 : t and c i are both correct, 1 : t is correct while c i is not, 2 : vice versa, and 3 : both incorrect.
This multi-classifier, described in Figure 1b, is built on top a RoBERTa Transformer, which produced a PairWise Representation (PWR). ASC is trained end-to-end with the rest of the network in a multi-task learning fashion, using its specific cross-entropy loss, computed with the labels above.
3. The ASR (see Figure 1c) uses the joint representation of (q, t) with (t, c i ), i = 1, .., k, where t and c i are the top-candidates reranked by PR. The k representations are summarized by applying a max-pooling operation, which will aggregate all the supporting or not supporting properties of the candidates with respect to the target answer. The concatenation of the PR embedding with the max-pooling embedding is given as input to the final classification layer, which scores t with respect to q, also using the information from the other candidates. For training and testing, we select a t from the k + 1 candidates of q at a time, and compute its score. This way, we can rerank all the k + 1 candidates with their scores.
Implementation details: ASR is a PR that also exploits the relation between t and A \ {t}. We use RoBERTa to generate the [CLS] ∈ R d embedding of (q, t) = E t . We denote withÊ j the [CLS] output by another RoBERTa Transformer applied to answer pairs, i.e., (t, c j ). Then, we concatenate E t to the max-pooling tensor fromÊ 1 , ..,Ê k : where V ∈ R 2d is the final representation of the target answer t. Then, we use a standard feedforward network to implement a binary classification layer: p(y i |q, t, C k ) = sof tmax(V W T + B), where W ∈ R 2×2d and B are parameters to transform the representation of the target answer t from dimension 2d to dimension 2, which represents correct or incorrect labels. ASC labels There can be different interpretations when attempting to define labels for answer pairs. An alternative to the definition illustrated above is to use the following FEVER compatible encoding: 0 : t is correct, while c i can be any value, as also an incorrect c i may provide important context (corresponding to FEVER Support label); 1 : t is incorrect, c i correct, since c i can provide evidence that t is not similar to a correct answer (corresponding to FEVER Refutal label); and 2 : both are incorrect, in this case, nothing can be told (corresponding to FEVER Neutral label).

Multi-Answer Support Reranker (MASR)
ASR still selects answers with a pointwise approach 2 . This means that we can improve it by  building a listwise model, to select the best answer for each question, by utilizing the information from all target answers. In particular, the architecture of MASR shown in Figure 1d is made up of two parts: (i) a list of ASR containing k + 1 ASR blocks, in which each ASR block provides the representation of a target answer t. (ii) A final multiclassifier and a softmax function, which scores each t from k + 1 embedding concatenation and selects the one with highest score. For training and testing, we select the t from the k + 1 candidates of q based on a softmax output at a time.
Implementation details: The goal of MASR is to measure the relation between k + 1 target answers, t 0 , .., t k . The representation of each target answer is the embedding V ∈ R 2d from Equation 1 in ASR. Then, we concatenate the hidden vectors of k + 1 target answers to form a matrix V (q,k+1) ∈ R (k+1)×2d . We use this matrix and a classification layer weights W ∈ R 2d , and compute a standard multi-class classification loss: where y is a one-hot-vector, and |y| = |k + 1|.

Experiments
In these experiments, we compare our models: KGAT, ASR and MASR with pointwise models, which are the state of the art for AS2. We also compare them with our joint model baselines (pairwise and listwise). Finally, we provide an error analysis.

Datasets
We used two most popular AS2 datasets, and one real world application dataset we built to test the generality of our approach.
WikiQA is a QA dataset (Yang et al., 2015) containing a sample of questions and answer-sentence candidates from Bing query logs over Wikipedia. The answers are manually labeled. We follow the most used setting: training with all the questions that have at least one correct answer, and validating and testing with all the questions having at least one correct and one incorrect answer.  TREC-QA is another popular QA benchmark by Wang et al. (2007). We use the same splits of the original data, following the common setting of previous work, e.g., (Garg et al., 2020).
WQA The Web-based Question Answering is a dataset built by Alexa AI as part of the effort to improve understanding and benchmarking in QA systems. The creation process includes the following steps: (i) given a set of questions we collected from the web, a search engine is used to retrieve up to 1,000 web pages from an index containing hundreds of millions pages. (ii) From the set of retrieved documents, all candidate sentences are extracted and ranked using AS2 models from (Garg et al., 2020). Finally, (iii) top candidates for each question are manually assessed as correct or incorrect by human judges. This allowed us to obtain a richer variety of answers from multiple sources with a higher average number of answers. Table 2 reports the corpus statistics of WikiQA, TREC-QA, and WQA 3 .
FEVER is a large-scale public corpus, proposed by Thorne et al. (2018a) for fact verification task, consisting of 185,455 annotated claims from 5,416,537 documents from the Wikipedia dump in June 2017. All claims are labelled as Supported, Refuted or Not Enough Info by annotators. Table 3 shows the statistics of the dataset, which remains the same as in (Thorne et al., 2018b).

Training and testing details
Metrics The performance of QA systems is typically measured with Accuracy in providing correct answers, i.e., the percentage of correct responses. This is also referred to Precision-at-1 (P@1) in the context of reranking, while standard Precision and Recall are not essential in our case as we assume the system does not abstain from providing answers. We also use Mean Average Precision (MAP) and Mean Reciprocal Recall (MRR) evaluated on the test set, using the entire set of candidates for each  Table 4: Results on WikiQA, TREC-QA and WQA, using RoBERTa base Transformer. † is used to indicate that the difference in P@1 between ASR and the other marked systems is statistically significant at 95%. Models We use the pre-trained RoBERTa-Base (12 layer) and RoBERTa-Large-MNLI (24 layer) models, which were released as checkpoints for use in downstream tasks 4 .

Reranker training
We adopt Adam optimizer (Kingma and Ba, 2014) with a learning rate of 2e-5 for the transfer step on the ASNQ dataset (Garg et al., 2020), and a learning rate of 1e-6 for the adapt step on the target dataset. We apply early stopping on the development set of the target corpus for both fine-tuning steps based on the highest MAP score. We set the max number of epochs equal to 3 and 9 for the adapt and transfer steps, respectively. We set the maximum sequence length for RoBERTa to 128 tokens.
KGAT and ASR training Again, we use the Adam optimizer with a learning rate of 2e-6 for training the ASR model on the target dataset. We utilize 1 Tesla V100 GPU with 32GB memory and a train batch size of eight. We set the maximum sequence length for RoBERTa Base/Large to 130 tokens and the number of training epochs to 20. The other training configurations are the same of the original KGAT model from (Liu et al., 2020). We use two transformer models for ASR: a RoBERTa 4 https://github.com/pytorch/fairseq Base/Large for PR, and one for ASC. We set the maximum sequence length for RoBERTa to 128 tokens and the number of epochs to 20.
MASR training We use the same configuration of the ASR training, including the optimizer type, learning rate, the number of epochs, GPU type, maximum sequence length, etc. Additionally, we design two different models MASR-F, using an ASC classifier targeting the FEVER labels, and MASR-FP, which initializes ASC with the data from FEVER. This is possible as the labels are compatible.

Choosing the best k
The selection of the hyper-parameter k, i.e., the number of candidates to consider for supporting a target answer is rather tricky. Indeed, the standard validation set is typically used for tuning PR. This means that the candidates PR moves to the top k +1 positions are optimistically accurate. Thus, when selecting also the optimal k on the same validation set, there is high risk to overfit the model.
We solved this problem by running a PR version not heavily optimized on the dev. set, i.e., we randomly choose a checkpoint after the standard three epochs of fine-tuning of RoBERTa transformer. Additionally, we tuned k only using the WQA dev. set, which contains ∼ 36, 000 Q/A pairs. WikiQA and TREC-QA dev. sets are too small to be used (121 and 65 questions, respectively). Fig. 2 plots the improvement of four different models, Joint Model Multi-classifier, Joint Model Pairwise, KGAT, and ASR, when using different k values. Their best results are reached for 5, 3, 2, and 3, respectively. We note that the most reliable curve shape (convex) is the one of ASR and Joint Model Pairwise. Table 4 reports the P@1, MAP and MRR of the rerankers, and different answer supporting models on WikiQA, TREC-QA and WQA datasets. As WQA is an internal dataset, we only report the improvement over PR in the tables. All models use RoBERTa-Base pre-trained checkpoint and start from the same set of k candidates reranked by PR (state-of-the-art model). The table shows that:

Comparative Results
• PR replicates the MAP and MRR of the stateof-the-art reranker by Garg et al. (2020) on WikiQA.
• Joint Model Multi-classifier performs lower than PR for all measures and all datasets. This is in line with the findings of Bonadiman and Moschitti (2020), who also did not obtain improvement when jointly used all the candidates altogether in a representation.
• Joint Model Pairwise differs from ASR as it concatenates the embeddings of the (q, c i ), instead of using max-pooling, and does not use any Answer Support Classifier (ASC). Still, it exploits the idea of aggregating the information of all pairs (q, c i ) with respect to a target answer t, which proves to be effective, as the model improves on PR over all measures and datasets.
• Our KGAT version for AS2 also improves PR over all datasets and almost all measures, confirming that the idea of using candidates as support of the target answer is generally valid. However, it is not superior to Joint Model Pairwise.
• ASR achieves the highest performance among all models (but MASR-FP on WQA), all datasets, and all measures. For example, it outperforms PR by almost 3 absolute percent points in P@1 on WikiQA, and by almost 6 points on TREC from 91.18% to 97.06%, which corresponds to an error reduction of 60%.  • MASR-FP exploiting FEVER for the initialization of ASC performs better than MASR and MASR-F on WikiQA and TREC. Interestingly, it significantly outperforms ASR by 2% on WQA. This confirms the potential of the model when enough training data is available.
• We perform randomization test (Yeh, 2000) to verify if the models significantly differ in terms of prediction outcome. We use 100,000 trials for each calculation. The results confirm the statistically significant difference between ASR and all the baselines, with p < 0.05 for WikiQA, and between ASR and all models (i.e., including also KGAT) on WQA.

Official State of the art
As the state of the art for AS2 is obtained using RoBERTa Large, we trained KGAT and ASR using this pre-trained language model. Table 5 also reports the comparison with PR, which is the official state of the art. Again, our PR replicates the results of Garg et al. (2020), obtaining slightly lower performance on WikiQA but higher on TREC-QA. KGAT performs lower than PR on both datasets. ASR establishes the new state of the art on WikiQA with an MAP of 92.80 vs. 92.00. The P@1 also significantly improves by 2%, i.e., achieving 89.71, which is impressively high. Also, on TREC-QA, ASR outperforms all models, being on par with PR regarding P@1. The latter is 97.06, which corresponds to mistaking the answers of only two questions. We manually checked these and found out that these were two annotation errors: ASR achieves perfect accuracy while PR only mistakes one answer. Of course, this just provides evidence that PR based on RoBERTa-Large solves the task of selecting the best answers (i.e., measuring P@1 on this dataset is not meaningful anymore).   Sec. 4.1). ACC is the overall accuracy while F1 refers to the category 0. We note that ASC in MASR-FP achieves the highest accuracy with respect to the average over all datasets. This happens since we pre-fine-tuned it with the FEVER data.

Model Discussion
We analyzed examples for which ASR is correct and PR is not. Tab. 7 shows that, given q and k = 3 candidates, PR chooses c 1 , a suitable but wrong answer. This probably happens since the answer best matches the syntactic/semantic pattern of the question, which asks for a type of color, indeed, the answer offers such type, primary colors. PR does not rely on any background information that can support the set of colors in the answer. In contrast, ASR selects c 2 as it can rely on the support of other answers. Its ASC provides an average score for the category 0 (both members are correct) of c 2 , i.e., 1 k i =2 ASC(c 2 , c i ) = 0.653, while for c 1 the average score is significant lower, i.e., 0.522. This provides higher support for c 2 , which is used by ASR to rerank the output of PR.
Tab. 8 shows an interesting case where all the sentences contain the required information, i.e., February. However, PR and ASR both choose answer c 0 , which is correct but not natural, as it provides the requested information indirectly. Also, it contains a lot of ancillary information. In contrast, MASR is able to rerank the best answer, c 1 , in the top position.

Conclusion
We have proposed new joint models for AS2. ASR encodes the relation between the target answer and all the other candidates, using an additional Transformer model, and an Answer Support Classifier, while MASR jointly models the ASR representations for all target answers. We extensively tested KGAT, ASR, MASR, and other joint model baselines we designed.
The results show that our models can outperform the state of the art. Most interestingly, ASR constantly outperforms all the models (but MASR-FP), on all datasets, through all measures, and for both base and large transformers. For example, ASR q: What kind of colors are in the rainbow? c1: Red, yellow, and blue are called the primary colors. c2: The order of the colors in the rainbow goes: red, orange, yellow, green, blue, indigo and violet. c3: The colors in all rainbows are present in the same order: red, orange, yellow, green, blue, indigo, and violet. c4: A rainbow occurs when white light bends and separates into red, orange, yellow, green blue, indigo and violet. q: What's the month of Valentine's day? c0: Celebrated on February 14 every year, saint Valentine's day or Valentine's day is the traditional day on which lovers convey their love to each other by sending Valentine's cards, sometimes even anonymously. c1: February is historically chosen to be the month of love and romance and the month to celebrate Valentine's day. c2: In order for today to be Valentine's day, it's necessary that today is in the month of February. c3: Every year, Valentine's day is celebrated on February 14 in many countries around the world.