AutoEQA: Auto-Encoding Questions for Extractive Question Answering

There has been a signiﬁcant progress in the ﬁeld of extractive question answering (EQA) in the recent years. However, most of them rely on annotations of answer-spans in the corresponding passages. In this work, we address the problem of EQA when no annotations are present for the answer span, i.e., when the dataset contains only questions and corresponding passages. Our method is based on auto-encoding of the question that performs a question answering (QA) task during encoding and a question generation (QG) task during decoding. Our method performs well in a zero-shot setting and can provide an additional loss to boost performance for EQA.


Introduction
Extractive question answering (EQA) is the task of finding an answer span to a question from a context paragraph. Most of the deep learning models for this task perform well when annotated data is present. Scaling such models to new domains often requires creation of new datasets (d'Hoffschmidt et al., 2020;Lim et al., 2019;Trischler et al., 2017;Kwiatkowski et al., 2019). However, collecting labels for these corpora is expensive and time consuming which may involve multiple steps such as article curation, question and answer sourcing. Alleviating the annotation efforts for any of these steps is not only of research but also of practical interest. In this work, we address the problem of extracting answer spans to a question from unannotated context paragraph.
Some works have already been proposed to solve EQA in both semi-supervised and unsupervised setting. Unsupervised methods focus on creating a synthetic corpus and further train a supervised model on the synthetic corpus (Lewis et al., 2019). In semi-supervised methods the focus is on different pre-training tasks that improve the initialization of the EQA models (Dhingra et al., 2018;Glass Figure 1: Schematic diagram of the proposed autoencoding scheme. To the right, is the semi-diagonal mask on the self-attention layers for the decoding step. It enables the uni-directional language model of the question. We assume a latent distribution over possible answer spans, approximated by candidate phrases. See §2 for details. Ram et al., 2021). Our work can be categorized as the latter with one key difference: to further perform question answering without annotations on answer spans. To validate our approach, we use the pre-trained BERT (Devlin et al., 2019) model using SQuAD (Rajpurkar et al., 2016).
Specifically, our method employs a conditional auto-encoding scheme that reconstructs question given a passage while assuming a latent distribution over the answer phrases. The encoder of our model is a Question Answering (QA) model that jointly encodes the context and the question to estimate the probability distribution over possible answer spans. This is further given as input along with passage to the decoder which is a Question Generation (QG) model. We use a shared architecture for both the encoder and the decoder. Therefore, our model can be viewed as a self-supervised machine comprehension model that learns from itself. We list our contributions as follows: • We propose a novel method to perform unsupervised answer span extraction given a corpus of questions and associated paragraphs.
• We obtain an accuracy of 90% on unsupervised answer sentence selection.
• We obtain strong results (34.3 EM, 53.4 F1 on SQuAD dev set) for EQA when there is no annotation on the answer spans (Rajpurkar et al., 2016).

Method
Our model can be characterized as a discrete conditional variational auto-encoder (CVAE), where we seek to maximize the ground truth distribution of question given context p θ (Q|c) with the assumption that there exist a latent variable answer span. We can then maximize the loglikelihood of p θ (Q|c) with this assumption by the Evidence Lower Bound (ELBO) (Kingma and Welling, 2014): (1) where Q is the question, c is the context, q φ is the inference network, which estimates the probability of an answer a given the question and context, and p θ is the decoder model to estimate the distribution p θ (Q|a, c). In our case, since the architecture is shared, θ and φ represent the same set of parameters. Our auto-encoding scheme consists of three modules phrase extractor, encoder and decoder as shown in Figure 1.

Phrase Extractor
For EQA, given that there is no supervised signal for answer spans, an exhaustive search over all the possible phrases would be sub-optimal as there can be many phrases not suitable for natural language questions (Trischler et al., 2017;Joshi et al., 2017). We limit our potential answer phrases to the named entities and tags from constituency trees 1 . We also allow overlapping answer phrases in the set of candidate answer phrases. This is necessary as the sub-phrases of a phrase can be answers to different questions. We further remove the phrases that overlapped with the question, because such phrases can be more significant for generating the question over the possible answer phrases. With our chosen phrases, it is possible to achieve a best 70% EM and 88% F1 on SQuAD. These results serve as upper bound on our model's performance.

Encoder
Our encoder is a pre-trained BERT (Devlin et al., 2019) model, which is referred to as the inference network, that estimates q(a|Q, c) taking a paragraph concatenated with the corresponding question as input. This is similar to Devlin et al. (2019) while encoding two different text segments. Each token of the input is accompanied by a segment feature that takes values 0 or 1 representing different segments of the input (i.e., the question or the paragraph). Without a supervised signal, estimating probabilities on individual phrases might be difficult, so we decompose the probability of a phrase by using the probability of its sentence as follows: where s i is the i-th sentence, a s i is one of the candidate phrases in it, Q and c are the question and the context paragraph respectively. To obtain the terms of the above expression, we define a scoring function that takes two text segments as input and outputs an affinity score. A text segment can either be a sentence, a question or a phrase. Each text segment is embedded as a vector from BERT output embeddings as follows: where t represents a text segment, BERT(w i ) is the output embedding of BERT model for token w i , v t is the vector representation of the phrase t obtained as an average of BERT embeddings of the phrase tokens. The affinity score is obtained as a bilinear product of the vector representations of the text segments with learnable matrix W ∈ R d×d .
The conditional probability of a sentence given the question and the paragraph, q(s i |Q, c), is obtained as a softmax of the scoring function in Eq. 3 over all the sentences.
is obtained as a softmax of the scores between the question Q and the answer phrase a (j) s i over all answer phrases within the i-th sentence s i of c: With these two expressions, one can obtain the probability distribution of the phrases from Eq. 2. Further, we transfer these (overlapping) phraselevel probabilities into token-level scores to obtain a real valued segment feature vector as follows (shown in Figure 2): The purpose of the binary segment features is to differentiate some part of the text from the rest and to signify connection between them. The pretrained weights of BERT model include segment embeddings for input segment features 0 or 1. However, the output of the encoder model is a vector of real numbers ∈ [0, 1]. To accommodate this input whilst not loosing the well-informed weights of BERT, we obtain the segment embeddings for each token as an interpolation between the binary segment embeddings of BERT: where vec seg (t i ) is the segment embedding at position i, given a segment feature t i ∈ [0, 1], vec seg (0) and vec seg (1) are segment embeddings for the input segment features 0 and 1 respectively.

Decoder
The decoder is a BERT model, which shares weights with the encoder. It performs the task of generating question given paragraph and the answer span. Here we employ a unified transformer architecture model similar to (Dong et al., 2019;Varanasi et al., 2020;Chan and Fan, 2019).

Model
Top-1 SUPERVISED Selector (Min et al., 2018) 91.2 BR-MPGE-AS Base (Tian et al., 2020) 92. To encode answer span, we use segment features of BERT. The first term in Eq. 1 is an expectation over an estimated distribution of the inference network. This requires sampling which can be simulated by adding Gumbel-noise (Maddison et al., 2017;Jang et al., 2017) to the distribution and further taking the softmax with a scaling factor τ , which decides the peakiness of the distribution. However during training, we allow soft answer selection instead of choosing a single answer. The probabilities on the answer phrases are transferred as scores per token and these scores are provided as soft segment ids for corresponding tokens. Similar to Sun et al. (2018) and Dong et al. (2019), we use a QG model to decode the question given a paragraph and an answer phrase as input. We hypothesize that the tasks of encoder and decoder complement each other as one single transformer model perform both QA and QG simultaneously. We use BERT based copy-mechanism (Gu et al., 2016) while generating the question as proposed by Varanasi et al. (2020). The copy-mechanism interpolates the probability distribution over the vocabulary with the probability distribution over the paragraph which is obtained from self attention scores across different layers of BERT.

Experiments
For EQA experiments, we used the SQuAD v1.1 (Rajpurkar et al., 2016) dataset and conducted both sentence level and phrase level answer span selection. We trained on paragraph-question pairs without using the labels for answers (i.e., 87, 594 paragraph-question pairs). We maximize the objec-
tive for log-likehood where we trained for 3 epochs on the training set and kept the model that has the best log-likelihood of the question. We observed that a question log-likelihood loss already achieves good performance. As expected for auto-regressive decoders, introducing KL-Divergence term in the Eq. 1 caused posterior collapse. We used simulated annealing to mitigate this issue. As mentioned above, removing phrases that are common with the question helped to avoid local minima. We used bert-base-cased and bert-large-cased (Devlin et al., 2019) models in our experiments, with initial learning rate 3e −5 using Adam (Kingma and Ba, 2015) optimizer with 0.1 proportion of linear warm-up for learning rate.

Unsupervised Sentence Level QA
Answer sentence selection is an important task that benefits EQA further in terms of the accuracy and speed. Min et al. (2018) showed that by reducing the context to a sentence, one can not only reduce the training and inference time but also at times obtain better accuracy. As we factored the probability of a sentence into the probability of a candidate answer phrase that it contains, our model naturally scores a sentence high if it impacts the likelihood of the question. We used a modified version of SQuAD for answer-sentence span selection, similar to Tian et al. (2020)  comparison of our results on SQuAD dev set to some of the unsupervised and supervised methods on answer sentence selection task. We provide our own baseline, AutoEQA-GS Base , by auto-encoding a missing (gap) sentence from a SQuAD paragraph instead of the question. We achieve 75% accuracy on top-1 sentence. This suggests that the architecture of AutoEQA by design captures semantic similarity necessary for question-answering. TF-IDF (Min et al., 2018) uses word frequency in the question and the sentence to provide a similarity score. Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) is a state-of-the-art sentence embedding model which is trained for Textual Similarity tasks (STS). It is noteworthy that our model AutoEQA-GS Base surpasses SBERT when there is no supervision for both paragraph or answer span. For supervised sentence selection models, Min et al. (2018) uses sentence-aware question embeddings to find similarity between sentences and questions and Tian et al. (2020) uses multi-perspective graph encoding to capture sentence relations to further benefit answer-sentence selection task. While both of these models use supervision with elaborate architecture for answer sentence selection, they only marginally outperform AutoEQA-QG model in span unsupervised setting. This suggests the potential for AutoEQA-QG loss to enhance for sentence level EQA models.

Unsupervised Extractive Question Answering
For evaluation on answer phrases, we compare our model with other possible answer span selection techniques. The baseline models use heuristics to train on simple features that do not require annotation for EQA. The first baseline model is the sliding window approach reported by Rajpurkar et al. (2016) that finds answers using word overlap with the question. Secondly, they also propose a supervised logistic regression model which is trained on hand crafted features. Kaushik and Lipton (2018) use supervision to extract the most likely answer span from the context but they completely ignore the question. These models mark the baseline. Secondly, we report models that pre-train on answer span selection methods to improve EQA. Dhingra et al. (2018) creates a noisy corpus from Wikipedia articles where questions are sentences with missing phrases called cloze questions. Recently, Glass et al. (2020) created a similar cloze question corpus with documents retrieved per each cloze question using information retrieval methods. Both models train on answer span selection that is required for the task of EQA. From table 2, one can see that AutoEQA out performs them with large margin. The difference between EM and F1 scores for our models suggests that there are more overlaps between the model's predictions and ground truth though it does not predict the exact phrase. This provides a scope of improvement on phrase selection.
While the selection of candidate answer phrases themselves can limit AutoEQA, some answer phrases might be inherently difficult to learn. For better understanding, we look at the performance statistics on different question categories. Figure  3 shows the average F1 scores on different question types. AutoEQA naturally performs better in the question categories when, where, and what attributing to the fact that the answers for these questions tend to be named entities. The model performed poorly in the why questions. This could be because of their lengthy answer phrases. It is interesting to note that (Lewis et al., 2019) too performed poorly in this category. The category other refers to which and who questions combined with no-question word questions. Overall, we seem to see a correlation with answer types being named entities and the model's performance. Nearly 75% of the predicted answers are less than 10 words distant from the ground truth.

Related Work
Recently, data augmentation has become a popular way to do unsupervised EQA (Lewis et al., 2019;Li et al., 2020;Fabbri et al., 2020), where synthetic questions are generated either by heuristics or by unsupervised question generation methods. Brown et al. (2020) show that very large-scale language models can generate answers without supervision. While these works have their own benefits, they are different from the problem we intend to address and hence can not be compared directly. For example, Lewis et al. (2019) achieves similar performance to ours using millions of artificially created data points for EQA corpora, while we achieve our results by using only 87k training samples suggesting the efficiency of our method when supervision for question, paragraph pairs is provided.

Conclusion
In this work, we proposed a novel method for Unsupervised answer span selection. We showed that using auto-encoding of question, one can get considerable gains (34.3% EM and 53.4% F1 score). Methods for unsupervised key phrase extraction can benefit AutoEQA in choosing well-informed and dynamic phrases.