VAULT: VAriable Unified Long Text Representation for Machine Reading Comprehension

Existing models on Machine Reading Comprehension (MRC) require complex model architecture for effectively modeling long texts with paragraph representation and classification, thereby making inference computationally inefficient for production use. In this work, we propose VAULT: a light-weight and parallel-efficient paragraph representation for MRC based on contextualized representation from long document input, trained using a new Gaussian distribution-based objective that pays close attention to the partially correct instances that are close to the ground-truth. We validate our VAULT architecture showing experimental results on two benchmark MRC datasets that require long context modeling; one Wikipedia-based (Natural Questions (NQ)) and the other on TechNotes (TechQA). VAULT can achieve comparable performance on NQ with a state-of-the-art (SOTA) complex document modeling approach while being 16 times faster, demonstrating the efficiency of our proposed model. We also demonstrate that our model can also be effectively adapted to a completely different domain – TechQA – with large improvement over a model fine-tuned on a previously published large PLM.


Introduction
Machine Reading Comprehension (MRC) has seen great advances in recent years with the rise of pre-trained language models (PLM) (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2019) and public leaderboards (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Joshi et al., 2017;Welbl et al., 2018;Kwiatkowski et al., 2019). While some challenges (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 focus on reading comprehension with shorter contexts, many others * Work done during an internship at IBM Research AI. † Equal contributions. (Welbl et al., 2018;Joshi et al., 2017;Kwiatkowski et al., 2019;Tanaka et al., 2021) focus on longer contexts that cannot fit into a typical 512 sub-token transformer window. Motivated by this, we focus on reading comprehension with long contexts. One newer approach to this task (Zheng et al., 2020) focuses on modeling document hierarchy to represent multi-grained information for answer extraction. Although this approach creates a strong representation of the text, it suffers from a significant drawback. The graph-based methods (Veličković et al., 2018) are inefficient on parallel hardware, such as GPUs, resulting in slow inference speed (Zhou et al., 2018;Zheng et al., 2020). Motivated by this, in this paper, we propose a reading comprehension model that addresses the above issue and uses a more light-weight, parallelefficient (i.e. efficient on parallel hardware) paragraph representation based on long contextual representations for providing paragraph answers to questions. Instead of modeling document hierarchy from tokens to document pieces, we first introduce a base model that builds on top of a large "longcontext" PLM (we use Longformer, Beltagy et al., 2020) to model longer contexts with lightweight representations of each paragraph. We note that while our approach could work with any PLM, we expect it to perform better with models that can support long contexts and therefore see more paragraph representations at once (Gong et al., 2020). To provide our model a notion of paragraph position relative to a text we also introduce positionaware paragraph representations (PAPR) utilizing special markup tokens and provide them as input for efficient paragraph classification. This approach allows us to encode paragraph-level position in the text and teach the model to impute information on each paragraph into the hidden outputs for these tokens that we can exploit to determine in which paragraph the answer resides. We then predict the answer span from this identified paragraph.
While previous MRC methods (Chen et al., 2017;Devlin et al., 2019) use ground-truth start and end span positions exclusively as training objectives when extracting answer spans from the context and consider all other positions as incorrect instances equally. However, spans that overlap with the ground-truth should be considered as partially correct. Motivated by Li et al. (2020) which proposes a new optimization criteria based on constructing prior distribution over synonyms for machine translation, we further propose to improve the above base model by considering the start and end positions of ground-truth answer spans as Gaussian-like distributions, instead of single points, and optimize our model using statistical distance.
We call this final model, VAULT (VAriable Unified Long Text representation) as it can handle a variable number and lengths of paragraphs at any position with the same unified model structure to handle long texts.
To evaluate the performance of VAULT, we select the new Natural Questions (NQ, Kwiatkowski et al., 2019) and TechQA (Castelli et al., 2020) datasets. NQ attempts to make Machine Reading Comprehension (MRC) more realistic by providing longer Wikipedia documents as contexts and real user search-engine queries as questions, and aims at avoiding observation bias: high lexical overlap between the question and the answer context which can happen frequently if the question is created after the user sees the paragraph (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Chakravarti et al., 2020;Karpukhin et al., 2020;Lee et al., 2019;Murdock et al., 2018). The task introduces the extraction of long answers (henceforth LA; typically paragraphs) besides also requiring short answers (henceforth SA) similar to SQuAD (Rajpurkar et al., 2016). In Figure 1 we examine an example from NQ along with the answers of VAULT and (Zheng et al., 2020). We see that while VAULT can extract answers from the very bottom of a page -if relevant -the existing system suffers from positional bias. It often predicts answers from the first paragraph of Wikipedia (a region which often contains the most relevant information). We evaluate our model for domain adaptation on TechQA, a recently introduced challenging dataset for QA on technical support articles where answers are typically 3-5 times longer than standard MRC datasets (Rajpurkar et al., 2016(Rajpurkar et al., , 2018. Empirically we first show that VAULT achieves comparable performance on NQ with (Zheng et al., 2020)'s document modeling architecture based on graph neural networks while being 16 times faster, demonstrating the efficiency of our proposed model. Secondly, we show the generalization of our model architecture for domain adaptation on TechQA. Our experiments show that our model pre-trained on NQ can be effectively adapted to TechQA outperforming a standard fine-tuned model trained on a large PLM such as RoBERTa. To summarize, our contributions include: 1. We introduce a novel and effective yet simple paragraph representation.
2. We introduce soft labels to leverage information from local contexts near ground-truth during training which is novel for MRC. 3. Our model provides similar performance to a SOTA system on NQ while being 16 times faster and also effectively adapts to a new domain: TechQA.

Related Work
Machine reading comprehension has been widely modeled as cloze-type span extraction (Chen et al., 2017;Cui et al., 2017;Devlin et al., 2019). In NQ, we need to identify answers in two levels, long and short answers. (Alberti et al., 2019a) adapt a span extraction model for short answer extraction. (Zheng et al., 2020; construct complex networks for paragraph-level representation to enhance long answer classification along with span extraction for short answers. In this work, we propose a more light-weight and parallel-efficient way for constructing paragraph-level representation and classification by using longer context and modeling the negative instance through Gaussian prior optimization. Using the hierarchical nature of a long document for question answering has been previously studied by (Choi et al., 2017), where they use a hierarchical approach to select candidate sentences and extract answers in those candidates. However, due to the limit of input length for large PLMs, existing methods (Alberti et al., 2019b;Zheng et al., 2020;Chakravarti et al., 2020) slice long documents into document pieces and perform prediction for each piece separately. In our work, we show that by modeling longer input with position-aware paragraph representation coupled with Gaussian prior optimization (which is novel for MRC), we can achieve comparable performance using much simpler architecture compared to previous models, which coincide with recent new PLM for long inputs on question answering (Ainslie et al., 2020) 1 .

Model Architecture
In this section, we introduce VAULT, our proposed model that uses a simple yet effective paragraph representation based on a longer context. VAULT starts from a base classifier that utilizes positionaware paragraph representations trained on top of a large PLM: Longformer (Beltagy et al., 2020). Next, we further introduce our Gaussian Priorbased training objective that considers partial credits for positions near the ground-truth, instead of only focusing on one ground-truth position. We show an overview of VAULT on the example from Figure 1 in Figure 2.

A Base "Paragraph" Predictor Model
SOTA methods for paragraph prediction (Zheng et al., 2020; represent paragraphs through expensive graph modeling, making it inefficient for "large-scale" production MRC systems. On the other hand, simply selecting the first paragraph performs poorly (Kwiatkowski et al., 2019). We hypothesize that by modeling a much longer context even simple paragraph representation can be effective for paragraph (i.e., long answers) classification. For this purpose, we employ a largewindow PLM: Longformer (Beltagy et al., 2020), which has shown effectiveness in modeling long contexts for QA Welbl et al., 2018;Joshi et al., 2017). Compared to conven-1 The code and model weights of ETC has not been released at the time of writing of the paper for us to have an accurate comparison.
tional Transformer-based PLMs e.g. RoBERTa (Liu et al., 2019) that can only take up to 512 subword tokens, Longformer provides a much larger maximum input length of 4,096. Position-aware Paragraph Representation (PAPR): To address the fact that many popular unstructured texts such as Wikipedia pages have relatively standard ways of displaying certain relevant information (e.g. birthdays are usually in the first paragraph vs. spouse names are in the "Personal Life" paragraph), we provide the base model with a representation of which part of the text it is reading by marking the paragraphs with special atomic markup tokens ([paragraph=i]) at the beginning of each paragraph, indicating the position of the paragraph within the text 2 . With this input representation, we then directly perform long answer classification using the special paragraph token output embedding. Formally, for every paragraph l i ∈ P , where P are all paragraphs in a text and the representation for the corresponding markup token h p i , the logit of a paragraph answer a it computes is as a p i = W h p i + b. We obtain additional document-piece representation from the standard [CLS] (Devlin et al., 2019) token to model document pieces that do not contain paragraph answers. The probability of choosing the paragraph given context c, is computed as the softmax over paragraph candidate (with an answer span) logits and not containing answer logit: We pad the paragraph representations to ensure a rectangular tensor in a batch. Our final prediction strategy is similar to Zheng et al. (2020) as we first choose the paragraph candidate with the highest logit among all candidates. We then extract span answers within the selected paragraph answer candidate using a standard pointer network.

Gaussian Prior Optimization (GPO)
Conventional span extraction models (Chen et al., 2017;Glass et al., 2020; optimize the probability of predicted start and end positions of the answer spans with ground-truth spans via maximum likelihood estimation (MLE,Wilks et al., 1938). MLE methods promote the probability for the ground-truth positions while suppressing the probability for all other positions. However we hypothesize that, for all those negative instances, the positions that are near the ground-truth should be given higher credit than farther distant positions, since the extracted answers will be partially overlapping with the ground-truth.
To tackle this problem, we follow the intuition from Li et al. (2020) which proposes to promote the probability of generating synonyms using a Gaussian-like distribution for machine translation. We construct the distribution where it has the highest probability at ground-truth positions, and drop the probability exponentially as computed by the distance to the corresponding ground-truth positions. Specifically, for a groundtruth start or end position at y s , where s ∈ {start, end}, we use a Gaussian distribution N (y s , σ), where the mean is the position y s and variance σ is a hyperparameter. We consider the probability density ϕ(y | y s , σ) of the Gaussian distribution at each position y as the logit for the corresponding position. We then use the softmax function with temperature T to rescale the logits to get the Gaussian-like distribution q(y |ŷ s ) for ground-truth distribution at position y s , q(y | y s ) = softmax(ϕ(y | y s , σ)/T ).
We augment our MLE objective with an additional KL divergence (Kullback and Leibler, 1951) term between constructed distribution q(y | y s ) and model prediction p s (y | c), s ∈ {start, end}, so that we can guide our model to follow the Gaussianlike distribution for partial credit.
L D = KL (q(y | y s ) p s (y | c)) = y q(y | y s ) log p s (y | c) − y q(y | y s ) log q(y | y s ).
We refer to this final model as VAULT.

Experiments
Datasets: We experiment with two challenging "natural" MRC datasets: NQ (Kwiatkowski et al., 2019) and TechQA (Castelli et al., 2020). We provide a brief summary of the datasets and direct interested readers to the corresponding papers. NQ consists of crowdsourced-annotated full Wikipedia pages which appear in Google search logs with two tasks: the start and end offsets for the short answer (SA) and long answer (LA, eg. paragraph) -if they exist. TechQA is developed from real user questions in the customer support domain where each question is accompanied by 50 documents -at most one of which has an answer -with answers significantly longer (~3-5x) than standard MRC datasets like SQuAD. We report official F1 scores for each dataset.

Results on NQ:
We train VAULT on NQ -predicting the paragraph and span answers as NQ's LA and SA respectively -and compare against ROBERTA DM : a RoBERTa (Liu et al., 2019) variant of the SOTA document model (DM) (Zheng et al., 2020) using the base variants for a more systematic comparison. Although it may seem fair to include a Longformer DM baseline in our table, doing so would be infeasible (and unwise) due to production resource constraints. We further show the impact of VAULT by providing ablation experiments where its components (GPO and PAPR) are removed. The base LM (Longformer in our experiments) without GPO and PAPR, is implemented in the style of (Alberti et al., 2019b; Chakravarti et al., 2020) where we first predict the SA and then select the enclosing LA. We aim to show that our proposed method provides comparable results to ROBERTA DM while being considerably faster while decoding and displaying improved performance over experiments just using the language model. To do this we consider development set SA and LA F1 (the F1 metrics with respect to the span and paragraph answers respectively) as well as decoding time t decode (on a V100) as metrics. Table 1 shows the results on the NQ dev set. We see VAULT and ROBERTA DM provide comparable F1 performance (precision and recall are shown in the Appendix). However, when it comes to decoding time, we can find VAULT decodes over 16 times faster than ROBERTA DM . We additionally see in the ablation experiments that our enhancements increase both F1 metrics by multiple points, at the expense of some decoding time. In particular we note that the F1 performance of Longformer is not competitive with VAULT. We conclude that VAULT provides the best balance of F1 and decoding time as it is effectively tied on F1 (with ROBERTA DM ) and is only around 20 minutes slower to decode than the quickest model.  Results on TechQA are reported in Table 2. We see that our VAULT model provides an improvement of 0.7 F1 and 8.5 HA F1 (denotes Has Answer); thus showing the effectiveness of our approach. In particular, we see that this approach of imputing a paragraph structure to classify provides a large boost to performance when a non-null answer exists (HA F1).

Conclusions
In this work we introduce and examine a powerful yet simple model for reading comprehension on long texts which we call VAULT, based on the hypothesis that with a large sequence length long answers can be classified effectively without computationally heavy graph-based models. We validate our approach by showing it yields F1 scores competitive with heavier methods at a fraction of the decoding cost on two very different domain benchmark datasets that require reading long texts.  compare the correct answers produced by VAULT with the incorrect answers produced by the ablated model from the last row of Table 3 (NQ) and Roberta baseline from the first row of Table 2 (TechQA).
In the first example the gold SA is null, however there is a gold LA. This indicates that there is no short span which answers the question: the correct answer here is an entire paragraph. This does not confuse VAULT which is able to identify the correct answer directly. However the ablated model which attempts to predict SA first struggles herepredicting the incorrect LA -as there is no gold SA.
In the second example we see that in this Technote both the correct and incorrect answers are single sentence paragraphs surrounded by paragraph breaks. Our VAULT is able to identify the correct paragraph using our imputed structure and select the correct answer -whereas the Roberta baseline selects a nearby but incorrect answer.
Example A1 (NQ) Question: why did government sponsored surveys and land acts encourage migration to the west Wikipedia Page: Homestead Acts Text: ... An extension of the Homestead Principle in law, the Homestead Acts were an expression of the " Free Soil " policy of Northerners who wanted individual farmers to own and operate their own farms, as opposed to Southern slave-owners who wanted to buy up large tracts of land and use slave labor, thereby shutting out free white men.
The first of the acts, the Homestead Act of 1862 , opened up millions of acres. Any adult who had never taken up arms against the U.S. government could apply. Women and immigrants who had applied for citizenship were eligible. The 1866 Act explicitly included black Americans and encouraged them to participate, but rampant discrimination slowed black gains. Historian Michael Lanza argues that while the 1866 law pack was not as beneficial as it might have been, it was part of the reason that by 1900 one fourth of all Southern black farmers owned their own farms.  The correct answer is a paragraph LA; only VAULT identifies the correct LA directly even though the gold SA is null. (A2) VAULT identifies the correct "paragraph" answer.