Robust Question Answering Through Sub-part Alignment

Current textual question answering (QA) models achieve strong performance on in-domain test sets, but often do so by fitting surface-level patterns, so they fail to generalize to out-of-distribution settings. To make a more robust and understandable QA system, we model question answering as an alignment problem. We decompose both the question and context into smaller units based on off-the-shelf semantic representations (here, semantic roles), and align the question to a subgraph of the context in order to find the answer. We formulate our model as a structured SVM, with alignment scores computed via BERT, and we can train end-to-end despite using beam search for approximate inference. Our use of explicit alignments allows us to explore a set of constraints with which we can prohibit certain types of bad model behavior arising in cross-domain settings. Furthermore, by investigating differences in scores across different potential answers, we can seek to understand what particular aspects of the input lead the model to choose the answer without relying on post-hoc explanation techniques. We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets. The results show that our model is more robust than the standard BERT QA model, and constraints derived from alignment scores allow us to effectively trade off coverage and accuracy.


Introduction
Current text-based question answering models learned end-to-end often rely on spurious patterns between the question and context rather than learning the desired behavior.They might be able to ignore the question entirely (Kaushik and Lipton, 2018), focus primarily on the answer type (Mudrakarta et al., 2018), or otherwise ignore the "intended" mode of reasoning for the task (Chen and Figure 1: A typical adversarial example on SQuAD, where the model picks the artificial distractor answer.By breaking the question and context to smaller units, we can expose the error (the wrong entity match) and use explicit constraints to fix it.Durrett, 2019;Niven and Kao, 2019).Thus, these models are not robust to adversarial attacks (Jia and Liang, 2017;Iyyer et al., 2018;Wallace et al., 2019): their reasoning processes are brittle, so they can be fooled by surface-level distractors that look laughable to humans (Figure 1).Methods like adversarial training (Miyato et al., 2016;Wang and Bansal, 2018;Lee et al., 2019;Yang et al., 2019), data augmentation (Welbl et al., 2020), and posterior regularization (Pereyra et al., 2016;Zhou et al., 2019) have been proposed to improve robustness.However, these settings are often optimized around a certain type of error, and it remains unclear how to dynamically adapt these models to new adversarial settings that may arise.
In this paper, we explore a model for text-based question answering through sub-part alignment.The core idea behind our method is that if every sub-part of the question is well supported by the answer context, then the answer produced should be trustable; if not, we suspect that the model is making an incorrect prediction.For instance, Figure 1 shows an adversarial example of SQuAD (Jia and Liang, 2017) where a standard BERT QA model predicts the wrong answer August 18,1991, and we do not know why.However, if we can decompose the question into smaller units, we can see that it is because Super Bowl 50 aligns to Champ Bowl and misleads the model.By exposing this error directly, we make it easier to subsequently patch, as we discuss later.Specifically, we incorporate Semantic Role Labeling (SRL) to decompose the sentences of the question and context into predicates and corresponding arguments.Then we view the question answering procedure as a constrained graph alignment problem where the nodes are represented by the predicates and arguments, and the edges are formed by relations between them (e.g.predicateargument relations and coreference relations).Our question should align to a local subgraph in the context, though our process is more flexible than graph alignments used in prior work (Sachan and Xing, 2016;Khashabi et al., 2018).Once we complete the alignment, the node aligned to the wh-span should contain the answer, so we use a standard QA model to extract the answer from this span.Note that while we use SRL in this work, our model could work with any graph-structured semantic representation, including AMR (Sachan and Xing, 2016).
Each pair of aligned nodes is scored using BERT (Devlin et al., 2019); these alignment scores are then plugged into a beam search procedure to find the optimal graph alignment subject to constraints.This structured alignment model can be trained as a structured support vector machine (SSVM) to minimize alignment error with heuristically-derived oracle alignments subject to graph constraints.The alignment scores are computed in a black-box way, so the model does not necessarily produce tokenlevel explanations (Jain and Wallace, 2019); however, the score of an answer is directly a sum of the score of each aligned piece, making this structured prediction phase of the model "faithful by construction."Critically, this allows us to understand what parts of the alignment are responsible for a prediction, and if needed, constrain the behavior of the alignment to correct for certain types of errors.
We view this interpretability and extensibility with constraints as one of the principal advantages of our model.As such, we train our model on the normal SQuAD dataset (Rajpurkar et al., 2016) and focus on performance on out-of-domain and adversarial data.Specifically, we use SQuAD Adversarial (Jia and Liang, 2017) and Universal Triggers on SQuAD (Wallace et al., 2019) to probe the model's behavior in these settings when it has only been exposed to "clean" training examples.Our framework allows us to incorporate natural constraints on alignment scores to improve zero-shot performance in adversarial settings.Finally, our model's alignments serve as "explanations" for its prediction, allowing us to ask why certain predictions were made over others and examine scores for hypothetical other answers the model could give.

Question Answering as Graph Alignment
Our approach critically relies on the ability to decompose questions and answers into a graph over text spans.Our model can in principle work for a range of syntactic and semantic structures, including dependency parsing, SRL (Palmer et al., 2005), and AMR (Banarescu et al., 2013).We use SRL in this work and augment it with coreference links, due to the high performance and flexibility of current SRL parsers (Shi and Lin, 2019;Peters et al., 2018).
Graph Construction An example of the graph we construct is shown in Figure 2.Both the question and passage are represented as graphs where the nodes consist of predicates and arguments.
Edges are undirected and connect each predicate and its corresponding arguments.Since SRL only captures the predicate-argument relations within one sentence, we add context information to the graph by adding coreference edges: if two arguments are in the same coreference cluster, we add an edge between them.Finally, in certain cases involving verbal or clausal arguments, there might exist nested structures where an argument to one predicate contains a separate predicate-argument structure.In this case, we remove the larger argument and add an edge directly between the two predicates.This is shown by the edge from was to determine (labeled as nested structure) in Figure 2).Breaking down such large arguments helps avoid ambiguity during alignment.
The alignment structure between the question and context has been proven to be useful for question answering in previous work (Khashabi et al., 2018;Sachan and Xing, 2016;Sachan et al., 2015).Our framework differs from theirs in that it incorporates a much stronger alignment model (BERT), allowing us to relax the alignment constraints while still achieving high performance.
Alignment Constraints Once we have the constructed graph, we can align each node in the question to its counterpart in the graph.In this work, we control the alignment behavior by placing explicit constraints on this process.We place a locality constraint on the alignment: we constrain adjacent pairs of question nodes to align no more than k nodes apart in the context.k = 1 means we are aligning the question to a connected sub-graph in the context, k = inf means we can align to a node anywhere in the context graph.In the following sections, we will discuss more constraints.

Model
We now describe the graph alignment process.Let T represent the text of the passage and question concatenated together.Assume a decomposed question graph Q with vertices q 1 , q 2 , . . ., q m represented by vectors q 1 , q 2 , . . ., q m , and a decomposed context C with vertices c 1 , . . ., c n represented by vectors c 1 , . . ., c n .Let a = (a 1 , . . ., a m ) be an alignment of question nodes to passage nodes, where a i ∈ {1, . . ., n} indicates the alignment of the ith question node.Each question node is aligned to exactly one passage node, and multiple question nodes can align to the same passage node.
We frame question answering as a maximization over possible alignments: s.t.constraints on a are satisfied that maximizes a scoring function f under some constraints.In this paper, we simply choose f [CLS] What day was Super Bowl 50 … [SEP] Super Bowl 50 was an … BERT { < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 j F / a x i K Q 5 0 q c k 4 5 3 c B + a S 7 C D O 0 = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 9 V 7 A e 0 o W y 2 m 3 b p Z h N 2 J 0 I J / Q d e P C j i 1 X / k z X / j t s 1 B W x 8

What day
Super Bowl 50 Super Bowl 50 was { < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 j F / a x i K Q 5 0 q c k 4 to be the sum over the scores of all alignment pairs f (a, Q, C, T) = n i=1 S(q i , c a i , T), where S(q, c, T) denotes the alignment score between a question node q and a context node c.This function relies on BERT (Devlin et al., 2019) to compute embeddings of the question and context nodes and will be described more precisely in what follows.We will train this model as a structured support vector machine (SSVM), described in Section 3.2.
Scoring Our alignment scoring function S is shown in Figure 3.Given a document and a question of raw text, we first concatenate the question with the document and then encode them using the pre-trained BERT encoder (Devlin et al., 2019).We then extract the representation for each node in the question and context using a span extractor, which in our case is simply mean pooling over the token representations.For example, the representation of a node c in the document is given by c k = mean(BERT(T)[c start : c end ]), where c start and c end denote the span start position and span end position of c in the text T. The node representation in the question can be computed in the same way.
In this work, we choose S(q, c, T) = q • c, the dot-product between the corresponding node representations q and c as introduced in the above section.
Answer Extraction Our model so far produces an alignment between question nodes and passage nodes.We assume that one question node contains a wh-word.Theoretically, the passage node aligned to this wh node should correspond to the answer, but in practice it may not always exactly.For example, in Figure 2, the wh-alignment is on February 7, 2016 but we only need the actual date February 7, 2016 as an answer.We resolve this by using a standard text-based QA model to extract the actual answer, namely the standard BERT QA to do this job.To train the BERT model, we treat all arguments in the context that contain answer as the "context" of BERT QA.

Training
We train our model as an instance of a structured support vector machine (SSVM).Ignoring the regularization term, this objective can be viewed as a sum over the training data of a structured hinge loss with the following formulation: where a denotes the predicted alignment, a * i is the oracle alignment, and Ham is the Hamming loss between these two.To get the predicted alignment a i during training, we need to run loss-augmented inference as we will discuss in the next section.When computing the alignment for node j, if a j = a * j , we add 1 to the alignment score to account for the loss term in the above equation.Intuitively, this objective requires the score of the gold prediction to be larger than any other hypothesis a by a margin of Ham(a, a * ).
When training our system, we first do several iterations of local training where we treat each alignment decision as an independent prediction, imposing no constraints, and optimize log loss over this set of independent decisions.The local training helps the global training converge more quickly.

Inference
Since our alignment constraints do not strongly restrict the space of possible alignments (e.g., by enforcing a one-to-one alignment with a connected subgraph), searching over all valid alignments is intractable.We therefore use beam search to find the approximate highest-scoring alignment as follows: (1) We initialize the beam with the node pairs associated with the top b highest alignment scores, where b is the beam size.(2) For each hypothesis in the beam, we compute a set of reachable nodes based on the currently aligned pairs under the locality constraint.(3) We extend the current hypothesis by adding each of these possible alignments and accumulating its score.We continue beam search until all the nodes in the question are aligned and then return the highest-scoring hypothesis.
An example of one step of beam hypothesis expansion is shown in Figure 4.In this state, the  two played nodes are already aligned.In any valid alignment, the neighbors of the played question node must be aligned within two nodes of the played passage node to respect the locality constraint.We therefore only consider alignments for the game, on Feb 7, 2016 and Super Bowl 50 as new reachable nodes.Then the alignment scores between all reachable nodes and the remaining nodes in the question are computed and used to extend the beam hypotheses.The highest scoring hypothesis in the next beam ends up aligning the two Super Bowl 50 nodes.
Note that this inference procedure allows us to easily incorporate other constraints as well.For instance, we could require a "hard" match on entity nodes, meaning that two nodes containing entities can only be matched if they share exact the same entities.In this sense, as shown in the figure, Super Bowl 50 can never be aligned to on Feb 6, 2016.We discuss such constraints more in Section 5.

Oracle Construction
The oracle construction is basically running the inference based on heuristically computed alignment scores S oracle , where S oracle (q, c) is computed by the Jaccard similarity between a question node q and a context node c.Instead of initializing the beam with the b best alignment pairs, we first align the wh-argument in the question with the nodes containing the answer in the context and then initialize the beam with legal alignment pairs.
If the Jaccard similarity between a question node and all other context nodes is zero, we treat these as unaligned nodes.During training, our approach can gracefully handle unaligned nodes by treating these as latent variables in structured SVM; the

Adversarial Robustness
Our focus in this work is primarily robustness, interpretability, and controllability of our model.We first focus on adapting to challenging adversarial settings in order to "stress test" our approach.
For all experiments, we train our model only on the unmodified SQuAD-1.1 dataset (Rajpurkar et al., 2016) and examine how well it can generalize to adversarial and out-of-domain settings with minimal modification, using no fine-tuning on new data and no data augmentation that would capture the adversarial transformations.

Baselines
We compare primarily against a standard BERT QA system (Devlin et al., 2019).We also investigate a local version of our model, where we only try to align the wh-node, without any global training (local training + local inference).Note that this can work fairly well because the BERT embeddings see the whole question and passage.We can also use our locally-trained alignment model but with our global inference scheme (local training + global inference).

Adversarial Datasets
Added sentences Jia and Liang (2017) propose to append an adversarial distracting sentence to the normal SQuAD development set to test the robustness of a QA model.In this paper, we use the two main test sets they introduced: addSent and addOneSent.Both of the two sets augment the normal test set with adversarial samples annotated by Turkers that are designed to look similar to question sentences.In this work, we mainly focus on the adversarial examples.
Universal Triggers Wallace et al. (2019) use a gradient based method to find a short trigger sequence.When they insert the short sequence to the original text, it will trigger the target prediction in the sequence independent of the rest of the passage content or the exact nature of the question.For QA, they generate different triggers for different types of questions including "who", "when", "where" and "why".

Implementation Details
We set the beam size b = 20 for the constrained alignment.We use BERT-base-uncased for all of our experiments, and fine-tune the model using Adam (Kingma and Ba, 2014) with learning rate set to 2e-5.We use the SpanBERT coreference system (Joshi et al., 2020) and the BERT SRL system (Shi and Lin, 2019).When doing inference, we set the locality constraint k = 3.We discard the questions that do not have a valid SRL parse or do not contain a wh word.

Results on Adversarial SQuAD
The results on the normal SQuAD development set and Adversarial SQuAD are shown in Table 1, we have the following observations: Our model is not as good as BERT QA on normal SQuAD but outperforms it in adversarial settings.Compared with the standard BERT QA model, our model indeed is fitting a different data distribution (learning a constrained structure) which makes the task harder.This kind of training scheme does cause some performance drop on normal SQuAD, but we can see that it improves the F1 on addSent and addOneSent by 3.0 and Table 2: Performance of our systems compared to the literature on both addSent and addOneSent.Here, overall denotes the performance on the full adversarial set, adv denotes the performance on the adversarial samples alone.∆ represents the performance gap between the normal SQuAD and the overall performance on adversarial set.
5.2 respectively.This smaller drop in performance indicates that learning the alignment helps improve the robustness of our model.
Global training and inference substantially improves performance in adversarial settings, despite having no effect in-domain.Normal SQuAD is a relatively easy dataset and the answer for most questions can be found by simple lexical matching between the question and context.
From the result of "local training + local inference", we see that more than 80% of answers can be located by matching the wh-argument BERT embedding with the passage.However, as there are very strong distractors in SQuAD-Adversarial, whargument matching is unreliable.In such situations, the constraints imposed by other argument alignments in the question are quite useful to correct the wrong wh-alignment through global inference.
We see that global inference is consistently better than the local inference on both addSent or addOneSent.
Global training in addition to global inference is also important for our model to attain high performance.We find that the locally trained model tends to make overly confident predictions about each separate alignment.Since our global inference objective is maximizing the sum of all alignment scores, the alignment tends to be dominated by those scores.During the objective of global training, the model might be correcting for this by learning to "ignore" certain alignments as long as it can get the final answer using the overall structure.
Once the answer is located, extracting the exact answer span is relatively easy.Comparing the result of "ans in wh" and "answer F1", we can see that the actual F1 score is quite similar to the percentage of answers found in the wh-alignment.This indicates that if the actual answer is contained in the wh-alignment, the answer extraction module can do an almost perfect job.

Results on Universal Trigger
The results on different triggers are shown in Table 3.We see that every trigger results in a bigger performance drop on BERT QA than our model.Our model is much more stable, especially on who and when question types, in which case the performance only drops by around 2%.Several factors may contribute to the stability: (1) The triggers are ungrammatical and their arguments often contain seemingly random words, which are likely to get lower alignment scores.
(2) Because our model is structured and trained in a different fashion than BERT, adversarial attacks designed for span-based question answering model may not fool our model as effectively.

Comparison to Existing Systems
We compare our best model (not using constraints from Section 5) with some other models in the literature in Table 2.We note that the overall performance of our model on the two adversarial set is lower compared with BERT, while the performance on adversarial samples alone is higher.That is because we make the task harder -we trade some in-distribution performance to improve the model's robustness, controllability and explainability.Also, we see that our model achieves the smallest gap on addSent and a comparable gap on addOneSent, which demonstrates that the constrained alignment we proposed is a strong and effective way to enhance the robustness of the model compared to some of previous methods like adversarial training (Yang et al., 2019), explicit knowledge integration (Wang and Jiang, 2018).

Generalization via Alignment Constraints
One advantage of our explicit alignments is that we can understand mechanically what the model is doing.This also allows us to add constraints to our model to prohibit certain behaviors, thus allowing us to flexibly adapt our model to this adversarial setting.
Our constraints take the form of either hard constraints on alignments or hard constraints on scores.These alignments may cause all answers to be rejected on a certain example.We therefore evaluate our model's accuracy at various precision points.

Constraint on
Entities By examining addSent and addOneSent, we find that the model is typically fooled when the nodes containing entities in the question aligns to "adversarial" entity nodes.An intuitive constraint we can place on the alignment is that we require a hard entity match -for each argument in the question, if it contains entities, it can only align to nodes in the context sharing exact the same entities.We call this type of constraint the "hard entity constraint".

Constraint on Alignment Scores
The hard entity constraint is quite inflexible and does not generalize to different questions (e.g., questions that do not contain a entity).However, the alignment scores we get during inference time are a good indicator showing how well a specific node pair is aligned.For a correct alignment, every pair should get a reasonable alignment score.However, if an alignment goes wrong, there should exist some bad alignment pairs which have relatively lower scores compared to the good ones.We can reject those samples by finding bad alignment pairs so that we can improve the precision of our model as well as give an explanation on why our model makes a unreliable prediction.
In this paper, we propose to use a simple heuristic to identify the bad alignment pairs: We first find the max score S max over all possible alignment pairs for a sample, then for each align- , what F1 does it achieve?For our model, the confidence is represented by our "worst alignment gap" metric.Smaller WAG indicates higher confidence.For BERT, the confidence is represented by the posterior probability.
ment pair (q i , c j ) in the predicted alignment, we calculate the worst alignment gap (WAG) g = min (q,c)∈a (S max − S(q, c)).If g is beyond some threshold, it may indicate that alignment pair is not reliable.1 Comparison to BERT Desai and Durrett (2020) show that pre-trained transformers like BERT are well-calibrated on a range of tasks.Since we are rejecting the unreliable predictions to improve the precision of the model, to make a fair comparison, we reject the same number of examples using the posterior probability of the BERT QA predictions.
To be specific, we rank the predictions of all examples by the sum of start and end posterior probabilities and compute the F1 score on the top k predictions.

Results on Constrained Alignment
On Adversarial SQuAD, the confidence scores of a normal BERT QA model do not align with its performance.From Figure 5, we find that the more confident BERT is (i.e., in low coverage settings), the worse its performance.One possible explanation of this phenomenon is that BERT overfits to the pattern of lexical overlap, and is actually most confident when highly fitting adversarial examples show up.Question: Who created an engine using high pressure steam in 1801?
Oracle alignment: Around 1800 Richard Trevithick and, separately, Oliver Evans in 1801 introduced engines using high-pressure steam; Adversarial alignment: Jeff Dean created an engine using low pressure steam in 1790.Hard entity constraints improve the precision but are not flexible.Figure 5 also shows that by adding hard entity constraint, we achieve a 67.5 F1 score which is an 11.3 improvement over the unconstrained model at a cost of only 60% of samples being covered.Under the hard entity constraint, the model is not able to align to the nodes in the adversarial sentence, but the performance is still lower than what it achieves on normal SQuAD.We examine some of the error cases and find that for a certain amount of samples, there is no path from the node satisfying the constraint to the node containing the answer (e.g. they hold a more complex discourse relation while we only consider coreference as cross-sentence relation).In such cases, we will never be able to find the answer through the hard entity match.
Smaller worst alignment gap indicates better performance.As opposed to BERT, our alignment score is well calibrated on those adversarial examples.This substantiates our claim that those learned alignment scores are good indicators of how trustful alignment pairs are.Also, we see that when the coverage is the same as the entity constraint, the performance under the alignment score constraint is even better.This demonstrates that the alignment constraints is flexible and easy to apply yet effective.

Case study on Alignment Scores
In the above experiments, we show that the alignment scores are helpful to control the behavior of our model.In this section, we give several examples of the alignment and demonstrate how those scores can act as an explanation to the model's behavior.Those examples are shown in Figure 6.
Here are some characteristics of those alignments: The model's behavior is highly affected by some overconfident alignment.As shown by the dashed arrows, all adversarial alignments contain at least one unreliable alignment with relatively lower alignment scores.6 Cross-Domain Performance We also test the performance on several cross-domain datasets, namely Natural Question (Kwiatkowski et al., 2019), NewsQA (Trischler et al., 2017), BioASQ and TextbookQA (Kembhavi et al., 2017), picked from the MRQA shared task (Fisch et al., 2019) and the results are shown in Table 4.Of particular note is that although our model does worse than BERT on SQuAD (Table 1), its performance is more similar to BERT's in other domains, even without the addition of any constraints.Also, we consistently see improvements from global training and global inference, except on the Natural Questions dataset.
We also note that a main cause of performance drop is actual answer extraction.On BioASQ, for example, we find a argument containing the answer nearly 64% of the time, but the answer extraction module fails because the types of answers are significantly different than those in SQuAD.We believe this module could be adapted more.

Related Work
Adversarial Attacks in NLP.Adversarial attacks on a wide range of NLP tasks has been increasingly studied in recent years.These may take the form of challenge sets like adversarial SQuAD (Jia and Liang, 2017) or attacks like the universal adversarial triggers (Wallace et al., 2019).
A separate line of work focuses on enumerating a space of sentence perturbations and searching over this space adversarially: for example, Ribeiro et al. (2018) propose deriving transformation rules, Ebrahimi et al. (2018) use character-level flips, and Iyyer et al. (2018) use controlled paraphrase generation.The more highly structured nature of our approach makes it naturally more robust to such attakcs.
Neural module networks.Neural module networks are a class of models that decompose a task into several sub-tasks (sub-module), which make the model more robust and interpretable (Andreas et al., 2016;Hu et al., 2017;Cirik et al., 2018;Hudson and Manning, 2018;Jiang and Bansal, 2019).
While our work models QA as a collection of alignment decisions, this differs from module networks in that their sub-modules are often learned endto-end while our alignment module is trained in a structured prediction framework, which makes our alignment module more flexible and controllable.
Unanswerable questions Our approach rejects some questions as unanswerable.This is similar to the idea of unanswerable questions in SQuAD 2.0 (Rajpurkar et al., 2018), which have been studied in other systems (Hu et al., 2019).However, techniques to reject these questions, which are not adversarial in nature, differ substantially from ours, and the setting we consider is more challenging as we do not assume access to such questions at training time.
Graph alignment Khashabi et al. (2018) propose to answer questions through a similar graph alignment using a wide range of semantic abstractions of the text, with ILP-based inference to find the optimal graph alignment.Our model differs in two ways: (1) Our alignment model is trained endto-end while their system mainly uses off-the-shelf, general-purpose natural language modules.(2) Our alignment is formed as node pair alignment rather than finding the optimal sub-graph, and is significantly more flexible.Sachan et al. (2015); Sachan and Xing (2016) propose to use a latent alignment structure most similar to ours; however, our model is quite different from theirs and our alignment is also more flexible.
Decomposing Questions Past work has decomposed complex questions to answer them more effectively (Talmor and Berant, 2018;Min et al., 2019;Perez et al., 2020).Wolfson et al. (2020) further introduce a Question Decomposition Mean-ing Representation (QDMR) to explicitly model this process.However, the questions they answer, such as those from HotpotQA (Yang et al., 2018), are fundamentally designed to be multi-part and so are easily decomposed, whereas the questions we consider are not.Our work focuses on the robustness, controllability and explanability, and our model theoretically could be extended to leverage these question decomposition forms as well.

Conclusion
In this work, we presented a model for doing question answering through sub-part alignment.By having our model structured around an explicit alignment scoring process, we show that our approach can to generalize better to other domains.Having alignments also makes it possible to filtering out bad model predictions (by treating the scores as confidence values) and interpreting the model's behavior (by examining the alignments and scores directly).

Figure 2 :
Figure 2: The constructed graph based on an example on SQuAD dev.Here Super Bowl 50 and the game are connected by a coreference edge.The edge from was to determine is formed through a predicate nested inside an argument.The oracle alignment (Section 3.4) is shown with dotted lines.

Figure 3 :
Figure 3: Alignment scoring.Here the alignment score is computed by the dot product between span representations of question and context nodes.The final alignment score (not shown) is the sum of these edge scores.

Figure 4 :
Figure 4: An example of how we align a node with constraints.The blue node played is already aligned.The orange nodes denote all the valid context nodes that can be aligned to for the next step of inference given the locality constraint.Here we only demonstrate the alignment candidates for Super Bowl 50, all other unaligned question nodes have the same alignment candidates.

F1Figure 5 :
Figure 5: The F1-coverage curve of our model compared with BERT QA.If our model can choose to answer only the k percentage of examples it's most confident about (the coverage), what F1 does it achieve?For our model, the confidence is represented by our "worst alignment gap" metric.Smaller WAG indicates higher confidence.For BERT, the confidence is represented by the posterior probability.

Figure 6 :
Figure 6: Examples of alignment of our model on addOneSent.The numbers are the actual alignment scores of the model's output.The dashed arrows denote the unreliable alignment, the bold arrows denote the alignment that contribute most to the model's prediction.

Table 1 :
Jia and Liang (2017)ur proposed model on SQuAD and two adversarial settings fromJia and Liang (2017).Our Sub-part Alignment model uses both global training and inference as discussed above."ans in wh" denotes the percentage of answers found in the span aligned to the wh-span, and F1 denotes the standard QA performance measure.Here for addSent and addOneSent, we only consider the adversarial examples in these datasets.

Table 3 :
The performance of our model on SQuAD-Adversarial-Triggers.Compared with BERT, our model sees smaller performance drops on all triggers.
Who translated and printed Luther's 95 These?Oracle alignment: … friends of Luther translated the 95 Theses from Latin into German and printed and widely copied them. Question:

Table 4 :
(Fisch et al., 2019) the model is overconfident towards the other alignments with a high lexical overlap as shown by the bold arrows.With those scores, it is easy for us to interpret our model's behavior.For instance, in example (a), it is because the predicate alignment causes Luther's 95 These to have no choice but align to Jeff Dean which is totally unrelated.Note that it is because we have those alignments over the sub-parts of a question that we can inspect the model's behavior in a way that the normal BERT QA model can not.The performance of our proposed model on several out-of-domain datasets from the MRQA shared task(Fisch et al., 2019).Compared to SQuAD in-domain, where our model is 6 F1 lower than BERT, global training and inference helps our model achieve nearly similar aggregate performance across different domains.