AttenWalker: Unsupervised Long-Document Question Answering via Attention-based Graph Walking

Annotating long-document question answering (long-document QA) pairs is time-consuming and expensive. To alleviate the problem, it might be possible to generate long-document QA pairs via unsupervised question answering (UQA) methods. However, existing UQA tasks are based on short documents, and can hardly incorporate long-range information. To tackle the problem, we propose a new task, named unsupervised long-document question answering (ULQA), aiming to generate high-quality long-document QA instances in an unsupervised manner. Besides, we propose AttenWalker, a novel unsupervised method to aggregate and generate answers with long-range dependency so as to construct long-document QA pairs. Specifically, AttenWalker is composed of three modules, i.e., span collector, span linker and answer aggregator. Firstly, the span collector takes advantage of constituent parsing and reconstruction loss to select informative candidate spans for constructing answers. Secondly, by going through the attention graph of a pre-trained long-document model, potentially interrelated text spans (that might be far apart) could be linked together via an attention-walking algorithm. Thirdly, in the answer aggregator, linked spans are aggregated into the final answer via the mask-filling ability of a pre-trained model. Extensive experiments show that AttenWalker outperforms previous methods on Qasper and NarrativeQA. In addition, AttenWalker also shows strong performance in the few-shot learning setting.


Introduction
Textual question answering (QA) is the task of answering questions given textual documents as the context.Previous works can be divided into short-document QA2 methods (Seo et al., 2017) a.Long Document We aggregate two types of sentences embeddings ...  (Beltagy et al., 2020) (the upper half).Then, the acquired token-level attention graph (not shown here) is converted into a spanlevel graph (the lower half) via the method described in Section 3.4.Spans (which might be far apart) are then linked if their edge weight is high.For example, the span "The main contributions" walks through 1,065 tokens and links with "a single-layer forward recurrent neural network", which is then linked with "Long Short-Term Memory" since their high weight edges (0.53 and 0.48).Other spans do not connect with them due to their low edge weights to these spans.
and long-document QA methods (Nie et al., 2022b).Short-document methods approach, and even outperform humans due to the availability of largescale short-document QA datasets (Rajpurkar et al., 2016).Despite that, long-document methods still lag behind humans by a large margin since annotating long-document QA datasets (Dasigi et al., 2021) is time-consuming and costly.Intuitively, the high cost of annotating longdocument QA pairs can be alleviated in an unsupervised manner.However, there are only shortdocument unsupervised question answering (UQA) works (Lewis et al., 2019;Pan et al., 2021), which aim to construct a large number of short-document QA pairs in an unsupervised manner and train a QA model with these QA pairs.Lewis et al. (2019) first propose the UQA task and use unsupervised neural translation to construct QA pairs in a short passage.Pan et al. (2021) raise the unsupervised shortdocument multi-hop question answering (UMQA) task and design a question generation method to build multi-hop questions within two short passages.To break the document length limitation and incorporate long-range information, we propose a more challenging task, i.e. unsupervised long-document question answering (ULQA) task, to generate high-quality long-document QA pairs and train a competitive QA model without any human-labeled long-document QA pairs.
The core challenge of this task is in the modeling of long-range dependency without supervision.To address this issue, we study an attention-driven method to incorporate meaningful long-range information in the constructed QA pairs.Figure 1 illustrates a motivating example of the attention flow in a long document.It is observed that, by walking through the attention edges of a pre-trained model, related spans would be linked and long-range dependency in the document could be constructed.Therefore, long-range information could be also incorporated into QA pairs through these walkable attention patterns among text spans.Thus, we propose AttenWalker, a novel unsupervised framework to generate long-range dependent answers in long-document QA pairs.Specifically, Atten-Walker comprises three modules: span collector, span linker and answer aggregator.Firstly, the span collector takes advantage of the constituent parsing and reconstruction ability of a pre-trained model to select informative candidate spans.Secondly, related spans that might be far apart could be connected through local or global attention edges of a long-document pre-trained model.Thirdly, collected spans are aggregated through the reconstruction ability of a pre-trained model.
Extensive experiments on Qasper (Dasigi et al., 2021) and NarrativeQA (Kociský et al., 2018) show that the proposed AttenWalker can effectively model long-range dependency in long-document QA.Besides, AttenWalker also shows strong performance in the few-shot learning setting.
Our contributions are as follows: • To the best of our knowledge, we are the first to explore unsupervised long-document QA.  2020) use cited documents to generate questions so that the overlapping problem between the generated question and the raw context could be alleviated.Nie et al. (2022a) propose to mine answers beyond named entities in the synthetic QA dataset and improve the model's ability in dealing with diverse answers.Pan et al. (2021) propose the first unsupervised multi-hop QA framework via multi-hop question generation.However, most of these methods focus on the short-document scenario, while the longdocument setting is still unexplored.
Long-document Question Answering Longdocument question answering (long-document QA) aims to answer questions based on the understanding of a long sequence of text.Previous methods can be divided into end-to-end methods and selectthen-read methods.End-to-end methods (Dasigi et al., 2021) apply sparse attention models to directly answer the question given a long document.Dasigi et al. (2021)  on long-document QA, these methods heavily rely on supervised QA data and can hardly apply to the low-resource setting.

AttenWalker
In this section, we first formalize the task of longdocument QA.After that, the proposed Atten-Walker is described in detail.

Problem Formulation
The setup of long-document QA is as follows.
Given a question q and a long document c, where c is often more than 10K tokens, the QA model p θ (a|c, q) needs to produce a free-formed answer a by understanding the long document c and aggregating question-related snippets from c.
In this paper, we consider an unsupervised setting, where only long document c is available.Our aim is to generate synthetic QA pairs (q ′ , a ′ ) with long-range information and train a competitive long-document QA model via (c, q ′ , a ′ ) triples.

Overview of the Method
The proposed AttenWalker focuses on incorporating long-range information via a well-designed answer generator.Specifically, AttenWalker comprises three modules: Span collector, Span linker, and Answer Aggregator.As shown in Figure 2 titioned Spans via Attention Graph Walking.This module aims to walk through local and global attention edges to link semantically related spans (which could be far apart in the long text) for aggregating answers.Thirdly, an Answer Aggregator combines all the Linked Spans via the reconstruction ability of a BART model to generate the answer.

Span Collector
To determine the candidate spans for generating the answers, we propose a Span Collector.Specifically, as shown in Figure 3, it first seeks for candidate spans via constituent parsing and then reconstructs masked text via a pre-trained T5 model (Raffel where L is the reconstruction loss of the specific span.T is the number of tokens in the ground truth span and p(y i ) is the T5 predicting probability of the i-th token y i in the ground truth span.As shown in Figure 3, "sentence encoder" has the largest reconstruction loss.Thus, we select it as one of the candidate spans.Meanwhile, its parent spans (i.e."as sentence encoder") and its child spans ("sentence" and "encoder") will not be selected for redundancy concern.

Span Linker
The proposed Span Linker is to incorporate longrange information in AttenWalker.It can effectively incorporate long-range dependency through attention-based graph walking.The Span Linker is composed of two sub-modules: a Span Graph Constructor and an Attention-based Graph Walker.

Span Graph Constructor
To explore possible relations among spans, token-level attention scores4 of the LED pre-trained model (Beltagy et al., 2020) can be used.As shown in Figure 1, based on the spans acquired in Section 3.3, we build a span graph G via attention scores between each pair of tokens as shown in Figure 4.For span i and span j, where i, j ∈ G, if there are any attention edges from one of the tokens in span i to one of the tokens in span j, there is an edge from span i to span j.Motivated by the idea of max-pooling technique (Dumoulin and Visin, 2016), to obtain the most obvious relation in each pair of spans, the edge weight e ij from span i to span j can be calculated by the maximum attention weight between any pair of tokens in between: where G i and G j are tokens in span i and span j.
(m, n) is an edge in token-level attention graph G t .w m,n is the attention weight of the edge (m, n).
In the LED encoder, there are local and global attention weights among the tokens in a long document.Both two types of weights can serve as the token-level edge weights w m,n in Eqn 2. In this work, we propose to consider both types for span graph construction.If there is a local attention weight l m,n from token m to token n, we directly assign the value to w m,n .Otherwise, the global attention is considered: we insert a "</s>" at the beginning of each paragraph and set global attention for each of it (Appendix B).It means that each "</s>" can attend to every token in the long sequence and vice versa.Each "</s>" could serve as the representative of the paragraph that follows it.Therefore, "</s>" can be regarded as a bridge to two spans in different paragraphs, which could be far apart and could not be accessible to each other only through the local attention mechanism.To build the "bridge" from paragraph p i to paragraph p j , we first select one of the K tokens t p i with the maximum attention score to the representation of "</s>" s p i .Next, for the representation of s p i , L highest attention scores to other "</s>" tokens are selected.For one of the L "</s>" tokens s p j in paragraph p j , we can access its maximum M attention weights to the corresponding M tokens (t p j ) in paragraph p j .For each t p i , its attention to the target token t p j can be: where g tp i ,tp j is the global attention score from token t p i to token t p j .w tp i ,sp i , w sp i ,sp j , w sp j ,tp j are attention scores directly acquired according to the global attention in the LED model.Here, we use the geometric mean of the attention edge weights from t p i to t p j as the approximate attention weight of the edge (t p i , t p j ).Thus, if there is no direct (local) attention from t p i to t p j but a global path, we can use g tp i ,tp j as the "lost" w tp i ,tp j .
Attention-based Graph Walker Span linking can be done via attention-based graph walking on the constructed span graph.Essentially, the proposed graph walker collects interrelated spans via traversing the span graph.Its main algorithm is based on the Depth First Search (Even, 2011).As shown in the lower half of Figure 5, starting from the first span "The main contributions", graph walking continues searching for accessible span.Thus, it successfully links to the span "a single-layer forward recurrent neural network".Then, starting from this linked span, "Long Short-Term Memory" is also linked because of the high weight 0.48 between it and "a single-layer forward recurrent neural network".To decide whether the edge is of "high weight", we set a pre-defined threshold τ on the edge weight.In other words, the original span graph G can be pruned as a new graph G ′ via: where w e is the weight of edge e.Finally, spans on the walking path are clustered together, which will be used in the following section.

Answer Aggregator
The proposed Answer Aggregator produces the final answer by aggregating the linked spans in Section 3.4.To achieve this goal, we take advantage of the reconstruction ability of a BART model (Lewis et al., 2020).For instance, the linked spans in the lower half of Figure 5 can be formalized into the input to BART: "The main contributions <mask> a single-layer forward recurrent neural network <mask> sentence information".Finally, the output can be an integral text as the answer: "The main contributions were to develop a single-layer forward recurrent neural network for sentence information".

Question Generation
Question generation (QG) is applied when we obtain the answer and all the sentences the linked spans from.We use the QG Operator in Unsupervised Multi-hop QA (Pan et al., 2021) as the QG module in our work.We concatenate the answer from Section 3.5 with all the aforementioned sentences into the QG module to generate a question.

Two-Pass Scheme for Long-Range Reasoning
In the pre-trained LED model, query, key, and value matrices of the global attention are just copied from the corresponding matrices in the local attention5 .
To further improve the ability of global attention in long-range reasoning, we design a two-pass scheme to construct long-document QA pairs as shown in Figure 6.In the first pass, only local attention is used in the proposed Span Graph Constructor.
Then Overall" refer to Extractive F1, Abstractive F1 and Overall F1 in Qasper.In the "Supervised" block, the row "LED" denotes the performance of an LED model fine-tuned on the supervised dataset."+MQA-QG" means that an LED model is first trained on the synthetic QA pairs from MQA-QG, and then continuously trained on supervised data.The meaning of "+AttenWalker" is similar.In the "Unsupervised" block, each unsupervised method generates long-document QA pairs and an LED model is fine-tuned on them without any supervised QA instances.

Experiment
In this section, we first discuss the main results of AttenWalker on Qasper and NarrativeQA, and then further analyze the proposed method.
As shown in Table 1, in the Supervised block, it can be found that an LED model trained on the synthetic dataset of AttenWalker can further make improvements when it is continuously fine-tuned on the supervised data, especially on Qasper, showing that the proposed method can effectively alleviate the data scarcity problem in Qasper.In the Unsupervised block, the proposed AttenWalker outperforms all baselines by a large margin in the fully unsupervised setting.showing a competitive performance of AttenWalker.

Ablation Study
We conduct an extensive ablation study on different components of AttenWalker.As shown in Table 2, the effectiveness of each component can be shown according to four observations.Effects of the span collector.As shown in Table 2, the performance drop of "w/ Random Span Collector" illustrates that randomly selecting candidate spans could introduce much noise and harm the quality of the generated QA pairs. 1 -2 0 0 0 2 0 0 1 -4 0 0 0 4 0 0 1 -8 0 0 0 8 0 0 1 -1 0 0 0 0 1 0 0 0 1 -1 2 0 0 0 > 1 2 0 0 0 Effects of the span linker.From the performance drop in setting "w/ Un-pre-trained LED" and "w/ Embedding Linker" as shown in Table 2, it can be known that the attention information stored in the LED parameters is rather useful for constructing high-quality long-document QA pairs.Besides, the competitive result of "w/ Embedding Linker" suggests that embedding information can benefit the QA pair construction.In addition, the performance of "w/o Global" illustrates that global attention is also an essential factor in improving the quality of the generated long-document QA pairs.
Effects of the answer aggregator.According to "w/ Answer Connector" in Table 2, the performance drops when simply connecting spans.It shows that connecting spans with proper transition words is crucial for generating a high-quality answer.
Effects of the two-pass scheme.The Two-Pass Scheme is helpful in improving the performance of the model as shown in the "w/ Single Pass" and "w/ Single Pass + Global" setting from Table 2.It suggests that local and global attention can benefit from the parameters of a fine-tuned LED model.

Effects on Long-Range Modeling
AttenWalker aims to incorporate long-range information in the QA pair construction.To further understand it, an experiment with varied document lengths is conducted.As shown in Figure 7, in essence, "w/o Global" is only to use local attention while "w/ Embedding" denotes a situation that both global and local are not used.When the document length is small (1-2,000), the performances of different methods are comparable.However, with the increasing document length, the gap among methods becomes larger.It shows that AttenWalker can model long-range dependency effectively.Furthermore, it is observed that MQA-QG performs worse than "w/ Embedding" when the document length is large.It can be explained in two aspects.Firstly, MQA-QG could hardly capture long-range information.Secondly, MQA-QG is only a reduced version of "w/ Embedding", which can only link two spans via literal matching (Section 5.6).

Effects of Attention Weights
We design three different span graph construction strategies to further investigate their influences on the proposed method.As shown in Table 3, the "Max-Pooling" strategy outperforms the other two strategies by large margins.It can be explained that the "Max-Pooling" strategy can capture the most obvious (and probably important) relation between two spans, which is useful in QA pair construction.Table 3: Overall F1 of several methods with different strategies to build span graph, on the Qasper dev set."Max-Pooling*" is used in AttenWalker, where the maximum attention score between tokens of two spans is selected as the edge weight.Similarly, "Min-Pooling" uses the minimum attention score, while "Mean-Pooling" uses the average of attention scores.

Few-Shot Learning
We conduct the few-shot learning experiment to explore the effectiveness of AttenWalker in different low-resource settings.As shown in Figure 8, with the increasing of the labeled training size, the performance of the model trained on the synthetic QA pairs from AttenWalker is consistently better than that of MQA-QG in Qasper and an LED model.It is because the Qasper dataset is quite small, which makes the synthetic dataset rather beneficial.Besides, in the NarrativeQA, AttenWalker reaches the best performance from 10 to 10,000 training sizes and then becomes comparable with MQA-QG.It can be explained that a large number of training sizes would narrow the gaps between them.

Case Study
In this section, we first analyze an example with the proposed two-pass scheme to explore the benefits of attention changes.Then, we compare two QA examples between AttenWalker and MQA-QG.As shown in Figure 5, with an LED model, the spans "The main contributions" can be connected with "a single-layer forward recurrent neural network" and "[7]".Yet, after fine-tuning the model with generated QA instances, a more reasonable path "The main contributions" -> "a single-layer forward recurrent neural network" -> "Long Short-Term Memory" is strengthened and the link to the trivial span "[7]" is weakened.It can be explained that after fine-tuning, noise in the LED attention edges is reduced, further improving the span linking and the quality of the generated QA instances.
In addition, as shown in Table 4, we compare two QA pairs generated by AttenWalker and the bestperformed baseline, MQA-QG.There are three key observations from the table.Firstly, Atten-Walker can synthesize multiple spans into an answer whereas MQA-QG can only link the repeated text.Secondly, MQA-QG fails in long-range modeling since repeated spans could probably be in a short distance.Thirdly, the generated answer by AttenWalker is much more informative than MQA-QG's.In the long-document setting, answering a question might need synthesizing many pieces of information from different parts of the document.Therefore, the informativeness property of Atten-Walker can be a better method for this setting.

Conclusion
We study a new task, named unsupervised longdocument question answering, and propose Atten-Walker, an unsupervised method to incorporate long-range information in QA pairs via graph walking.Extensive experiments show the strong performance of the proposed method.We believe that this work can be an important step in the long-document reasoning with a low-resource setting.

Limitations
Despite the strong performance of the proposed AttenWalker.There is still large room for improving efficiency.For example, the time cost of our method is still high.Since we need to search for all Transformer layers and heads to find potentially re-AttenWalker Related Context: ...... QG research traditionally considers ...(1,909 tokens)... most commonly considered factor by current NQG systems is the target answer ...(1,919 tokens)... the answer also deserves more attention from the model... Generated Answer: QG research shows the target answer deserves more attention Generated Question: What is the most commonly considered factor by current NQG systems?MQA-QG Related Context: ... They both follow the traditional decomposition of QG into content selection and question construction ...(8 tokens)...For content selection, [58] learn a sentence selection task to identify question-worthy sentences ... Generated Answer: content selection Generated Question: What is the task of identifying question-worthy parts in traditional the question that is the purpose of Question Generation synonymous with?Table 4: Examples of the generated QA instances from AttenWalker and MQA-QG given the same long document.Blue texts are selected spans for answer generation.lated spans, the dataset construction could be quite time-consuming.Therefore, an algorithm could be designed in the future to pre-select proper layers and heads for attention-based graph walking, which would save much time in dataset construction.document reasoning.Therefore, we only focus on the extractive and abstractive QA instances in this work.NarrativeQA (license: Apache-2.0) is a QA dataset established upon books and movie scripts of long text sequences.Given summaries of the books/scripts, annotators need to generate corresponding QA pairs where answers are free-formed.Table 6 shows the statistics of these two datasets.We use version 0.3 of Qasper dataset8 for our experiment, where empty documents are removed.For NarrativeQA, we use the dataset9 provided in Huggingface, which is a well-formed dataset.Thus, no extra cleaning step is needed.

C.2 Unsupervised Long-Document QA Dataset Construction
The datasets constructing process is shown in Figure 6.Specifically, we first extract sentence constituents from a long document using Berkeley Neural Parser (Kitaev et al., 2019).Then, a t5-small model is used in reconstruction-based span selection.In the span linker, we use led-base-16384 to acquire the token-level attention graph for span linking.The threshold τ is set to 0.45.In the answer aggregator, we use the bart-large model to convert spans into an integral answer.Then, an operator10 is used to generate questions.In the first pass, the generated dataset is used to train an led-base-16384 model.In the second pass, the trained LED model is first used to provide the token-level attention graph as mentioned above.
Besides, the global attention scores are also used to complete the attention graph (described in the paragraph "Span Graph Constructor").The globalattention-related hyperparameters K, L, M .are all set as 3.   warmup proportion is 30% and the epoch number is 5.We chunk the maximum input length into 13,000 tokens and set the attention window size to 640 so that the LED model in this configuration can be trained on four 11GB GPUs in 3 hours.Despite this relatively limited setting, we find that the performance of the LED model is comparable to the default configuration.

D Statistics of the Generated Datasets
In this section, we summarize the long-document QA datasets generated by AttenWalker.For saving time in QA pair generation, for each document, we randomly sample at most 32 linked span sets for QA-pair generation.The final generated results are shown in Table 7.

E Details in the Implementing of Baselines
Since current UQA methods cannot directly apply to the ULQA setting, we make further modifications and describe our implementation in detail.
UNMT (Lewis et al., 2019) To generate QA pairs with UNMT, each paragraph in the long document is used as a short context for QA generation.
When training the LED model, the question generated by UNMT and the full long document is concatenated into a full sequence so as to train the model.
RefQA (Li et al., 2020) Similar to UNMT, each paragraph in the long document is separately used to generate QA pairs.
DiverseQA (Nie et al., 2022a) Similar to UNMT and RefQA, each paragraph is selected as a short document.And then, answers of diverse types are extracted from the document.Finally, each question is generated based on the answer and the short document.
MQA-QG (Pan et al., 2021) For MQA-QG, in a long document, two paragraphs are randomly sampled.These two paragraphs are then input into the MQA-QG for generating multi-hop QA pairs.Finally, the generated question is concatenated with the long document as the input to train the LED model.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.
Figure 1: The long-range relation discovering process for a long document in Qasper dataset.The document is first fed into an LED pre-trained model(Beltagy et al., 2020) (the upper half).Then, the acquired token-level attention graph (not shown here) is converted into a spanlevel graph (the lower half) via the method described in Section 3.4.Spans (which might be far apart) are then linked if their edge weight is high.For example, the span "The main contributions" walks through 1,065 tokens and links with "a single-layer forward recurrent neural network", which is then linked with "Long Short-Term Memory" since their high weight edges (0.53 and 0.48).Other spans do not connect with them due to their low edge weights to these spans.

Figure 4 :
Figure4: An attention heatmap from each token in span "a single-layer forward recurrent neural network" to each token in span "Long Short-Term Memory".Attention score values lower than 0.0001 are not displayed.The highest value 0.4762 is selected as the edge weight between these two spans.

Figure 5 :
Figure 5: An example document (in Qasper training set) of edge weight changes from the first pass to the second pass.

Figure 6 :
Figure 6: Overview of the proposed two-pass scheme.s a , s a ′ are the sentences of the linked spans for a, a ′

Figure 7 :
Figure 7: The mean and variance of Overall F1 (with 5 random seeds) for AttenWalker, two ablated versions "w/o Global""w/ Embedding" and MQA-QG.The dev set of Qasper is divided based on document length.

Figure 8 :
Figure 8: The few-shot learning of three methods on different sizes of labeled training data, evaluated on the dev set.

Table 1 :
, an LED model is fine-tuned on these QA pairs with global and local attention as described in Appendix B. This step aims to improve the ability of the query, key, and value matrices, especially for global attention.In the second pass, based on the fine-tuned LED model, both local and global attention are considered to construct the span graph for attention walking.Hence, further knowledge with global attention is incorporated into the finally constructed QA pairs.The performance on the test set of Qasper and NarrativeQA.In the second row, "Extractive, Abstractive,

Table 2 :
Ablation study of AttenWalker, evaluating on the dev set of Qasper and NarrativeQA."w/ Random Span Collector" denotes that candidate spans are randomly selected."w/ Un-pre-trained LED" uses an LED model with randomly initialized parameters in the Span Linker."w/ Embedding Linker" calculates attention scores only by the inner-product values between each pair of input embeddings."w/o Global" does not consider the global attention in AttenWalker."w/ Answer Connector" directly connects linked spans to form the answer."w/ Singe Pass" only uses the pass-one in the proposed Two-Pass Scheme, while "w/ Single Pass + Global" further add global attention in it.

Table 5 :
The performance of F1 scores on the dev set of Qasper.In the first row, "Extractive, Abstractive, Yes/No, Unanswerable" are four types of answers."Overall" is the F1 score of all the answers."LED+Q+Full Text" denotes training an LED model with a question and the long document as the input."LED+Q" denotes a setting when the question but the long document is not provided for training the QA model.

Table 7 :
The statistics of QA pairs in the synthetic dataset constructed by AttenWalker.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section "G.2 Unsupervised Long-Document QA Dataset Construction" and "G.3 Long-Document QA Model Setting".C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section "5.3 Effects on Long-Range Modeling", "A.Maximum Evidence Span Range Analysis" and "B.Multi-Hop Analysis" C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section "5.1 Main Results".D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.