A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents

Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First, we define the requirements and challenges for this user-facing decontextualization task, such as clarifying where edits occur and handling references to other documents. Second, we propose a framework that decomposes the task into three stages: question generation, question answering, and rewriting. Using this framework, we collect gold decontextualizations from experienced scientific article readers. We then conduct a range of experiments across state-of-the-art commercial and open-source language models to identify how to best provide missing-but-relevant information to models for our task. Finally, we develop QaDecontext, a simple prompting strategy inspired by our framework that improves over end-to-end prompting. We conclude with analysis that finds, while rewriting is easy, question generation and answering remain challenging for today's models.


Introduction
Tools to support research activities often rely on extracting text snippets from long, technical documents and showing them to users.For example, snippets can help readers efficiently understand documents (August et al., 2023;Fok et al., 2023b) or scaffold exploration of document collections (e.g.conducting literature review) (Kang et al., 2022;Palani et al., 2023).As more applications use language models, developers use extracted snippets to protect against generated inaccuracies; snippets can help users verify model-generated out- puts (Bohnet et al., 2022) and provide a means for user error recovery.
However, extracted snippets are not meant to be read outside their original document: they may include terms that were defined earlier, contain anaphora whose antecedents lie in previous paragraphs, and generally lack context that is needed for comprehension.At best, these issues make extracted snippets difficult to read, and at worst, they render the snippets misleading outside their original context (Lin et al., 2003;Cohan et al., 2015;Cohan and Goharian, 2017;Zhang et al., 2023).
In this work, we consider the potential for making extracted snippets more readily-understood in user-facing settings through decontextualization (Choi et al., 2021)-the task of rewriting snippets to incorporate information from their originating contexts, thereby making them "stand alone".
We focus our attention on scenarios in which users read snippets from technical documents (e.g., scientific articles).For example, consider a citation graph explorer that allows users to preview citation contexts to explain the relationship between papers (Luu et al., 2021).Also, consider an AI research assistant that surfaces extracted attribution snippets alongside generated answers.Figure 1 illustrates these two motivating applications.How do language models fare when performing snippet decontextualization over complex scientific text?Our contributions are: First, we introduce requirements that extend prior decontextualization work (Choi et al., 2021) to handle user-facing scenarios (e.g., delineation of model-generated edits).We characterize additional challenges posed by decontextualizing scientific documents (e.g., longer text, citations and references) and describe methods to address them ( §2).
Second, we propose a framework for snippet decontextualization that decomposes the task into three stages: question generation, question answering, and rewriting ( §3).This decomposition is motivated by a formative study in which our framework makes decontextualization less challenging and creates higher-quality annotations.We use this framework to collect gold decontextualization data from experienced readers of scientific articles ( §4).
Finally, with this data, we operationalize our framework by implementing QADECONTEXT, a strategy for snippet decontextualization ( §5).Our best experimental configuration demonstrates a 41.7% relative improvement over end-to-end model prompting ( §5.2).We find that state-of-theart language models perform poorly on our task, indicating significant opportunity for further NLP research.We perform extensive analysis to identify task bottlenecks to guide future investigation ( §6).

Decontextualization for User-facing Snippets from Scientific Documents
In this section, we define decontextualization and motivate some additional task requirements when considering user-facing scenarios.Then, we describe additional task challenges that arise when operating on scientific documents.

Requirements for User-facing Snippets
Task Definition.As introduced in Choi et al. (2021), decontextualization is defined as: Given a snippet-context pair (s, c), an edited snippet s ′ is a valid decontextualization of s if s ′ is interpretable without any additional context, and s ′ preserves the truth-conditional meaning of s in c.
where the context c is a representation of the source document, such as the full text of a scientific article.
Multi-sentence Passages.While Choi et al. (2021) restrict the scope of their work to singlesentence snippets, they recommend future work on longer snippets.Indeed, real-world applications should be equipped to handle multi-sentence snippets as they are ubiquitous in the datasets used to develop such systems.For example, 41% of evidence snippets in Dasigi et al.'s (2021) dataset and 17% of citation contexts in Lauscher et al.'s (2022) dataset are longer than a single sentence.To constrain the scope of valid decontextualizations, we preserve (1) the same number of sentences in the snippet and (2) each constituent sentence's core informational content and discourse role within the larger snippet before and after editing.
Transparency of Edits.Prior work did not require that decontextualization edits were transparent.We argue that the clear delineation of machineedited versus original text is a requirement in userfacing scenarios such as ours.Users must be able to determine the provenance (Han et al., 2022) and authenticity (Gehrmann et al., 2019;Verma et al., 2023) of statements they read, especially in the context of scientific research, and prior work has shown that humans have difficulty identifying machine-generated text (Clark et al., 2021).In this work, we require the final decontextualized snippet s ′ to make transparent to users what text came from the original snippet s and what text was added, removed, or modified.We ask tools for decontextualization to follow well-established guidelines in writing around how to modify quotations1 .Such guidelines include using square brackets ([]) to denote resolved coreferences or newly incorporated information.

Challenges in Scientific Documents
We characterize challenges for decontextualization that arise when working with scientific papers.
Long, Complex Documents.We present quantitative and qualitative evidence of task difficulty compared to prior work on Wikipedia snippets.
First, Choi et al. (2021) found between 80-90% of the Wikipedia sentences can be decontextualized using only the paragraph with the snippet, and section and article titles.However, we find in our data collection ( §4) that only 20% of snippets from scientific articles can be decontextualized with this information alone (and still only 50% when also including the abstract; see Table 5).
Second, we conduct a formative study with five computer science researchers, asking them to manually decontextualize snippets taken from Wikipedia and scientific papers.2Participants took between 30-160 seconds (µ=88) for Wikipedia sentences from Choi et al. (2021) and between 220-390 seconds (µ=299) for scientific snippets from our work. 3In qualitative feedback, all participants expressed the ease of decontextualizing Wikipedia snippets.For scientific paper snippets, all participants verbally expressed difficulty of the task despite familiarity with the subject material; 3/5 participants began taking notes to keep track of relevant information; 4/5 participants felt they had to read the paper title, abstract and introduction before approaching the snippet; and 4/5 participants encountered cases of chaining in which the paper context relevant to an unfamiliar entity contained other unfamiliar entities that required further resolving.None of these challenges arose for Wikipedia snippets.
Within and Cross-Document References.Technical documents contain references to withindocument artifacts (e.g., figures, tables, sections) and to other documents (e.g., web pages, cited works).Within-document references are typically to tables, figures, or entire sections, which are difficult to properly incorporate into a rewritten snippet without changing it substantially.With crossdocument references, there is no single best way to handle these when performing decontextualization; in fact, the ideal decontextualization is likely more dependent on the specific user-facing application's design rather than on intrinsic qualities of the snippet.For example, consider interacting with an AI research assistant that provides extracted snippets: What corpus did Bansal et al. use? "We test our system on the CALL-HOME Spanish-English speech translation corpus [42] ( §3)." One method of decontextualization can be: " [Bansal et al., 2017] test [their] system on the CALLHOME Spanish-English speech translation corpus [42] ["Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus" at IWSLT 2013] ( §3)." incorporating the title of cited paper "[42]".4But in the case of a citation graph explorer, a typical interface likely already surfaces the titles of both citing and cited papers (recall Figure 1), in which case the addition of a title isn't useful.Possibly preferred is an alternative decontextualization that describes the dataset: " [Bansal et al., 2017] test [their] system on the CALLHOME Spanish-English speech translation corpus [42] [, a noisy multi-speaker corpus of telephone calls in a variety of Spanish dialects] ( §3)."

Addressing Challenges
To address the increased task difficulty that comes with working with long, complex scientific documents, we introduce a framework in ( §3) and describe how it helps humans tackling this task manually.We also opt to remove all references to in-document tables and figures from snippets, and leave handling them to future work5 .
Finally, to handle cross-document references, we assume in the AI research assistant application setting that a user would have access to basic information about the current document of interest but no knowledge about any referenced documents that may appear in the snippet text.Similarly, we assume in the citation context preview setting, that a user would have access to basic information about the current (citing, cited) document pair but no knowledge about any other referenced documents that may appear in the snippet text.

QA for Decontextualization
Decontextualization requires resolving what additional information a person would like to be incorporated and how such information should be incorporated when rewriting (Choi et al., 2021).If we view "what" as addressed in our guidelines ( §2), then we address "how" through this proposal:

Our Proposed Framework
We decompose decontextualization into three steps: 1. Question generation.Ask clarifying questions about the snippet.2. Question answering.For each question, find an answer (and supporting evidence) within the source document.3. Rewriting.Rewrite the snippet by incorporating information from these QA pairs.
We present arguments in favor of this framework: QA and Discourse.Questions and answers are a natural articulation of the requisite context that extracted snippets lack.The relationship between questions and discourse relations between document passages can be traced to Questions Under Discussion (QUD) (Onea, 2016;Velleman and Beaver, 2016;De Kuthy et al., 2018;Riester, 2019).Recent work has leveraged this idea to curate datasets for discourse coherence (Ko et al., 2020(Ko et al., , 2022)).We view decontextualization as a task that aims to recover missing discourse information through the resolution of question-answer pairs that connect portions of the snippet to the source document.
Improved Annotation.In our formative study ( §2.2), we also presented participants with two different annotation guidelines.Both defined decontextualization, but one (QA) described the stages of question generation and question answering as prerequisite before rewriting the snippet, while the other (NoQA) showed before-and-after examples of snippets.All participants tried both guidelines; we randomized assignment order to control for learning effects.
While we find adhering to the framework slows down annotation and does not impact annotation quality in the Wikipedia setting ( §A.4), adhering to the framework results in higher-quality annotations in the scientific document setting.3/5 of participants who were assigned QA first said that they preferred to follow the framework even in the NoQA setting6 .Two of them additionally noted this framework is similar to their existing notetaking practices.The remaining 2/5 of participants who were assigned NoQA first struggled initially; both left their snippets with unresolved acronyms or coreferences.When asked why they left them as-is, they both expressed that they lost track of all the aspects that needed decontextualization.These annotation issues disappeared after these participants transitioned to the QA setting.Overall, all participants agreed the framework was sensible to follow for scientific documents.

Data Collection
Following the results of our formative study, we implemented an annotation protocol to collect decontextualized snippets from scientific documents.

Sources of Snippets
We choose two English-language datasets of scientific documents as our source of snippets, one for each motivating application setting (Figure 1): Citation Graph Explorer.We obtain citation context snippets used in a citation graph explorer from scientific papers in S2ORC (Lo et al., 2020).We restrict to contexts containing a single citation mention to simplify the annotation task, though we note that prior work has pointed out the prevalence of contexts containing multiple citations7 (Lauscher et al., 2022).AI Research Assistant.We use QASPER (Dasigi et al., 2021), a dataset for scientific document understanding that includes QA pairs along with document-grounded attributions-extracted passages that support a given answer.We use these supporting passages as user-facing snippets that require decontextualization.

Annotation Process
Following our proposed framework: Writing Questions.Given a snippet, we ask annotators to write questions that clarify or seek additional information needed to fully understand the snippet.Given the complexity of the annotation task we used Upwork8 to hire four domain experts with experience reading scientific articles.Annotators were paid $20 USD per hour9 .
Answering Questions.We hired a separate set of annotators to answer questions from the previous stage using the source document(s).We additionally asked annotators to mark what evidence from the source document(s) supports their answer.We used the Prolific10 annotation platform as a highquality source for a larger number of annotators.Annotators were recruited from the US and UK and were paid $17 USD per hour.To ensure data quality, we manually filtered a total of 719 initial answers down to 487 by eliminating ones that answered the question incorrectly or found that the question could not be answered using the information in the paper(s) (taking ∼20 hours).
Rewriting Snippets.Given the original snippet and all QA pairs, we ask another set of annotators from Prolific to rewrite the snippet incorporating all information in the QA pairs.

Dataset Statistics
In total, we obtained 289 snippets (avg.44.2 tokens long), 487 questions (avg.7.8 tokens long), and 487 answers (avg.20.7 tokens long).On average, the snippets from the Citation Graph Explorer set have 1.9 questions per snippet while the AI Research Assistant snippets have 1.3 questions per snippet.Questions were approximately evenly split between seeking definitions of terms, resolving coreferences, and generally seeking more context to feel informed.See §A.2 for a breakdown of question types asked by annotators.

Experimenting with LLMs for Decontextualization
We study the extent to which current LLMs can perform scientific decontextulaization, and how our QA framework might inform design of methods.

Is end-to-end LLM prompting sufficient?
Naively, one can approach this task by prompting a commercially-available language model with the instructions for the task, the snippet, and the entire contents of the source paper.We experiment with text-davinci-003 and gpt-4-0314.For gpt-4, most papers entirely fit in the context window (for a small number of papers, we truncate them to fit).For davinci, we represent the paper with the title, abstract, the paragraph containing the snippet, and the section header of section containing the snippet (if available).This choice was inspired by Choi et al.'s (2021) use of analogous information for decontextualizing Wikipedia text, and we empirically validated this configuration in our setting as well (see §A.3).We provide our prompts for both models in §A.6.4 and §A.6.5.
For automated evaluation, we follow Choi et al. ( 2021) and use SARI (Xu et al., 2016).Originally developed for text simplification, SARI is suitably repurposed for decontextualization as it computes the F1 score between unigram edits to the snippet performed by the gold reference versus edits performed by the model.As we are interested in whether the systems add the right clarifying information during decontextualization, we report SARI-add as our performance metric.We additionally report BERTScore (Zhang et al., 2020) which captures semantic similarity between gold reference and model prediction, though it is only used as a diagnostic tool and does not inform our evaluative decisions; due to the nature of the task, as long as model generations are reasonable, BERTScore will be high due to significant overlap between the source snippet, prediction and gold reference.
We report these results in Table 1.Overall, we find that naively prompting LLMs end-to-end performs poorly on this task.

Can our QA framework inform an improved prompting strategy?
To improve upon end-to-end prompting, we implement QADECONTEXT, a strategy for snippet decontextualization inspired by our framework.This approach is easy to adopt, making use of widelyavailable LLMs as well as off-the-shelf passage retrieval models.See Figure 2 for a schematic.All prompts for each component are in §A.6.
Question Generation.We prompt an LLM ( davinci) to generate questions with a one-shot prompt with instructions.We found more in-  context examples allowed for better control of the number of questions, but decreased their quality.
Question Answering.Given a question, we can approach answering in two ways.In retrieve-thenanswer, we first retrieve the top k relevant paragraphs from the union of the source document and any document cited in the snippet, and then use an LLM to obtain a concise answer from these k paragraphs.Specifically, we use k = 3 and Contriever (Izacard et al., 2021) for the retrieval step, and davinci or gpt-4 as the LLM.
Alternatively, in the full document setting, we directly prompt an LLM that supports longer context windows ( gpt-4) to answer the question given the entire source document as input.This avoids the introduction of potential errors from performing within-document passage retrieval.
Rewriting.Finally, we prompt an LLM ( davinci) with the snippet, generated questions, generated answers, and any relevant context (e.g., retrieved evidence snippets if using retrieve-thenanswer and/or text from the source document) obtained from the previous modules.This module is similar to end-to-end prompting of LLMs from §5.1 but prompts are slightly modified to accommodate output from previous steps.
Results.We report results also in Table 1.We find our QADECONTEXT strategy achieves a 41.7% relative improvement over the gpt-4 end-to-end baseline, but given the low SARI-add scores, there remains much room for improvement.

Human Evaluation
We conduct a small-scale human evaluation (n = 60 samples) comparing decontextualized snippets with our best end-to-end ( davinci) and QADE-CONTEXT approaches.Snippets were evaluated on whether they clarified the points that the reader needed help understanding.System outputs for a given snippet were presented in randomized order and ranked from best to worst.The evaluation was performed by two coauthors who were familiar with the task, but not how systems were implemented.The coauthors annotated 30 of the same snippets, and achieved a binary agreement of 70%.This is quite high given the challenging and subjective nature of the task; Choi et al. (2021) report agreements of 80% for snippets from Wikipedia.
Our QADECONTEXT strategy produces convincing decontextualized snippets in 38% of cases against 33% for the end-to-end approach.We note that decontexualization remains somewhat subjective (Choi et al., 2021), with only 42% of the gold decontextualizations judged acceptable.We conduct a two-sample Binomial test and find that the difference between the two results is not statistically significant (p = 0.57).See Table 4

Is rewriting the performance bottleneck?
To study if the rewriting module is the bottleneck, we run oracle experiments to provide an upper bound on the performance of our strategy.We perform these experiments assuming that the LLM-based rewriting module receives gold (human-annotated) Questions, Answers, and answer Evidence paragraphs.We also investigate various combinations of this gold data with the source Document itself (i.e., title, abstract, paragraph containing the snippet, and section header).To ensure our best configuration applies generally across models, we study all com-binations using two commercial ( claude-v1, text-davinci-003) and two open source ( tülu-30b, llama2-chat-70b) models.Our prompts are in §A.6.
We report results in Table 2. First, we observe that, on average, the performance ranking of different input configurations to the rewriter is consistent across models: (1) Including the gold evidence (E) is better than including larger document context (D), (2) including the gold answer (A) results in the largest improvement in all settings, and (3) performance is often best when the rewriter receives only the questions (Q) and answers (A).
Second, we find that overall performance of the best oracle configuration of QADECONTEXT ( davinci) achieves 261% higher performance over the best QADECONTEXT result in Table 1.As we did not change the rewriter for these oracle experiments, we conclude significant errors are being introduced in the question generation and answering modules, rather than in the rewriter.

Are question generation or question
answering the performance bottleneck?
We continue this investigation using similar oracle experiments to assess performance bottlenecks in the question generation and question answering modules.To scope these evaluations, we only consider input configurations to the rewriting module based on the top two oracle results for davinci from Table 2-QA and DQAE.We report these new results in Table 3).
Question Generation.First, how much better is QADECONTEXT if we replace generated questions with gold ones?From Table 3, we see a relative lift ranging from 48.2% to 72.7% by switching to gold questions (see rows 5 vs 8, 6 vs 9, 7 vs 10).Question generation is a major source of error.
Question Answering.How much better is retrieve-then-answer in QADECONTEXT if we used gold evidence instead of relying on retrieval?Just ablating the retrieve step, from ones, and see the largest relative improvement: 66.8% to 92.3% (see rows 1 vs 3, 2 vs 4).Question answering is a major source of error.
Overall.While the relative performance improvement from using gold data is large in both the question generation and question answering modules, the absolute values of the scores are quite different.On average, using gold questions provides a 0.080 increase in absolute SARI-add (rows 5 vs 8, 6 vs 9, 7 vs 10), while using gold answers provides a 0.212 absolute increase (rows 1 vs 3, 2 vs 4).We identify question answering as the main performance bottleneck in QADECONTEXT.

Does QADECONTEXT generalize beyond scientific documents?
We compare our approach to the one used by Choi et al. (2021) by applying our QADECONTEXT strategy to their Wikipedia data.In these experiments, we find that QADECONTEXT performs slightly worse than end-to-end LLM prompting (∼1 percentage point SARI-add absolute difference).These results match our intuitions about the QA approach from our formative study ( §3.1 and §A.4) in which study participants found that following the QA framework for Wikipedia was cumbersome, was unhelpful, or hindered their ability to perform decontextualization.The results also moti-  2023) show an extract-thendecontextualize approach can help summarization.
Despite its utility, decontextualization remains a challenging task.Eisenstein et al. (2022) noticed similar failures to those we found in §5.1 when dealing with longer input contexts.Beyond models, decontextualization is challenging even for humans.Choi et al. (2021) note issues related to subjectivity resulting in low annotator agreement.Literature in human-computer interaction on the struggles humans have with note-taking (judging what information to include or omit when highlighting) are similar to those we observed in our formative study and data annotation (Chang et al., 2016).

Bridging QA and other NLP Tasks
In this work, we establish a bridge between decontextualization and QA.A similar bridge between QA and discourse analysis has been wellstudied in prior NLP literature.In addition to the relevant works discussed in §3.1, we also draw attention to works that incorporate QA to annotate discourse relations, including Ko et al. (2020Ko et al. ( , 2022)); Pyatkin et al. (2020).In particular, Pyatkin et al. (2020) show that complex relations between clauses can be recognized by non-experts using a QA formulation of the task, which is reminiscent of the lowered cognitive load observed during our formative study ( §3.1).Beyond discourse analysis, prior work has used QA as an approach to downstream NLP tasks, including elaborative simplification (Wu et al., 2023), identifying points of confusion in summaries (Chang et al., 2023b), evaluating summary faithfulness (Durmus et al., 2020), and paraphrase detection (Brook Weiss et al., 2021).

Instruction following
davinci might fail to follow requirements specified in the instructions.For example, our prompt explicitly required avoiding questions about figures, which weren't part of the source document.

Realistic questions
davinci might generate questions that a human wouldn't need to ask as the information is already provided in the snippet.For example, for the snippet "In addition, our system is independent of any external resources, such as MT systems or dictionaries, as opposed to the work by Kranias and Samiotou (2004).",davinci generated "What kind of external resources were used by Kranias and Samiotou (2004)?" even though the information is already in the snippet (see highlighted text).User background davinci generates questions whose appropriateness depends on user background knowledge.For example, "What is ROUGE score?" is not good question for a user with expertise in summarization.

Retrieval errors
davinci or gpt-4 fails to abstain and hallucinates an answer despite irrelvant retrieved passages.

Answer errors
The question is answerable from retrieved context, but davinci or gpt-4 either unnecessarily abstains or hallucinates a wrong answer.For example, given question: "What does 'each instance' refer to?" and the retrieved passage: "The main difference was that (Komiya and Okumura, 2011) determined the optimal DA method for each triple of the target word type of WSD, source data, and target data, but this paper determined the method for each instance.",the model outputs "Each instance refers to each word token of the target data."The correct answer is highlighted.

Rewriting
Format errors might fail to enclose snippet edits in brackets.During human evaluation ( §5.3), annotators found that 24% of generations had these errors (compared to 5% of gold annotations).

Missing info
Overall, annotators found that 45% of decontextualized snippets through QADECONTEXT were still missing relevant information or raised additional questions (compared to 34% for the gold snippets).

QA for User Information Needs
Like in user-facing decontextualization, prior work has used questions to represent follow-up (Meng et al., 2023), curiosity-driven (Ko et al., 2020), or confusion-driven (Chang et al., 2023b) information needs.QA is a well-established interaction paradigm, allowing users to forage for information within documents through the use of natural language (Wang et al., 2022;ter Hoeve et al., 2020;Jahanbakhsh et al., 2022;Fok et al., 2023a).

Prompting and Chaining LLMs
Motivated by recent advancement in instruction tuning of LLMs (Ouyang et al., 2022), several works have proposed techniques to compose LLMs to perform complex tasks (Mialon et al., 2023).These approaches often rely on a pipeline of LLMs to generate to complete a task (Huang et al., 2022;Sun et al., 2023;Khot et al., 2023), while giving a model access to modules with different capabilities (Lu et al., 2023;Paranjape et al., 2023;Schick et al., 2023).While the former is typically seen as an extension of chain-of-thought (Wei et al., 2022), the latter enables flexible "soft interfaces" between models.Our QADECONTEXT strategy relies on the latter and falls naturally from human workflows as found in our formative study.

Conclusion
In this work, we present a framework and a strategy to perform decontextualization for snippets from scientific documents.We introduce task requirements that extend prior work to handle user-facing scenarios and the handle the challenging nature of scientific text.Motivated by a formative study into how humans perform this task, we propose a QA-based framework for decontextualization that decomposes the task into question generation, answering, and rewriting.We then collect gold decontextualizations and use them to identify how to best provide missing context so that state-of-theart language models can perform the task.Finally, we implement QADECONTEXT, a simple prompting strategy for decontextualization, though ultimately we find that there is room for improvement on this task, and we point to question generation and answering in these settings as important future directions.

Limitations
Automated evaluation metrics may not correlate with human judgment.In this work, we make extensive use of SARI (Xu et al., 2016) to estimate the effectiveness of our decontextualization pipeline.While Choi et al. (2021) has successfully applied this metric to evaluate decontextualization systems, text simplification metrics present key biases, for example preferring systems that perform fewer modifications (Choshen and Abend, 2018).While this work includes a human evaluation on a subset of our datasets, the majority of experiments rely on aforementioned metrics.
Collecting and evaluating decontextualizations of scientific snippets is expensive.The cost of collecting scientific decontextualions limited the baselines we could consider.For example, Choi et al. (2021) approach the decontextualization task by fine-tuning a sequence-to-sequence model.
While training such a model on our task would be an interesting baseline to compare to, it is not feasible because collecting enough supervised samples is too costly.In our formative study, we found that it took experienced scientists five times longer to decontextualize snippets from scientific papers compared to ones from Wikipedia.Instead, we are left to compare our method to Choi et al.'s (2021) by running our pipeline in their Wikipedia setting.
The high cost of collecting data in this domain also limited our human evaluation due to the time and expertise required for annotating model generations.For example, a power analysis using α = 0.05 and power= 0.8, and assuming a true effect size of 5 percentage points absolute difference, estimates that the sample size would be n = 1211 judgements per condition for our evaluation in §5.3.Evaluating model generations is difficult for many tasks that require reading large amounts of text or require domain-specific expertise to evaluate.Our work motivates more investment in these areas.
Closed-source commercial LLMs are more effective than open models.While we experimented with open models for writing decontextualized snippets ( tülu-30b, llama2-chat-70b), results indicate a large gap in performance between their closed-source counterparts, such as claude and davinci.Since these systems are not available everywhere and are expensive, their use makes it difficult for other researches to compare with our work, and use our approach.
Prompting does not guarantee stable output, limiting downstream applicability of the decontextualization approach.As highlighted in Table 9, all approaches described in this work do not reliably produce outputs that precisely follow the guidelines described in §2.Thus, current systems are likely not suitable to be used in critical applications, and care should be taken when deploying them in user-facing applications.
Decontextualization is only studied for English and for specific scientific fields.In this work, we limit the study of decontextualization to natural language processing papers written in English.The reason for this is two-fold: first, most scientific manuscripts are written in English; second, current instruction-tuned LLMs, particularly those that are open, are predominantly monolingual English models.

Ethical Considerations & Broader Impact
Reformulation of snippets may inadvertently introduce factual errors or alter claims.Scientific documents are a mean to disseminate precise and verifiable research findings and observations.Because LLMs are prone to hallucination and may inadvertently modify the semantics of a claim, their use in scientific applications should be carefully scrutinized.Our decontextualization approach is essentially motivated by the need to make snippets portable and understandable away from their source; however, this property makes verification of their content more challenging.While this work does not discuss safeguards to be used to mitigate this risk, these factor must be considered if this research contribution were to be implemented in user facing applications.
Availability of decontextualization tools may discourage users from seeking original sources.Because decontextualization systems are not generally available to the public, users today may be more likely to seek the original content of a snippet.Progress in decontexualization systems might change that, as snippets may offer a credible replacement for the full document.We recognize that, while this functionality might offer improvements in scientific workflows, it would also encourage bad scholarly practices.Even more broadly, more general-domain decontextualization systems might lead to users not visiting sources, thus depriving content creators of revenue.

A.6.3 Rewriting
The following "text snippet" will be quoted in an article using the Chicago Manual of Style .The following questions were answered using information from the paper.Rewrite the "text snippet" into quote format by adding the answers in-between square brackets.Write as if you were an expert scientist in the field of natural language processing.

Text snippet: "{{sentence}}"
Instructions: Using the given information, please rewrite the text snippet by adding additional information into square brackets.For example: the snippet "Our approach performs well" becomes "[REF0's] approach [ bidirectional language modeling] performs well".For example: the snippet "Our task is MT" becomes "[REF0's] task is MT [machine translation]." After adding clarifying information: * Replace first-person pronouns with a placeholder.Replace "we" with "[REF0]" and "our" with "[REF0's]".* Remove discourse markers (like "in conclusion", "in this section", "for instance", etc.) ").Put any clarifying text between brackets .For example "Our approach performs well" becomes "[REF0's] approach [bidirectional language modeling] performs well".* Define any specific terminology or acronyms that other scientists will not be familiar with.For example "Our task is MT" becomes "[REF0's] task is MT [machine translation]."* If needed, add additional short clarifications that are necessary for an expert reader to understand the broader context of the quote.
Only add up to a single sentence and put the sentence in between square brackets.
After adding clarifying information: * Replace first-person pronouns with a placeholder.Replace "we" with "[REF0]" and "our" with "[REF0's]".* Remove discourse markers (like "in conclusion", "in this section", "for instance", etc.) * Citations are marked as BIBREF or (Author Name, Year).Keep these the same.Do not add any additional citations.* Remove any references to Figures ("FIGREF") and Tables ("TABREF") * Fix the grammar Reminders: * Follow the Chicago Manual of Style for quotes by putting all added text between square brackets.* The rewritten snippet is a quote, so the word order should closely match the original snippet's.
Please rewrite this snippet according to the instructions and the given information.Text snippet: "{{sentence}}" Rewrite: A.7 Sample QA Pairs Title: "DOLORES: Deep Contextualized Knowledge Graph Embeddings" User query: "Is fine-tuning required to incorporate these embeddings into existing models?"

Original Snippet
The only requirement is that the model accepts as input, an embedding layer (for entities and relations).If a model fulfills this requirement (which a large number of neural models on knowledge graphs do), we can just use Dolores embeddings as a drop-in replacement.We just initialize the corresponding embedding layer with Dolores embeddings.

QA-Pairs
Question: "What is an embedding layer?"Answer: "An embedding layer is a layer in a neural network model that accepts as input representations of entities and relations in the form of embeddings."Question: "What are Dolores embeddings?" Answer: "Dolores embeddings are deep representations of entities and relations in knowledge graphs, learned using Bi-Directional LSTMs from entity-relation chains."Question: "How do we initialize the corresponding embedding layer?"Answer: "We initialize the corresponding embedding layer with Dolores embeddings."

Original Snippet
In contrast to our work, (Elson et al., 2010) are solely focus on length and number of dialogues between persons to measure relatedness, whereas our approach looks at general co-occurrence or similarity as measured by LT tools which use word embeddings.

QA-Pairs
Question: "What are LT tools?"Answer: "LT tools are language technology tools that use word embeddings for measuring similarity and co-occurrence in text."Question: "How do LT tools measure co-occurrence or similarity?" Answer: "LT tools measure co-occurrence or similarity using word embeddings."Question: "What are word embeddings?" Answer: "Word embeddings are numerical representations of words in a multidimensional space, which capture semantic and syntactic information about the words and their relationships with one another."Table 8: Two examples of the outputs of the different stages our best decontextualization pipeline.The questions, answers, and decontextualized snippet are all model generated.The first example is from the QASPER dataset (Dasigi et al., 2021); the snippet is an evidence passage containing the answer the user question.The second is a text span extracted from Wohlgenannt et al. (2016) citing Elson et al. (2010).Note that the questions are not all natural and are sometimes redundant, but the information they query is only included once in the decontextualized snippet.

Figure 1 :
Figure 1: Illustration of two user-facing scenarios requiring snippet decontextualization. (Top) A citation graph explorer surfacing citation context snippets to explain relationships between papers.(Bottom) An AI research assistant providing snippets as attributions.Highlighted spans are added during decontextualization.
s] model accepts input representations for entities and relations in the form of dense continuous vector embeddings [i.e., an embedding layer].Dolores embeddings, which are deep contextualized knowledge graph embeddings learned using a deep neural sequential model, can be used as a drop-in replacement for the embedding layer in existing knowledge graph prediction models.To initialize the corresponding embedding layer, [REF0] simply uses Dolores embeddings.Citing paper: "Extracting Social Networks from Literary Text with Word Embedding Tools"(Wohlgenannt et al., 2016)   Cited paper: "Extracting Social Networks from Literary Fiction"(Elson et al., 2010) s] approach looks at general co-occurrence or similarity as measured by LT [language technology] tools, which use word embeddings [language modeling techniques that transform the vocabulary of an input corpus into a continuous and low-dimensional vector representation, capturing semantic and contextual information of words].

Snippet Question Generation Document Snippet Document Question Snippet Decontextualized Snippet Document Rewriting Question Answer Evidence Snippet Question Answering evidence Retrieve answer Generate full doc Access answer Generate Document Question Answer Evidence Figure
2: The three modules used for QADECONTEXT.Question generation ❶ formulates clarification questions given a snippet and (optionally) the source document.Question answering ❷ returns an answer and (optionally) supporting evidence for a given a question, snippet, and (optionally) the source document.Rewriting ❸ receives the snippet and (one of more elements in) the context produced by previous modules to perform decontextualization.For examples of the outputs of these steps, see Table8.

Table 2 :
Oracle performance of QADECONTEXT when using gold (Q)uestions, (A)nswers, answer (E)vidence obtained from annotators or source (D)ocument.Across models, extra input hurts performance (e.g., QA outperforms DQA and DQAE).Results from best two input configs are bold.Entries for tülu are missing as inputs don't fit in context window.
for qualitative examples of QADECONTEXT errors.We conduct ablation experiments to better understand the performance and errors of each module in QADECONTEXT.We refer the reader to Table4for qualitative error examples.

Table 3
vate future work pursuing methods that can adapt to different document types, such as Wikipedia or scientific documents, and user scenarios, such as snippets being user-facing versus intermediate artifacts in a larger NLP systems.These situations require personalizing decontextualizations to diverse information needs.

Table 4 :
Most common error types at different stages of QADECONTEXT.Question generation and question answering errors identified through qualitative coding of n = 30 oracle outputs from §6.2.Rewriting errors identified during human evaluation ( §5.3).

Table 7 :
Ablating modules in our decontextualization pipeline that affect the input to the final Rewriter module.We ablate (1) source of question, (2) use of the full document vs retrieving passages as evidence, (3) choice of the QA model for obtaining the answer, and (4) the amount of context provided to the rewriter module.Last three rows are fully predictive, while others use gold data.Rewriter module is identical to that from last row of Table2.