NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge

News article revision histories provide clues to narrative and factual evolution in news articles. To facilitate analysis of this evolution, we present the first publicly available dataset of news revision histories, NewsEdits. Our dataset is large-scale and multilingual; it contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources based in three countries, spanning 15 years of coverage (2006-2021). We define article-level edit actions: Addition, Deletion, Edit and Refactor, and develop a high-accuracy extraction algorithm to identify these actions. To underscore the factual nature of many edit actions, we conduct analyses showing that added and deleted sentences are more likely to contain updating events, main content and quotes than unchanged sentences. Finally, to explore whether edit actions are predictable, we introduce three novel tasks aimed at predicting actions performed during version updates. We show that these tasks are possible for expert humans but are challenging for large NLP models. We hope this can spur research in narrative framing and help provide predictive tools for journalists chasing breaking news.


Introduction
Revision histories gathered from various natural language domains like Wikipedia (Grundkiewicz and Junczys-Dowmunt, 2014), Wikihow (Faruqui et al., 2018) and student learner essays (Zhang and Litman, 2015) have primarily been studied to explore stylistic changes, such as grammatical error correction (Shah et al., 2020) and argumentation design (Afrin et al., 2020). However, deeper questions about content updates and narrative evolution are underexplored: Which facts are uncertain and 1 We release the dataset and all code used in modeling and evaluation: https://github.com/isi-nlp/NewsEdits.git Figure 1: We identify sentence-level operations -Edit, Addition, Deletion and Refactor -between two versions of a news article (merges, shown here, and splits are a special cases of Edits). We propose tasks aimed at predicting these operations on article versions. We characterize aspects of additions, deletions and edits. We hope NewsEdits can contribute to research on narrative and factual development patterns.
likely to be changed? Which events are likely to update? What voices and perspectives are needed to complete a narrative?
Existing edits corpora do not address these questions due to the nature of previously studied domains: as shown in Yang et al. (2017), the distribution of edits in other domains, like Wikipedia, tend to focus on syntax or style edits. In this work, we introduce a novel domain for revision histories, news article revision histories which, we show, covers the updating of events. Many edits in news either (1) incorporate new information (2) update events or (3) broaden perspectives (Section 3).
Our dataset, NewsEdits, contains 1.2 million articles with 4.6 million versions. We develop a document-level view for studying revisions and define four edit actions to characterize changes between versions: sentence Addition, Deletion, Edit and Refactor (i.e. the sentence is moved within a document). We introduce algorithms for identifying these actions. We count over 40 million Edits, Additions, Deletions or Refactors in NewsEdits.
We argue that news is an important, practical medium to study questions about narrative, factual and stylistic development. This is because, we hypothesize, there are consistent patterns in the way articles update in the breaking news cycle (Usher, 2018). To prove this hypothesis, we show that updates are predictable. We design three tasks: (1) "predict whether an article will be updated," (2) "predict how much of an article will updated," (3) "predict sentence-level edit actions." We show that current large language model (LLM)-based predictors provide a strong baseline above random guessing in most tasks, though expert human journalists perform significantly better. Our insights are twofold: (a) article updates are predictable and follow common patterns which humans are able to discern (b) significant modeling progress is needed to address the questions outlined above. See Section 4.6 for more details. Finally, we show that the NewsEdits dataset can bring value to a number of specific, ongoing research directions: event-temporal relation extraction (Ning et al., 2018;Han et al., 2019a), article link prediction (Shahaf and Guestrin, 2010), factguided updates (Shah et al., 2020), misinformation detection (Appelman and Hettinga, 2015), headline generation (Shen et al., 2017) and author attribution (Savoy, 2013), as well as numerous directions in computational journalism (Cohen et al., 2011;Spangher et al., 2020) and communications fields (Spangher et al., 2021b).
Our contributions are the following: 1. We introduce NewsEdits, the first public academic corpus of news revision histories.
2. We develop a document-level view of structural edits and introduce a highly scalable sentence-matching algorithm to label sentences in our dataset as Addition, Deletion, Edit, Refactor. We use these labels to conduct analyses characterizing these operations.
3. We introduce three novel prediction tasks to assess reasoning about whether and how an article will change. We show that current large language models perform poorly compared with expert human judgement.

The NewsEdits Dataset
NewsEdits is a dataset of 1.2 million articles and 4.6 million versions. In Section 2.1, we discuss the sources from which we gathered our dataset. In Section 2.2, we discuss the categories of editactions designed to characterize changes between versions, and in Section 2.3, we discuss the algorithm we built to identify these edit-actions.

Data Collection
We collect a dataset of news article versions. An article is defined by a unique URL, while a version is one publication (of many) to that same URL. We combine data from two online sources that monitor news article updates: NewsSniffer 2 and Twitter accounts powered by DiffEngine. 3 These sources were chosen because, together, they tracked most major U.S., British and Canadian news outlets (Kirchhoff, 2010). Our corpus consists of article versions from 22 media outlets over a 15-year timescale (2006-2021), including The New York Times, Washington Post and Associated Press. Although the median number of updates per article is 2, as shown in Figure 2, this varies depending on the outlet. More dataset details in Appendix E.

Edit-Action Operations
Since we are interested in how an entire news article updates between versions, we focus on sentence edits (document-level actions), not word edits (sentence-level actions). Identifying that sentences are added and deleted (vs. updated), can help us study the degree of change an edit introduces in the article (Daxenberger andGurevych, 2012, 2013;Fong and Biuk-Aghai, 2010).
Thus, we define the following sentence-level edit-actions, shown in Figure 1: Addition, Deletion, Edit and Refactor. Additions should contain novel information and Deletions should remove information from the article. Edits should be substantially similar except for syntactic changes, rephrased and minimally changed or updated information. Special cases of the Edit operation result in sentences that are merged or split without substantial changes. See Section 2.3 for more details.
Refactors are intentionally moved in an article.  Table 1: F1 scores on validation data for matching algorithms. Left-hand group shows embedding-based methods (TinyBert (TB) and RoBERTa (RB)) with Maximum or Hungarian matching. Middle group shows ngram methods. Right-hand group shows BLEU for different ngram weightings (1,2 and 1,2,3 are uniform weightings over unigrams, bigrams and trigrams).
Refactors are important because, based on the inverse pyramid 5 (Pöttker, 2003) of article structure, sentences that are higher in an article are more important (Scanlan, 2003). Thus, Refactors give us insight into the changing importance of sentences in a narrative.

Edit-Action Extraction
To extract these edit-actions, we need to be able to construct a bipartite graph linking sentences between two versions of an article (example graph shown in Figure 1). If an edge exists between a sentence in one version and a sentence in the other, the sentence is an Edit (or Unchanged). If no edge exists, the sentence is an Addition (if the sentence exists in the newer version only) or Deletion (if it exists in the older version only). We identify Refactors based on an algorithm we develop: in short, we identify a minimal set of edges in the graph which causes all observed edge-crossings. For details on this algorithm, see Appendix F. In order to construct this bipartite graph, we need a scalable, effective, sentence-similarity algorithm. There is a wide body of research in assessing sentence-similarity (Quan et al., 2019;Abujar et al., 2019;Yao et al., 2018;Chen et al., 2018). However, many of these algorithms measure symmetric sentence-similarity. As shown in Figure 1, two sentences from the old version can be merged in the new version. 6 The symmetric similarity between these three sentences would be low, leading us to label the old sentences as Deletions and the new However, Sentences 5, 6 in versiont are shifted upwards in version t+1 , which is movement that is not caused by other operations. We label this as a Refactor. 5 An inverse pyramid narrative structure is when the most crucial information, or purpose of the story, is presented first (Scanlan, 2003). 6 E.g. "ipsum. Lorem" → "ipsum; and Lorem". Conversly, one sentence can also be split.
one an Addition, even if they were minimally edited (for concrete examples, see Table 14). This violates our tag definitions (Section 2.2). So, we need to measure one-way similarity between sentences, allowing us to label merged and split sentences as Edits. Our algorithm is an asymmetrical version of the maximum alignment metric described by Kajiwara and Komachi (2016): where φ (x i ,y j ) ∶= similarity between words x i in sentence x and y j in sentence y. We test several word-similarity functions, φ . The first uses a simple lexical overlap, where φ (x i ,y j ) = 1 if lemma(x i ) = lemma(y j ) and 0 otherwise. 7 The second uses word-embeddings, where φ (x i ,y j ) = Emb(x i ) ⋅ Emb(y j ), and Emb(x i ) is the embedding derived from a pretrained language model (Jiao et al., 2020;Liu et al., 2019).
Each φ function assesses word-similarity; the next two methods use φ to assess sentence similarity. Maximum alignment counts the number of word-matches between two sentences, allowing many-to-many word-matches between sentences. Hungarian matching (Kuhn, 1955) is similar, except it only allows one-to-one matches. We compare these with BLEU variations (Papineni et al., 2002), which have been used previously to assess sentence similarity (Faruqui et al., 2018).

Edit-Action Extraction Quality
Although our sentence-similarity algorithm is unsupervised, we need to collect ground-truth data in order to set hyperparameters (i.e. the similarity threshold above which sentences are considered a 7 We extend this to non-overlapping ngram matches.    match) and evaluate different algorithms. To do this, we manually identify sentence matches in 280 documents. We asked two expert annotators to identify matches if sentences are nearly the same, they contain the same information but are stylistically different, or if they have substantial overlap in meaning and narrative function. See Appendix G for more details on the annotation task. We use 50% of these human-annotated labels to set hyperparameters, and 50% to evaluate match predictions, shown in Table 1. Maximum Alignment with TinyBERT-medium embeddings (Jiao et al., 2020) (Max-TB-medium) performs best. 8

Exploratory Analysis
We extract all edit actions in our dataset using methods described in the previous section. Statistics on the total number of operations are shown in Table  2. In this section, we analyze Additions, Deletions and Edits to explore when, how and why these editactions are made and the clues this provides as to why articles are updated. We leave a descriptive analysis of Refactors to future work. Insight #1: Timing and location of additions, deletions and edits reflect patterns of breaking news and inverse pyramid article structure. How do editing operations evolve from earlier to later versions, and where do they occur in the news article?
In Figure 3a, we show that edit-actions in an article's early versions are primarily adding or updating information: new articles tend to have roughly 20% of their sentences edited, 10% added and few 8   deleted. This fits a pattern of breaking news lifecycles: an event occurs, reporters publish a short draft quickly, and then they update as new information is learned (Hansen et al., 1994;Lewis and Cushion, 2009). We further observe, as is demonstrated in Figure 6 in the appendix, that updates occur rapidly: outlets known for breaking news 9 have a median article-update time of < 2 hours. An article's later lifecycle, we see, is determined by churn: ≈ 5% of sentences are added and 5% are deleted every version. As seen in Figure 3b, additions and edits are more likely to occur in the beginning of an article, while deletions are more likely at the end, indicating newer information is prioritized in an inverse pyramid structural fashion.
Insight #2: Additions and deletions are more likely to contain fact-patterns associated with breaking news (quotes, events, or main ideas) than unchanged sentences. In the previous section, we showed that the timing and position of edit-actions reflects breaking news scenarios. To provide further clues about the semantics of editactions, we sample Additions, Deletions and unchanged sentences and study the kinds of information contained in these sentences. We study three different fact-patterns associated with breaking news: events, quotes and main ideas (Ekström et al., 2021;Usher, 2018). To measure the prevalence of these fact-patterns, we sample 200,000 documents (7 million sentences) from our corpus and run an event-extraction pipeline (Ma et al., 2021), quote-detection pipeline (Spangher et al., 2020), and news discourse model (Spangher et al., 2021a). As shown in Table 3  Insight #3: Edited sentences often contain updating events. The analyses in the previous sections have established that edit-actions both are positioned in the article in ways that resemble, and contain information that is described by, breaking news epistemologies (Ekström et al., 2021). A remaining question is whether the edit-actions change fact-patterns themselves, rather than simply changing the style or other attributes of sentences.
One way to measure this is to explore whether edit-actions update the events in a story (Han et al., 2019b). We focus on pairs of edited sentences. We randomly sample Edits from documents in our corpus (n = 432,329 pairs) and extract events using Ma et al. (2021)'s model. We find that edited sentence pairs are more likely to contain events (43.5%) than unchanged sentences (31.4%). Further, we find that 37.1% of edited sentences with events contain different across versions. We give a sample of pairs in Table 4. This shows that many within sentence operations update events.
Taken together, we have shown in this analysis that factual updates drive many of the edit operations that we have constructed to describe NewsEdits revision histories. Next, we will measure how predictable these update patterns are.

Predictive Analysis on NewsEdits
As shown in Section 3, many edit-actions show breaking news patterns, which Usher (2018) observed follow common update patterns. Now, we explore how predictable these operations are, to address whether future work on the fundamental research questions addressed in Section 1 around narrative design is feasible.
In this section, we outline three tasks that involve predicting the future states of articles based on the current state. These tasks, we hypothesize, outline several modeling challenges: (1) identify indicators of uncertainty used in news writing 10 (Ekström et al., 2021), (2) identify informational incompleteness, like source representation (Spangher et al., 10 E.g. "Police to release details of the investigation." 2020) and (3) identify prototypical event patterns (Wu et al., 2022). These are all strategies that expert human evaluators used when performing our tasks (Section 4.6). The tasks range from easier to harder, based on the sparsity of the data available for each task and the dimensionality of the prediction. We show that they are predictable but present a challenge for current language modeling approaches: expert humans perform these tasks much more accurately than LLM-based baselines.
In addition to serving a model-probing and dataexplanatory purpose, these tasks are also practical: journalists told us in interviews that being able to perform these predictive tasks could help newsrooms allocate reporting resources in a breaking news scenario. 11

Task Description and Training Data Construction
We now describe our tasks. For all three tasks, we focus on breaking news by filtering NewsEdits down to short articles (# sents ∈ [5, 15]) with low version number (<20) from select outlets. 12 Task 1: Will this document update? Given the text of an article at version v, predict if ∃v + 1. This probes whether the model can learn a high-level notion of change, irrespective of the fact that different edit-actions have different consequences for the information presented in a news article. For Task 1, y = 1 if a newer version of an article was published and 0 otherwise. We sample 100,000 short article versions from NewsEdits, balancing across length, version number, and y. Task 2: How much will it update? Given the text of an article at version v, predict in the next version how many Additions, Deletions, Edits, Refactors will occur. This moves beyond Task #1 and requires the model to learn more about how each edit-action category changes an article.
For Task 2, y = counts of sentence-level labels (Num. Additions, Num. Deletions, Num. Refactors, Num. Edits) described in the previous sections, aggregated per document. Each count is binned: [0,1), [0,3), [3,∞) and is predicted separately as a multiclass classification problem. We sample 150,000 short article versions balancing for sources, length and version number. Architecture diagram for the model used for our tasks. Word-embeddings are averaged using Self-Attention to form sentence-vectors. A minimal transformer layer is used to contextualize these vectors (+Contextual Layer). In Tasks 1 and 2, self-attention is used to generate a document-embedding vector.
Task 3: How will it update? For each sentence in version v, predict whether: (1) the sentence itself will change (i.e. it will be a Deletion or Edit) (2) a Refactor will occur (i.e. it will be moved either up or down in the document) or (3) an Addition will occur (i.e. either above or below the sentence). This task, which we hypothesize is the hardest task, requires the model to reason specifically about the informational components of each sentence and understand nuance about structure and form in a news article (i.e. like the inverse pyramid structure (Pöttker, 2003)).
For Task 3, y = individual sentence-level labels. Labels are derived for the following subtasks mentioned above: (1) Sentence Operations is a categorical label comprising: [Deletion, Edit, Unchanged], expressed as a one-hot vector. (2) Refactor is a categorical label comprising: [Up, Down, Unchanged], also expressed as a one-hot vector. (3) Addition Above and Addition Below are each binary labels expressing whether > 1 sentences was added above or below the target sentence. Because some sentences had Additions above and below, we chose to model this subtask as two separate classification tasks. We sample 100,000 short article versions, balancing for sources, length and version number.
For each task, the input X is a document represented as a sequence of sentences. For each evaluation set, we sample 4k documents balancing for class labels (some labels are highly imbalanced and cannot be balanced).

Modeling
We benchmark our tasks using a RoBERTa-based architecture shown in Figure 4. Spangher et al. (2021a) showed that a RoBERTa-based architecture (Liu et al., 2019) with a contextualization layer outperformed other LLM-based architectures like Reimers and Gurevych (2019) for document-level understanding tasks (further insight given in Section 4.6).
In our model, each sentence from document d is fed into a pretrained RoBERTa Base model 13 to obtain contextualized word embeddings. The word embeddings are then averaged using self-attention, creating sentence vectors. For Task 3, these vectors are then used directly for sentence-level predictions. For Tasks 1 and 2 these vectors are condensed further, using self-attention, into a single document vector which is then used for document-level predictions. The sentence vectors are optionally contextualized to incorporate knowledge of surrounding sentences, using a small Transformer layer 14 (+Contextualized in Tables 5, 6, 7).
We experiment with the following variations. For Task 2, we train with less data (n = 30,000 version pairs) and more data (n = 150,000 version pairs), balanced as described in Section 4.1, to test whether a larger dataset would help the models generalize better. We also experiment, for all tasks, with freezing the bottom 6 layers of the RoBERTa architecture (+Partially Frozen) to probe whether pretrained knowledge is helpful for these tasks. Additionally, we experiment giving the version number of the older version as an additional input feature alongside the text of the document (+Version).
Finally, for Tasks 2 and 3, we attempt to jointly model all subtasksusing separate prediction heads for each subtask but sharing all other layers. We use uniform loss weighting between the tasks. Spangher et al. (2021a) showed that various document-level understanding tasks could benefit by being modeled jointly. For our tasks, we hypothesize that decisions around one operation might affect another: i.e. if a writer deletes many sentences in one draft they might also add sentences, so we test whether jointly modeling has a positive effect.
We do not consider any feature engineering on   Table 6: Task 3 Benchmarks: Baseline model performance for sentence-Level tasks. Addition tasks are: "Was a sentence added below the target sentence?", "Was a sentence added above the target sentence?" Sentence Operations columns are three operations that occur on the target sentence: "Deletion", "Editing", "Unchanged". Refactor is binned into whether the target sentence is "Moved Up", "Moved Down" or "Unchanged". (

Human Performance
To evaluate how well human editors agree on edits, we design two human evaluation tasks and recruit 5 journalists with ≥ 1 year of editing experience at major U.S. and international media outlets.
Evaluation Task 1: We show users the text of an article and ask them whether or not there will be an update. Collectively, they annotate 100 articles. After completing each round, they are shown the true labels. This evaluates Task 1.
Evaluation Task 2: We show users the sentences of an article, and they are able to move sentences, mark them as deleted or edited, and add sentenceblocks above or below sentences. They are not asked to write any text, only mark the high-level actions of "I would add a sentence," etc. Collectively they annotate 350 news articles. After each annotation, they see what edits actually happened. The raw output evaluates Task 3 and we aggregate their actions for each article to evaluate Task 2.
They are instructed to use their expert intuition and they are interviewed afterwards on the strategies used to make these predictions. (See Appendix G for task guidelines and interviews).  Documents belonging to some topics are easier to predict than others. By label (last column): medium-range growth is easier to predict.

Results
As shown in Tables 5, 6, and 7, model-performance indicates that our tasks do range from easier (Task 1) to harder (Task 3). While our models show improvements above Random, and Most Popular in almost all subtasks, a notable exception is Task 3's Addition subtasks, where the models do not clearly beat Random. We note that this was also the most difficult subtask for human evaluators. We observe that +Partially Frozen increases performance on Task 2, boosting performance in all subtasks by ≈ 10 points. In contrast, it does not increase performance on Task 3, perhaps indicating that the subtasks in Task 3 are difficult for the current LLM paradigm. Although adding version embeddings (+Version) boosts performance for Task 1, it does not seem to measurably increase performance for the other tasks. Finally, performing Task 2 and 3 as multitask learning problems decreases performance for all subtasks.
In contrast, human evaluators beat model performance across tasks, most consistently in Task 2, with on average performance 20 F1-score points above Baseline models. On Task 3, human performance also is high relative to model performance. We observe that, despite Additions in Task 3 being the hardest task, as judged by human and model performance, humans showed a ≈ 40 point increase above model performance. Humans are also better at correctly identifying minority classes, with a wider performance gap seen for Macro F1 scores (i.e. see Sentence Operations, where the majority of sentences are unchanged).

Error Analysis
We perform an error analysis on the Task 2 task and find that there are several categories of edits that are easier to predict than others. We run Latent Dirichlet allocation on 40,000 articles, shown in Table 8. 15 We assign documents to their highest topic and find that articles covering certain news 15 Topic words shown in Appendix C. topics (like War) update in a much more predictable pattern than others (like Business), with a spread of over 26 F1-score points. Further, we find that certain edit-patterns are easier to differentiate, like articles that grow between 1-5 sentences (Table  8). This show us ways to select for subsets of our dataset that are more standard in their update patterns.
The class imbalance of this dataset (Table 2) results in the Most Popular scoring highly. To mitigate this, we evaluate on balanced datasets. Class imbalanced training approaches (Li et al., 2020;Spangher et al., 2021a) might be of further help.

Evaluator Interviews
To better understand the process involved with successful human annotation, we conducted evaluator interviews. We noticed that evaluators first identified whether the main news event was still occurring, or if it was in the past. For the former, they tried to predict when the event would update. 16 For the latter, they considered discourse components to determine if an article was narratively complete and analyzed the specificity of the quotes. 17 They determined where to add information in the story based on structural analysis, and stressed the importance of the inverse pyramid for informational uncertainty: information later in an article had more uncertainty; if confirmed, it would be moved up in later versions. 18 Finally, they considered the emotional salience of events; if a sentence described an event causing harm, it would be moved up. 19 Clearly, these tasks demand strong worldknowledge and common sense, as well and highlevel discourse, structural and narrative awareness. 20 Combining these different forms of reasoning, our results show, is challenging for current language models, which, for many subtasks, perform worse than guessing. +Multitask performance actually decreases performance for both Task 2 and Task 3, indicating that these models learn features that do not generalize across subtasks. This contrasts with what our evaluators said: their decision to delete sentences often used the same reasoning as, and were dependent on, their decisions to add.
However, we see potential for improvement in these tasks. Current LLMs have been shown to identify common arcs in story-telling (Boyd et al., 2020), identify event-sequences (Han et al., 2019b) and reason about discourse structures (Spangher et al., 2021a;Li et al., 2021). Further, for the ROCStories challenge, which presents four sentences and tasks the model with predicting the fifth ( . These are all aspects of reasoning that our evaluators told us they relied on. Narrative arcs in journalism are often standard and structured (Neiger and Tenenboim-Weinblatt, 2016), so we see potential for improvement.

Related Work
A significant contribution of this work, we feel, is the introduction of a large corpus of news edits into revision-history research and the framing of questions around sentence-level edit-actions. Despite the centrality of news writing in NLP (Marcus et al., 1993;Carlson et al., 2003;Pustejovsky et al., 2003;Walker et al., 2006), we know of no academic corpus of news revision histories. Two works that analyze news edits to predict article quality ( . However, we are not aware of any work using WikiNews revision histories. We did not include WikiNews because its collaborative community edits differ from professional news edits.
Since at least 2006, internet activists have tracked changes made to major digital news articles (Herrmann, 2006). NewsDiffs.org, NewsSniffer and DiffEngine are platforms which researchers have used to study instances of gender and racial bias in article drafts, 23 (Brisbane, 2012; Burke, 21 Datasets could not be released due to copyright infringement, according to the authors in response to our inquiry.

Conclusion
In this work, we have introduced the first largescale dataset of news edits, extracted edit-actions, and shown that many were fact-based. We showed that edit-actions are predictable by experts but challenging for current LM-backed classifierss. Going forward, we will develop a schema describing the types of edits. We are inspired by the Wikipedia Intentions schema developed by Yang et al. (2017), and are working in collaboration with journalists to further clarify the differences. This development will help to clarify the nature of these edits as well as focus further directions of inquiry.

Acknowledgements
We are grateful to Amanda Stent, Sz-Rung Shiang, Gabriel Kahn, Casey Williams, Meg Robbins, I-Hung Hsu, Mozhdeh Gheini, Jiao Sun and our anonymous reviewers for invaluable feedback. Spangher is grateful for Bloomberg for supporting this research with a PhD fellowship. May is supported by DARPA Contract FA8750-19-2-0500.

Dataset
We received permission from the original owners of the datasets, NewsSniffer and Dif-fEngine.
Both sources are shared under strong sharing licenses.
NewsSniffer is released under an AGPL-3.0 License, 26 which is a strong "CopyLeft" license. DiffEngine is released under an Attribution-NoDerivatives 4.0 International license. 27 Our use is within the bounds of intended use given in writing by the original dataset creators, and is within the scope of their licensing.

Privacy
We believe that there are no adverse privacy implications in this dataset. The dataset comprises news articles that were already published in the public domain with the expectation of widespread distribution. We did not engage in any concerted effort to assess whether information within the dataset was libelious, slanderous or otherwise unprotected speech. We instructed annotators to be aware that this was a possibility and to report to us if they saw anything, but we did not receive any reports. We discuss this more below.

Limitations and Risks
The primary theoretical limitation in our work is that we did not include a robust non-Western language source; indeed, our only two languages were English and French. We tried to obtain sources in non-Western newspapers and reached out to a number of activists that use the DiffEngine platform to collect news outside of the Western world, including activists from Russia and Brazil. Unfortunately, we were not able to get a responses.
Thus, this work should be viewed with that important caveat. We cannot assume a priori that all cultures necessarily follow this approach to breaking news and indeed all of the theoretical works that we cite in justifying our directions also focus on English-language newspapers. We provide documentation in the Appendix about the language, 26 https://opensource.org/licenses/AGPL-3.0 27 https://creativecommons.org/licenses/by-nd/4.0/ source, timeline and size of each media outlet that we use in this dataset.
One possible risk is that some of the information contained in earlier versions of news articles was updated or removed for the express purpose that it was potentially unprotected speech: libel, slander, etc. We discussed this with the original authors of NewsSniffer and DiffEngine. During their years of operation, neither author has received any requests to take versions down. Furthermore, instances of First Amendment lawsuits where the plaintiff was successful in challenging content are rare in the U.S. We are not as familiar with the guidelines of protected speech in other countries.
Another risk we see is the misuse of this work on edits for the purpose of disparaging and denigrating media outlets. Many of these news tracker websites have been used for noble purposes (e.g. holding newspapers accountable for when they make stylistic edits or try to update without giving notice). But we live in a political environment that is often hostile to the core democracy-preserving role of the media. We focus on fact-based updates and hope that this resource is not used to unnecessarily find fault with media outlets.

Computational Resources
The experiments in our paper require computational resources. All our models run on a single 30GB NVIDIA V100 GPU, along with storage and CPU capabilities provided by AWS. While our experiments do not need to leverage model or data parallelism, we still recognize that not all researchers have access to this resource level.
We use Huggingface RoBERTa-base models for our predictive tasks, and release the code of all the custom architectures that we construct at https:// github.com/isi-nlp/NewsEdits.git. Our models do not exceed 300 million parameters.

Annotators
We recruited annotators from professional journalism networks like the NICAR listserve. 28 All the annotators consented to annotate as part of the experiment, and were paid $1 per task, above the highest minimum wage in the U.S. Of our five annotators, three are based in large U.S. cities, one lives in a small U.S. city and one lives in a large Brazilian city. Four annotators identify as white and one identifies as Latinx. Four annotators identify as male and one identifies as female. This data collection process is covered under a university IRB. We do not publish personal details about the annotations, and their interviews were given with consent and full awareness that they would be published in full.

A Dataset: Broader Scope
We expect that NewsEditswill be useful for a range of existing tasks for revision corpora, such as edit language modeling (Yin et al., 2018) and grammatical error correction (Grundkiewicz and Junczys-Dowmunt, 2014). We also think NewsEdits can impact other areas of NLP research and computational journalism, including: 1. Resource Allocation in Newsrooms Newsrooms are often tasked with covering multiple breaking news stories that are unfolding simultanesouly (Usher, 2018). When multiple stories are being published to cover breaking news, or multiple news events are breaking at the same time, newsrooms are often forced to make decisions on which journalists to assign to continue reporting stories. This becomes especially pronounced in an era of budget cuts and localjournalism shortages (Nielsen, 2015). We interviewed 3 journalists with over 20 years of experience at major breaking news outlets. They agreed that a predictive system that performed the tasks explored in Section 4 would be very helpful for allowing editors track which stories are most likely to change the most, allowing them to keep resources on these stories. 2. Event-temporal relation extraction (Ning et al., 2018) and Fact-guided updates (Shah et al., 2020). As shown in Tables 3 and 4, added and edited sentences are both more likely to contain events, and event updates. We see potential for using these sentences to train revise-and-edit (Hashimoto et al., 2018)

B Exploratory Analysis Details
Insight #2 in Section 3 was based on several experiments that we ran. Here we provide more details about the experiments we ran. Events: We sample of 200,000 documents (7 million sentences) from our corpus 32 and use Eventplus (Ma et al., 2021) to extract all events. We find added/deleted sentences have significantly more events than unchanged sentences. Quotes: Using a quote extraction pipeline (Spangher et al., 2020), we extract explicit and implicit quotes from the sample of documents used above. The pipeline identifies patterns associated with quotes (e.g. double quotation marks) to distantly supervise training an algorithm to extract a wide variety of implicit and explicit quotes with high accuracy (.8 F1-score). We find added/deleted sentences contain significantly more quotes than unchanged sentences. News Discourse: We train a model to identify three coarse-grained discourse categories in news text: Main (i.e. main story) Cause (i.e. immediate context), and Distant (i.e. history, analysis, etc.) We use a news discourse schema (Van Dijk, 1983) and a labeled dataset which contains 800 news articles labeled on the sentence-level (Choubey et al., 31 Contribution acknowledgement. Appendix E.1.4 for ex. 32 We balance for newspaper source, article length (from 5 to 100 sentences), and number of additions/deletions (from 0% of article to 50%) 2020). We train a model on this dataset to score news articles in our dataset. 33 Then, we filter to Addition, Deletion, etc. sentences. We show that added and deleted sentences are significantly more likely than unchanged sentences to be Main or Cause sentences, while unchanged sentences are significantly more likely to be Distant.

C Error Analysis: Continued
As discussed in Section 4.5, we perform Latent Dirichlet Allocation (Blei et al., 2003) to softcluster documents. In Table 9, we show the top k = 10 words for each topic i (i.e. β i 1,...k where

D.1 Modeling Decisions
For Task 1, we sample documents in our training dataset, balancing across versions and y and exclude articles with more than 6,000 characters. However, because of the imbalanced nature of the dataset, we could not fully balance. As is seen in Table 2, +Version, the version number of the old version had a large effect on the performance of the model, boosting performance by over 10 points. We believe that this is permissible, because the version number of the old article is available at prediction time. Interestingly, the effect is actually the opposite of what we would expect. As can be seen in Figure 5, the more versions an article has, the more likely it is to contain another version. This is perhaps because articles with many versions are breaking news articles, and they behave differently than articles with fewer versions. To more properly test a model's ability to judge breaking news specifically, we can create a validation set where all versions of a set of articles are included; thus the model is forced to identify at early versions whether an article is a breaking news story or not.
For Task 2, we first experiment with different regression modeling heads before reframing the task as a classification task. We test with Linear Regression and Poisson Regression, seeking to learn the raw counts. However, we found that we were not able to improve above random in any subtask and reframed the problem as a binned classification problem. 33 We achieve a macro F1-score of .67 on validation data using the architecture described in Spangher et al. (2021a).

D.2 Hyperparameters and Training
For all tasks, we used pretrained RoBERTa Base from Wolf et al. (2020). We used reasonable defaults for learning rate, dropout and other hyperparameters explored in Spangher et al. (2021a), which we describe now. For all tasks, we used AdamW as an optimizer, with values β 1 = .9, β 2 = .99, ε = 1e−8. We used batch-size = 1 but experimented with different gradient accumulations (i.e. effective batch size) ∈ [10,20,100]. We did not find much impact to varying this parameter. We used a learning rate of 1e-6 as in Spangher et al. (2021a). Early in experimentation, we trained for 10 epochs, but did not observe any improvement past the 3rd epoch, so we limited training to 5 epochs. We used a dropout probability of .1, 0 warmup steps and 0 weight decay. The embedding dimensionality for the pretrained RoBERTa Base we used is 768, and for all other layers, we used a hidden-dimension of 512.
For deriving sentence embeddings, we tested several different methods. We tested both using the <sep> token from RoBERTa and averaging the word-embeddings of each word-piece, as in Spangher et al. (2021a), but found that a third method-using self-attention over the word embeddings, or a learned, weighted average-performed the best. We concatenated a sentence-level positional embedding vector, as in Spangher et al. (2021a), with a max cutoff of 40 positional embeddings (i.e. every sentence with an index greater than 40 was assigned the same vector.)

E Dataset Details
Here, we give additional details on the dataset, starting with relevant analyses and ending with technical details that should guide the user on how to access our dataset.  people  family  killed  court  school  police  president  party  man  airport  died  people  year  year  officers  trump  mr  old  plane  hospital  attack  old  world  people  minister  labour  year  aircraft  old  al  mr  new  area  prime  council  arrested  reported  man  forces  man  people  incident  house  minister  woman  agency  service  attacks  murder  city  local  donald  leader  officers  officials  rescue  group  police  time  scene  obama  new  men  news  year  military  years  years  shot  white  people  suspicion  air  police  city  told  day  shooting  new  secretary  london  flight  death  security  guilty  event  injured   Table 9:

E.1.1 Amount of time between Versions
The amount of time between republication of an article varies widely across news outlets, and has a large role in determining what kinds of stories are being republished. As can be seen in Figure  6, we group sources into 4 categories: (1) Figure  6a, those that update articles over weeks (tabloids and magazines), (2) Figure 6b, those that update articles on a daily basis, on median, (3) Figure 6c, those that update 2-3 times a day, and (4) Figure 6d, those that update hourly, or breaking news outlets.
We are especially interested in rapid updates,    because, by limits imposed by this timescale on how much information can be gathered by journalists, these updates are more likely to contain single units of information, updates and quotes. Thus, in our experiments, we focus on The New York Times, Independent, Associated Press, Washington Post, and BBC. We also include Guardian and Reuters because they typically compete directly with the previously mentioned outlets in terms of content and style, even if they do not publish as frequently.

E.1.2 Discourse Across Time
We are interested in the dynamics of articles over time. Although this analysis is still ongoing, we seek to understand how, as the article grows through time, the types of information included in Unchanged said, trump, people, president, concerns, government, year Add/Del says, senate, law, death, wednesday, monday, tuesday it changes.We show in Figure 7a and 7b that in later versions and longer articles 34 sentences are dominated by Distant discourse. Interestingly, later versions are also more likely to have Main and Cause discourse added. Based on our annotator interviews, we surmise that this is because, for breaking news, a journalist is frequently trying to assess the causes behind the story. In early drafts, we also see Main sentences being removed. This is due to, as the story is updating in early versions, the Main event is most likely to be changing.

E.1.3 Top Words
Top Words: We characterize added and deleted sentences by their word usage in Table 10. Words indicating present-tense, recent updates are more likely: day-names like "Monday" or "Tuesday" and the present-tense verb "says" (compared with the past-tense "said" in unchanged sentences).

E.1.4 Collection of Corrections, Authorship
To identify instances of Corrections in added sentences, we used the following lexicon: "was corrected", "revised", "clarification", "earlier error", "version", "article" Here are some examples of corrections: • CORRECTION: An earlier version of this story ascribed to Nato spokesman Brig Gen Carsten Jacobsen comments suggesting that after Saturdayś shooting, people would have to be "looking over their shoulders" in Afghan ministries. • CORRECTION 19 November 2012:An earlier version of this story incorrectly referred to "gargoyles", not "spires". • Correction 7 March 2012: An earlier version of this story mistakenly said Rushbrook's car had been travelling at 140mph at the time of the crash. To identify instances of Contributor Lines, we use the following lexicon: "reporting by", "additional reporting", "contributed reporting", "editing by" Here

E.2 Dataset Tables and Fields
Our dataset is released in a set of 5 SQLite tables. Three of them are primary data tables, and two are summary-statistic tables. Our primary data tables are: articles, sentence _ diffs, word _ diffs; the first two of which are shown in Tables 12a and 12b (word _ diffs shares a similar structure with sentence _ diffs). We compile two summary statistics tables to cache statistics from sentence _ diffs and word _ diffs; they calculate metrics such as NUM _ SENTENCES _ ADDED and NUM _ SENTENCES _ REMOVED per article. 35 The sentence _ diffs data table's schema is shown in Table 12 and some column-abbreviated sample rows are shown in Table 14. As can be seen, the diffs are calculated and organized on a sentence-level. Each row shows a comparison of sentences between two adjacent versions of the same article. 36 Every row in sentence _ diffs contains index columns: SOURCE, A _ ID, VERSION _ OLD, and VERSION _ NEW. These columns can be used to uniquely map each row in sentence _ diffs to two rows in article. 37 35 These summary statistic tables make it convenient to, say, filter sentence _ diffs in order train a model on all articles that have one sentence added; or all articles that have no sentences removed. 36 So, for instance, article A, with versions 1, 2 where each version has sentences i, ii, iii, would have 3 rows (assuming sentences were similar): A.  Table 11: A summary of the number of total number of articles and versions for different media outlets which comprise our dataset. Also shown is the original collection that they were derived from (DE for DiffEngine, and NS from NewsSniffer), and the date-ranges during which articles from each outlet were collected.

E.3 TAG columns in sentence _ diffs
The columns TAG _ OLD and TAG _ NEW in sentence _ diffs have specific meaning: how to transform from version to its adjacent version. In other words, TAG _ OLD conveys where to find SENT _ OLD in VERSION _ NEW and whether to change it, whereas TAG _ NEW does the same for SENT _ NEW in VERSION _ OLD. More concretely, consider the examples in Table 14b, 14a and 14c. As can be seen, each tag is 3-part and has the following components. Component 1 can be either M, A, or R. M means that the sentence in the current version was Matched with a sentence in the adjacent version, A means that a sentence was Added to the new version and R means the sentence was Removed from the old version. 38 Component 2 is only present for Matched sentences, and refers to the index or indices of the sentence(s) in the adjacent version. 39 Additionally, Component 3 is also only present if the sentence is Matched. It can be either C or U. C refers to whether the matched sentence was Changed and U to whether it was Unchanged.
Although not shown or described in detail, all sentence _ diffs.VERSION _ NEW = article.VERSION _ ID. 38 i.e. an Added row is not present in the old version and a Removed row is not present in the new version. They have essentially the same meaning and we could have condensed notation, but we felt this was more intuitive. 39 I.e. in TAG _ OLD, the index refers to the SENTENCE _ ID of SENT _ NEW M sentences have corresponding entry-matches in word _ diffs table, which has a similar schema and tagging aim.
A user might use these tags in the following ways: 1. To compare only atomic edits, as in Faruqui et al. (2018), a user could filter sentence _ diffs to sentences where M..C is in TAG _ OLD (or equivalently, TAG _ NEW). Then, they would join TAG _ OLD.Component _ 2 with SENTENCE _ ID. Finally, they would select SENT _ OLD, SENT _ NEW. 40 2. To view only refactorings, or when a sentence is moved from one location in the article to another, a user could filter sentence _ diffs to only sentences containing M..U and follow a similar join process as in use-case 1. 3. To model which sentences might be added, i.e. p(sentence i ∈ article t+1 sentence i ∉ article t ), a user would select all sentences in SENT _ OLD, and all sentences in SENT _ NEW where A is in

E.4 Comparison With Other Edits Corpora
Here, we give a tabular comparison with other edits corpora, showing our

F Algorithm Details
In this section, we give further examples further justify our asymmetrical sentence-matching algorithm. The examples shown in Tables 14b, 14a and 14c illustrate our requirements. The first example, shown in Table 14b, occurs when a sentence is edited syntactically, but its meaning does not change. 42 So, we need our sentence-matching algorithm to use a sentence-similarity measure that considers semantic changes and does not consider surface-level changes. The second example, shown in Table 14a, occurs when a sentence is split (or inversely, two sentences are merged.) Thus, we need our sentence matching algorithm to consider manyto-one matchings for sentences. The third example, shown in Table 14c, occurs when sentence-order is rearranged, arbitrarily, throughout a piece. Finally, we need our sentence-matching algorithm to perform all pairwise comparisons of sentences.

F.1 Refactors
To identify which sentences were intentionally moved rather than moved as a consequence of other document-level changes, we develop an iterative algorithm based on the idea that a refactor is an intentional sentence movement that creates an edgecrossing. Algorithm 2 givens our algorithm. In English, our algorithm represents sentence matches between two article versions as a bi-42 Syntactic changes: synonyms are used, or phrasing is condensed, but substantially new information is not added end Algorithm 1: Asymmetrical sentencematching algorithm. Input v old , v new are lists of sentences, and output is an index mapper. If a sentence maps to 0 (i.e. d < T ), there is no match. Sim asym is described in text. partite graph. We use a Binary Tree to recursively find all edge crossings in that graph. This idea is based off of the solution for an SPOJ challenge problem: https://www.spoj. com/problems/MSE06H/. 43 We extend this problem to return the set of all edge crossings, not just the crossing number.
Then, we filter edge crossings to a candidate set, applying the following conditions in order and stopping when there is only one edge crossing left: 43 Solution given here:  (1) edges that have the most number of crossings (2) edges that extend the most distance or (3) edges that move upwards. In most cases, we only apply the first and then the second conditions. In very rare cases, we apply all three. In rarer cases, we apply all three and still have multiple candidate edges. In those cases, we just choose the first edge in the candidate set. We continue removing edges until we have no more crossings.

G.1 Task: Sentence Matching
We give our annotators the following instructions: The goal of this exercise is to help us identify sentences in an article-rewrite that contain substantially new information. To do this, you will identiy which sentences match between two versions of an article.
Two sentences match if: 1. They are nearly the same, word-forword. 2. They convey the same information but are stylistically different. 3. They have slightly different information but have substantial overlap in meaning and narrative function.
Examples of Option 3 include (please see the "Examples" section for real examples): 1. Updating events.
• (Ex) The man was presumed missing. → The man was found in his home. • (Ex) The death count was at 23. → 50 were found dead. • (Ex) The senators are still negotiating the details. → The senators have reached a deal. 2. An improved analysis.
• (Ex) The president is likely seeking improved relations. → The president is likely hoping that hard-liners will give way to moderates, improving relations. • (Ex) The storm, a Category IV, is expected to hit Texas. → The storm, downgraded to Category III, is projected to stay mainly in the Gulf. • (Ex) Analysts widely think the shock will be temporary. → The shock, caused by widespread shipping delays, might last into December, but will ultimately subside. 3. A quote that is very similar or serves the same purpose.
• (Ex) "We knew we had to get it done." said Senator Murphy. → "At the end of the day, no one could leave until we had a deal" said Senator Harris. • (Ex) "It was gripping." said the by- "She has not seen him for 12 years, and the first time she saw him was through a monitor," said Lloyd.
She has not seen him for 12 years, and the first time she saw him was through a monitor," said Lloyd.
"The mother, this was the first time seeing her son since he got to the States." M 1 U 3 "She wept, and wept, and wept." A (c) Demo 3: Two features shown: (1) Refactoring, or order-swapping, makes sentences appear as though they have been deleted and then added. Swapped sentences are matched through their tags.
(2) The last sentence is a newly added sentence and is not matched with any other sentence. i ] lists that contain it. c = removeEdge(t) Algorithm 2: Identifying Refactors. We define refactors as the minimal set of edge crossings in a bipartite graph which, when removed, remove all edge crossings.
Annotators completed the task by drawing lines between sentences in different versions of an article. An example is shown in Figure 8. We use highlighting to show when non overlapping sequences in the inbox, using simple lexical overlap. If the user mouses over a text block, they can see which words do no match between all textblocks on the other side. Although this might bias them towards our lexical matching algorithms, we do not see them beaking TB-medium. This was very helpful for reducing the cognitive overload of the task.

G.2 Task: Edit Actions
In this task, workers were instructed to perform edit operations to an article version in anticipation of what the next version would look like. We recruited 5 workers: journalists who collectively had over a decade of experience working for outlets like The New York Times, Huffington Post, Vice, a local outlet in Maine, and freelancing.
We gave our workers the following instructions.
You will be adding, deleting and moving sentences around in a news article to anticipate what a future version looks like.
• Add a sentence either below or above the current sentence by pressing the Add ↑ or Add ↓ buttons. Adding a sentence means that you feel there is substantially new information, a novel viewpoint or quote, or necessary background information that needs to be present.

G.3 Annotator Analysis
We seek here to characterize the performance of different expert annotators. We see in Table 15 that there are three workers which do over 30 tasks each. We characterize the per-task accuracy by counting the number of edit-operations per document, and seeing if they got the same number as the true number of edits (each expressed as a binned count i.e. low: [0,1) operations, medium: [1,3) operations, Figure 9: Example of Editing Task. The gray boxes on the left serve as a reference for how the original article was written. The sandbox on the right is where annotators actually perform the task. The first sentence has been Edited, two sentences have been Added, the third has been Deleted and the fourth has been Refactored downwards.   We show that there is a wide variety of performances, in Table 16, with some workers getting over 75% of the operations correct and others getting ≈ 30% correct.
Interestingly, we see that there is a learning process occurring. In Figure 10, we see that workers get better over time as they do more tasks. This indicates that the training procedure of letting them see the edits that actually happened is successful at teaching them the style and patterns the edits will take.

G.4 Annotator Interview 1
This annotator was involved in the Editing task. They edited 50 stories.
1. What was your general thought process? Well, my first general though was: "how do I do this update?" Then I thought back to the instructions, and really tried to predict how the AP 44 would update. I then had to decide what timespan I'd use-in general, I assumed a 24 hour update window, but sometimes it was different. If the story updates 2 hours after news breaks vs. 2 days, it will look very different Sometimes, I would read the story, try to figure out what the story was about, ask what was missing, what I'd include in a story if I was reporting it fully. A lot of times what I felt were missing were more causal analysis, more quotes, more perspectives.
As I was going through, I almost always decided to edit the lede, and was almost always correct with that. Most leads, I thought, could be more efficient, they could incorporate more details from further down in the story into the lede. Also, as stories unfolded, the actor responsible for the event becomes clear, that information will get added to the lede. For example, a building collapses in Manhattan -> faulty beam causes the building collapse. This detail often only becomes apparent afterwards.
What I realized doing this was that there are different genres of breaking news article, and genre matters a lot for how it gets updated. These are the following categories: (a) Stories where the future is contingent, and you're making predictions in realtime. ex) A sailor went missing off the isle of Mann. This story is fundamentally about an unknown -will he be discovered or not? This is one of the harder ones to figure out how to update. How it plays out determines how it will be updated. If the search goes on for a long time, you'll have more details, you'll have quotes from his family, conditions on the water. If he's found, this stuff becomes irrelevant. You'll have information about how he gets found, then you'll have information about how many people get updated. ex) A story was about "Trump is about to make a speech". "Trump expected to speak". I updated it as if event didn't happen yet. But the real update actually contained him speaking. Stories about when multiple futures can happen, without knowing the timescale of the update, are difficult to predict. I determined whether an event was unfolding by looking for several clues. I looked for certain words: "expected", "scheduled", etc. Usually this signals an event-update. I looked for stories where there's a ton of uncertainty.
Another clue was that the only sources are official statements (ex. "Officials in Yemen say something happened".) The space of possible change increases. You're going to get conflicting reports, eye-witnesses contradicting official statements.
Some articles included direct appeals to readers-"don't use the A4 if you're traveling between London, etc." For crime articles: "if you have any information, please contact agency." This kind of direct appeal is not relevant in the next version. (b) Past stories when the event is totally in the past. For these stories, I looked for vagueness of the original article to determine what would be updated. If it's more specific, for example, with exact death toll numbers, information about specific actors and victims, the less it's going to be updated. For these stories, my tendency was to add at least 1-2 sentences of context towards the end of every story. If you're writing for Reuters, you might not need that.
In general, I wanted to see some background, people involved.
The quotes you're getting, are they press releases or are they directly from people? If they more official statements and press releases, then you'll see more updates in the form of specific victim quotes. One general note: most breaking stories were about bad things. Disasters, crashes, missing people, etc. For a bombing, there's a pretty predictable pattern of expansion. Death toll will get added, more eyewitness accounts. It has an expansionary trajectory. 2. How did you determine if a sentence needed to be added? I decided to add anywhere I saw vagueness. I added a lot towards the beginning, right after the nut graf is where I added the most sentences. If I saw a sentence taken from a press release, I added after that, assuming that the journalist would get a more fleshed-out quote from someone.
Often I added [sentences] at the end to add context. I never added something before the lead.
Maybe a story has two ideas, then I'd add sentences to the second half to flesh out a second idea.
Sometimes I thought about different categories of information-quotes, analysis, etc.and it was obvious if some of that was missing. 3. How did you determine if a sentence needed to be deleted? I very rarely thought things needed to be deleted One of the challenges of the experiment was that it was hard to indicate how to combine sentences. I got around this by hitting "edit" for sentences that needed to be combined. Then I'd delete ones below, assuming that the edited sentence would include a clause from the sentence below it.
Structural sentences and cues got deleted often. Sentences like "More follows", etc. Nothing integral to the substance of the story.
I noticed that almost always, [informational content of sentences that had been deleted] had been reincorporated. 4. How did you determine if a sentence needed to be moved up/down? I did this by feel, what seemed important. One example: A building collapse in Morocco. A sentence way towards the end had a report about weak foundations, that needed to be brought up. This indicated that the journalist became more confident about something The inverted pyramid so widely used, in a breaking news it's fairly easy to weight the importance of different elements. Thus, I rarely felt the need to move items upwards.
Sometimes I saw examples of when what was initially a small quote from official was expanded in a later version. Then, it was brought up because the quote became more important. But usually, my instinct would not be to move quotes from officials up. 5. Did it help to see what actually happened after you finished the task?
Usually there was 1-2 things that we had done that were basically the same.
A couple of times, [I] was satisfied to see that the updated story made the same decision to switch sentences around. 6. Any general closing thoughts?
Most interesting thing was to see how formally constrained journalists and editors are, and how much these forms and genres shape your thought and your work.
There are assumptions get baked into the genres about who's credible, what kinds of things carry weight, sorts of outcomes deserve special attention, a whole epistemic framework.
Even though there's a lot of variation, there's a fair amount of consistency.
I was disappointed that, especially for rapidly expanding stories, the edits were mainly causes and main events. I saw very few structural, causal analyses added to breaking stories. There was some analysis that got added to one story about bombings in the Middle East, but still, not a whole lot about how the specific conflict originated.