Revisiting Sentence Union Generation as a Testbed for Text Consolidation

Tasks involving text generation based on multiple input texts, such as multi-document summarization, long-form question answering and contemporary dialogue applications, challenge models for their ability to properly consolidate partly-overlapping multi-text information. However, these tasks entangle the consolidation phase with the often subjective and ill-defined content selection requirement, impeding proper assessment of models' consolidation capabilities. In this paper, we suggest revisiting the sentence union generation task as an effective well-defined testbed for assessing text consolidation capabilities, decoupling the consolidation challenge from subjective content selection. To support research on this task, we present refined annotation methodology and tools for crowdsourcing sentence union, create the largest union dataset to date and provide an analysis of its rich coverage of various consolidation aspects. We then propose a comprehensive evaluation protocol for union generation, including both human and automatic evaluation. Finally, as baselines, we evaluate state-of-the-art language models on the task, along with a detailed analysis of their capacity to address multi-text consolidation challenges and their limitations.


Introduction
In order to acquire knowledge on a new subject or find answers to complex questions, it is often necessary to consult multiple sources of written information.While information provided in a single document is usually consistent, textual materials from various sources often use different language expressions, which may vary in terms of level of specificity, to convey similar information.An illustration of this phenomenon can be seen in Figure 1.In this paper, we aim to address the process of combining such multiple partially overlapping textual 1 Our data and code is available at: https://github.com/eranhirs/sentence_union_generation [S1] The fire has destroyed a large section of the store and fire crews and investigators are still on the scene.
[S2] A FIRE has badly damaged the Waitrose supermarket in Wellington's High Street.
[Union] The fire has destroyed a large section of the Waitrose supermarket in Wellington's High Street and fire crews and investigators are still on the scene.Figure 1: An example of a sentence pair and its union sentence.Information that must be included in the union is highlighted differently for each sentence (green and purple for sentences 1 and 2, respectively), unless the information is paraphrastic (equivalent) between the two sentences, which is then highlighted by the same color (blue).Non-highlighted information indicates that there is corresponding information in the other sentence that is more specific.sources into a single unified and comprehensive format, to which we refer as text consolidation.
Text consolidation plays a crucial role in almost any text-based information access application, such as Multi-Document Summarization (MDS) (Fabbri et al., 2019;Giorgi et al., 2022), long-form question answering (Fan et al., 2019;Nakano et al., 2022), and contemporary dialogue applications (Thoppilan et al., 2022;OpenAI, 2023).It is important to point out here that content selection and consolidation manifest two distinct sub-tasks in such applications, where the former involves identifying the sought information in the source texts, based on considerations such as salience and user needs.Consolidation, on the other hand, involves merging the selected information into a coherent output text.Accordingly, we suggest that each sub-task deserves separate investigation, while focusing in this paper on the consolidation task, manifested as information union.This approach enables targeted investigation of information union capabilities of models, while enabling modular architectures, where an effective information consolidation model can be paired with different content selec-tion models and strategies, whether fully-automatic or interactively involving a user in the loop.
To achieve a more controlled research environment, a sentence fusion task was introduced, which fuses a set of sentences into a single sentence (Barzilay et al., 1999;Thadani and McKeown, 2013;Agarwal and Chatterjee, 2022).However, being similar to summarization, the general sentence fusion task is ill-defined, because it allows for subjective salience-based content selection decisions (Daume III and Marcu, 2004;Krahmer et al., 2008).In contrast, the sentence union generation task is strictly defined as generating a sentence that contains exactly all information from the source sentences (see Fig. 1).While identifying the union task to be more attractive due to its more objective and semantically challenging nature, we found that datasets for this topic are relatively scarce (McKeown et al., 2010;Geva et al., 2019;Lebanoff et al., 2020), none of them sufficiently addressing the text consolidation setting.
Consequently, we revisit the sentence union generation task and propose that it can be used as an effective generic testbed for text consolidation.Compared to the sentence intersection task, the union task is more challenging, as it requires merging both joint and disjoint information in the output and hence provides a more complete testbed for text consolidation.Our input format is rich and challenging enough, as shown in our analyses, to support research on information merging models.Further, this setting may already be of practical use for downstream text generation tasks, for example when combined with sentence compression or decontextualization models.
Our contributions are outlined as follows: (1) we suggest focusing on sentence union generation as a resource for studying cross-text consolidation capabilities, and point out that properly identifying informational relations between pairs of sentences is necessary for proper consolidation; (2) we provide the largest union fusion dataset to date, while proposing a controlled annotation protocol and interface for careful creation of a sentence union corpus; (3) we suggest evaluation protocols to assess the quality of a generated sentence union, accompanied by automatic metrics that can be used for comparing multiple systems; (4) we provide empirical results on the abilities of prominent neural generative models to address the union task, assessing their capabilities and limitations.

Background
In Multi-Document Summarization (MDS) (Narayan et al., 2018;Fabbri et al., 2019) multipletexts are summarized into a single, shorter text.In a more controlled variant of MDS, the task requires the fusion of partly-overlapping sentences (Barzilay et al., 1999;Thadani and McKeown, 2013;Agarwal and Chatterjee, 2022).Generally, the sentence fusion task included a saliency detection (or importance) component which requires identifying which pieces of information to preserve in the fused output.As a result, sentence fusion is generally ill-defined, as different possible content selections may be valid, making the task subjective to varying necessities of a user (Daume III and Marcu, 2004;Krahmer et al., 2008).Its output could be seen as covering a "loose" intersection of the content of two sentences.McKeown et al. (2010) on the other hand, to ensure more consistent fusion settings, makes a distinction between two strict variants of the task: sentence intersection and sentence union generation.Given two (or a set of source sentences), their intersection is a sentence that contains only information that is common to both source sentences, while their union is a sentence that contains all information from the source sentences.As we will see in §3, these tasks can indeed be formulated in strict entailment terms.McKeown et al. (2010) crowdsourced a dataset of 300 examples for sentence intersection and sentence union, but subsequent works mostly focused on the intersection fusion part of the dataset (Thadani and McKeown, 2011;Fuad et al., 2019).Further, their dataset size is relatively small and primarily intended for evaluation purposes, making it inadequate for partitioning into a training dataset for fine-tuning large language models.
While McKeown et al. (2010) used similar sentences, whose contents partly overlap, as input, later works researched the union of disparate sentences (Geva et al., 2019;Lebanoff et al., 2021) where contents are disjoint.This does not address the challenge of consolidating partly overlapping texts.In this work, we chose sentence union as a more complete testbed for multi-text consolidation.We see our work as a continuation of the work by McKeown et al. (2010), and complementary to works that introduced fusion datasets for disparate sentences.
Our work further relates to a line of research that focuses on objective generation of text.Castro Ferreira et al. (2020) introduced a data-to-text generation task, wherein knowledge graph triplets describing facts are transformed into natural language text.While there are many possible realizations of the knowledge graph into natural language, the task is semantically objective, with respect to the informational content expected in the output, and is hence similar to the sentence union task.Recently, Slobodkin et al. (2022) introduced a new controlled text reduction task: given an input document with highlighted spans, the task is to generate a summary in which only the information covered in the highlighted spans is included, which could be compared to a highlight union task.Compared to our work, the spans that they used all appear in a single document, which makes it more similar to datasets which fuse disparate sentences.

Task Formulation
The input for our sentence union task consists of two related sentences whose content partly overlap.The output union is then defined as a single sentence that follows two conditions: (a) it contains exactly the information from the two input sentences, and (b) it does not include any redundancies in its content.Condition (a) implies that there cannot be any information missing from the union that is mentioned in the source sentences, while at the same time the union cannot contain information that is not mentioned in the source sentences (i.e., hallucinations).Condition (b) implies that the union must avoid repetition of any units of information stemming from the source sentences, even if they are conveyed in different lexical terms.Notably, the semantic content of the output union (condition (a)) can be defined objectively in strict textual entailment terms.Formally, given an input of two related sentences s 1 and s 2 , and their union u, u should satisfy u |= s 1 , u |= s 2 and s 1 + s 2 |= u, where |= denotes textual entailment and + denotes concatenation of the two sentences.This definition, however, does not cover condition (b) of avoiding redundancies.
Identifying relevant informational links is crucial for producing a union, as demonstrated by the example in Fig. 2. We observe three types of relations between information units in the source sentences that affect the content of the resulting unit: (1) equivalent content, (2) uni-directional entailing content, and (3) disjoint content.Equivalent content, such as lexical equivalence or paraphrases, needs to be identified and included exactly once in the union to avoid redundancy.Uni-directional entailing content pertains to aligned text spans where one span can be implied from the other.In this case, only the entailing text unit should be included: including both spans would be redundant, while including only the less specific mention would result in missing information.Disjoint content must be included in the union as it provides distinct information not mentioned in the other sentence.For example, in Fig. 2, sentence 1 mentions the reason for firing Weightman while sentence 2 mentions that Harvey resigned, each providing distinct information.In addition, according to our annotation scheme, we assume that the date of the publication is known, which means that when a phrase such as "the previous Thursday" is mentioned, we can infer the specific date.Thus, the text spans "On March 1st" and "the previous Thursday" are equivalent, while "Francis Harvey" in sentence 1 is more specific than the text span "Harvey" in sentence 2. By considering these three types of relations, a proper union can be produced.
As noted earlier, we see the union generation task as a more comprehensive setup for information consolidation than the intersection generation task2 .This is because the union output should combine all the content from both source sentences, while the output of the intersection task does not include information mentioned in only one of the sentences.As a result, the union is more informative than the intersection, which makes it more representative for downstream multi-text tasks requiring information consolidation, aiming to create an efficient, nonrepetitive output text.

Data sources
Annotating a text consolidation sentence union dataset requires a collection of related sentences, as input, as seen in Fig. 1.Specifically, we require naturally occurring sentences with some semantic overlap, where different types of informational relations are present.Note that we do not consider sentences with no content overlap as relevant for our dataset.[Union] Army Secretary Francis Harvey, who dismissed Walter Reed commander Major General George Weightman the previous Thursday because the army had lost trust and confidence in him, has resigned himself.

Generation
Figure 2: An example of a pair of sentences, the informational relations between their text spans, and their union.In order to generate the union, it is first necessary to identify these relations (possibly implicitly), and then include all new or more specific information (denoted by colors) without redundancy.
To that end, we use the dataset created by Weiss et al. (2021), which includes pairs of relevant sentences with high semantic overlap.Their dataset was curated by identifying information overlap between sentences, based on the repurposing of existing human annotations.This approach is preferable to using models that identify semantic overlap, such as Thadani and McKeown (2013), since it introduces less bias to the dataset.The original datasets from which they sourced the sentences include: (1) the Event Coreference Bank (ECB+, an extension over ECB) (Cybulska and Vossen, 2014), which provides annotations for coreferring event and entity mentions, (2) MultiNews (MN) (Fabbri et al., 2019), which contains clusters of news articles along with human-written summaries, and (3) The Document Understanding Conference (DUC) and the Text Analysis Conference (TAC) 3 , both providing MDS evaluation datasets.

Annotating sentence union
The process of writing a sentence union involves carefully tracking information units and blending them together to form the output, as outlined in §3.We introduce an elaborate crowdsourcing approach and interface (see Figure 3) for annotating union datasets at a large scale, which splits the annotation process into multiple steps.
Starting with the two source sentences, the first step is to choose one sentence as the base sentence, 3 https://duc.nist.gov/, https://tac.nist.gov/that will be used as the basis for generating the sentence union, depicted in (Fig. 3, [1]).Our early experiments have shown that it is easier to merge the information from one sentence by adding it to the other sentence than write a merged sentence from scratch.We instruct the workers to choose the more detailed sentence as the base sentence, since this sentence would usually require less edits when merging into it information from the other sentence.In the other sentence, termed the integrated sentence, the worker has to highlight which spans they would like to integrate into the base sentence (Fig. 3, [2]).Finally, in the writing step, the worker blends the highlighted spans into the base sentence, thus creating the sentence union (Fig. 3

, [3]).
To optimize the diversity of inputs within our dataset while considering our annotation budget, each example was assigned to a single annotator.To ensure the quality in annotators' decisions, our process follows the controlled crowdsourcing approach (Roit et al., 2020).See App.C for more details and screenshots of the entire annotation process.

Skipping examples
In certain cases, it may not be possible to generate a coherent sentence union from a pair of sentences, and annotators were given the option to skip such examples.A comprehensive analysis of these skipped cases is presented in Appendix A. Mainly, our findings indicate that the dataset from which we derived our data (Weiss et al., 2021), and was primarily designed for proposition alignment, contains many sentence pairs that are not sufficiently related to each other and hence are not suitable for producing a meaningful union.

Subtle annotation cases
In addition to the aforementioned instructions, we took into consideration a few prominent special cases concerning the source sentences that would affect the resulting sentence union.Such cases include the need for world knowledge, temporal issues, subjectivity and attribution.For examples and guidelines provided to the workers for such cases, refer to App.B.

Cleaning annotations
In order to ensure a high quality dataset, we introduced a post-processing step in which we either removed or manually edited examples matching specific filtering criteria.Filtering included finding non-overlapping input sentences based on their output union (i.e., the output was a simple concatenation of the two source sentences), as well as automatically identifying and manually reviewing subtle annotation cases described in App.B. For more details, see App.D.

Dataset Analysis and Assessment
In the following subsections, we report various analyses of the quality and other properties of our dataset.Dataset split statistics appear in Table 1.Our approach yielded a test dataset comprising of 477 instances, a sample size which is reasonable in light of the confidence intervals outlined in §8.Moreover, our analysis of learning curves (see Appendix G) suggests that the size of our training dataset is sufficient, and further expansion may not yield significant benefits.

Sentence union quality
To estimate the reliability of our dataset, we have conducted a human assessment on a sample of 100 examples of sentence unions generated by our annotators.Our goal is to check whether the sentences in the dataset objectively fulfill the union requirements defined in Sec. 3.For this purpose we designed two evaluation criteria for content (coverage, faithfulness), and one criterion for finding redundancies (redundancy).In addition, we evaluate the fluency of the generated sentence, as commonly done for generation tasks.
• Coverage: Does the sentence union contain all information expressed in the source sentences?
• Faithfulness: Does the sentence union describe only information expressed in the source sentences?
• Redundancy: Does the sentence union redundantly repeat some information?
• Fluency: Does the sentence union progresses fluently, form a coherent whole and is easy to understand?
The content criteria resemble closely those used for data-to-text generation tasks (Castro Ferreira et al., 2020) which also require exact content matching between their input and output.We add another criterion for evaluating redundancies, as our input does include redundancies which needs to be avoided in the output.
As a simple way to measure the content criteria, we count the number of content words4 involved in pieces of information that are missing from the sentence union, or are unfaithful to the source sentences.For example, if the sentence union in Fig 2 would not mention the name "Nick Jones", which was mentioned in sentence 2, we count this as 2 misses.A more complicated example would be if the sentence union attributes "Nick Jones" to the wrong entity, such as "FBI Deputy Director Nick Jones".In such case, we consider the entire span (5 words) as missing, as well as unfaithful.Note that faithfulness can be seen as symmetrical to coverage, where we simply count content words in the sentence union that are not supported in the source sentences.Similarly, for the redundancy score, we count the number of content words involved in pieces of information that are redundant in the union.For example, in the phrase "Thursday overnight at 2:09am", the phrase "overnight" is considered redundant, and we will count 1 redundant word.We did not notice any fluency issues in the sentence unions created by the workers, as may be naturally expected given the high quality of our selected workers.
We start by counting the number of content words in all of the sentence unions in our sample, which adds up to 2372 content words, termed w total .Then, to create a coverage score, the count of missing content words is termed w missing , and the coverage score is calculated as w total w total +w missing .To create a faithfulness and redundancy scores, we calculate 1− w unf aithf ul w total and 1− w redundant w total , respectively, where w unf aithf ul is the number of unfaithful words and w redundant is the number of redundant words.Results for these metrics are available in Table 2. Overall, coverage issues were encountered in 8 examples out of 100, faithfulness and redundancy issues in one example each.
Quality comparison to the prior dataset We compare our dataset to the McKeown et al. (2010) dataset of 300 sentence unions examples.In their annotation process, 5 workers annotated each pair of sentences, and then a single sentence union out of the 5 was automatically chosen as a representative.We evaluated a sample of 20 such representative sentence unions and used the same quality metrics that were used in our dataset quality analysis, reported in Table 2.We conclude that our controlled process, which separates the identification of informational relations from the writing phase, results in higher quality sentence unions, making significantly less coverage and redundancy mistakes, which are often due to lack of attention to details.For the faithfulness criterion, both approaches achieved similar high scores, which is expected since humans are not prone to hallucinate when editing a sentence.Overall, our annotation process achieves slightly better results, while employing only one worker instead of five.

Dataset compression rate
Our motivation for the union task is to develop models that can consolidate information from naturally occurring texts with varying degrees of overlapping information.Hence, in order to assess the diversity of our dataset with respect to the degree of such information overlap, we suggest to compute and analyze the Compression Rate (CR) in our instances, which measures in our setting the amount of redundancies (unlike the data-to-text setting) between the two source sentences5 .By design, a CR of 100% would imply that a single source sentence contains all of the information in both source sentences, which means that the other sentence is completely redundant.A CR of 0% would imply that there is no redundancies between the source sentences.
Denoting our two input sentences short and long, per their lengths, as well as the union sentence, and following the rationale above, the compression rate is calculated as the amount of information that is eliminated from the shorter sentence.As can be seen in Fig. 4, our dataset supplies a variety of examples in terms of CR for every split.We report an average CR score of 60.82 ±0.67 for our dataset and an average CR score of 65.62 ±1.35 for McKeown et al. (2010).These results imply that our dataset on average contains somewhat less overlap between the source sentences, overall includes a large variety of redundancy levels.

Informational relations analysis
Complementary to the analysis in §5.2, naturally occurring texts can include a wide variety of crosstext informational relations, as described in §3.For this reason, we analyzed the frequency of the more challenging relations necessary to generate proper sentence union.Our analysis includes a sample of 30 sentence pairs from our dataset.On average, a sample of 10 examples is expected to include 17 "paraphrastic uni-directional entailment" relations (a uni-directional entailment which differs lexically), such as "supermarket" entailing "store", or "gave interviews on NBC's today" entailing "appearance on NBC's today".As described in §3, such examples challenge a consolidation model to include only the entailing expression in the output.In addition, such a sample is expected to include 21 paraphrastic equivalence relations.These challenge the model to include only one of the equivalent expressions in the output, to avoid repetition.Overall, these statistics assess the abundant semantic challenges posed by our dataset.

Baseline Models
We present baseline models, aiming to test neural pretrained language models' for their ability to implicitly recognize relevant informational relations between input sentences and properly create their union.
Fine-tuned models As our first type of baseline we fine-tune a large pre-trained sequenceto-sequence model using our data.To that end, we picked two strong models: T 5 large (Raffel et al., 2019), which is commonly applied to endto-end text generation tasks (Chen et al., 2020), and PRIMERA (Xiao et al., 2022), which was pretrained in a cross-document fashion (Caciularu et al., 2021) and achieves state-of-the-art results over multi-document summarization datasets.This makes this model appealing for our sentence fusion task, where the two sentences originate in different documents.See App.F for information about training details.
In-context learning Another current baseline approach is in-context learning, in which the instructions and examples to the task are provided as input (the prompt) at inference time to very large pre-  trained language models.We used GP T 3 (Brown et al., 2020), specifically text-davinci-003.The instructions we initially used were similar to those given to the annotators.We then optimized the prompt by running it on the training dataset and manually identifying mistakes.The identified mistakes were added to the prompt as examples.In addition, we added to the instructions "important" notes to what the model should pay attention to.See App.E for the complete final prompt and configuration used.

Model Evaluation Protocols
We evaluate our baseline systems both through human evaluation ( §7.1) and with automatic metrics ( §7.2) suitable for the task, which can generally be used in the development cycles of union generation systems ( §7.2).

Human evaluation
The human evaluation is conducted over the predicted unions for the test set for each of the baseline models.Instead of judging the generated sentence union for each baseline system separately, the evaluation is done in a comparative fashion, following previous works where the evaluator sees together the outputs of all baseline systems (Callison-Burch et al., 2007;Novikova et al., 2018).Similar to the analysis of the dataset quality in §5, we are interested in evaluating the coverage, faithfulness, redundancy and fluency of the predicted union, this time in a manner that fits crowdsourced human evaluation.Content and redundancy are scored on a scale from 1 to 4 (higher is better), described in Table 3.This scale is inspired by the Semantic Textual Similarity human evaluation approach (Agirre et al., 2013), which also tests for information overlap.For the fluency score, we use a common Likert scale from 1 to 5 (Fabbri et al., 2021).See App.H for details and screenshots.
As there exist trade-offs between the two content measures and the redundancy measure, we add an additional measure which evaluates consolidation as a whole.For example, by arbitrarily adding more information to the union we can increase the coverage, but also risk increasing redundancies and unfaithfulness.The consolidation measure simply averages the three aforementioned measures, thus testing for overall text consolidation quality.

Automatic evaluation
In line with previous works in text generation, we report the ROUGE metric between the reference union and the predicted union.However, like for most generation tasks, ROUGE will unfairly penalize correct but paraphrastic sentence unions (as described in §3).To partly address this issue, we add another automated metric which tests for bi-directional textual entailment (aka NLI), comparing the reference union sentence to the predicted union sentence, requiring entailment in both directions.Specifically, we use the DeBERT a xxlarge v2 model (He et al., 2020), finetuned with the MNLI task (Williams et al., 2017) and a threshold of 0.5.
While both metrics test for content matching, they would not penalize a model that bluntly concatenates the two input sentences.Therefore, we also report ∆CR ( §5.2), calculated as the average difference between the CRs of the predicted vs. the reference union sentences (the latter is subtracted from the former), on each instance.A positive value thus indicates that the model compression rate is higher than that of the reference union, while a negative value indicates the opposite (model compresses less than the reference).

Human evaluation of the models
Results are presented in Table 4, and example generations with their respective scores are provided in App.I.The trade-off mentioned in §7.1 between increasing coverage while still remaining faithful and without redundancies is evident in the results of T 5 large and GP T 3. PRIMERA comes out as a slightly better model, as it achieves the highest consolidation score, with yet a lot of room for improvement.
To get a better sense of the absolute performance of the union sentences generated by the baseline models, we compare them to two naive models which output: (1) the concatenation of the source sentences (no avoidance of redundancy), and (2) the longer sentence (no attempt to consolidate and cover information from the other sentence).Based on evaluation of 50 examples completed by the authors, we report an average redundancy score of 1.6 ±.1 for the concatenation and an average coverage score of 2.3 ±.1 for the longer sentence.As reported below, all our baseline models outperform these naive models by a large margin.
Further, we draw a plot (Fig. 5) of the minimal system score amongst the three component measures that the consolidation measure combines.We note that even for the best model, PRIMERA, only 29.7% of the predictions are fully correct with respect to content and redundancy, another 40.6% examples include minor errors, and 26% examples contain substantial errors in at least one of the measures, indicating the limitations of current models.

Automatic evaluation of the models
While automatic metrics are clearly less reliable than human metrics, they can be useful for development cycles.The automatic metric results are also reported in Table 4, observing that both the ROU GE1 score is highest for PRIMERA, while the NLI score is highest for GP T 3. The ∆CR scores roughly correlate with the combination of coverage and redundancy detected in the human evaluation, where both lower coverage (undesired) and lower redundancy (desired) increase compression rate.
To identify the potential utility of our automatic metrics, we follow the standard practice (Fabbri et al., 2021) and calculate a Kendall τ coefficient (McLeod, 2005) between the human and automatic evaluation results.Our results show that ROU GE1 Table 4: Human (left) and automatic (right) evaluation results of system generated unions over the complete test set.All scores are averages, along with their standard error (standard error for manual evaluation results was always smaller than 0.01, and is therefore omitted from the table ).
is the highest correlated metric with the consolidation measure (τ = 0.38, p < 0.05).Overall, these automatic metrics can be used in tandem to provide certain feedback during model development cycles.

Error analysis
To shed light on the various errors made by the baseline models, we examined 20 erroneous examples identified in the human evaluation, with each example consisting of three predictions, one from each of the baseline systems.Our findings indicate that the most frequent causes of model errors are related to the complexity of informational relationships present in the source sentences, with uni-directional entailment being the most common.Moreover, the models seem to face difficulties in accurately combining related information, which often results in incorrect merging of information with the wrong entity or predicate.Further details on the analysis can be found in Appendix J.

Conclusions
In this paper, we advocate for using the sentence union task as a testbed for multi-text consolidation.
We release a realistic dataset, together with a set of analyses that show that the dataset is of high quality, and challenging for multi-document consolidation efforts.We evaluate the performance of state-of-the-art pretrained large language models on text consolidation, where our findings suggest key challenges for future research.Future research may expand upon our dataset to include consolidation beyond 2 input sentences, and may examine the use of explicit text consolidation structures for improving multi-text consolidation in large language models.

Limitations
We enumerate some limitations to our work.While we did create the largest union dataset to date, it is still of moderate size.As shown by our learning curves (App.G), the amount of training data we created seemed sufficient to saturate the learning of the models with which we experimented, but it might still be found insufficient for training other models.
Our annotation protocol might have influenced the compression rates of the unions, as we instructed workers to annotate sentence unions by first choosing a base sentence and then highlighting the other sentence.Additionally, while the highlighting facilitates the annotation process, it cannot directly be used for analyses of the dataset since it is uni-directional.
The dataset includes only input with exactly two sentences and it might be desirable for future works to also be able to train systems that take more than two sentences as input.Our dataset is also domain specific, in that all the sentences are taken from news sources.This might result in challenging cross-domain generalization.
This dataset is limited to the English language.While the suggested annotation protocol seemingly fits other languages, the step in which words are highlighted might prove problematic for morphologically rich languages, in which a single word includes many pieces of information.A segmentation of the text before annotation might be required.

Ethics Statement
Crowdsourcing To crowdsource the dataset, we used the Amazon Mechanical Turk6 (MTurk) platform.To participate in the first stage of recruitment, workers were required to possess the following MTurk qualifications: • NumberHITsApproved greater than 10000 • PercentAssignmentsApproved greater than 98% • WorkerLocale in US, CA, AU, GB, NZ Workers were paid $0.3 for each sentence union annotation assignment, as well as a $1.25 bonus for every 100 assignments, and $0.4 for each evaluation assignment, as well as a $1 bonus for every 50 assignments.Overall, by an average approximation of 1.8 minutes for the first assignment, and 2.4 minutes for the second assignment, their wage is expected to start from $10 per hour and increase as the workers are more familiar with the task and start receiving bonuses.Workers were informed that the ratings they will provide will be used to evaluate artificial intelligence models which were trained on the data they annotated.
Dataset The texts that workers write that are included in our dataset are limited to the information expressed in the source sentences.The source sentences originate from the datasets mentioned in §4.1, which include only texts available in public news sources and were previously made available by Weiss et al. (2021).Our dataset does not contain information that would make it possible to reconstruct the original documents, or any human annotations, such as the summary or coreference resolution annotation, from the original datasets.

A Skip Guidelines
In Section 4.2, it was noted that there are cases where generating a union from a pair of sentences

Category Count
No information consolidation 19 Unnatural union 7 Mistake 3 Missing context 1 Table 5: An analysis of 30 cases that were skipped by workers during the annotation process.Among these, some were categorized as mistakes, meaning that they should not have been skipped.
is not suitable, and workers were given the option to skip the annotation for such examples.This section outlines the specific scenarios in which workers were directed to skip examples.Eventually, our annotators skipped 458 sentence pairs from the original dataset that we used as input, as shown in Table 1.An analysis of a sample of 30 such cases is presented in Table 5, categorized based on the criteria below.In conclusion, we found that the dataset we used as the source of our sentence pair instances, which was originally developed by Weiss et al. (2021) for aligning predicate-argument structures (represented as question-answer pairs), includes a significant number of instances where information consolidation in the form of sentence union is mostly irrelevant.
No information consolidation.One case in which workers were directed to skip examples during annotation is when there is no partially overlapping information to consolidate from two related sentences, hence their union would simply be a concatenation of the two.This case is referred to as "No information consolidation".An example of this scenario is when sentence 1 mentions that "Acupuncture is the ancient Chinese medical therapy technique of inserting thin, sharpened needles into specific nerve junction points of the body," and sentence 2 mentions a study that found "53.8 percent of the subjects who had needles inserted in four acupuncture "zones" in the ear five times a week tested free of cocaine at the end of the eight-week study period."In this case, there is no need to consolidate the information from the two sentences as they provide distinct pieces of information.Sentence 1 explains what is acupuncture while sentence 2 discusses a study about it.Unnatural union.An example of an "Unnatural union" scenario is when unifying two input sentences would form an awkward or unnatural sentence.For instance, if the first sentence is written in the past tense and the second one in the future tense, unifying them could lead to an unnatural sentence union.As an example, consider the following sentences: "Fannie Mae's board met Sunday night to discuss Raines' future" and "The directors of Fannie Mae, the big mortgage finance company, will meet Sunday to consider the fate of two senior executives who signed off on financial statements that violated accounting rules, people close to the company said Friday."Here, the first sentence uses the past tense while the second sentence uses the future tense.It would be more natural to use the past tense in the sentence union since the event occurred in the past.However, incorporating the information that someone said something on Friday before the event could result in an awkward sentence union.
Missing context.This case happens when two sentences need to be interpreted in the broader text context, which is missing in our annotation scenario, for example when there is a dangling reference to an entity that is not specified in the given sentence.This is often not problematic, unless understanding the identity of the entity is necessary to create the union.For instance, one sentence quotes a person, while the other sentence does not mention the speaker.An example of this scenario is the following: "Sadly, because Magic Leap seldom hires and does not actively recruit female candidates, the company loses competitive advantage to products like Microsoft's Hololens."and "When Tannen Campbell was hired by Magic Leap in 2015, the Florida company had no women in leadership roles and its only idea to make its product femalefriendly was to release a pink version, according to Forbes."Merging these two sentences is not straightforward due to the lack of context.
Disagreements.Sometimes, there are two statements that contradict or disagree with one another.For example, sentence 1 is "Video of Brooklyn Mother of 13 Zurana Horton shot and killed in a gang shooting was revealed Thursday ."and sentence 2 is "A shocking video released for the first time Thursday captures the moment a Brooklyn mother of 12 was killed in a gang shootout as she picked her daughter up from school .". Sentence 1 mentions that the child is 13 years old while sentence 2 mentions that the child is 12 years old.

B Subtle annotation Cases
In Section 4.2 we noted that certain special cases arose when generating a union from a pair of sentences, and were included in the instructions for annotators.This section outlines the specific instructions provided to workers, with an analysis of 50 cases (Table 6), categorized based on various criteria as described below.
Attribution.One potential issue is when the source sentences make attributions to a specific source, such as a news agency.An example of this can be seen in sentence 1 "Video of Brooklyn Mother Zurana Horton being shot and killed was revealed Thursday, according to the N.Y.Daily News." and sentence 2 "A shocking video released for the first time Thursday captures the moment a Brooklyn mother was killed as she picked her daughter up from school.",where the new information in sentence 2 is attributed to the video content, rather than to the N.Y.Daily News.Another example is when a sentence contains quotes, as changing a quote to contain more information would create an unfaithful sentence union.In such cases, the workers were allowed, whenever it seemed reasonable, to attribute combined pieces of information originating from the two sentences to a reported source, even if only parts of the combined information were explicitly attributed to this source, in one of the sentences.
Relative dates.Some sentences may mention a specific time relative to when the sentence was written, such as "yesterday" or "Monday", which implies that the sentence was written in the same week of the event.Workers were instructed to assume that the date of publication is known, so there is no difference between the mention of "yesterday" and "Monday", but, for example, that "yesterday" is more specific than "earlier this month".
World knowledge.In some cases, sentences may mention the same piece of information in different levels of specificity, which requires world knowledge to identify.Workers were instructed to assume common world knowledge when creating the sentence union.An example is given for Paris, which is both a city in Texas and the capital of France.
Before and after an event.For sentences referring to events, some may differ in their time of publication compared to the event itself.Workers were instructed to use the past tense, as the sentence union is written after the event.For example, sentence 1 mentions an event that has already happened "After leaving Alderson at 12:30 a.m. on March 3, 2005, Martha Steward declared the 5-month experience as "life altering and life affirming."",while sentence 2 was written before the event "US lifestyle guru Martha Stewart is expected to leave jail on Friday after a five-month sentence for a stock scandal that reinvigorated her career rather than dooming it.".In this case, the sentence union should be written in the past tense, as it refers to an event that has already occurred.

C Annotation Process
Screenshots of the entire annotation process are depicted in Figure 6.Guidelines for creating sentence unions7 include writing one coherent sentence, ordering the information in a stand-alone manner (as if the sentence would have been written from scratch), meaning that the writing process should not be distracted by the original split and ordering of information in the two input sentences.To the extent possible, the sentence union should preserve the original wording of the information, but phrasing may be minimally adjusted to create a coherent sentence union.Each piece of information should appear only once in the sentence union.When there is a redundancy across the two sentences, the more specific phrasing should be chosen.
The interface helps the workers to avoid making common mistakes.For example, in order to reduce redundancies of information in the union, if a highlighted word already exists in the base sentence, both word mentions will be marked to draw the worker's attention.Another example is warning the worker when the sentence union contains nonhighlighted words from the base sentence.Also, when integrating highlighted words into the sentence union, the worker will see yellow highlights turn into green highlights.If the worker tries to submit the annotation with yellow highlights, the system will raise an alert.
To ensure the quality in annotators' judgements, our process follows the controlled crowdsourcing approach (Roit et al., 2020), which includes a recruitment phase, two training phases accompanied by extensive guidelines, and ongoing monitoring during the annotation of the production task.Workers were allowed to participate in primary tasks only if they had completed the entire process.Only workers who performed well on the recruitment phase were accepted to the next training phases.The training phases were created manually, including subtle annotation cases.After each annotation, workers were shown gold target highlights and sentence unions8 for comparison with their own output.

D Cleaning Annotations
Disjoint sentences Following the skip guidelines (see App. A), we automatically identified examples which their sentences are mutually exclusive and their sentence union is a concatenation of the source sentences.We find these instances by comparing content words only, since connecting the two sentences sometimes involves non-semantic lexical changes (e.g., adding a semicolon or a comma).Due to the fact that there is no consolidation of information in such examples, we see them unfit for a union, as mentioned in §4.1, and they were not included in the dataset.We leave the automatic categorization of sentences into whether or not they are suitable for sentence unions to future work.
Quotes Following the attribution discussion in App.B, we manually reviewed examples where the union contained a quote that was not in any of the source sentences, as well as any example that had a sentence which used a first-person perspective (e.g., "I", "we", "mine", "ours", ...).

E In-Context Learning
For the in-context learning approach, we used a temperature value of 0.4 and the following prompt: In this task, you will be presented with two sentences that overlap in information, and you are tasked to merge the information of the two into a single unifying sentence without redundancies.Important: Do not omit information.Important: Do not repeat information.
Here is an example of a correct union and a wrong union: Sentence 1: The February assassination of former Lebanon Prime Minister Hariri put Syria under renewed pressure from the international community to abide by U.N. Security Council Resolution 1559 and withdraw its troops from Lebanon.Sentence 2: Foreign ministers from all The union is wrong, because it does not mention that foreign ministers gathered for a meeting on Wednesday.
Please generate a correct union to the following sentences: Sentence 1: <sentence 1 goes here> Sentence 2: <sentence 2 goes here> Correct union:

F Training Details
We fine-tuned T 5 large and PRIMERA models for 20 epochs on a Tesla V100-SXM2-32GB GPU.We used a hyperparameter random search strategy.The learning rate was tuned within the range [1e − 8, 5e − 5], while the batch size varied between [8,16,32].We also explored the weight decay range of [0, 0.5] and warump step range of [0, 300].The best model was selected based on the ROU GE1 metric.9The best T5 model was obtained with a learning rate of 4.3e − 6, no weight decay, no warmup steps, batch size of 32, after 18 epochs.For the best-performing PRIMERA model, we used a learning rate of 3.5e − 6, weight decay of 0.5, warmup steps of 80, batch size of 16 and selected the best checkpoint after 9 epochs.The training time for T 5 large and PRIMERA models were approximately 1 hour and 10 minutes each.
Input structure When concatenating the two source sentences to insert as input for the model, we add special separator tokens to make the model aware of the sentence boundaries.For T 5 large , we separated between the source sentences in the input using a newly created special token, while for PRIMERA, we used the <doc-sep> token, which was used in the pre-training phase to separate between input source documents.

G Learning Curves
To assess the adequacy of our dataset size, we evaluated the baseline models on different subsets of our training data ([25%, 50%, 75%, 100%]) and various model sizes (T 5 base and T 5 large ).Based on our findings (Figure 7), it appears that enhancing the model size from T 5 base to T 5 large results in performance improvement.However, the marginal benefit of increasing training dataset size may be limited, and further gains may not be significant.

H Evaluation Process
As explained in Section 7, the evaluation process involves a comparative approach, whereby all the unions of system-generated sentences are evaluated simultaneously, as shown in Figure 8.The evaluation is conducted separately for four criteria.To assess the content differences between the reference union and the system union, including coverage and faithfulness, a single sentence is designated as the base sentence, and the worker is asked to evaluate the other sentence based on the amount of missing content.The reference union serves as the base sentence for evaluating coverage, while the system union is used as the base sentence for evaluating faithfulness since any information present in the system union but absent in the reference union is deemed unfaithful.In evaluating redundancy and fluency, the evaluator is only presented with the system union without the reference union.
To assess the coverage and faithfulness criteria, the workers are required to compare the generated union with the reference union, aided by red strikethroughs on words that are not included in the generated union and green highlights on words that are not included in the reference union, as illustrated in Figures 8a and 8b.For redundancy and fluency criteria, the reference union is not needed, as demonstrated in Figures 8c and 8d.

I Example Sentence Unions
See Table 7 for examples of sentence unions, including the sentence unions from each predicted system.

J Error Analysis
In order to perform an error analysis, we analyzed 20 examples that were rated less than perfect for all metrics based on the human evaluation (see §8.1).The findings are presented in Table 8, with one representative example from each subcategory included in Table 9.Our key observation is that models make various coverage errors as they fail to identify the uni-directional entailment correctly in the dataset.Furthermore, models make multiple coverage and faithfulness errors by incorrectly combining information and attaching it to the wrong entity or predicate.Sentence 1 French museum officials traveled to New York last month and confirmed the find is indeed the missing Picasso work, which the Centre Georges Pompidou realized was missing from its storerooms in 2001 following a loan request; it was then valued at more than $2.5 million.Sentence 2 The canvas had been smuggled out of a storeroom of the Centre Georges Pompidou, the Paris museum and arts center, and its whereabouts had not been known.Gold union French museum officials traveled to New last month and confirmed the find is indeed the missing Picasso canvas smuggled out of a storeroom of the Centre Georges Pompidou, the Paris museum and arts center, which realized it was missing in 2001 following a loan request; it was then valued at more than $2.5 million.T 5 large French museum officials traveled to New York last month and confirmed the find is indeed the missing Picasso work, which the Centre Georges Pompidou realized was missing from its storerooms in 2001 following a loan request; it was then valued at more than $2.5 million, and its whereabouts had not been Table 7: Examples of predicted union sentences from each baseline system and their corresponding human evaluation.

Coverage Faithfulness Repetition Subcategory Explanation Subcategorization
Uni-directional entailment 17 2 5 This includes cases where either the entailing part is missing and the entailed part is present in the sentence or both the entailing and entailed parts are present in the sentence.Wrong attachment 13 13 1 This includes cases where an argument is attributed to the wrong predicate or entity.Lexical similar but different information 8 0 0 This includes cases where information is omitted, and the omitted information had a phrase that was lexically similar to a phrase in the other sentence.Ignores prefix 4 0 0 This includes cases where the prefix to the sentence in the source is omitted from the union.
Related new information 2 0 0 This includes cases where the source sentences contain related but different information, and one of them is not included in the union.This includes cases where paraphrased information from the source is repeated in the union.
External hallucination 0 3 0 This includes cases where there is information in the union that does not originate from the source sentences.
Table 8: Error analysis based on a sample of 20 erroneous examples, each example analyzed for the 3 system outputs.For each metric, we report the frequency of a subcategory that we suspect is the cause for the error.One representative example from each subcategory is included in Table 9.

Figure 3 :
Figure 3: A screenshot of the sentence union text generation annotation interface.The screenshot shows the last step, where the worker already choose sentence 1 as the base sentence [1], highlighted the new or more specific information in sentence 2 [2] and wrote the final sentence union ("Merged sentence") [3].

Figure 4 :
Figure 4: Compression Rate (CR) vs. the frequency of each CR bin, for the train/dev/test dataet splits.

Figure 5 :
Figure 5: A histogram of minimal system scores, testing for coverage, faithfulness or redundancy mistakes.

Figure 6 :
Figure 6: The interface used for the annotation process.

Figure 7 :
Figure 7: An evaluation of T 5 models on different subsets of our training data [25%, 50%, 75%, 100%], as well as different model sizes (T 5 base and T 5 large ).The number of parameters is indicated for each model.

Figure 8 :
Figure 8: The interface used for the evaluation of a predicted sentence union's quality.

Table 2 :
Evaluation of union quality.

Table 3 :
The ordinal scales used for the content (coverage & faithfulness) and redundancy measures.

Table 6 :
Distribution of subtle annotation cases in a sample of 50 instances (some instances belong to more than one category).