Pre-training Language Models for Comparative Reasoning

Comparative reasoning is a process of comparing objects, concepts, or entities to draw conclusions, which constitutes a fundamental cognitive ability. In this paper, we propose a novel framework to pre-train language models for enhancing their abilities of comparative reasoning over texts. While there have been approaches for NLP tasks that require comparative reasoning, they suffer from costly manual data labeling and limited generalizability to different tasks. Our approach introduces a novel method of collecting scalable data for text-based entity comparison, which leverages both structured and unstructured data. Moreover, we present a framework of pre-training language models via three novel objectives on comparative reasoning. Evaluation on downstream tasks including comparative question answering, question generation, and summarization shows that our pre-training framework significantly improves the comparative reasoning abilities of language models, especially under low-resource conditions. This work also releases the first integrated benchmark for comparative reasoning.


Introduction
Comparative reasoning constitutes a fundamental cognitive ability that plays a crucial role in human decision-making processes.It involves comparing and contrasting various objects, concepts, or entities to draw conclusions or make informed decisions.For example, consumers often compare products based on features such as price, quality, and user reviews before making a purchase decision.Similarly, policymakers weigh the advantages and disadvantages of different policy proposals to address pressing issues.In the context of textual documents, comparative reasoning is crucial for tasks such as identifying differences between research papers, contrasting news articles from different sources, or synthesizing the arguments of opposing viewpoints in a debate.
In recent years, there have been studies developing natural language processing (NLP) models capable of mining (Jindal and Liu, 2006;Li et al., 2011), understanding (Bondarenko et al., 2022;Bista et al., 2019), and generating (Iso et al., 2022;Beloucif et al., 2022) comparative contents over texts.Yet, a substantial barrier persists: these models often require labor-intensive manual data labeling, rendering them costly and unfeasible for large-scale applications.Moreover, these models are designed for a particular task, which limits their generalizability on different or emerging tasks.Meanwhile, pre-trained language models (PLM) such as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) exhibit good generalizability on several NLP tasks.However, existing pre-training methods (e.g., masked language modeling and span infilling) could not grant the language models strong comparative reasoning abilities, especially in fewshot and zero-shot settings (see results in Table 4).
In response to these challenges, in this paper, we present a novel pre-training framework to enhance the comparison reasoning abilities of PLMs, specifically by capturing comparable information within paired documents more effectively.Our approach pilots around a scalable, labor-free data collection method that can gather a wealth of facts for entity comparison by combining structured (e.g., Wikidata) and unstructured data (e.g., news sources and Wikipedia).We represent these facts as quintuples, which consist of a pair of entities and the corresponding values of their shared property.To enable the pre-training in a text-to-text manner, we convert the quintuples into textual components such as question-answer pairs and brief summaries of the two entities.We further design three pre-training tasks, including generating synthetic comparative answers, questions, and summaries, given the documents of two entities as contexts.Subsequently, we unify the pre-training tasks by multi-task learning.To our best knowledge, we are the first to pre-train language models for comparative reasoning.
We assess the effectiveness of our approach by benchmarking comparative reasoning on a suite of tasks, including comparative question answering, comparative question generation, and comparative summarization.Our experimental results demonstrate a notable improvement in the performance of conventional PLMs including BART and T5 under limited-resource scenarios.
Our contributions are three-fold: • We propose a scalable method for synthesizing data for entity comparison, leveraging both structured and unstructured data sources.• We present a novel framework for pre-training PLMs to enhance their comparative reasoning abilities on multiple related tasks.• We provide the first benchmark for entity comparison over texts, serving as a foundation for future research in this domain.
2 Related Work

Comparative Reasoning Tasks
Early studies focused on mining explicit comparative information from massive corpora, such as identifying comparative sentences (Jindal and Liu, 2006), mining comparable entities (Li et al., 2011), and classifying components of comparison (Beloucif et al., 2022).Recent work focused more on natural language generation tasks such as generating arguments to answer comparative questions (Chekalina et al., 2021), generating comparable questions for news articles (Beloucif et al., 2022), and summarize comparative opinions (Lerman and McDonald, 2009;Iso et al., 2022).However, the existing techniques were designed for specific tasks, and they are limited by the scarcity of supervised data, which poses a challenge due to the laborintensive nature of data collection.

Language Models Pre-training
The combination of structured and unstructured data in language model pre-training has garnered considerable attention in recent research.Early work proposed to fuse KG information and text by encoding the graph structure and take entity embeddings as a part of the input, such as ERNIE (Zhang et al., 2019), K-Adapter (Wang et al., 2020), KE-PLER (Wang et al., 2021), K-BERT (Liu et al., 2020), and JAKET (Yu et al., 2022).Another branch of work proposed to integrate entity information (Xiong et al., 2020;Qin et al., 2021;Zhang et al., 2022) or relation information (Qin et al., 2021;Hu et al., 2021) without modifying language model's structure.Notably, RGPT-QA (Hu et al., 2021) (Hu et al., 2022) brought structured knowledge into generative LMs by integrating graph-based knowledge augmented modules.However, they require knowledge graph as a part of their inputs and process them by the graph neural network-based modules.

Pre-training Framework
Our framework explicitly teaches language models at the pre-training stage about comparative reasoning.Specifically, they are given a pair of documents, each describing an entity, and are trained to generate a piece of text pertaining to the comparison between these two entities.Regarding the types of output texts, we design three sequence-tosequence pre-training tasks that require the model to simultaneously attend to both documents and extract information for pairwise comparison.The pretraining tasks include comparative answer generation ( §3.3.1),question-answer generation ( §3.3.2), and summarization ( §3.3.3).
To enable large-scale pre-training with a data collection, we utilized structured data (e.g., Wikidata) and unstructured corpora (e.g., Gigawords, CC-News and Wikipedia) to obtain quintuples, a novel structural unit on entity comparison.Then we convert the quintuples to textual components for the sequence-to-sequence pre-training.

Notations
Given a (head) entity e, we denote a Wikidata statement as (e, p, v), where p is a property or relation, and v is a value or (tail) entity.Given entities e 1 and e 2 , we obtain their Wikidata statements.From the two sets of statements, we first extract quintuples for entity comparison.A quintuple is represented as (e 1 , e 2 , p, v 1 , v 2 ), where p is a common property of e 1 and e 2 , and v 1 and v 2 are the corresponding values in the Wikidata statements.Such quintuples enable the comparison on shared properties, reflecting the similarity or difference between the corresponding property values or tail entities.
Then, we convert quintuples into textual forms, such as question-answer pairs, which are denoted by (Q, A), and summaries denoted by S. Additionally, we extract text descriptions D 1 and D 2 from Wikipedia for e 1 and e 2 , respectively.

Data sources
Our approach uses large-scale data with free access, as needed in most pre-training frameworks.We design novel pre-training tasks using both unstructured and structured data sources, compared to the existing frameworks that use mainly the text corpora.Because the structured data help define the tasks for comparative reasoning.
Wikidata1 is a collaborative knowledge base that stores data in a structured format.Wikidata contains a set of statements that describe entities, where each statement includes a property and a value, as denoted in §3.1.Values can be object entities which have unique identifiers, or literal values including date values, numerical values or strings.Each entity and property is associated with a set of aliases.
Wikipedia2 is a encyclopedia that contains a vast collection of articles covering a wide range of entities.A Wikidata entity can be linked to a Wikipedia article through a property named "sitelink".
Text corpora, encompassing news sources (e.g., Gigawords3 , CC-News4 ) and Wikipedia, offer an abundance of information for determining the comparability of entities and the properties for the comparisons.For example, a sentence in a piece of news from New York Times like "The show, with a book by the screenwriter Diablo Cody ('Juno') and staging by director Diane Paulus ('Waitress'), takes on the good work ...," indicating that Diablo Cody and Diane Paulus could be compared on the property of work (values: screenwriter vs. director).

Quintuple collection
In this section, we introduce the process of collecting quintuples by combining Wikidata and text corpora.The underlying hypothesis guiding our efforts is that when a pair of statements concerning the same property of related entities co-occur in a textual context, there is a high probability that these statements are indeed comparable.
To extract this comparability information, we first sample a document from the text corpora.Then, we link Wikidata statements to the sentences in corpora by identifying the mentions of entity e, property p, and value v using string matching based on their aliases provided in Wikidata.Specifically, a statement (e, p, v) is linked to a sentence if the aliases of e, p, and v all appear in the sentence.To increase the linking accuracy, we properly tokenize the sentences, convert all text to lowercase, and remove stop words.Next, we pair (e 1 , p 1 , v 1 ) and (e 2 , p 2 , v 2 ) if they satisfy the following criteria: 1. e 1 and e 2 belong to the same category, e.g., they both have the value human for property instance of.This ensures the entities are analogous to each other.2. p 1 and p 2 are equal.This follows the principle that comparisons are made on a common property of two entities.3. The sentences linked to (e 1 , p 1 , v 1 ) and (e 2 , p 2 , v 2 ) co-occur within the same context (e.g., a short paragraph of a news article).Being mentioned together indicates a high probability of implicit comparison.We denote such a statement pair as a quintuple (e 1 , e 2 , p, v 1 , v 2 ).Such quintuples store information for entity comparison.

Quintuple textualization
In order to pre-train the language model in a textto-text manner, it is crucial to represent the comparative information inherent in the quintuples in a textual form.We aim to explicitly train the model to capture comparable information from a pair of documents and make comparisons.To this end, we propose to input a pair of documents, each containing a text description of a single entity, and train the model to generate texts involving comparison.
As part of this process, we extract documents for each pair of entities e 1 and e 2 .First, we find Wikipedia articles of e 1 and e 2 by the links provided in Wikidata.Next, we split the articles into 10-sentence segments.To ensure the information within the quintuple can be inferred from the documents, we link (e 1 , p, v 1 ) and (e 2 , p, v 2 ) to sentences in the articles of e 1 and e 2 , respectively.We link the statements based on two assumptions: (1) Within an article pertaining to entity e, sentences are highly probable to discuss e as their subject; (2) If a sentence in a Wikipedia article of e mentions both e and v from a Wikipedia statement (e, p, v) , then it is highly likely that the sentence describes the fact of (e, p, v).Thus, we link the statements to sentences whenever (e, v) or (p, v) can be matched.To assess the linking quality, we randomly sampled 100 statement-sentence links and manually evaluate the the accuracy.The linking accuracy exceeds 95%, indicating the Wikidata statements are effectively linked to the sentences.Finally, We choose text segments from the articles of e 1 that can be linked to the (e 1 , p, v 1 ) as document D 1 .Similarly, we obtain D 2 for e 2 .
To enable the model to make comparisons in a text-to-text manner, the comparison knowledge encapsulated within the quintuples is converted into natural language forms, namely, question-answer pairs and summaries.The conversion to questionanswer pairs (Q, A) are synthesized using predefined templates shown in Table 2.
To generate synthetic comparative summaries S, we utilize a data-to-text model (Ribeiro et al., 2021) that has been fine-tuned with DART (Nan et al., 2021) dataset.This allows us to transform quintuples into concise declarative sentences.

Pre-training Tasks and Objectives
We propose three comparative pre-training tasks, including comparative answer generation, QA pairs generation, and summary generation.They are all text generation tasks, which fit the architectures of popular language models such as BART and T5 very well.We unify the pre-training tasks with task specific prompts, as shown in Table 1.Table 2: Synthetic QA templates.All indicates the following templates is applicable to all quintuples.The templates under When v 1 ̸ = v 2 : or When v 1 ̸ = v 2 : will be applied to quintuples whose v 1 and v 2 are different or the same, respectively.

Comparative answer generation
We concatenate the synthesized comparative questions and the two documents as input.The model is trained to generate the corresponding answer.This task not only activates the attention mechanism between the question and relevant contexts in each document, more importantly, it encourages the interaction between both documents, which is essential to make the comparison.We define the loss function for a training sample as in which T is a set of QA pairs derived from the templates.and P (•) is the predicted probability.

Comparative QA pairs generation
Given two documents, the model is required to generate comparative questions and answers.By learning to generate comparative questions, the model learns to attend to both documents and reason about common and valuable properties of the two entities:

Comparative summary generation
Given two documents, the model is tasked with generating short comparative summaries that represent the comparable statements: where S is the set of summaries from quintuple textualization.

Pre-training objectives
Inspired by the multi-task prompted training (Sanh et al., 2022), we unify the aforementioned pretraining tasks with natural language prompts.The detailed format of source and target sequences are shown in Table 1.The model is jointly optimized for all tasks using a shared loss function, which encourages the model to learn generalizable representations that are beneficial across tasks.For BART, we train the proposed pre-training tasks together with the text infilling (TI) task, where the model requires to reconstruct the text corrupted with randomly masked spans, as described in Lewis et al. 2020.The detailed We denote the loss function for text infilling as as L TI .Hence, the overall objective L is formulated as follows: 4 Experiments

Datasets and Evaluation Metrics
To evaluate our proposed method, we consider downstream tasks involving comparative reasoning, including comparative question answering (QA), comparative question generation (QG) and comparative summarization.In this section, we introduce the downstream datasets and evaluation metrics.The statistics of the datasets are shown in Table 3.

Comparative question answering
Comparative question answering (QA) requires to compare two or more entities on their shared properties.Since our focus on comparison over documents instead of knowledge retrieval, we do not include distractor passages but directly use the gold evidence passages as the context for question answering.For evaluation, we calculate the exact match (EM) score between the predicted answer and the ground-truth answer, after necessary normalization (Chen et al., 2017).Besides, unigram F-1 scores are also calculated as a complementary metric to compare the similarity beteween the predicted answer and the ground-truth.
HotpotQA CMP and 2WikiQA CMP HotpotQA (Yang et al., 2018) and 2WikiMultihopQA (2Wiki) (Ho et al., 2020) are factual question answering datasets collected from English Wikipedia.These datasets require multi-hop reasoning on different entities before reaching the correct answer.To focus on comparative questions, we obtain the subset of comparison questions based on their question type annotations and denote them as HotpotQA CMP and 2WikiQA CMP , repspectively.

Comparative question generation
Comparative question generation (QG) aims at generating questions that draw comparisons between the shared properties of two entities, given their textual descriptions.The task poses the challenge of identifying and inquiring about properties that humans would deem interesting and worthy of comparison.For comparative QG, we adopt wellestablished evaluation metrics, including BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004).
For comparative QG, we perform answerunaware QG and use datasets converted from comparative QA, denoted by HotpotQG CMP and 2WikiQG CMP respectively.

Comparative summarization
Comparative summarization aims at generating summaries that highlight the similarities or differences between two entities given their descriptions.Following the convention in text summarization (Zhang et al., 2020) , we evaluate the generated summaries with ROUGE scores.
CocoTrip For the CocoTrip dataset (Iso et al., 2022), we employ the common opinion summarization setting.The task involves summarizing the shared opinions from two sets of reviews about two hotels.We concatenate the reviews as input.
Diffen To address the absence of available datasets for the comparative summarization of two entities, we have taken the initiative to curate a unique dataset.The dataset is collected from Diffen.com, a website recognized for offering highquality, human-authored comparisons between different people or objects to help people make informed decisions.Comparison articles on Diffen.comtypically include a brief introduction summarizing the similarities and differences.We man- ually collect these introductory paragraphs as comparative summaries.To gather input sources, we obtain Wikipedia articles as entity descriptions.The task aims at generating a comparative summary based on the given text descriptions of two entities.
The input sequence consists of concatenated entity descriptions, with each description truncated to the first 512 tokens due to the text length restriction.

Experimental Setup
As a pilot study on pre-training for comparative reasoning, we adopt the pre-trained BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) as baselines.Models that are further trained on our comparative objectives are denoted as BART+CMP and T5+CMP, respectively.
For each downstream dataset, we implemented three distinct settings -full-data fine-tuning, fewshot learning, and zero-shot learning -to experiment with both the baseline and our proposed method.In the context of the few-shot learning scenario, we randomly selected 100 instances from the original training set.However, given the limited number of training instances available in CocoTrip and Diffen (specifically, only 20 instances), we merge the full-data and few-shot settings for these particular datasets.For Cocotrip, where the test set is available, we select the best model based on the validation set, and report the results on the test set.For other dataset, we report the results on the validation sets.

Effects of comparative pre-training
In the comprehensive evaluation across six datasets on three tasks, we compare the performance of the proposed method (in the form of "+CMP") against BART and T5.Main results are listed in Table 4.
In the full-data setting, both our proposed models, BART+CMP and T5+CMP, demonstrate performances on par with their baselines across all tasks.Specifically, on HotpotQA, the BART+CMP model achieved an EM of 69, while BART achieved an EM of 69.27, and similarly for T5, the metrics are 72.69 and 73.16, respectively.Similar patterns are observed for other metrics and datasets, emphasizing the competitive performance of our approach in data-abundant scenarios.
However, in low-resource scenarios, as represented by the few-shot and zero-shot settings, the superiority of our method over the baselines became clearly evident.For few-shot learning, our models outperform the baselines on most datasets.Among three tasks, our models show the most significant improvement on the comparative QA task, demonstrating the effectiveness of our synthetic QA pre-training.In zero-shot setting, BART+CMP and T5+CMP consistently surpass their baselines by a large margin.For instance, on HotpotQG, BART+CMP improves over BART by a substantial margin in BLEU (6.86 vs. 1.70) and .Likewise, T5+CMP surpasses T5, with BLEU and ROUGE-L scores of 7.24 and 28.99 against 1.21 and 18.7, respectively.Therefore, these results illustrate that our proposed pre-training method greatly enhances LM's performance in low resource scenarios, while retaining competitive performance in scenarios with abundant training data.

Multi-task v.s. single-task pre-training
To further explore the benefits of multi-task pretraining, we compare the performance of our models pre-trained on single task (i.e., QA, QAG or SUM) with the unified models pre-trained on all proposed tasks.Results are shown in Table 5.When the model is pre-trained on a single task, we observed a significant improvement in performance on the downstream task that closely resembled the pre-training task.However, the model did not exhibit similar improvements on other tasks that were unrelated or less similar in nature.This finding suggests that pre-training on a single task enhances the model's ability to transfer knowledge specifically to tasks with similar characteristics.For the unified model, we observed substantial improvements in performance across all downstream tasks.The improvement brought by the multi-task pretrained model on each task is comparable to the gains achieved through the corresponding task-specific pre-training.This outcome suggests that multitask pre-training enables the model to learn more generalized representations and effectively leverage the shared knowledge across different tasks.

Case Study
To intuitively show the comparative reasoning ability of our pre-trained model, we present an example of comparative summarization in Table 6.Given documents describing airsoft and paintball, models are expected to generate a summary comparing the commonalities and differences of these two games.However, without exhaustive fine-tuning, the generated summary of BART fails to describe the correct relationship between these two entities.On the contrary, after pre-trained on various comparative reasoning objectives, our model generates high quality comparative summaries based on the provided documents under the few-shot setting.The generated summary includes that both games are popular shooting sports while also comparing their differences in their equipment.

Conclusion
In this paper, we presented a novel framework for pre-training language models for comparative rea-D 1 : Airsoft is a team game in which participants eliminate opposing players by tagging them out of play with spherical plastic projectiles shot with mock air weapons called airsoft guns.... (446 words left) D 2 : Paintball is a competitive team shooting sport in which players eliminate opponents from play by hitting them with spherical dye-filled gelatin capsules called paintballs that break upon impact.... (472 words left) Gold: Airsoft is a popular combat simulation game where participants are eliminated when hit by pellets launched from guns that resemble real firearms.In paintball participants try to hit each other with paintballs launched from a special paintball marker/gun.While airsoft is cheaper and provides a more realistic warfare experience, paintball is more popular, more organized and has larger events.
BART (R-L: 18.66, R-2: 4.39) Airsoft is a team shooting sport in which participants eliminate opponents by hitting them with airsoft guns.Airsoft guns are shaped like basketballs or baseball bats and are equipped with a series of round-shaped projectiles called paintballs.BART+CMP (R-L: 19.17, R-2: 8.62) Airsoft and Paintball are two of the most popular shooting sports of all time.Airsoft is a shooting sport that involves hitting opponents with airsoft guns, while Paintball is a more aggressive game that uses a softer, more aggressive, ball-shaped paintball.soning.It obtained quintuples for entity comparison by combining structured and unstructured data, converted the quintuples into textual components, and employed them in three novel sequence-tosequence pre-training tasks.We demonstrated the effects of the pre-training tasks on six downstream datasets, especially in limited-resource scenarios.To facilitate the assessment of models' capability of entity comparison over texts, we release the benchmark for future research.

Limitations
In our pre-training framework, we generate synthetic data for comparative answer generation pretraining with templates, which can cause some synthetic questions not fluent.Such noises in the pre-training data might limit the downstream performance.Similarly, the language of the synthetic summaries generated by a trained data-totext model are rigid and lack of diversity and flexibility.Future work can adopt more advanced approaches to convert quintuples into more fluent and diverse texts for pre-training.

Figure 1 :
Figure 1: The framework of pre-training language models (LMs) for comparative reasoning abilities.In Step 1, we collect quintuples for entity comparison by combining structured knowledge base (i.e., Wikidata) and unstructured corpora (i.e., Gigawords, CC-News, Wikipedia).Details are in § 3.2.2.In Step 2, to obtain text-based pre-training data, we convert the quintuples into synthetic QA pairs and summaries with a set of QA templates and a fine-tuned data-to-text model, respectively.We gather Wikipedia documents as text descriptions of entities.Details are in § 3.2.3.In Step 3, we design novel seq-to-seq pre-training tasks for the LMs.Details are described in § 3.3.
e 1 and e 2 have the same/different value of p? A: Yes/No Q: Do e 1 and e 2 both have the value of v1 in terms of p? A: Yes/No Q: What are the p of e 1 and e 2 ?A: v 1 , v 2 When v 1 ̸ = v 2 : Q: Which one of the following entity's p is v 1 ?e 1 or e 2 ?A: e 1 Q: Is e 1 's p v 1 or v 2 ?A: v 1 When v 1 = v 2 : Q: Which entity has the same value as e 1 in terms of p? A: e 2 Q: e 1 and e 2 are known for what (value) of p? A: v 1 /v 2

Table 3 :
Statistics of our downstream datasets.

Table 4 :
Main results.Our pre-trained models denoted by +CMP, bring significant performance gain to BART and T5 in zero-shot (e.g., relatively +82% and +220% of F1 on HotpotQA CMP ) and few-shot (e.g., relatively +29% and +52% of F1 on 2WikiQA CMP ) settings across all tasks.In full-data settings that assume a huge number of labeled examples are available, our approach makes smaller improvements on the two models.

Table 5 :
Few-shot and Zero-shot results of models with multi-task pre-training (denoted by +CMP) vs. single-task pre-training (denote by +CMP_QA, +CMP_QAG, and +CMP_SUM).On each task, the (multi-task) unified model shows performance gain over BART, which is comparable to task-specific pre-trained models.In the meanwhile, it also improve on other tasks, showing the effectiveness of our unified multi-task pre-training.

Table 6 :
A test example of Diffen dataset.BART and BART+CMP refer to the model predictions after fewshot fine-tuning.