Modeling Information Change in Science Communication with Semantically Matched Paraphrases

Whether the media faithfully communicate scientific information has long been a core issue to the science community. Automatically identifying paraphrased scientific findings could enable large-scale tracking and analysis of information changes in the science communication process, but this requires systems to understand the similarity between scientific information across multiple domains. To this end, we present the SCIENTIFIC PARAPHRASE AND INFORMATION CHANGE DATASET (SPICED), the first paraphrase dataset of scientific findings annotated for degree of information change. SPICED contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers. We demonstrate that SPICED poses a challenging task and that models trained on SPICED improve downstream performance on evidence retrieval for fact checking of real-world scientific claims. Finally, we show that models trained on SPICED can reveal large-scale trends in the degrees to which people and organizations faithfully communicate new scientific findings. Data, code, and pre-trained models are available at http://www.copenlu.com/publication/2022_emnlp_wright/.


Introduction
Science communication disseminates scholarly information to audiences outside the research community, such as the public and policymakers (National Academies of Sciences, Engineering, and Medicine, 2017).This process usually involves translating highly technical language to nontechnical, less-formal language that is engaging and easily understandable for lay people (Salita, 2015).The public relies on the media to learn about new scientific findings, and media portrayals of science affect people's trust in science while at the same time influencing their future actions  2016) and the news quote is from this Reuters story.(Gustafson and Rice, 2019;Fischhoff, 2012;Kuru et al., 2021).However, not all scientific communication accurately conveys the original information, as shown in Figure 1.Identifying cases where scientific information has changed is a critical but challenging task due to the complex translating and paraphrasing done by effective communicators.Our work introduces a new task of measuring scientific information change, and through developing new data and models aims to address the gap in studying faithful scientific communication.
Though efforts exist to track and flag when popular media misrepresent science,1 the sheer volume of new studies, reporting, and online engagement make purely manual efforts both intractable and unattractive.Existing studies in NLP to help automate the study of science communication have examined exaggeration (Wright and Augenstein, 2021b), certainty (Pei and Jurgens, 2021), and fact checking (Boissonnet et al., 2022;Wright et al., 2022), among others.However, these studies skip over the key first step needed to compare scientific texts for information change: automatically identi-fying content from both sources which describe the same scientific finding.In other words, to answer relevant questions about and analyze changes in scientific information at scale, one must first be able to point to which original information is being communicated in a new way.
To enable automated analysis of science communication, this work offers the following contributions (marked by C).First, we present the SCIENTIFIC PARAPHRASE AND INFORMATION CHANGE DATASET dataset (SPICED), a manually annotated dataset of paired scientific findings from news articles, tweets, and scientific papers (C1, §3).SPICED has the following merits: (1) existing datasets focus purely on semantic similarity, while SPICED focuses on differences in the information communicated in scientific findings; (2) scientific text datasets tend to focus solely on titles or paper abstracts, while SPICED includes sentences extracted from the full-text of papers and news articles; (3) SPICED is largely multi-domain, covering the 4 broad scientific fields that get the most media attention (namely: medicine, biology, computer science, and psychology) and includes data from the whole science communication pipeline, from research articles to science news and social media discussions.
In addition to extensively benchmarking the performance of current models on SPICED (C2, §4), we demonstrate that the dataset enables multiple downstream applications.In particular, we demonstrate how models trained on SPICED improve zeroshot performance on the task of sentence-level evidence retrieval for verifying real-world claims about scientific topics (C3, §5), and perform an applied analysis on unlabelled tweets and news articles where we show (1) media tend to exaggerate findings in the limitations sections of papers; (2) press releases and SciTech tend to have less informational change than general news outlets; and (3) organizations' Twitter accounts tend to discuss science more faithfully than verified users on Twitter and users with more followers (C4, §6).

Related Work
The analysis of scientific communication directly relates to fact checking, scientific language analysis, and semantic textual similarity.We briefly highlight our connections to these.
Fact Checking Automatic fact checking is concerned with verifying whether or not a given claim is true, and has been studied extensively in multiple domains (Thorne et al., 2018;Augenstein et al., 2019) including science (Wadden et al., 2020;Boissonnet et al., 2022;Wright et al., 2022).Fact checking focuses on a specific type of information change, namely veracity.Additionally, the task generally assumes access to pre-existing knowledge resources, such as Wikipedia or PubMed, from which evidence can be retrieved that either supports or refutes a given claim.Our task is concerned with a more general type of information change beyond categorical falsehood and is a required task to complete prior to performing any kind of fact check.
Scientific Language Analysis Automating tasks beneficial for understanding changes in scientific information between the published literature and media is a growing area of research (Wright and Augenstein, 2021b;Pei and Jurgens, 2021;Boissonnet et al., 2022;Dai et al., 2020;August et al., 2020b;Tan and Lee, 2014;Vadapalli et al., 2018;August et al., 2020a;Ginev and Miller, 2020).The three tasks most related to our work are understanding writing strategies for science communication (August et al., 2020b), detecting changes in certainty (Pei and Jurgens, 2021), and detecting changes in causal claim strength i.e. exaggeration (Wright and Augenstein, 2021b).However, studying these requires access to paired scientific findings.To be able to do so at scale will require the ability to pair such findings automatically.

Semantic Similarity
The topic of semantic similarity is well-studied in NLP.Several datasets exist with explicit similarity labels, many of which come from SemEval STS shared tasks (e.g., Cer et al., 2017) and paraphrasing datasets (Ganitkevitch et al., 2013).It is possible to build unlabelled datasets of semantic similarity automatically, which is the main method that has been used for scientific texts (Cohan et al., 2020;Lo et al., 2020).However, such datasets fail to capture more subtle aspects of similarity, particularly when the focus is solely on the scientific findings conveyed by a sentence (see Appendix A).And as we will show, approaches based on these datasets are insufficient for the task we are concerned with in this work, motivating the need for a new resource.

SPICED
We introduce SPICED, a new large-scale dataset of scientific findings paired with how they are commu-nicated in news and social media.Communicating scientific findings is known to have a broad impact on public attitudes (Weigold, 2001) and to influence behavior, e.g., the way vaccines are framed in the media has an effect on vaccine uptake (Kuru et al., 2021).Building upon prior work in NLP (Wright and Augenstein, 2021a;Pei and Jurgens, 2021;Sumner et al., 2014;Bratton et al., 2019), we define a scientific finding as a statement that describes a particular research output of a scientific study, which could be a result, conclusion, product, etc.This general definition holds across fields; for example, many findings from medicine and psychology report on effects on some dependent variable via manipulation of an independent variable, while in computer science many findings are related to new systems, algorithms, or methods.Following, we describe how the pairs of scientific findings were selected and annotated.

Data Collection
An initial dataset of unlabelled pairs of scientific communications was collected through Altmetric (https://www.altmetric.com/) a platform tracking mentions of scientific articles online.This initial pool contains 17,668 scientific papers, 41,388 paired news articles, and 733,755 tweets-note that a single paper may be communicated about multiple times.The scientific findings were extracted in different ways for each source.Similar to Prabhakaran et al. (2016), we fine-tune a RoBERTa (Liu et al., 2019) model to classify sentences into methods, background, objective, results and conclusions using 200K paper abstracts from PubMed that had been self-labeled with these categories (Canese and Weis, 2013).This sentence classifier attained 0.92 F1 score on a held-out 10% sample (details in Appendix I) and then the classifier was applied to each sentence of the news stories and paper fulltexts.Given the domain difference between scientific abstracts and news, we additionally manually annotated a sample of 100 extracted conclusions; we find that the precision of the classifier is 0.88, suggesting that it is able to accurately identify scientific findings in news as well.We extract each sentence classified as "result" or "conclusion" and create pairs with each finding sentence from news articles written about it.This yields 45.7M potential pairs of ⟨news, paper⟩ findings.For tweets, we take full tweets as is, yielding 35.6M potential pairs of ⟨tweet, paper⟩ findings.

Data sampling
Pairing every finding from a news story with every finding from its matched paper results in an untenable amount of data to annotate.Additionally, it has been shown that proper data selection can reduce the need to annotate every possible sample (MacKay, 1992;Holub et al., 2008;Houlsby et al., 2011).Therefore, to obtain a sample of paired findings covering a range of similarities, we first filter our pool of unlabelled matched findings based on the semantics using Sentence-BERT (SBERT, Reimers and Gurevych (2019)), a Siamese BERT network trained for semantic text similarity, trained on over 1B sentence pairs (see Appendix G for further details).We use this model to score pairs of findings from news articles and papers based on their embeddings' cosine similarity and conduct a pilot study to determine which data to annotate.
For the pilot, we sample 400 pairs evenly for every 0.05 increment bucket in the range [0, 1] of similarity scores (20 per bucket).Each sample is annotated by two of the authors of this study with a binary label of "matching" vs "not matching", yielding a Krippendorff's alpha of 0.73.2From this sample, we observed that there were no matches below 0.3 and only 2 ambiguous matches below 0.4.At the same time, the vast majority of samples from the entire dataset have a similarity score of less than 0.4.Additionally, above 0.9 we saw that each pair was essentially equivalent.Given the distribution of matched findings across the similarity scale, in order to balance the number of annotations we can acquire, the yield of positive samples, and the sample difficulty, we sampled data as follows based on their cosine similarity: • Below 0.4 = automatically unmatched.
• Above 0.9 with a Jaccard index above 0.5 = automatically matched.• Sample an equal number of pairs from each 0.05 increment bin between 0.4 and 0.9 for human expert annotation.We sample 600 ⟨news, paper⟩ finding pairs from the four fields which receive the most media attention (medicine, biology, computer science, and psychology) using this method.This yields 2,400 pairs to be annotated.For extensive details on the pilot annotation and visualizations, see Appendix B. Similarity Score IMS However, the consistency of the erythritol results in both the central adiposity and usual glycemia comparisons lends strength to the findings, and the cluster of metabolites has biological plausibility.
Young adults who exhibited central adiposity gain over the course of 35 weeks had plasma erythritol levels 15-times higher at baseline than those with stable adiposity over the same period.

1
Our results showed that most of the official adultonset men began their antisocial activities during early childhood.
Beckley, who is in the department of psychology and neuroscience at Duke, said the adult-onset group had a history of anti-social behavior back to childhood, but reported committing relatively fewer crimes.0.38 4.4 Table 1: Annotated information matching score (IMS) and the similarity score estimated by SBERT (Reimers and Gurevych, 2019) for selected finding pairs from SPICED.These examples demonstrate that simple similarity scores may not reflect whether the two sentences are covering the same scientific finding.
We follow a similar procedure to sample pairs from papers and Twitter for annotation.However, rather than use the SBERT similarity scores, we instead first obtain annotations for news pairs using the scheme to be described later in §3.3 in order to train an initial model on our task (CiteBERT, Wright and Augenstein 2021a).We then use the trained model to obtain scores in the range [0,1] for each pair and sample an equal number of pairs from bins in 0.05 increments, for a total of 1,200 pairs (300 from each field of interest).

Finding Matching Annotation
We perform our final annotation based on the sampling scheme above using the Prolific platform (https://www.prolific.co/)as it allows prescreening annotators by educational background.We require each annotator to have at least a bachelor's degree in a relevant field to work on the task.Annotators are asked to label "whether the two sentences are discussing the same scientific finding" for 50 finding pairs with a 5-point Likert schema where each value indicates that "The information in the findings is..." (1): Completely different (2): Mostly different (3): Somewhat similar (4): Mostly the same, or (5): Completely the same.See Appendix C for details of how this rating scale was decided.We call this the INFORMATION MATCHING SCORE (IMS) of a pair of findings.Annotation was performed using POTATO (Pei et al., 2022).Full annotation instructions and details are listed in Appendix D. Notably, annotators were instructed to mark how similar the information in the findings was, as opposed to how similar the sentences are.Further, they were instructed to ignore extraneous information like "The scientists show..." and "our experiments demonstrate...".

Post processing
To improve the reliability of the annotations, we use MACE (Hovy et al., 2013) to estimate the competence score of each annotator and removed the labels from the annotators with the lowest competence scores.We further manually examine pairs with the most diverse labels (standard deviation of ratings >1.2) and manually replace the outliers with our expert annotations.The overall Krippendoff's α is 0.52, 0.57, 0.53, and 0.52 for CS, Medicine, Biology, and Psychology respectively, indicating that the final labels are reliable.While many annotators considered the task challenging, our quality control strategies allow us to collect reliable annotations. 3For all the annotated pairs, we average the ratings as the final similarity score.In addition to the 3,600 manually annotated pairs, we include an extra 2,400 automatically annotated pairs as determined in §3.2 (unmatched pairs get an IMS of 1, matched pairs get an IMS of 5), for a total of 6,000 pairs.Given that there can be multiple pairs from a single newspaper pair, to avoid overlaps between training and test sets, we split the dataset 80%/10%/10% based on the paper DOI and balance across subjects.Further dataset details in Appendix E Selected Examples To highlight the difficulty of SPICED, we show a pair of samples from our final dataset in Table 1.The IMS is compared to the cosine similarity between embeddings produced by SBERT.For the first case, SBERT presumably picks up on similarities in the discussed topics, such as erythritol and its relationship to adiposity, but the paper finding is concerned with the consistency of results and its biological implications while the news finding explicitly mentions a relationship between erythritol and adiposity.The second case expresses the opposite effect; the news finding contains a lot of extraneous information for STSB SNLI SPICED News Tweets 0.401 0.631 0.726 0 .7120 .749Table 2: The average normalized edit distance between matching pairs for various datasets shows that SPICED includes more pairs that are lexically dissimilar.For SPICED and STSB, pairs are considered matching if their similarity score is greater than 3.For SNLI, pairs are considered matching if the label is "entailment".
context, but one of the core findings it expresses is the same as the paper finding, giving it a high rating in SPICED.
Comparison with existing datasets To further characterize the difficulty of SPICED compared to existing datasets, we show the average normalized edit distance between matching pairs in SPICED, STSB (Cer et al., 2017), andSNLI (Bowman et al., 2015) (see Appendix F for the calculation).STSB is a semantic text similarity dataset consisting of pairs of sentences scored with their semantic similarity, sourced from multiple SemEval shared tasks.SNLI is a natural language inference corpus, and consists of pairs of sentences labeled for if they entail each other, contradict each other, or are neutral.We calculated the mean normalized edit distance across all pairs of matching sentences in each dataset's training data; For SPICED and STSB, pairs are considered matching if their IMS or similarity score is greater than 3, respectively.For SNLI, pairs are considered matching if the label is "entailment".
We find that there is a much greater lexical difference between the matching pairs in SPICED (0.726) than existing general domain paired text datasets (0.401 for STSB and 0.631 for SNLI).This gap between STSB and SPICED also emphasizes the difference between traditional semantic textual similarity tasks and the information change task we describe here.Within SPICED, Twitter pairs had a higher distance (0.749) than news pairs (0.712), suggesting stronger domain differences.For qualitative examples showing the difference between SPICED and STSB, see Appendix A.

Relationship of SPICED to Fact Checking
The task introduced by SPICED captures information change more broadly than veracity as in automatic fact checking, as the task is concerned with the degree to which two sentences describe the same scientific information-indeed, two similar sen-tences may describe the same information equally poorly.Our task is similar to the sentence selection stage in the fact checking pipeline, and we later demonstrate that models trained on SPICED data are useful for this task for science in section 5.However, our task and annotation are agnostic to whether a pair of sentences entail one another.This is especially useful if one wants to compare how a particular finding is presented across different media.Fact-checking datasets are also explicitly constructed to contain claims which are about a single piece of information-SPICED is not restricted in this way, focusing on a more general type of information change beyond categorical falsehood.Finally, we note two more unique features of SPICED: 1) SPICED contains naturally occurring sentences, while fact checking datasets like FEVER and Sci-Fact often contain manually written claims.2) The combination of domains in SPICED is unique; sentences are paired between (news, science) and (tweets, science), and these pairings don't exist currently.

Scientific Information Change Models
We now use SPICED to evaluate models for estimating the IMS of finding pairs in two settings: zero-shot transfer and supervised fine-tuning.

Experimental setup
We use the following four models to estimate zero-shot transfer performance.Paraphrase: RoBERTa (Liu et al., 2019) pre-trained for paraphrase detection on an adversarial paraphrasing task (Nighojkar and Licato, 2021).We convert the output probability of a pair being a paraphrase to the range [1,5] for comparison with our labels.Natural Language Inference (NLI): RoBERTa pre-trained on a wide range of NLI datasets (Nie et al., 2020).The final score is the model's measured probability of entailment mapped to the range [1,5].MiniLM: SBERT with MiniLM as the base network (Wang et al., 2020a); we obtain sentence embeddings for pairs of findings and measure the cosine similarity between these two embeddings, clip the lowest score to 0, and convert this score to the range [1,5].Note that this model was trained on over 1B sentence pairs, including from scientific text, using a contrastive learning approach where the embeddings of sentences known to be similar are trained to be closer than the embeddings of negatively sampled sentences.SBERT models rep-  resent a very strong baseline on this task, and have been used in the context of other matching tasks for fact checking including detecting previously factchecked claims (Shaar et al., 2020).MPNet: The same setting and training data as MiniLM but with MPNet as the base network (Song et al., 2020).
We fine-tune the following six models on SPICED to estimate IMS as a comparison with zeroshot transfer.
• MiniLM-FT: The same MiniLM model from the zero-shot transfer setup but further finetuned on SPICED.The training objective is to minimize the distance between the IMS and the cosine similarity of the output embeddings of the pair of findings.• MPNet-FT: The same setup as MiniLM-FT but using MPNet as the base network.• RoBERTa: The RoBERTa (Liu et al., 2019) base model; We perform a regression task where the model is trained to minimize the mean-squared error between the prediction and IMS.• SciBERT: A transformer model trained using masked language modeling on a large corpus of scientific text (Beltagy et al., 2019).The fine-tuning setup is the same as for the RoBERTa model.
• CiteBERT: A SciBERT model further finetuned on the task of citation detection, and was shown to have improved performance on downstream tasks using scientific text (Wright and Augenstein, 2021a).The training setup is the same as for the RoBERTa model.Please see Appendix G for further details on the models and pretraining methods.For the finetuned models, we train on the entire training set of SPICED, including both news findings and tweets.For the test set we only use manually annotated pairs.Performance is measured in terms of meansquared error (MSE) and Pearson correlation (r) (definitions of all metrics in Appendix F).All results are reported as the average and standard deviation for each model across 5 random seeds.

Results
Paraphrase detection and natural language inference models perform very poorly for zero-shot transfer on this task (Figure 2, grey bars), with NLI having slightly better transfer, supporting our hypothesis that transferring from existing tasks to this domain is challenging.Fine-tuned models with Masked Language Model (MLM) pretraining can learn the task decently well (Figure 2, red bars), but surprisingly RoBERTa performs just as well as SciBERT and CiteBERT which were specifically pretrained on scientific texts.We posit that this could be due to the fact that RoBERTa was pretrained on a wider range of texts that are reflective of the domains in SPICED, including news texts, while SciBERT and CiteBERT were trained solely on scientific papers.
SBERT models trained on large amounts of pretraining sentences perform well in the zero-shot transfer setup, with the MiniLM based model outperforming MPNet.The best setup was using SBERT fine-tuned on SPICED (Figure 2, blue bars), which yields up to 3.9 points gained overall in Pearson correlation and a reduction of 0.3 in terms of MSE (MPNet to MPNet-FT).We also note that there is a large gap between performance on this data and general semantic similarity datasets such as STSB, which see correlation scores in the 90s.As such, there is potentially much room to grow in terms of raw performance on this dataset.
Models performed worse for pairs with tweets versus those from news (Appendix Table 7).This performance difference is in line with our expectations, as there is a large domain shift between tweets and scientific texts and our base models were not exposed to tweets during pre-training.All models, including the zero-shot transfer SBERT models, perform much worse on that split of the data.Additionally, we only see minor gains in performance in terms of MSE for MiniLM when fine-tuned on tweets.We see larger gains for MP-Net.Interestingly, the best performance (Pearson r) for Tweets is RoBERTa, though the overall MSE is still best for MPNet-FT.We show extended benchmarking in Appendix J and the top-5 errors for RoBERTa and MPNet-FT in Appendix K.

Application: Zero-Shot Evidence
Retrieval for Scientific Fact Checking Accurately measuring the similarity of scientific findings written in different domains enables a wide range of downstream analyses and tasks.As a first task, we consider evidence retrieval for scientific fact checking of real-world scientific claims.In general, automatic fact checking consists of retrieving relevant evidence for a given claim and predicting if that evidence supports or refutes the claim.We test the ability of models trained on SPICED to perform the evidence retrieval task in a zeroshot setting.In this, we use the models as is, with no further fine-tuning on any evidence retrieval  Setup We compare different models' ability to rank the evidence sentences such that the ground truth evidence for a given claim is ranked highest.We use four models in a zero-shot setting for comparison (MiniLM, MiniLM-FT, MPNet, and MPNet-FT; '-FT' indicates fine-tuning on SPICED), and show results with the unsupervised BM25 (Robertson et al., 1994), a widely used bag-ofwords retrieval model.We report retrieval results in terms of mean average precision (MAP) and mean reciprocal rank (MRR), and average the results for models fine-tuned on SPICED across 5 random seeds.

Results
We find that fine-tuning on SPICED provides consistent gains in retrieval performance on both datasets for both SBERT models (Table 3).This performance increase is encouraging, as there are two notable differences between SPICED and the two datasets in our experiment.The first is that the tasks are different: SPICED provides a general scientific information similarity task which proves to be useful for evidence sentence ranking.The second is that the domains are different: SPICED contains ⟨news, paper⟩ and ⟨tweet, paper⟩ pairs, while CoVERT and COVID-Fact have claims from Twitter and Reddit, respectively, paired with evidence in news.Our results show that training on SPICED improves the IR performance of the SBERT mod-

Application: Modeling Information Change in Science Communication
Whether the media faithfully communicate scientific information has long been a core question to the science community (National Academies of Sciences, Engineering, and Medicine, 2017).Our dataset and models allow us to conduct a largescale analysis to study information change in science communication.Here, we focus on three research questions: • RQ1: Do findings reported by different types of outlets express different degrees of information change from their respective papers?• RQ2: Do different types of social media users systematically vary in information change when discussing scientific findings?• RQ3: Which parts of a paper are more likely to be miscommunicated by the media?
RQ1-2 focus on the holistic information change captured in IMS, while RQ3 focuses on what types of information might be changing.

RQ1: Comparing Media Outlets
Different types of media target different audiences and tend to report the same issue differently (Richardson, 1990;Mencher and Shilton, 1997).While good science journalism requires outlets to prioritize quality, in real practices, journalists may adopt different writing strategies for different types of audiences (Roland, 2009).Thus, we investigate if findings reported by different types of outlets express different levels of information change, focusing on three types of outlets: General News (e.g., NYTimes), Press Releases (e.g., Science Daily), and Science & Technology (e.g., Popular Mechanics).We use our best-performing MPNet-FT model to estimate the IMS of over 1B pairs and keep those with IMS > 3, which finally leads to 1.1M paired findings from 26,784 news stories and 12,147 papers.We then build a linear mixed effect regression model (Gałecki and Burzykowski, 2013) to predict IMS for matching pairs from news stories and research articles.We include a fixed effect for the type of news outlet, using General News as the reference category.To account for reporting differences across fields and variations specific to highly-publicized papers, we also include a fixed effect for the scientific subject and a random effect for each paper with 30+ pairs (all other papers are pooled in a single random effect).
Results.Compared with General News, Science & Technology news outlets and Press Releases report findings that more closely match those from the original paper (Figure 3 shows the regression coefficients).This difference likely is due to some form of audience design where the journalist is writing for a more science-savvy readership in the latter two, whereas General News journalists must more heavily paraphrase the results for lay people.

RQ2: Comparing Social Media Accounts
Social media play an important role in disseminating scientific findings (Zakhlebin and Horvát, 2020), so what factors affect the presentation of scientific information on social media becomes an important question.Here, we focus on the types of Twitter users who tweet about scientific findings.Based on 182K matched tweets and paper findings, we again build a linear mixed effect regression model to predict IMS.We include fixed effects of (1) if the account is run by an organization, as inferred using M3 (Wang et al., 2019), (2) if the account is verified (3) the number of followers and following, both log-transformed, and (4) the account age in years.We use the same field fixed effects and paper random effects as in RQ1.

Results
The type of user strongly influences how faithful the tweets are to the original findings (Figure 4).Accounts from organizations tend to be more faithful to the original paper findings, which could be due to intentional actions of image management to build trust (Saffer et al., 2013).Surprisingly, verified accounts were far more likely to change information away from its original meaning; similarly, accounts with more followers had the same trend.Given their prominent roles in Twitter communication (Bakshy et al., 2011  et al., 2014), multiple mechanisms may explain this gap such as adding more commentary or trying to translate original scientific findings to lay language to make the findings easier to understand.
Appendix L shows the details of regression results.

RQ3: What Information Changes
Most studies on scientific misinformation focus on paper titles and abstracts (e.g., Sumner et al., 2014), which cannot fully reflect the information presented in the full papers.Analyzing the information change of findings paired from all sections of papers could help to better understand the mechanisms behind scientific misinformation and develop strategies to reduce them.We use the same 1.1M finding pair dataset as RQ1 and analyze what information might have changed using two models trained for changes in scientific communication: identifying exaggerations (Wright and Augenstein, 2021b) and certainty (Pei and Jurgens, 2021).See Appendix H for more details on the exaggeration detection task.
Results Journalists tend to downplay the certainty and strength of findings from abstracts (Figure 5), mirroring the results of Pei and Jurgens (2021).However, this pattern does not persist for findings in other parts of papers, especially the limitations.Existing studies suggest that journalists might fail to report the limitations of scientific findings (Fischhoff, 2012), and our results here suggest that findings presented in limitations are more likely to be exaggerated and overstated.However, it is also possible that scientists may adopt different discourse strategies for different parts of a paper (Clark, 2013).Nonetheless, our result obviates the necessity of analyzing the full text of a paper when studying science communication.

Conclusion
Faithful communication of scientific results is critical for disseminating new information and establishing public trust in science.Given the challenge of-and occasional failures in-communicating science, new resources and models are needed to evaluate how science is reported.Here, we introduce SPICED, a new science communication paraphrases dataset labeled with information similarity.Extensive experiments demonstrate that models can predict the degree to which two reports of a scientific finding have the same information but that this is a challenging task even for current SOTA pre-trained language models.In downstream applications, we show SPICED improves model performance for evidence retrieval for scientific fact checking; and, using the trained model to perform a large-scale analysis of information change in science communication, we show systematic behaviors in how different people and news outlets faithfully convey scientific results.Data, code, and pretrained models are available at http: //www.copenlu.com/publication/2022_emnlp_wright/.

Acknowledgements
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 801199 and a Rackham Graduate Student Research Grant at the University of Michigan.

Limitations
We note three limitations of our study.Our data and analysis in social media is limited to only one platform, Twitter, and includes only tweets directly linked to the original paper, as indicated through Altmetric.While Twitter is among the largest social media platforms and is the most common in the Altmetric data, our data potentially omits other kinds of scientific communication about papers that do not directly link to a paper or tweets that link to a paper that cannot be easily identified to a DOI (e.g., linking to a PDF hosted on a personal website).Other types of tweets may be omitted from our dataset such as those written in a thread, or in a tweetorial, about a paper (Gero et al., 2021), which may include additional tweets that describe a paper's findings.While our models would likely still be able to effectively analyze such tweets, these additional forms of scientific communication could add new variety.We leave identifying and collecting such tweets to future work.
Second, our study focuses on only four large scientific fields.While these fields do cover a broad selection of papers, we were unable to annotate additional fields due to annotation budget and limitations from the Prolific platform.On Prolific, not all potential domains had sufficient numbers of qualified annotators (we required at least a Bachelor's degree in the domain) and the number of unique surveys to run scaled linearly with the number of domains, creating a significant human overhead.However, we will open source our annotation interface and pipeline and we encourage further efforts to build a larger dataset across more scientific domains.
Finally, while our models achieve moderately high performance at inferring the information matching (Figure 2), performance is not perfect, which potentially limits our ability in downstream models and tasks.While we show the data is still useful in training for related tasks ( §5) and a trained model can be used to identify systematic behavior by types of users and outlets ( §6), more accurate models would likely be needed to identify any trends for finer-grained settings, such as looking at the behavior of a specific outlet.For this reason,w e have kept our analyses at a higher level (e.g., outlet categories).

Ethics and Impacts
Miscommunication of scientific information can have negative impacts on many aspects of our society.Our study contributes to a large research program on the science of science communications (National Academies of Sciences, Engineering, and Medicine, 2017).Our dataset and model could be used to keep track of information change in science communication, enable large-scale analysis to understand the current science communication ecosystem, and finally help to facilitate better and more effective science communications.
Crowdsourcing ethics Annotating paired findings requires deep attention and may lead to annotator burnout.We carefully designed our annotation pipeline to provide a good annotation experience for the annotators.We designed a user-friendly Web-based annotation interface that allows annotators to do annotations using keyboard shortcuts.All the annotators are encouraged to leave comments and answer several questions about their annotation experience.More than 95% of the annotators are satisfied with their annotation experience and many people suggest that our study helps them to better understand the science communication process4 and our annotation interface makes their task easier. 5z Beltagy, Kyle Lo, and Arman Cohan. 2019 Association for Computational Linguistics.4 Figure 6: Distribution of the cosine similarity between findings extracted from news articles about particular scientific papers.Cosine similarity is measured between the embeddings produced for both findings using SBERT (Reimers and Gurevych, 2019).

A Information Change vs. Semantic Similarity
We wish to highlight key differences between information change and semantic similarity, particularly with an eye to what makes the task introduced in SPICED difficult compared to semantic similarity scoring.To illustrate this, we present a sample of pairs in STSB that have the highest similarity score of '5' vs. samples in SPICED which have an IMS of 5 in Table 4 and Table 5.
In this, for a pair to be perfectly similar from a semantics perspective, the entire sentence must contain exactly equivalent meaning.This is not the case with our task.For the information change task, pairs are highly similar even if some aspects of the semantics of the sentence are changed e.g. in the first sample, there is a difference between the two sentences semantically: the second in the pair discusses "being intrigued" by the finding, which is shared between the pair.This also makes the task extremely difficult -a model must learn to compare only the salient scientific facts between the pair of sentences, as opposed to the entire meaning of each sentence.

B Pilot Annotation Details
For the pilot, we use 20 pairs from 20 different cosine similarity score bins in increments of 0.05 starting from 0. In other words, we have 20 bins with ranges of scores as: 0.0 − 0.05, 0.05 − 0.1...0.9 − 0.95, 0.95 − 1.0.This results Figure 7: Number of samples per bin rated as matching vs. not matching (samples limited to those where both annotators agreed on the label).Most matching samples come from higher similarity bins, while more difficult samples come from the middle bins.
in 400 samples to annotate.The score distribution from 7,392,690 pairs from 3,525 source papers which we use for sampling is given in Figure 6.Each sample is annotated by two of the authors of the study with a binary label of "matching" vs "not matching", yielding a Krippendorff's alpha of 0.73.
The number of positive samples per bin from the pilot study is given in Figure 7.We see here that bins with a cosine similarity below 0.65 tend to have very few positive samples, and only above 0.8 do we start to see many positive samples in the bins.Almost all samples above 0.9 are matching, and the only unmatched pairs appear to be instances of SBERT failing, since the matched pairs are almost exactly copied text.Additionally, this histogram indicates that the base rate of positive matching findings is low as the overall distribution of samples in the high cosine similarity region, where most of the matches exist, is small.At the same time, we note that some of the matches we find in the lower cosine similarity regions constitute quite interesting samples; for example, the following which has a cosine similarity of 0.41.
Paper finding: For cases comparing a drone and a vehicle carrying a single package over similar distances, for example, a customer picking up a package from a retail store, the drone is clearly a lower-impact solution.(Stolaroff et al., 2018) Sentence 1 Sentence 2 The polar bear is sliding on the snow.
A polar bear is sliding across the snow.
A plane is taking off An air plane is taking off A dog rides a skateboard A dog is riding a skateboard A man is playing the drums A man plays the drum Sentence 1 Sentence 2 Higher-income professionals had less tolerance for smartphone use in business meetings.
We are intrigued by the result that professionals with higher incomes are less accepting of mobile phone use in meetings.
If we allow people to retract recently posted comments, then we may be able to minimize regret from posting in the heat of the moment.
Allowing users to retract recently posted comments may help minimize regret .
Papers with shorter titles get more citations #science #metascience #sciencemetrics Our analysis suggests that papers with shorter titles do receive greater numbers of citations.
Low levels of self-esteem and poor emotional processing skills were significantly correlated with gang involvement, as were low levels of parental monitoring, poor parental communication and housing instability.
Major findings also indicated that low levels of parental monitoring, poor parental communication and housing instability were significantly associated with gang involvement.News finding: But if you forgot that essential ingredient for tonight's dinner, our findings suggest it's much better to have the grocery store send it to you by drone rather than to take your car to the store and back. 6 Both sentences are talking about the same finding, that drone delivery is more efficient over short distances than using a car, but in entirely different ways.From this, it is clear that simply using semantic text similarity is insufficient for solving this task, and we should include some of these lower similarity samples in our annotation.We, therefore, propose the following sampling scheme in order to balance the number of annotations we can acquire, the yield of positive samples, and the sample difficulty: • Label all samples with a cosine similarity below 0.4 as unmatched.• Label all samples above 0.9 with a Jaccard 6 https://www.enbridge.com/energy-matters/news-andviews/delivering-packages-with-drones-might-be-good-forthe-environmentindex above 0.5 as matching.
• Sample an equal number of pairs from each 0.05 increment bin between 0.4 and 0.9 for human expert annotation.

C Experimented annotation
We experimented with two annotation schemas: a binary schema where the annotators are asked to label "whether the two sentences are discussing the same scientific finding" with Yes or No, and a Likert schema where the annotators are asked to label if "The information in the findings is..." • 1: Completely different • 2: Mostly different • 3: Somewhat similar • 4: Mostly the same • 5: Completely the same We ran several pilots using the two annotation schemas and the Likert a schema led to higher interannotator agreement (0.45 Krippendorff's alpha) compared with the binary schema (0.21 Krippendorff's alpha).Therefore we adopt the 5-point Likert schema for the annotation.

D Full Annotation Instructions
Annotation was performed using Prolific workers who labeled using POTATO (Pei et al., 2022).The annotation interface setup is available at https://github.com/davidjurgens/potato/tree/master/example-projects/match_finding which includes all the following instructions as well.
Task description: The task is to label to what degree two sentences have the same information.The information in the sentences is scientific findings.Here, a scientific finding is a statement that describes a research output of a scientific study, such as a result, conclusion, product, etc.You should rate how similar the findings are; you can ignore extra information like "The researchers showed...", "In vivo experiments demonstrated..." etc.For example, in the sentence "After controlling for weight and age, researchers found that overconsumption of sugar is linked with an increase in diabetes," the information in the finding is "overconsumption of sugar is linked with an increase in diabetes".Some sentences may have no findings or multiple findings, so use your best judgment about what are the core findings being said.
You will rate this on a 5-point scale, where each level means the following: 1.The information in the findings is completely different • Sentences in this category have findings which say completely different information • The sentences may be on totally different topics -Overconsumption of sugar causes diabetes -Regular exercise improves heart health • There may be some overlap in key words used between the two sentences, but the actual information is completely different -Chocolate contains a lot of sugar, and therefore can have an effect on weight.
-Overconsumption of sugar leads to diabetes.
2. The information in the findings is mostly different • The findings may talk about the same topic, but the actual information is mostly different; for example, these sentences convey mostly different information even though they talk about the same topic: -Overconsumption of sugar causes diabetes -Sugar is good for your health • There could be a link between the two findings, but the information conveyed is still different -Overconsumption of sugar increases blood glucose levels -High blood glucose over time increases the risk of developing diabetes 3. The information in the findings is somewhat similar • The findings are discussing relevant research outputs but there are some differences in the information conveyed.Here the difference is that (i) talks about the relationship between overconsumption of sugar and diabetes and (ii) describes how genetics plays a role in overconsumption of sugar -Overconsumption of sugar causes diabetes -Overconsumption of sugar might be genetically determined 4. The information in the findings is mostly the same • In this case there may be some changes in e.g. the level of generality.Additionally, one sentence may go into more detail than the other and add additional context, but the information is largely the same • Here the two findings have the same information but at different levels of generality: -A link between sugar and diabetes was found -Overconsumption of sugar is associated with the onset of diabetes • Here both sentences have the same core finding, but one sentence goes into more detail -The researchers found that overconsumption of sugar leads to diabetes • Note that there can be changes in e.g. the level of certainty or the strength of the information.
-Overconsumption of sugar leads to diabetes.-It is likely that there is a link between overconsumption of sugar and the onset of diabetes.

E Final dataset details
Figure 9 shows the IMS distribution in SPICED.
Figure 10 shows the IMS distribution for annotated pairs in SPICED.Figure 11 shows the IMS distribution for each split.We measure various aspects of lexical richness between the different domains of the data in Table 6.Average Normalized Edit Distance We calculate the normalized edit distance as follows: 2 ) max (|s where |D| is the size of the dataset, (s 2 ) is a sentence pair, and d is the edit distance.
Jaccard Index The Jaccard index is calculated based on the overlap of the members of two sets (e.g. the words in two sentences X and Y ): The cosine similarity between two vectors a and b is calculated as: Which is their dot product divided by the product of their lengths.

Mean Squared Error
The mean squared error between two lists of numbers of length n is calculated as: Mean Average Precision The mean average precision in ranking takes the average Precision@k (P@k) for every relevant sample in a ranked list.First, P@k is calculated as follows:  -v2 -22,713,216 parameters MPNet This is the same setup as in MiniLM but with using MPNet as the base network (Song et al., 2020).MPNet is trained using a permuted language modeling (PLM) objective with position information as input to achieve the best of both worlds between MLM and PLM.The base network is used in the SBERT setup where it is further finetuned on the same dataset and same task as with MiniLM Huggingface model name (sentence transformers): all-mpnet-base-v2 -109,486,464 parameters Paraphrase Detection This is a paraphrase detection model based on RoBERTa used in (Nighojkar and Licato, 2021).The model is trained on the adversarial paraphrase dataset introduced in that paper.
Huggingface model name (sentence transformers): ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli -124,647,170 parameters SciBERT SciBERT is the original BERT model trained using MLM on a large set of scientific papers from Semantic Scholar (Lo et al., 2020).
Huggingface model name (sentence transformers): allenai/scibert_scivocab_uncased -109,920,514 parameters CiteBERT CiteBERT is SciBERT further finetuned on the CiteWorth dataset for the task of citation detection, which predicts if a given sentence requires a citation or not (Wright and Augenstein, 2021a).

H Exaggeration Detection
The problem of scientific exaggeration detection was studied in (Wright and Augenstein, 2021b).The basic task is: given a pair of scientific findings (e.g. a reference finding from a paper and its counterpart in a news article), determine if one finding is exaggerating the other finding.More formally, the task focuses on differences in the causal claim strength of the two findings, where the claim strength can take on one of four values: • 0: No statement of relationship • 1: Correlational statement (e.g."X is associated with Y") • Conditional causal statement (e.g."X might cause Y under circumstance Z") • Causal statement (e.g."X causes Y") Wright and Augenstein (2021b) curate data and build models for performing the exaggeration detection task in two different settings: as predicting the individual claim strengths and comparing, and as an inference task where a model is fed both findings and asked to predict if the reference finding is being exaggerated, downplayed, or faithfully represented by its counterpart.We use the best-performing model from their paper, which is a multi-task fewshot learning model based on pattern exploiting training (PET) called MT-PET.In particular, we use the model for strength classification which has seen 4,500 individual findings labeled for claim strength and 200 pairs labeled for exaggeration.

I Scientific Text Parser
We fine-tuned a RoBERTa model over 200K selflabeled abstracts from PubMed.The model is trained to predict five labels including: BACK-GROUND, CONCLUSIONS, METHODS, OB-JECTIVE and RESULTS.We did a 8:1:1 split for the data and fine-tune the RoBERTa model for 1 epoch.0.92 F1 is attained on the test set.

J Extended Benchmarking
Tables with extended benchmarking results can be found in Table 7 to Table 11.

K Error Examples
Examples of errors which our best models made on ⟨tweet, paper⟩ pairs can be found in Table 12 and Table 13.

L Regression details
Table 15 shows the regression table for RQ1.Table 16 shows the regression table for RQ2.

News Finding Prediction
Increase in the body size of dicynodonts across the Late Triassic may have been driven by selection pressure to reach a size refuge from large predators (24) .
Researchers believe selection pressurespotentially to protect themselves from larger predators-may have been the driver behind their giant size, but more research will be needed to understand Lisowicia and its place in the evolutionary tree.

3.0008
The best option among the three is the EPS container with the lowest impacts across the 12 categories.
The study found that the styrofoam container was the best option among the disposable containers across all the impacts considered, including the carbon footprint.

3.1120
As media coverage started to increase, water demand decreased and the models with media correctly captured the downward trend, but the models without media forecasted increasing demand.
Strikingly, the models also found that for every 100-article increase over a two-month period, there was an 11 percent to 18 percent decrease in demand for water.

3.1537
For example, of the 63 negative precipitation years during 1896-2014, 15 of the 32 warm-dry years (47%) produced 1-SD drought, compared with only 5 of the 31 cool-dry years (16%) Their analysis revealed that the years that were both warm and dry were about twice as likely to produce a severe drought as years that were cool and dry.

3.2569
Our study shows that low-dose BPA and BPS exposure has physiological effects.Although the levels were low, the scientists soon saw that both BPA and BPS caused changes in the brain development of the zebra fish embryos.

3.3331
Use of multiple prescription medications with these potential effects was associated with greater likelihood of concurrent depression.
About 15 percent of participants who simultaneously used three or more of these drugs were depressed.

3.3692
We also found that renewal submission rate was the factor most predictive of sustained funding for either gender, and that gender differences in survival disappear when genders were matched on renewal submission rate and first year of funding.
On average, women submitted eligible grants for renewal 42% of the time and won funding 36% of the time, compared with 45% and 39%, respectively, for men.
After 12 months, the re-employment rate of smokers was 24 percent lower than that of nonsmokers.

3.5151
This suggests behaviour consistent with moral licensing: participants who refrained from cheating at higher stakes seem to have subsequently licensed themselves to donate less to charity, thereby "balancing" their moral behaviour over time.
However those who cheated the least when tempted with high stakes were more likely to license themselves not to behave so charitably in another task.

3.5481
Lack of Panx1 increases adipocyte hypertrophy and reduces adipocyte numbers in subcutaneous fat in vivo.
With both a normal diet, and a a high-fat diet, a lack of Panx1 increases cell size.

3.5618
Table 14: Borderline IMS Model prediction samples.We note that 3 appears to be a good threshold for matching, as pairs with an IMS over 3 tend to discuss the same scientific findings.

Figure 1 :
Figure 1: We are interested in measuring the information similarity of statements about scientific findings between different sources, including scientific papers, news, and tweets, shown here with real examples.The finding in this figure comes from Fang et al. (2016) and the news quote is from this Reuters story.
Figure 2: (a) Mean Squared Error (MSE, ↓ better) and (b) Pearson correlation (r, ↑ better) on the test set of SPICED.Grey = zero-shot transfer models, red = MLM models fine-tuned on SPICED,blue = SBERT models fine-tuned on SPICED.Results are averaged across 5 random seeds.Best results are given in bold.

Figure 3 :
Figure 3: Scientific findings covered by Press Release and SciTech generally have less informational changes compared with findings presented in General Outlets

Figure 4 :
Figure 4: Organizational Twitter accounts keep more original information from the paper finding while verified users and those with more followers change more information when tweeting about a scientific finding.

Figure 8 :
Figure 8: The annotation page of our crowdsourcing task

Figure 9 :
Figure 9: Distribution of the final matching score in SPICED, which includes some pairs of scientific findings that are automatically labeled based on their extreme textual similarity (high or low), in addition to the annotated pairs.

Figure 10 :
Figure 10: Distribution of the final matching score for annotated pairs in SPICED

Table 4 :
Samples of sentence pairs in STSB which have a similarity score of 5

Table 5 :
Samples of sentence pairs in SPICED which have an IMS of 5.

Table 6 :
Various measures of lexical richness and diversity between findings in papers and other sources.RTTR is the root token-type ratio; MTLD is measure of textual lexical diversity (McCarthy and Jarvis, 2010); HDD is the hypergeometric distribution diversity (Mc-Carthy and Jarvis, 2010).