Measuring Sentence-Level and Aspect-Level (Un)certainty in Science Communications

Certainty and uncertainty are fundamental to science communication. Hedges have widely been used as proxies for uncertainty. However, certainty is a complex construct, with authors expressing not only the degree but the type and aspects of uncertainty in order to give the reader a certain impression of what is known. Here, we introduce a new study of certainty that models both the level and the aspects of certainty in scientific findings. Using a new dataset of 2167 annotated scientific findings, we demonstrate that hedges alone account for only a partial explanation of certainty. We show that both the overall certainty and individual aspects can be predicted with pre-trained language models, providing a more complete picture of the author’s intended communication. Downstream analyses on 431K scientific findings from news and scientific abstracts demonstrate that modeling sentence-level and aspect-level certainty is meaningful for areas like science communication. Both the model and datasets used in this paper are released at https://blablablab.si.umich.edu/projects/certainty/.


Introduction
Expressing certainty about what is known is a necessary characteristic of scientific work as science involves producing knowledge about what was previously unknown (Friedman et al., 1999;Smithson, 2012). Given the natural aversion to uncertainty, existing studies have found that presenting uncertainty in science communications influences people's perception of scientific findings and trust in science (Gustafson and Rice, 2019; Fischhoff, 2012; Van Der Bles et al., 2020). Therefore, understanding how journalists and scientists communicate certainty and uncertainty is critical for understanding the current ecosystem of science journalism and further provides better guidance for uncertainty communication (National Academies of Sciences, Engineering, and Medicine, 2017).
Arctic sea ice is definitely declining [Probability ] at a rate of 13.3 [Number] percent per decade [Extent] and we believe that [Psychological] the society may need to take action [Suggestion] to control global warming.
Arctic sea ice is declining [Probability ] … at a rate of 13.3 [Number] percent per decade [Extent] and we believe that [Psychological] the society may need to take action [Suggestion] to control global warming.
Arctic sea ice could be declining [Probability ] … at a rate of 13.3 [Number] percent per decade [Extent] and we believe that [Psychological] the society may need to take action [Suggestion] to control global warming.
Arctic sea ice might be declining [Probability ] at a rate of 13.3 [Number ] percent per decade [Extent ] and we believe that [Framing ] the society may need to take action [Suggestion ] to control global warming.
Uncertain Certain S e n te n c e -l e v e l C e rt a in ty Figure 1: Certainty is a multi-dimensional construct. The certainty of a scientific finding can be perceived holistically at the sentence level from its description. However, scientific findings may involve multiple aspects that may each be described as certain (↑ ↑) or uncertain (↓ ↓) (aspect-level certainty).
Multiple studies in Linguistics, NLP, and the Science of Science literature have examined how certainty is expressed. These studies have modeled certainty in multiple ways, including epistemic modality (Vold, 2006), semantic uncertainty (Szarvas et al., 2012), verbal uncertainty (Hart and Childers, 2004), factuality (Saurí and Pustejovsky, 2009), and hedging (Hyland, 1996). In practice, most uses of uncertainty rely on hedging as a coarse characterization of the overall uncertainty (Farkas et al., 2010). However, as suggested by Rubin et al. (2006), certainty itself is a complex construct and has to be modeled from multiple dimensions. The complex and subjective nature of certainty makes annotation challenging, often resulting in moderate-to-low annotator agreement (Henriksson and Velupillai, 2010;Rubin, 2007), and motivating a better and more practical way to model and annotate certainty in text.
We propose to study certainty from the commonly-used sentence-level and, in parallel, introduce a new dimension of aspect-level certainty, providing a fine-grained description of how certainty is communicated in text. This approach is analogous to work in sentiment analysis that models both holistic (Meena and Prabhakar, 2007), and aspect-level valence (Schouten and Frasincar, 2015)-and their interactions. Based on existing categorizations of certainty, we compile six aspects of scientific findings including: NUMBER, EXTENT, PROBABILITY, FRAMING, CONDITION, and SUGGESTION. Following carefully designed annotation guidelines and after extensive annotator training, we introduce a new annotated corpus of 2200 scientific findings equally, sampled from news and scientific abstracts for both sentencelevel and aspect-level certainty, and attained reliable inter-annotator agreements. Analysis with this dataset suggests that the number of hedges can only partially explain the variance of the overall sentence-level of certainty (Pearson's r=0.55). Therefore, to better model certainty in scientific findings, we fine-tuned a SciBERT (Beltagy et al., 2019) for the two tasks and it achieves 0.63 Pearson's r for sentence-level certainty and an average of 0.66 binary-F1 for aspect-level certainty.
Our paper offers the following three contributions. First, we provide the first dataset of scientific findings annotated with both sentence-level and aspect-level certainty and fine-tune neural language models to predict certainty in scientific findings. Second, using our best-performing model, we infer the sentence-level and aspect-level certainty for 431K scientific findings in news and abstracts and show that the sentence-level certainty of findings in abstracts is associated with journal impact factor and team size. Regression analysis reveals that low-impact journals and large teams often present scientific findings with higher sentencelevel certainty. Third, using 6586 findings from abstracts paired with their description in news, we find that news reports a finding with lower certainty than its corresponding description in the abstract. Fine-grained regression analysis over aspect-level certainty further reveals that journalists describe some key aspects with less certainty. Despite some studies suggest that news reports tend to describe uncertain findings as more certain (e.g., Weiss and Singer, 1988;Fahnestock, 2009), our study indicates that journalists may actually play down the certainty when reporting scientific findings. Other work has proposed examining which aspects of language contribute to the perception of uncertainty. For example, Rubin et al. (2006) proposed a four dimension model of certainty, including perspective, focus, timeline, and level, and each dimension contains several sub-categories. Following, we synthesize multiple approaches for measuring sentence-level and aspect-level certainty and proposed a representative categorization.

Modeling Certainty
Sentence-level Certainty Prior research has assumed that the level of certainty for a finding is presented, perceived, and further analyzed within one or several sentences (Holmes, 1982;Henriksson and Velupillai, 2010;Rubin, 2007). This aggregate perception represents a unified perception of various information expressed in the given piece of text and is the primary judgment of certainty along a continuum from uncertain to certain (Rubin et al., 2006). This perception of this overall level of certainty is known to influence people's following actions in many contexts (Corley and Wedeking, 2014; Wood and Eagly, 2009). Therefore, in modeling certainty, we include a sentence-level estimate of the certainty for a scientific finding's description. Here, we synthesize prior approaches and propose six repre-sentative aspects of scientific findings which could involve certainty or uncertainty, with a goal of creating a comprehensive scheme that captures most of what is seen in knowledge-intensive corpora.

Aspect-level Certainty
NUMBER refers to certainty towards specific quantities. For example, "approximately 250 individuals participated in this study" is uncertain towards NUMBER. Numerical information is vitally important in science communication as it is found to be the best way to promote scientific understanding in situations like climate change (Budescu et al., 2009) and health (Peters et al., 2014a). Accordingly, the imprecision of numbers or the inaccuracy of calculations are usually considered as a form of uncertainty (French, 1995). How to effectively communicate numerical information in scientific findings has been identified as one of the major challenges of science communication (Peters et al., 2014b). Identifying certainty regarding NUMBER in scientific findings could help to understand how journalists and scientists communicate certainty about numbers and inspire better ways to communicate this information.
EXTENT refers to certainty about the proportion/ratio of properties that make up an object/event or the extent of a change. For example, "This bridge is mainly composed of agate" and "We observe a moderate increase of suicide in Winter" involves uncertainty towards EXTENT. EXTENT can be described with numbers in certain situations. For example, "The average sea level across the world increased by approximately 30%" expresses extent via a number. However, unlike numbers that focus on specific quantities, EXTENT focuses on the components of an object, substance, or the extent of a change/effect. Previously, EXTENT was not explicitly proposed as a source of uncertainty, although some studies have brought up similar ideas. For example, French (1995) considers "Uncertainty about how much [the] impacts matter" as a form of uncertainty and Phillips et al. (2009) propose uncertainty about the strength or validity of evidence about risks, which may not be described with specific quantities. Existing studies suggest that journalists may misreport the extent of scientific findings to which it is supported by evidence (Dixon and Clarke, 2013), motivating its inclusion here.
PROBABILITY refers to certainty about the probability that something will occur, has occurred, or is associated with another factor. For example, "This medicine could possibly cure cancer" and "A is possibly associated with B" involves uncertainty about PROBABILITY. PROBABILITY has been widely recognized as one major source of uncertainty (Howard, 1988;Mosleh and Bier, 1996;Politi et al., 2007) and how to communicate probabilities effectively has long been an important question in science communication (Budescu et al., 2012;Sinayev et al., 2015).
CONDITION refers to the situation where something depends on a specific condition, and the condition involves certainty or uncertainty (Szarvas et al., 2012). Scientific findings are often qualified by specific conditions under which the result is valid, which may themselves be certain or uncertain (Friedman et al., 1999). For example, "Cancer could be cured if the medicine can be made shelfstable" is uncertain regarding CONDITION.
FRAMING refers to the certainty about how scientists or journalists themselves frame or interpret the scientific finding. For example, "We suspect A has effects on B" involves the uncertainty from the authors while "We conclude that A has effects on B" frames the finding with conviction. This aspect is related to expressions about epistemic uncertainty(the speaker having or lacking knowledge Szarvas et al., 2012; Fox and Ülkümen, 2011) and psychological uncertainty in the Psychology literature (Windschitl and Wells, 1996). In the news, journalists actively add their interpretations about the original information (Lin et al., 2006) and different framing may further affect people's perception of the overall certainty of the presented information (Soni et al., 2014). Therefore, identifying the certainty about FRAMING could help us to better understand how journalist's framing affects people's perceptions of scientific findings. SUGGESTION refers to certainty or uncertainty about the implications or future actions for the public or science community. Scientific findings do not only describe facts, but can also communicate practical implications for people's daily life (Batteux et al., 2021). For example, "Patients probably need more medicine to cure this disease" involves uncertainty regarding future actions. Uncertainty about SUGGESTION was previously identified as dynamic uncertainty in Szarvas et al. (2012).
A single scientific finding may include multiple aspects with their own certainties. For example, "The vaccine is effective for 76% of chances" is uncertain regarding PROBABILITY while is certain about the specific NUMBER. Similarly, "The sci-entists need to do more research to understand the effect of A on B" indicates the uncertainty about PROBABILITY but is certain about SUGGESTION.

Data
To study certainty, we construct a dataset of scientific findings reported in news and research articles. News data comes from Altmetrics, which tracks mentions of scientific articles in news outlets. We restrict our analysis to U.S.-based outlets where we could retrieve the full text of the article and where the DOI for the scientific article was recorded in the Microsoft Academic Graph (MAG), which provides metadata on the article (e.g., authors, abstract, and publication venue). Supplemental material §A contains additional details on preprocessing steps. A total of 128,942 news/article pairs were collected, spanning 273 different news outlets and 57,807 different scientific articles.
For scientific articles, we extract the findings from the abstract reported in the MAG using the abstract parser developed by Prabhakaran et al. (2016), which labels sentences as background, method, introduction, result, and conclusion. We use sentences labeled as result or conclusion in our analysis. For news, we adopt a heuristic approach and identify all sentences containing a discoveryrelated keyword (e.g., find, conclude). We retain the subjective clause after the verb as the finding. Examples of findings produced by each method, as well as additional details, are reported in Supplemental Material §C. This process finally leads to 608,694 unique scientific findings from abstracts and 106,612 unique scientific findings from news reports. Among the 128,942 news-paper pairs, 52,406 have both identified findings from news and paper abstracts.

Annotating Certainty
We annotate scientific findings for both their sentence-level certainty and the presence and certainty of each of the six aspects. Given subjectivity in perception, the same expression of certainty may evoke different perceptions (Druzdzel, 1989). Prior work in annotating uncertainty has generally reported low to moderate inter-annotator agreement (Henriksson and Velupillai, 2010; Rubin, 2007); for example, Rubin (2007) reports Cohen's κ=0.41 when annotating news certainty with a five-level Likert scale. To mitigate these challenges, we carefully designed the annotation procedures which are described in this section. Annotation Setup Annotators were recruited from a US university and received initial one-hour training. All annotators are fluent in English and have extensive experience in reading scientific news and research articles. Annotators who attained high IAA with our gold standard were retained and then went through four additional rounds of pilot annotation and discussions (2 rounds for sentence-level and 2 for aspect-level) to build consensus. All annotators were paid $15/hr for training and annotation.
Annotation was performed in three phases. In the first phase, the initial data was sampled in a way to be more balanced across levels of certainty. Markers of certainty are not equally distributed throughout scientific communication (Rubin et al., 2006); for example, 87% of the data labeled in Henriksson and Velupillai (2010) were found to be very certain. Therefore, in the initial data, we sample 1000 findings equally from news and paper abstracts where 50% scientific findings containing no hedges, 35% with one hedge, and 15% with 2+ hedges. The hedge words are collected from Hyland (2005).
The annotators were first asked to rate how certain they perceived the finding on a six-point Likert scale as the sentence-level certainty. Aspect-based ratings were performed in a separate round so as not to potentially bias annotators towards focusing their sentence-level judgments on the basis of aspects. For each aspect, annotators were asked to assess whether that aspect was present and, if so, whether the language for the aspect was certain or uncertain. For instances that are clearly not scientific findings, the annotators were instructed to label it as BAD-TEXT. Each finding was rated by at least two annotators for sentence-level and, due to increased variance observed during training, three annotators for aspect-level. In this phase, 2349 sentence-level and 3209 aspect-level annotations were collected for 1000 findings. Annotators had a high agreement in both sentence-level and aspect-level tasks. For sentence-level, annotators attained a Krippendorff's α=0.67, which is substantially higher than IAA for the closest comparable task (Rubin, 2007, Cohen's κ=0.41). 1 The final sentence-level rating is computed as the average across all annotators' scores. For aspect-level cer- tainty, the average Krippendorff's α is 0.57 for the six aspects, indicating moderate to high agreement. Supplemental material §D contains more details about agreement scores. We take the majority label as the final label for aspect-level certainty.
In the second phase, given the label imbalance in both sentence-level and aspect-level certainty, we sample additional data for annotation based on model predictions using different sampling strategies for sentence-level and aspect-level certainties. For sentence-level certainty, we fine-tuned a roberta-base classification model and predicted all the extracted findings with sentence-level certainty. We further sampled 400 findings with low model confidence equally from news and abstracts for the second-phase annotation. For aspect-level certainty, we fine-tuned a SciBERT classification model and up-sampled 600 findings for less frequent aspects, including SUGGESTION, EXTENT, CONDITION, and FRAMING.
Given that the first and second phase data do not reflect the distribution of certainty in a natural sample, we further randomly sampled 200 findings from all the extracted findings as to the third phase. Given that the annotators were capable of doing reliable annotations after the first phase, annotators independently annotated 1200 findings for the second and third phases.

Results
The final annotated dataset contains 6958 labels for 2200 findings. After removing find- : Relative sentence-level certainty when each aspect is certain/uncertain. The overall certainty of scientific findings is majorly affected by PROBABILITY and SUGGESTION, while are less affected by other aspects like NUMBER and EXTENT ings that are labeled as BAD-TEXT, we obtained 1551 findings labeled with sentence-level certainty and 1760 findings labeled with aspect-level certainty, among which 1144 findings are labeled with both sentence-level and aspect-level certainty. Supplemental §E present examples and distribution of the data. Previously, Szarvas et al. (2012) considers CONDITION as one type of uncertainty, however, in the annotated data, less than 10% of CONDITION is labeled as uncertain. This difference indicates that previously proposed types of uncertainty may not be considered as uncertain in knowledge-intensive corpus like scientific findings and demonstrates the value of aspect-level certainty.
To what degree do hedges capture certainty? Comparing the sentence-level certainty with the number of hedges (Figure 2, top) shows only a moderate correlation between hedging and certainty, r=0.55, despite their widespread use as a proxy. For example, "Further research is necessary to understand whether this is a causal relationship" contains zero hedges but explicitly expresses strong uncertainty towards the causal relationship, suggesting that many descriptions of certainty are not well capture by simple hedge-based lexicons. Further, authors vary in how frequently they employ hedges when describing the different aspects of certainty (Figure 2, bottom). This variance in their distribution suggests that hedges are less effective as proxies for capturing uncertainty for all aspects.
What is the association between aspect-level and sentence-level certainty? Figure 3 shows the relative sentence-level certainty for each as-

Predicting Certainty
In this section, we build models to predict sentenceand aspect-level certainty in scientific findings to support downstream analyses of certainty. We test two linear baseline models and two deep-learning models based on neural language models. As linear baselines, we include a model using bag-of-words (BoW) features and another based on the frequency of each hedging word. For neural models, we use SciBERT (Beltagy et al., 2019), and RoBERTa model (Liu et al., 2019) as the base models and fine-tune them over our annotated dataset. For both sentence-level and aspect-level certainty, the data labeled in phase 1 and phase 2 are split 8:1:1 into training, validation, and test. To better reflect the expected performance generalization, the test is made from the random set annotated in Phase 3 and the 10% test partition of Phases 1 and 2. For all the models, we also report their performance on the random test set to demonstrate their performance over the natural samples. Supplemental Section §B describes additional training details.

Sentence-level Certainty
We formulate sentence-level certainty prediction as a regression task for all the models and Table 1 shows the model performance. We find that a linear weighting of the hedges is unable to predict the overall sentence-level certainty when tested on the random sample, largely due to the relatively low ratio of findings containing hedges. In comparison, linear regression with bag-of-words features is better able to capture overall certainty with Pearson's r=0.55, suggesting that other cues in addition to hedges also affect the overall certainty in the natural sample. Compared with the two baselines, the two neural models based on pre-trained language models achieve better performance. Both neural models are run five times with different random seeds, showing that the performance improvements over the baselines are statistically significant (p<0.05 paired t-test). The SciBERT model performs slightly better than the RoBERTa-base model, indicating domain-specific pre-training is helpful though not to the point of significance. We use the best-performing SciBERT model (r=0.70) as the regressor for sentence-level certainty in the following analyses.
Aspect-level Certainty For each aspect-level certainty, we predict whether a scientific finding is Not-Present, Certain or Uncertain. For the two neural models, we use shared pre-trained language models but independent classification heads for each aspect. Figure 4 shows the binary-F1 scores for predicting aspect-level certainty. The SciBERT model consistently outperforms other baselines over the six aspects, indicating that aspect-level certainty prediction requires more domain-specific and context information. However, given that uncertainties about CONDITION and SUGGESTION are relatively rare in the annotated dataset, the SciBERT model does not capture well the uncertainty about CON-DITION and shows high variance when predicting uncertainty about SUGGESTION. In the following analysis, we use the best-performing SciBERT model (mean F1=0.71) as the classifier for aspectlevel certainty.

Certainty in Science Communications
Certainty is a core aspect of science communication (Friedman et al., 1999) and presenting certainty in different forms (i.e., aspects) may further affect people's perception and future action about a series of issues including climate change (Fortner et al., 2000) and the Covid-19 vaccine (Batteux et al., 2021). Our models and dataset enable us to study how journalists and scientists present certainty in science communications. Here, we focus on the following five research questions. RQ1: Are findings in science news more certain than those in paper abstracts? RQ2: Do journalists and scientists differ in their use of aspect-level certainty? RQ3: Does aspect-level certainty in abstract findings affect the sentence-level certainty in news findings? RQ4: Does journal impact factor affect the certainty of scientific findings and how they are covered in news reports? RQ5: Does team size affect the certainty of scientific findings and how they are covered in news reports? RQ1-RQ3 focus on changes to the description of certainty in the science communication process. While studies have found that news reports tend to describe uncertain findings as more certain (Weiss and Singer, 1988; Fahnestock, 2009), some studies suggest that news articles may also add more uncertainty to science finding in some cases (Friedman et al., 1999). Our model and dataset allow us to study (1) if certainty is changed, (2) if so, what aspects are changed, and (3) what drives the change. RQ4-RQ5 examine external factors that may affect how journalists and scientists present certainty. We focus on (4) the prestige/quality of the journal, asking whether lower-or higher-quality journals differ in how certain their findings are, and (5)  Data and method For RQ1 and RQ2, to control the effects of the content of the finding, we propose a method to match the same scientific finding in news and paper abstracts. For each extracted finding in a paper abstract, we identify the paraphrased findings in the corresponding news article reporting on that paper. We first remove all punctuation and stop words and then stem all the words in each sentence. Next, we calculate the overlap and Jaccard similarity between each pair of findings in the news and abstract. We manually evaluated the matched findings and set word overlap >=3 and Jaccard similarity >0.3 as the threshold. Based on the findings from 52,406 news-paper pairs, we identify 6,586 unique finding pairs from news and abstracts. We manually annotated 70 matched finding pairs, and 63 (90%) refer to the same science finding, indicating high precision of our matching process. Supplemental Material §F shows a random sample of the matched finding pairs. We construct separate regressions predicting the sentence-level (RQ1) and each aspect-level (RQ2) certainty in findings with the source of the finding (i.e., news or abstract). We further control the fields, author and affiliation ranking, journal impacts, finding length, and Flesch reading ease score. For RQ3, we construct a regression predicting the overall sentence-level certainty in news findings with the aspect-level certainty in the corresponding finding from the paper abstract. Except for all the IVs above, we further control the news outlet and the sentence-level certainty of the finding in the abstract.
For RQ4 and RQ5, we construct a regression predicting the sentence-level certainty in 265,758 findings presented in 55,178 paper abstracts using journal impacts factors and the number of authors. Recognizing its limitations (Kurmis, 2003), we use journal impact factor as a proxy for the quality of science based on prior use (Saha et al., 2003). We include controls the field of the research, author, and affiliation ranking extracted from the Microsoft Academic Graph (Wang et al., 2019), finding length and Flesch reading ease score to remove the potential confounds. To test the connection between certainty in news findings and external factors, we construct another regression to predict the level of uncertainty in 72,013 findings presented in 27,000 news articles. Besides all the IVs regarding abstract and authors, we also control the outlet to remove potential confounds.
RQ1: Are findings in news more certain than those in paper abstracts? The regression analysis (details in Supplemental Extent_Uncertain Probability_Uncertain *** *** *** ** * * ** Figure 5: Controlling for multiple factors in RQ2 (e.g., topic, news outlet), the marginal effects show the relative probability of finding each aspect described in the abstract (left) versus news (right), revealing that some aspects like numeric certainty are much more likely to be described in one source.
site: findings in news are less certain compared with findings in abstract, even when controlling the content and many contextual factors. RQ2: Do journalists and scientists differ in their use of aspect-level certainty? Yes, as shown in Figure 5, findings in abstracts are associated with more certainties about FRAMING and NUMBER. Findings in news are associated with uncertainties about PROBABILITY, EXTENT and NUMBER, indicating that the journalists tend to play down the certainty of some aspects, especially regarding numeric information. Existing studies suggest that laypeople with lower numeracy tend to focus more on narrative instead of numeric information (Dieckmann et al., 2009); one potential explanation for this difference is that journalists could be intentionally simplifying numerical information with hedges like "roughly " instead of the detailed number to better engage the lay audiences. Further, journalists are more likely to discard expressions of the scholar's uncertainty (FRAMING) when presenting the results, potentially aiming to make the work seem more objective. Our result suggests a potential mechanism for lower sentence-level certainty in news compared with abstracts. RQ3: Does aspect-level certainty in abstract findings affect the sentence-level certainty in news findings? As shown in figure 6, uncertainty about NUMBER and FRAMING are associated with decreased certainty in news findings, indicating that their uncertainty expressions are readily perceived by journalists. However, we also find that the certainty about SUGGESTION in abstracts are also associated with decreased certainty in news, suggesting that journalists may play down the certainty when presenting the findings involving suggestions or future actions even when it is certain. While existing studies suggest that journalists may exaggerate the potential benefits of science (Wilson et al., 2010), our result indicates that journalists can be very careful when reporting findings involving suggestions or future actions. RQ4: Are findings in high-impact journals more certain than findings in low-impact journals? No. As shown in Figure 7, findings in the lower-impact journals are written with the highest level of certainty, while findings appearing in relatively higher-impact journals are described with comparatively less certainty. One potential explanation for this phenomenon is that high-quality papers published in journals with more strict reviewing processes 2 present certainty more precisely, which leads to a lower overall certainty compared with findings in low-impact journals. As a comparison, the certainty of findings written by journalists is not significantly associated with journal impact factors, suggesting that the prestige of a journal does not affect how journalists present scientific findings. RQ5: Are findings from small teams more certain than findings from large teams? We find a linear relationship between the number of authors and the overall level of certainty in scientific findings (Figure 7), even with controls for fields and authors. Multiple mechanisms may explain this behavior. Larger teams may themselves be more capable of producing more certain results due to 2 Journals with higher impact factors generally have longer reviews than low impact journals (Publons, 2018, p. 36).   Across these results, our study suggests that the journalists report scientific findings with less certainty than scientists (RQ1). This result contradicts the existing findings that the journalists are overstating science (Weiss and Singer, 1988; Fahnestock, 2009). The fine-grained analysis over aspect-level certainty provides further details for such a change: journalists may play down the certainty about several core aspects of science findings like SUGGES-TION even when they are certain in the abstract (RQ2,3). Moreover, we find that the certainty of scientific findings in research articles varies with journal impact factors and team size, while such a  pattern does not persist in science news (RQ4,5), suggesting that journalists may not alter scientific uncertainty according to these factors.

Discussion
In this paper, we propose a new taxonomy, data, and models for certainty in science communications. Using the model, we analyzed a large scientific finding dataset and answered a series of important research questions on science communications. However, we also note the following limitations of our study.
(1) We only use the abstract rather than the full text of research articles due to open access restrictions from copyright. Although authors normally present the core findings in the abstract, findings in abstracts could still be presented differently from findings in the main texts. (2) We use report verbs to extract findings in science news, which may miss findings that are presented without them. How to identify scientific findings in science news is still an open question and we call for future studies in this direction. (3) In our analysis, we use word-based heuristic methods (word overlap and Jaccard similarity) to match findings in news and abstracts, while the same scientific finding can be paraphrased with different sets of words. In future studies, we will develop better methods to identify paraphrases of scientific findings.

Conclusion
Our study represents a new step towards modeling certainty in text and demonstrates that sentencelevel and aspect-level certainty are natural and feasible ways to model and annotate certainty. The proposed computational framework for certainty in scientific findings could support and inspire new studies on certainty in general language as well as new approaches to study science communication.

A Data Preprocessing
Altmetric mention data Altmetric 3 tracks a variety of sources for mentions of research papers, including coverage from over 2,000 news outlets around the world. To control for differences in the frequency of scientific reporting and potential confounds from variations in journalistic practices across different countries, the list of news outlets was curated to 423 U.S.-based news media outlets, with each having at least 1,000 mentions in the Altmetric database. Location data for each outlet is provided by Altmetric. This initial dataset consists of 2.4M mentions of 521K papers by 1.7M news articles before 2019-10-06. Each mention in the Altmetric data has associated metadata that allows us to retrieve the original citing news story as well as the DOI for the paper itself. News processing During data processing, we notice that some very long news articles are usually policy documents. Therefore, we removed news longer than 1392 words (top 5%). To ensure that each news is specifically written about a single research paper's findings, we keep news only linked to one research paper. This leads to 128,942 newspaper pairs spanning 273 different news outlets and 57,807 different scientific articles. For all the news stories, we first remove references and paragraphs containing quotes as they might bias our analysis of uncertainty (e.g., a scientist describes their own work as uncertain in a quote).

B Model Details
We use scikit-learn version 0.23.1 to build the linear regression model (Pedregosa et al., 2011). Specifically, for the linear model, we use ridge regressor and classifier with default settings. The built-in CountVectorizer of scikit-learn is used to vectorize the unigram, bigram, and trigram of each input question. The size of the bag-of-words feature vector is set as 40000.
For both the SciBERT and RoBERTa models, we use Hugging Face 4 transformers (Wolf et al., 2020) and set the batch size as 128 and learning rate as 0.0001. We set max_len=60. Adam (Kingma and Ba, 2015) is used for optimization. All the other hyperparameters and the model size are the same as the default roberta-base model and the SciBERT model. We train both models for 50 epochs and choose the model with the lowest 3 https://www.altmetric.com/ 4 https://huggingface.co/ loss on the validation set. All the code, datasets, and parameters of our best-performing model are released and one could easily reproduce all the experiments.

C Additional Details on Extracting Scientific Findings
We use the following lexicons to extract scientific findings in news: found that, find that, finds that, reveal that, reveals that, revealed that, suggest that, suggested that, suggests that, discover that, discovers that, discovered that, show that, shows that, showed that, conclude that, concludes that, concluded that, indicate that, indicates that, indicated that, claim that, claims that, claimed that, argue that, argues that, argued that We manually annotate 50 extracted findings and only 1 of them does not fully counted as a scientific finding, indicating high precision of our approach. Table 4 presents the extracted findings from news and abstract.  Figure 11 presents the distribution of aspect-level certainty score in annotated dataset. Figure 9 presents the distribution of sentence-level certainty across data splits. Table 2 and Table 3 presents the annotated findings for sentence-level and aspectlevel certainty.  Table 6 presents the regression results for RQ1. Table 7 presents the regression results for RQ2. Table 8 presents the regression results for RQ3. Table 9 and Table 10 present the regression results for RQ4-5.  Based on these observations, we propose that the apparent receding contact angle should be used for characterizing superliquid-repellent surfaces rather than the apparent advancing contact angle and hysteresis.

G Detailed regression results
The nondemented subjects with Alzheimer pathology may have had "preclinical" AD, or numerous cortical plaques may occur in some elderly subjects who would never develop clinical dementia.

Finding Uncertain Certain
Arctic sea ice is declining at a rate of nearly 13 percent per decade.
NUMBER, EX-TENT PROBABILITY Some functions show decreases with potentially irreversible global impacts EXTENT, PROBABILITY There were 365 cases of maternal sepsis, giving an incidence of severe maternal sepsis of 4.7 women per 100,000.
NUMBER, EX-TENT The practice may actually drive away qualified applicants who feel that their privacy has been compromised.

PROBABILITY
More research is needed on how cryptic postcopulatory and post-zygotic processes contribute to determining paternity and bridging the behavioural and genetic mating systems of viviparous species. In addition, liver procurement data such as WIT showed that organs with less than 30 mins WIT led to significantly reduced yield, but no impact was found on viability.

PROBABILITY SUGGESTION
Organs with less than 30 mins wit led to significantly reduced yield, but no impact was found on viability.
showed news I conclude that we live in one of infinitely many universes -one for each value of the gravitational constant.
We live in one of infinitely many universes -one for each value of the gravitational constant.
conclude news They found that the findings were specific to ADHD, with no associations observed between the other two disorders.
The findings were specific to adhd, with no associations observed between the other two disorders.
found news The Irish investigators of that meta-analysis found that methotrexate was associated with a small albeit statistically significant 10% increase in the risk of all adverse respiratory events and an 11% increase in the risk of respiratory infection.
Methotrexate was associated with a small albeit statistically significant 10% increase in the risk of all adverse respiratory events and an 11% increase in the risk of respiratory infection.
found news The study, in the current issue of Research in Nursing & Health, revealed that while physical environment had no direct influence on job satisfaction, it did have a significant indirect influence because the environment affected whether nurses could complete tasks without interruptions, communicate easily with other nurses and physicians, and/or do their jobs efficiently.
While physical environment had no direct influence on job satisfaction, it did have a significant indirect influence because the environment affected whether nurses could complete tasks without interruptions, communicate easily with other nurses and physicians, and/or do their jobs efficiently.
revealed news Mixed-species neighbourhoods did not significantly affect tree ring growth in normal years.
Mixed-species neighbourhoods did not significantly affect tree ring growth in normal years.
abstract Statistical tests (ordinary least squares, quantile, robust regressions, Akaike information criterion model tests) document independence from phylogeny, and a previously unrecognized strong and significant correlation of σ13C enrichment with body mass for all mammalian herbivores.
Statistical tests (ordinary least squares, quantile, robust regressions, Akaike information criterion model tests) document independence from phylogeny, and a previously unrecognized strong and significant correlation of σ13C enrichment with body mass for all mammalian herbivores.
abstract There were no differences in socioeconomic status, cognitive reserve, general cognitive status, or lipid and TSH profiles between the groups.
There were no differences in socioeconomic status, cognitive reserve, general cognitive status, or lipid and TSH profiles between the groups.
abstract Much remains unknown and multiple research disciplines are needed to address this: forest scientists and other biologists have a major role to play.
Much remains unknown and multiple research disciplines are needed to address this: forest scientists and other biologists have a major role to play. abstract The co-administration of the energy drink with alcohol did not alter the alcohol-induced impairment on these objective measures.
The co-administration of the energy drink with alcohol did not alter the alcohol-induced impairment on these objective measures.
abstract For children with low self-esteem, high praise may be more harmful than helpful.
Inflated praise decreases challenge seeking in children with low self-esteem and has the opposite effect on children with high self-esteem.
0.36 5 Breast-feeding might be no more beneficial than bottle-feeding for 10 of 11 long-term health and well-being outcomes in children age 4 to 14.
Children aged 4 to 14 who were breast-as opposed to bottle-fed did significantly better on 10 of the 11 outcomes studied.

7
While our first impressions of educators might affect our ratings of them, ultimately the quality of their instruction matters the most in student evaluations.
Quality of instruction is the strongest determinant of student factual and conceptual learning, but that both instructional quality and first impressions affect evaluations of the instructor.

7
Even in the absence of symptoms, trauma may have an enduring effect on brain function.
Trauma has a measurable, enduring effect upon the functional dynamics of the brain, even in individuals who experience trauma but do not develop ptsd.

6
Being bullied may increase the risk for parasomnias.
Being bullied increases the risk for having parasomnias.

4
Dried fruits may lower the gi of white bread through displacement of high gi carbohydrate.
When displacing half the available carbohydrate in white bread, all dried fruit lowered the GI; however, only dried apricots (GI_ = _57___5) showed a significant displacement effect (P _ = _0.025).

8
Although the biological mechanisms of these associations need to be explored in future research, these new data may shed new light on the long-observed epidemiological associations between personality, physical health, and human longevity.
The present data shed new light on the long-observed epidemiological associations between personality, physical health, and human longevity.

12
Late colonies more frequently rejected both young and old non-nestmates, suggesting that risk of acceptance may be too high at this stage.
Young non-nestmates were more frequently accepted in early than in late colonies.

6
Only graphic warning labels reduced the percentage of sugary drinks purchased, and that the public may support the use of graphic labels if they are informed that only graphic labels are effective.
Graphic warning labels reduced the share of sugary drinks purchased in a cafeteria from 21.4% at baseline to 18.2% effect driven by substitution of water for sugary drinks.
0.36 8 The kenyan runners are able to maintain their cerebral oxygenation within a stable range, which may contribute to their success in longdistance races.
Kenyan runners from the kalenjin tribe are able to maintain their cerebral oxygenation within a stable range during a self-paced maximal 5-km time trial, but not during an incremental maximal test.

9
The world might be closer to exceeding the budget for the long-term target of the paris climate agreement than previously thought.
The world is closer to exceeding the budget for the long-term target of the paris climate agreement than previously thought.

11
Src may be associated with longer overall survival.
Higher src activity is associated with longer overall survival.

5
Pg-free mel may not reduce short-term complications or improve outcomes after asct for mm.
In summary, we demonstrate that switching to PG-free MEL did not significantly reduce shortterm complications of ASCT or improve outcomes in MM.

9
Higher adenoma detection rates may be associated with up to 50 percent to 60 percent lower lifetime colorectal cancer incidence and death without higher overall costs, despite a higher number of colonoscopies and potential complications, according to a study in the june 16 issue of jama.
In this microsimulation modeling study, higher adenoma detection rates in screening colonoscopy were associated with lower lifetime risks of colorectal cancer and colorectal cancer mortality without being associated with higher overall costs.

14
Long-term ppi use may increase the risk of hip fracture.
The increased risk of hip fracture was evident only in short-term proton pump inhibitor use, but no association was found for long-term or cumulative use. 0.38 6