Semi-Supervised Exaggeration Detection of Health Science Press Releases

Public trust in science depends on honest and factual communication of scientific papers. However, recent studies have demonstrated a tendency of news media to misrepresent scientific papers by exaggerating their findings. Given this, we present a formalization of and study into the problem of exaggeration detection in science communication. While there are an abundance of scientific papers and popular media articles written about them, very rarely do the articles include a direct link to the original paper, making data collection challenging, and necessitating the need for few-shot learning. We address this by curating a set of labeled press release/abstract pairs from existing expert annotated studies on exaggeration in press releases of scientific papers suitable for benchmarking the performance of machine learning models on the task. Using limited data from this and previous studies on exaggeration detection in science, we introduce MT-PET, a multi-task version of Pattern Exploiting Training (PET), which leverages knowledge from complementary cloze-style QA tasks to improve few-shot learning. We demonstrate that MT-PET outperforms PET and supervised learning both when data is limited, as well as when there is an abundance of data for the main task.


Introduction
Factual and honest science communication is important for maintaining public trust in science (Nelkin, 1987;Moore, 2006), and the "dominant link between academia and the media" are press releases about scientific articles (Sumner et al., 2014). However, multiple studies have demonstrated that press releases have a significant tendency to sensationalize their associated scientific articles (Sumner et al., 2014;Bratton et al., 2019;Woloshin et al., 2009;Woloshin and 1 The code and data are available online at https://github.com/copenlu/ scientific-exaggeration-detection Figure 1: Scientific exaggeration detection is the problem of identifying when a news article reporting on a scientific finding has exaggerated the claims made in the original paper. In this work, we are concerned with predicting exaggeration of the main finding of a scientific abstract as reported by a press release. Schwartz, 2002). In this paper, we explore how natural language processing can help identify exaggerations of scientific papers in press releases.
While Sumner et al. (2014) and Bratton et al. (2019) performed manual analyses to understand the prevalence of exaggeration in press releases of scientific papers from a variety of sources, recent work has attempted to expand this using methods from NLP (Yu et al., 2019(Yu et al., , 2020Li et al., 2017). These works focus on the problem of automatically detecting the difference in the strength of causal claims made in scientific articles and press releases. They accomplish this by first building datasets of main claims taken from PubMed abstracts and (unrelated) press releases from EurekAlert 2 labeled for their strength. With this, they train machine learning models to predict claim strength, and analyze unlabelled data using these models. This marks an important first step toward the goal of automatically identifying exaggerated scientific claims in science reporting.
However, existing work has only partially at-tempted to address this task using NLP. Particularly, there exists no standard benchmark data for the exaggeration detection task with paired press releases and abstracts i.e. where the data consist of tuples of the form (press release, abstract) and the press release is written about the paired scientific paper. Collecting paired data labeled for exaggeration is critical for understanding how well any solution performs on the task, but is challenging and expensive as it requires domain expertise (Sumner et al., 2014). The focus of this work is then to curate a standard set of benchmark data for the task of scientific exaggeration detection, provide a more realistic task formulation of the problem, and develop methods effective for solving it using limited labeled data. To this end, we present MT-PET, a multi-task implementation of Pattern Exploiting Training (PET, Schick and Schütze (2020a,b)) for detecting exaggeration in health science press releases. We test our method by curating a benchmark test set of data from the expert annotated data of Sumner et al. (2014) and Bratton et al. (2019), which we release to help advance research on scientific exaggeration detection.
Contributions In sum, we introduce: • A new, more realistic task formulation for scientific exaggeration detection. • A curated set of benchmark data for testing methods for scientific exaggeration detection consisting of 563 press release/abstract pairs. • MT-PET, a multi-task extension of PET which beats strong baselines on scientific exaggeration detection.

Problem Formulation
We first provide a formal definition of the problem of scientific exaggeration detection, which guides the approach described in §3. We start with a set of document pairs {(t, s) ∈ D}, where s is a source document (e.g. a scientific paper abstract) and t is a document written about the source document s (e.g. a press release for the paper). The goal is to predict a label l ∈ {0, 1, 2} for a given document pair (t, s), where 0 implies the target document undersells source document, 1 implies the target document accurately reflects the source document, and 2 implies the target document exaggerates the source document. Two realizations of this formulation are investigated in this work. The first (defined as T1) is an inference task consisting of labeled document pairs Figure 2: MT-PET design. We define pairs of complementary pattern-verbalizer pairs for a main task and auxiliary task. These PVPs are then used to train PET on data from both tasks. used to learn to predict l directly. In other words, we are given training data of the form (t, s, l) and can directly train a model to predict l from both t and s. The second (defined as T2) is as a classification task consisting of a training set of documents d ∈ D from both the source and the target domain, and a classifier is trained to predict the claim strength l of sentences from these documents. In other words, we don't require paired documents (t, s) at train time. At test time, these classifiers are then applied to document pairs (t, s) and the predicted claim strengths (l s , l t ) are compared to get the final label l. Previous work has used this formulation to estimate the prevalence of correlation to causation exaggeration in press releases (Yu et al., 2020), but have not evaluated this on paired labeled instances.
Following previous work (Yu et al., 2020), we simplify the problem by focusing on detecting when the main finding of a paper is exaggerated. The first step is then to identify the main finding from s, and the sentence describing the main finding in s from t. In our semi-supervised approach, we do this as an intermediate step to acquire unlabeled data, but for all labeled training and test data, we assume the sentences are already identified and evaluate on the sentence-level exaggeration detection task.

Approach
One of the primary challenges for scientific exaggeration detection is a lack of labeled training data. Given this, we develop a semi-supervised approach for few-shot exaggeration detection based on pattern exploiting training (PET, Schick and Schütze (2020a,b)). Our method, multi-task PET (MT-PET, see Figure 2), improves on PET by using multiple complementary cloze-style QA tasks derived from different source tasks during training. We first describe PET, followed by MT-PET.

Pattern Exploiting Training (PET)
PET (Schick and Schütze, 2020a) uses the masked language modeling objective of pretrained language models to transform a task into one or more cloze-style question answering tasks. The two primary components of PET are patterns and verbalizers. Patterns are cloze-style sentences which mask a single token e.g. in sentiment classification with the sentence "We liked the dinner" a possible pattern is: "We liked the dinner. It was [MASK]." Verbalizers are single tokens which capture the meaning of the task's labels in natural language, and which the model should predict to fill in the masked slots in the provided patterns (e.g. in the sentiment analysis example, the verbalizer could be Good).
Given a set of pattern-verbalizer pairs (PVPs), an ensemble of models is trained on a small labeled seed dataset to predict the appropriate verbalizations of the labels in the masked slots. These models are then applied on unlabeled data, and the raw logits are combined as a weighted average to provide soft-labels for the unlabeled data. A final classifier is then trained on the soft labeled data using a distillation loss based on KL-divergence.

Notation
We adopt the notation in the original PET paper (Schick and Schütze, 2020a) to describe MT-PET. In this, we have a masked language model M with a vocabulary V and mask token [MASK] ∈ V . A pattern is defined as a function P (x) which transforms a sequence of input sentences x = (s 0 , ..., s k−1 ), s i ∈ V * to a phrase or sentence which contains exactly one mask token. Verbalizers v(x) map a label in the task's label space L to a set of tokens in the vocabulary V which M is trained to predict.
For a given sample z ∈ V * containing exactly one mask token and w ∈ V corresponding to a word in the language model's vocabulary, M (w|z) is defined as the unnormalized score that the language model gives to word w at the masked posi-tion in z. The score for a particular label as given in Schick and Schütze (2020a) is then For a given sample, PET then assigns a score s for each label based on all of the verbalizations of that label. When applied to unlabeled data, this produces soft labels from which a final model M can be trained via distillation using KL-divergence.

MT-PET
In the original PET implementation, PVPs are defined for a single target task. MT-PET extends this by allowing for auxiliary PVPs from related tasks, adding complementary cloze-style QA tasks during training. The motivation for the multi-task approach is two-fold: 1) complementary cloze-style tasks can potentially help the model to learn different aspects of the main task; in our case, the similar tasks of exaggeration detection and claim strength prediction; 2) data on related tasks can be utilized during training, which is important in situations where data for the main task is limited. Concretely, we start with a main task T m with a small labeled dataset (x m , y m ) ∈ D m , where y m ∈ L m is a label for the instance, as well as an auxiliary task T a with labeled data (x a , y a ) ∈ D a , y a ∈ L a . Each pattern P i m (x) for the main task has a corresponding complementary pattern P i a (x) for the auxiliary task. Additionally, the labels in L a have their own verbalizers v a (x). Thus, with k patterns, the full set of PVP tuples is given as Finally, a large set of unlabeled data U for the main task only is available. MT-PET then trains the ensemble of k masked language models using the pairs defined for the main and auxiliary task. In other words, for each individual model both the main PVP (P m , v m ) and auxiliary PVP (P a , v a ) are used during training. For a given model M i in the ensemble, on each batch we randomly select one task T c , c ∈ {m, a} on which to train. The PVP for that task is then selected as (P i c , v c ). Inputs (x c , y c ) from that dataset are passed through the model, producing raw scores for each label in the task's label space.
The loss is calculated as the cross-entropy between the task label y c and the softmax of the score s normalized over the scores for all label verbalizations (Schick and Schütze, 2020a), weighted by a term α c .
where N is the batch size, n is a sample in the batch, H is the cross-entropy, and α c is a hyperparameter weight given to task c.
MT-PET then proceeds in the same fashion as standard PET. Different models are trained for each PVP tuple in P, and each model produces raw scores s p i m for all samples in the unlabeled data. The final score for a sample is then a weighted combination of the scores of individual models.
where the weights w i are calculated as the accuracy of model M i on the train set D m before training. The final classification model is then trained using KL-divergence between the predictions of the model and the scores s as target logits.

MT-PET for Scientific Exaggeration
We use MT-PET to learn from data labeled for both of our formulations of the problem (T1, T2). In this, the first step is to define PVPs for exaggeration detection (T1) and claim strength prediction (T2).
To do this, we develop an initial set of PVPs and use PETAL (Schick et al., 2020) to automatically find verbalizers which adequately represent the labels for each task. We then update the patterns manually and re-run PETAL, iterating as such until we find a satisfactory combination of verbalizers and patterns which adequately reflect the task. Additionally, we ensure that the patterns between T1 and T2 are roughly equivalent. This yields 2 patterns for each task, provided in Table 1, and verbalizers given in Table 2. The verbalizers found by PETAL capture multiple aspects of the task labels, selecting words such as "mistaken," "wrong," and "artificial" for exaggeration, "preliminary" and "conditional" for downplaying, and multiple levels of strength for strength detection such as "estimated" (correlational), "cautious" (conditional causal), and "proven" (direct causal).
For unlabeled data, we start with unlabeled pairs of full text press releases and abstracts. As we are concerned with detecting exaggeration in the primary conclusions, we first train a classifier based on single task PET for conclusion detection using a set of seed data. The patterns and verbalizers we use for conclusion detection are given in Table 3 and Table 4. After training the conclusion detection model, we apply it to the press releases and abstracts, choosing the sentence from each with the maximum score s p (1|x).

Data Collection
One of the main contributions of this work is a curated benchmark dataset for scientific exaggeration detection. Labeled datasets exist for the related task of claim strength detection in scientific abstracts and press releases (Yu et al., 2020(Yu et al., , 2019, but these data are from press releases and abstracts which are unrelated (i.e. the given press releases are not written about the given abstracts), making them unsuitable for benchmarking exaggeration detection. Given this, we curate a dataset of paired sentences from abstracts and associated press releases, labeled by experts for exaggeration based on their claim strength. We then collect a large set of unlabeled press release/abstract pairs useful for semi-supervised learning.

Gold Data
The gold test data used in this work are from Sumner et al. (2014) and Bratton et al. (2019), who annotate scientific papers, their abstracts, and asso-  Verbalizers are obtained using PETAL (Schick et al., 2020), starting with the top 10 verbalizers per label and then manually filtering out words which do not make sense with the given labels.
ciated press releases along several dimensions to characterize how press releases exaggerate papers.
The original data consists of 823 pairs of abstracts and press releases. The 462 pairs from Sumner et al. (2014) have been used in previous work to test claim strength prediction (Li et al., 2017), but the data, which contain press release and abstract conclusion sentences that are mostly paraphrases of the originals, are used as is.
We focus on the annotations provided for claim strength. The annotations consist of six labels which we map to the four labels defined in Li et al. (2017). The labels and their meaning are given in Table 5. This gives a claim strength label l ρ for the press release and l γ for the abstract. The final exaggeration label is then defined as follows: As the original abstracts in the study are not provided, we automatically collect them using the Semantic Scholar API. 3 We perform a manual inspection of abstracts to ensure the correct ones 3 https://api.semanticscholar.org/

Name
Pattern  are collected, discarding missing and incorrect abstracts. Gold conclusion sentences are obtained by sentence tokenizing abstracts using SciSpaCy (Neumann et al., 2019) and finding the best matching sentence to the provided paraphrase in the data using ROUGE score (Lin, 2004). We then manually fix sentences which do not correspond to a single sentence from the abstract. Gold press release sentences are gathered in the same way from the provided press releases. This results in a dataset of 663 press release/abstract pairs labeled for claim strength and exaggeration. The label distribution is given in Table 6. We randomly sample 100 of these instances as training data for few shot learning (T1), leaving 553 instances for testing. Additionally, we create a small training set of 1,138 sentences labeled for whether or not they are the main conclusion sentence of the press release or abstract. This data is used in the first step of MT-PET to identify conclusion sentences in the unlabeled pairs.
For T2 we use the data from Yu et al. (2020Yu et al. ( , 2019. Yu et al. (2019) create a dataset of 3,061 conclusion sentences labeled for claim strength from structured PubMed abstracts of health observational studies with conclusion sections of 3 sentences or less. Yu et al. (2020) then annotate statements from press releases from EurekAlert. The selected data are from the title and first two sentences of the press releases, as Sumner et al. (2014) note that most press releases contain their main conclusion statements in these sentences, following an inverted pyramid structure common in journalism (Pöttker, 2003). Both studies use the      Table 5). The final data contains 2,076 labeled conclusion statements. From these two datasets, we select a random stratified sample of 4,500 instances for training in our full-data experiments, and subsample 200 for few-shot learning (100 from abstracts and 100 from press releases).

Unlabeled Data
We collect unlabeled data from ScienceDaily, 4 a science reporting website which aggregates and rereleases press releases from a variety of sources. To do this, we crawl press releases from ScienceDaily via the Internet Archive Wayback Machine 5 between January 1st 2016 and January 1st 2020 using Scrapy. 6 We discard press releases without paper DOIs and then pair each press release with a paper abstract by querying for each DOI using the Semantic Scholar API. This results in an unlabeled set of 7,741 press release/abstract pairs. Additionally, we use only the title, lead sentence, and first three sentences of each press release.

Experiments
Our experiments are focused on the following primary research questions: • RQ1: Does MT-PET improve over PET for scientific exaggeration detection? • RQ2: Which formulation of the problem leads to the best performance? • RQ3: Does few-shot learning performance approach the performance of models trained with many instances? • RQ4: What are the challenges of scientific exaggeration prediction?
We experiment with the following model variants: • Supervised: A fully supervised setting where only labeled data is used. • PET: Standard single-task PET.
• MT-PET: We run MT-PET with data from one task formulation as the main task and the other formulation as the auxiliary task.
We perform two evaluations in this setup: one with T1 as the main task and one with T2. For T1, we use the 100 expert annotated instances with paired press release and abstract sentences labeled for exaggeration (200 sentences total). For T2, we use 100 sentences from the press data from Yu et al. (2020) and 100 sentences from the abstract data in Yu et al. (2019)   strength. We use RoBERTa base (Liu et al., 2019) from the HuggingFace Transformers library (Wolf et al., 2020) as the main model, and set α m to be 1, and α a = min(2, |Dm| |Da| ). All methods are evaluated using macro-F1 score, and results are reported as the average performance over 5 random seeds.

Performance Evaluation
We first examine the performance with T1 as the base task (see Table 7). In a purely supervised setting, the model struggles to learn and mostly predicts the majority class. Basic PET yields a substantial improvement of 10 F1 points, with MT-PET further improving upon this by another 8 F1 points. Accordingly, we conclude that training with auxiliary task data provides much benefit for scientific exaggeration detection in the T1 formulation.
We next examine performance with T2 (strength classification) as the main task in both few-shot and full data settings (see Table 8). In terms of base performance, the model can predict exaggeration better than T1 in a purely supervised setting. For PET and MT-PET, we see a similar trend; with 200 instances for T2, PET improves by 7 F1 points over supervised learning, and MT-PET improves on this by a further 0.9 F1 points. Additionally, MT-PET improves performance on the individual tasks of predicting the claim strength of conclusions in press releases and scientific abstracts with 200 examples. While less dramatic, we still see gains in performance using PET and MT-PET when 4,500 instances from T2 are used, despite the fact that there are still only 100 instances from T1. We also test if the improvement in performance is simply due to training on more in-domain data ("PET+in domain MLM" in Table 8). We observe gains for exaggeration detection using masked language modeling on data from T1, but MT-PET still per-forms better at classifying the strength of claims in press releases and abstracts when 200 training instances from T2 are used.

RQ1
Our results indicate that MT-PET does in fact improve over PET for both training setups. With T1 as the main task and T2 as the auxiliary task, we see that performance is substantially improved, demonstrating that learning claim strength prediction helps produce soft-labeled training data for exaggeration detection. Additionally, we find that the reverse holds with T2 as main task and T1 as auxiliary task. As performance can also be improved via masked language modeling on data from T1, this indicates that some of the performance improvement could be due to including data closer to the test domain. However, our error analysis in subsection 5.2 shows that these methods improve model performance on different types of data.
RQ2 We find that T2 is better suited for scientific exaggeration detection in this setting, however, with a couple of caveats. First, the final exaggeration label is based on expert annotations for claim strength, so clearly claim strength prediction will be useful in this setup. Additionally, the task may be more forgiving here, as only the direction needs to be correct and not necessarily the final strength label (i.e. predicting '0' for the abstract and any of '1,' '2,' or '3' for the press release label will result in an exaggeration label of 'exaggerates').
RQ3 We next examine the learning dynamics of our few-shot models with different amounts of training data (see Figure 3), comparing them to MT-PET to understand how well it performs compared to settings with more data. MT-PET with only 200 samples is highly competitive with purely supervised learning on 4,500 samples (57.44 vs. 58.66). Additionally, MT-PET performs at or above supervised performance up to 1000 input samples, and at or above PET up to 500 samples, again using only 200 samples from T2 and 100 from T1.

Error Analysis
RQ4 Finally, we try to understand the difficulty of scientific exaggeration detection by observing where models succeed and fail (see Figure 4). The most difficult category of examples to predict involve direct causal claims, particularly exaggeration and downplaying when one document is a direct causal claim and the other an indirect causal claim ('CON->CAU', 'CAU->CON'). Also, it is challenging to predict when both the press release and abstract conclusions are directly causal.
The models have the easiest time predicting when both statements involve correlational claims, and exaggerations involving correlational claims from abstracts. We also observe that MT-PET helps the most for the most difficult category: causal claims (see Figure 5 in Appendix A). The model is particularly better at differentiating when a causal claim in an abstract is downplayed by a press release. It is also better at identifying correlational claims than PET, where many claims involve association statements such as 'linked to,' 'could predict,' 'more likely,' and 'suggestive of.' The model trained with MLM on data from T1 also benefits causal statement prediction, but mostly for when both statements are causal, whereas MT-PET sees more improvement for pairs where one causal statement is exaggerated or downplayed by another (see Figure 6 in Appendix A). This suggests that training with the patterns from T1 helps the model to differentiate direct causal claims from weaker claims, while MLM training mostly helps the model to understand better how direct causal claims are written. We hypothesize that combining the two methods would lead to mutual gains.
6 Related Work
Most work on scientific exaggeration detection has focused on flagging when the primary finding of a scientific paper has been exaggerated by a press release or news article (Sumner et al., 2014;Bratton et al., 2019;Yu et al., 2020Yu et al., , 2019Li et al., 2017). Sumner et al. (2014) and Bratton et al. (2019) manually label pairs of press releases and scientific papers on a wide variety of metrics, finding that one third of press releases contain exaggerated claims, and 40% contain exaggerated advice. Li et al. (2017) is the first study into automatically predicting claim strength, using the data from Sumner et al. (2014) as a small labeled dataset. Yu et al. (2019) and Yu et al. (2020) extend this by building larger datasets for claim strength prediction, performing an analysis of a large set of unlabeled data to estimate the prevalence of claim exaggeration in press releases. Our work improves upon this by providing a more realistic task formulation of the problem, consisting of paired press releases and abstracts, as well as curating both labeled and unlabeled data to evaluate methods in this setting.

Learning from Task Descriptions
Using natural language to perform zero and fewshot learning has been demonstrated on a number of tasks, including question answering (Radford et al.), text classification (Puri and Catanzaro, 2019), relation extraction (Bouraoui et al., 2020) and stance detection (Hardalov et al., 2021b,c). Methods of learning from task descriptions have been gaining more popularity since the creation of GPT-3 (Brown et al., 2020). Raffel et al. (2020) attempt to perform this with smaller language models by converting tasks into natural language and predicting tokens in the vocabulary. Schick and Schütze (2020a) propose PET, a method for few shot learning which converts tasks into cloze-style QA problems which can be solved by a pretrained language model in order to provide soft-labels for unlabeled data. We build on PET, showing that complementary cloze-style QA tasks can be trained on simultaneously to improve few-shot performance on scientific exaggeration detection.

Conclusion
In this work, we present a formalization of and investigation into the problem of scientific exaggeration detection. As data for this task is limited, we develop a gold test set for the problem and propose MT-PET, a semi-supervised approach based on PET, to solve it with limited training data. We find that MT-PET helps in the more difficult cases of identifying and differentiating direct causal claims from weaker claims, and that the most performant approach involves classifying and comparing the individual claim strength of statements from the source and target documents. The code and data for our experiments can be found online 7 . Future work should focus on building more resources e.g. datasets for exploring scientific exaggeration detec-

Broader Impact Statement
Being able to automatically detect whether a press release exaggerates the findings of a scientific article could help journalists write press releases, which are more faithful to the scientific articles they are describing. We further believe it could benefit the research community working on fact checking and related tasks, as developing methods to detect subtle differences in a statement's veracity is currently understudied.
On the other hand, as our paper shows, this is currently still a very challenging task, and thus, the resulting models should only be applied in practice with caution. Moreover, it should be noted that the predictive performance results reported in this paper are for press releases written by science journalists -one could expect worse results for press releases which more strongly simplify scientific articles.

A Error Analysis Plots
Extra plots from our error analysis are given in Figure 5 and Figure 6.

B.1 Computing Infrastructure
All experiments were run on a shared cluster. Requested jobs consisted of 16GB of RAM and 4 Intel Xeon Silver 4110 CPUs. We used a single NVIDIA Titan X GPU with 24GB of RAM.

B.2 Average Runtimes
The average runtime performance of each model is given in

B.3 Number of Parameters per Model
We use RoBERTa, specifically the base model, for all experiments, which consists of 124,647,170 parameters.

B.4 Validation Performance
As we are testing a few shot setting, we do not use a validation set and only report the final test results.

B.5 Evaluation Metrics
The primary evaluation metric used was macro F1 score. We used the sklearn implementation of precision_recall_fscore_support for F1 score, which can be found here: We also weigh the cross-entropy loss based on the label distribution.
Hyperparameters for Distillation We used the following hyperparameters for distillation (training the final classifier after PET) for both T1 and T2 as the main task: Epochs: 3; Batch Size: 4; Learning Rate: 0.00001; Warmup Steps: 200; Weight decay: 0.01; Temperature: 2.0. We also weigh the crossentropy loss based on the label distribution.

B.7 Data
We build our benchmark test dataset from the studies of Sumner et al. (2014) and Bratton et al. (2019).
A link to the test data will be provided upon acceptance of the paper (and included in the supplemental material). Claim strength data from Yu et al. (2019) for abstracts can be found at https://github.com/junwang4/ correlation-to-causation-exaggeration/ blob/master/data/annotated_pubmed.csv.  Claim strength data for press releases from Yu et al. (2020) can be found at https://github.com/junwang4/ correlation-to-causation-exaggeration/ blob/master/data/annotated_eureka.csv