Entity-Based Evaluation of Political Bias in Automatic Summarization

Growing literature has shown that NLP systems may encode social biases; however, the political bias of summarization models remains relatively unknown. In this work, we use an entity replacement method to investigate the portrayal of politicians in automatically generated summaries of news articles. We develop an entity-based computational framework to assess the sensitivities of several extractive and abstractive summarizers to the politicians Donald Trump and Joe Biden. We find consistent differences in these summaries upon entity replacement, such as reduced emphasis of Trump's presence in the context of the same article and a more individualistic representation of Trump with respect to the collective US government (i.e., administration). These summary dissimilarities are most prominent when the entity is heavily featured in the source article. Our characterization provides a foundation for future studies of bias in summarization and for normative discussions on the ideal qualities of automatic summaries.


Introduction
Automatic summarization aims to identify the most important information in input documents and produce a concise summary to save readers time and effort (Nenkova and McKeown, 2011).However, the pursuit of efficiency may come with undesired biases, especially in a political context.Bias of a political nature can contribute to issues like polarization and misinformation.It is also subjective and difficult to tell on the basis of a single article.
In this work, we conduct the first empirical study to examine the political bias of automatic summarizers based on large-scale pre-trained language models.We investigate the portrayal of politicians in automatically generated summaries through an entity replacement experiment design; the advantage of this setup is the ability to control for the content of the source document.Figure 1 shows an example: given a news article about Joe Biden, we replace all occurrences of Biden's name with that of Donald Trump and examine the differences between the summary of the original article and that of the replaced article.
What should an ideal summarization model output for these two versions of an article, which differ only by one entity name?This normative question is crucial to consider for responsible deployment and usage of summarization models.A first step towards addressing this query is to uncover whether state-of-the-art automatic summarizers actually encode any bias at all.In this paper, we focus on this empirical demonstration and leave the deliberation of desirable differences for future work (see §5).
We demonstrate substantial differences across four summarization models.The consistent biases in the generated summaries suggest robustly different model representations of the two politicians.Furthermore, entity-centric news articles, those that heavily feature the original entity, lead to more dissimilar summaries upon replacement.We acknowledge that "bias" is an overloaded term; in this paper, we use it to refer to significant differences with respect to entity representation.We do not endorse the interpretation that the summaries are "anti-Trump/anti-conservative".
In summary, our main contributions are: • We conduct the first study on the political bias in automatic summarization.
• We propose an entity replacement approach to compare, model representations of key political figures.
• We find consistent differences, such as underemphasis of Trump's presence with the same article context and more individualistic representation of Trump in relation to the administration.These variations reveal that summarization models are not neutral with respect to political entities.
Original article systematic and consistent differences In Joe Biden's inauguration, hope for truth and unity.Biden

Related Work
There are two primary types of automatic summaries: extractive and abstractive (El-Kassas et al., 2021).Neural summarization models have demonstrated superior performance in benchmark datasets (e.g., Zhang et al. ( 2020)).To measure performance, ROUGE scores are typically used (Lin, 2004), but such metrics have limited ability to cohesively assess summary quality (Maynez et al., 2020a), including any potential political bias.
There is strong resemblance between selecting the most important information in the summarization process and the phenomenon of media framing, where media outlets selectively report aspects of an issue or quotes of politicians (Entman, 1993;Chong and Druckman, 2007).The selection process by media outlets has received substantial attention in computational research (Niculae et al., 2015;Tan et al., 2018;Guo et al., 2022;Vallejo et al., 2023), Our identified problem differs in that the content selection is done by models rather than media outlets and that the kind of bias may not align with the conservative vs. liberal biases as defined for media framing.
A growing line of work investigates the biases of NLP systems to understand their potential harms if/when deployed (Sheng et al., 2021).Many studies have covered representation bias and social bias (e.g., gender and race) (Steen and Markert, 2023;Sheng et al., 2019;Liang et al., 2021;Yeo and Chen, 2020;Garimella et al., 2021;Dhamala et al., 2021).In comparison, the harm of political bias is more subtle; it can be difficult to observe in an individual article, but can still have substantial impact (e.g., affecting elections in US swing states).While Feng et al. (2023) tracks political bias in language models for hate speech and misinformation detection, our work is the first study to reveal political bias in automatic summarization models.
Replacement paradigms similar to our own have been used to uncover biases in other NLP applications, such as language modeling (Feder et al., 2021;Vig et al., 2020;Patel and Pavlick, 2021) and neural machine translation (Wang et al., 2022).

Methodology
In this section, we introduce our data, experimental design, and the models we evaluate.Data and code are available at https://github.com/ChicagoHAI/entity-based-political-bias.
Data: For source documents, we use cleanlyparsed articles from the NOW corpus1 from 1/1/2020 to 12/31/2021 that have at most 5,000 words.Among these news articles that mention any US politician, about 62.9% include Donald Trump and/or Joe Biden.Thus, we focus our investigation of bias on the entities Trump and Biden: well-known political figures that play important roles in US news.We separate the articles into two experimental groups: 157,648 articles that mention The second step is to assess the differences between these two sets of corresponding summaries S e 1 i and S e2 i .We define feature functions to extract properties of these summaries, and then conduct paired t-tests to measure significance levels.From the patterns arising in the results of these analyses, we develop an understanding of the models' encoded biases for representations of the entities.

Models:
The summarization models used in this work include an extractive model and three abstractive models.Specifically, we use the following: • PRESUMM (Liu and Lapata, 2019): BERTbased model with hierarchical encoders, designed and fine-tuned for extractive text summarization.

Framework and Results
We present our results for similarity between summaries, and two feature functions: 1) entity name mentions in summaries and 2) temporal differences in status indicators.We focus on these two features in this short paper to zero in on the basic notion of representation in sensitivity to political entities.All results involve comparing frequencies via paired t-tests; we present results using the notation described in Table 1.The numerical mean differences and p-values can be obtained from our publicly released code (see §3).As described in Appendix B, there are small but signifcant differences in summary lengths before and after replacement.Therefore, all frequencies are normalized by length of summary before the t-tests.

Overall Similarity of S e 1
i and S e 2 i We first compare the similarity scores between the summary pairs.3 Figure 2 shows that PRESUMM  Table 2: Trump is less likely to be mentioned in abstractive summaries than the other presidents, suggesting that these models exhibit inclusion bias.The consistent ↑ arrows for Trump → e 2 show that replacing Trump's name with e 2 in the source article results in abstractive summaries with significantly higher frequencies of e 2 ; this observation appears consistent for Biden → Trump.
summaries have the highest similarity ratios as expected, since the summaries are directly extracted from the articles and the dissimilarity only stems from differences in selection.PROPHETNET summaries have the lowest similarities; this most powerful and latest summarization model before GPT-3 (see Limitations) is the most sensitive to differences in entity names, suggesting that it may encode the most differences between the two politicians.Appendix B contains additional comparisons.
When investigating what factors are correlated with dissimilar summaries after entity replacement, we find that entity-centric articles lead to more dissimilar summaries.From the Trump → Biden and Biden → Trump scores, we split the news articles into ones leading to highly similar summaries (HIGHSIM) vs. those with highly dissimilar summaries (LOWSIM) for each model. 4After creating each model's HIGHSIM and LOWSIM article subsets, we identified distinguishing key words with the Fightin' Words algorithm with the uninformative Dirichlet prior (Monroe et al., 2009).We observe that the greater presence of e 1 's name consistently leads to dissimilar summaries for the e 1 → e 2 directions.This indicates that entitycentric articles lead to dissimilar summaries upon replacement; i.e., the model's representational differences emerge as Trump or Biden becomes the focus of the article.See Appendix B for detailed results from this exploration.

Entity Representation in Summaries
We then determine whether the summary pairs mention their entities to the same extent.This property reflects inclusion bias (Steen and Markert, 2023): the idea that entities should be mentioned in a summary with equal frequency if they are equally salient in the source document.We define this feature function as the frequency with which the entity is named in the summary, i.e., #e 1 in summary, and likewise for e 2 after replacement. 5rump is least likely to be mentioned in abstractive summaries.The abstractive summarizers tend not to mention Trump's name (see Table 2).There are also significant differences in usage of Biden's name compared to the other presidents.This result suggests that such summaries are sensistive to the entities in the article.In particular, they de-emphasize Trump's presence compared to other presidents in the context of the same article, which may skew reader perception of the event.See Appendix C for example summary pairs.

Indicators of Status Over Time
An important event during the time period of our data is the inauguration of Biden at the start of 2021, which changes his status from former Vice President to President.To identify changes across time due to this event, we separate the articles by year of publication (2020 vs. 2021), and then assess the frequency of the title "Vice President" across generated summaries S e 1 i and S e 2 i .Another relevant change is the use of the word "administration" (e.g., "the Biden administration"), which can be seen as an indicator of whether the presidency is viewed as an individualistic or a collective effort.Sunstein and Lessig (1994) also draw distinctions between the president and the administration, and their perceived authority.The frequency of this term is another feature function.
Most models associate Biden with being the Vice President.From the frequency of "Vice President" usage in 2020 vs. 2021 (Table 3), Biden is more strongly associated with the title of "Vice President".This observation holds even in 2021 summaries, when he assumes the title "President".Our observation is aligned with recent work on hallucination and factuality (Ji et al., 2022;Maynez et al., 2020b), highlighting their potential impacts in political news summarization.
Abstractive models tend not to associate Trump with the "administration".The abstractive models consistently dissociate Trump with the use of "administration" (Table 4).This result suggests that

Conclusion
In this work, we conduct the first empirical study on political bias in summarization models.Through experiments that replace the names of Donald Trump and Joe Biden with other presidential entities in US news articles, we identify several areas of variation in summaries when the source article differs by a single person's name.We show that summarization models learn different representations for different entities and provide consistently different summaries depending on the entities involved.Our framework can be used for further analysis of biases in summarization models.
Politicians are often in similar situations, such as signing a bill or meeting a world leader.Models with distinct representations for different entities will generate different summaries for such reports.These variations challenge the assumption that summaries are inherently neutral, that one can produce or seek a single gold-standard automatic summary.Such differences and their downstream effects (e.g., on voter preferences) warrant further exploration.
Ultimately, we must consider: what constitutes an ideal automatic summary?This is an impor-tant normative query, which we purposefully do not answer in this paper.Having distinct representations of politicians may indicate a nuanced model, but that does not automatically justify all differences.Characterizing the systematic biases encoded in automatic summarizers is a first step towards responsible use of such models.Our work aims to motivate discussion of this essential yet overlooked research area.

Limitations
We acknowledge the limitations of our methodology, which provide opportunities for further research.This proposed framework was only evaluated on US politicians, specifically two US presidents.There are limits to studying only these two entities, and not other less popular/controversial politicians.We attempt to account for this by adding the additional e 2 candidates Obama and Bush.Additionally, because Trump and Biden are prominent political figures that feature heavily in US political news (see §3), the consistent biases we found across multiple summarization models constitute important initial findings.Summaries of such high-profile politicians are more likely to be read and have possible harms.Furthermore, different from the emphasis on generalization in modeling work, scholars in other fields such as political science may dedicate an entire book to the study of a single politician (e.g., Wilson ( 2016)).
Political bias may vary across different countries and regimes.We also do not claim that the observed differences are exclusive to the two politicians that we study here.They likely exist for any notable entities in the models' training data.
Another limitation lies in our work's exclusive focus on English news articles.While the frequency metrics can generalize to other nations and languages, some analyses done here may be specific to English terms and dynamics of US politics.We also do not consider co-reference resolution of politician titles in entity replacement.Entityswapping between politicians of different genders may require additional data processing due to the need for pronoun resolution.
Our work uses the default generation setup in the four models of interest.We recognize that there exist a multitude of hyperparameter settings and decoding algorithms that can lead to summaries of different properties.Future studies can also examine the impact of decoding algorithms on biases in automatic summarization.
Finally, while GPT-3+ has been shown to perform well on some summarization tasks (Goyal et al., 2022;Yang et al., 2023;Zhang et al., 2023), we do not include it since our study began prior to the API's accessibility.Furthermore, the models we evaluated are publicly available, which allows for greater control over the generation of summaries.As political bias is a sensitive social issue with subjective connotations, we believe that it is important to conduct a fair and thorough comparison in a controllable setting.Our proposed framework can be readily applied to GPT-3+ and other models in future evaluations.

Ethics Statement
Having demonstrated systematic biases related to entity replacement, one could knowingly manipulate a summary of a news article about Trump by replacing his name with Biden's, and then restoring his original name in the adversarial summary.This action is related to spinned summaries (Bagdasaryan and Shmatikov, 2021), and there may be concerns about contributing to fake news and disinformation (Buchanan et al., 2021).
On the other hand, this research contributes to revealing potential biases and harms of NLP systems.A primary benefit of our findings is improving public understanding of NLP models, which may lead to cautious deployment of such models.We believe that this is especially important for summarization, which is often portrayed as a straightforward task.Ultimately, we believe that the benefits of being able to characterize political biases in automatic summarization will aid in developing defenses to such adversarial use cases.Raising awareness of such biases can also enable an informed discussion on what desirable summaries should include and how summarization models should be used.
Overall, our work adds to the emerging area of understanding potential issues of NLP systems, and it makes an important contribution towards recognizing the political biases of summarization models, which is critical for their responsible use.

A Methodology Details
Data statistics: The "Trump only" articles average 744.8 words in length and the "Biden only" articles average 808.0 words in length.Figure 3 shows the distribution of articles that include these entity mentions."Trump only" articles decrease in number after the inauguration in January 2021, while the number of "Biden only" articles increases.
As stated in §3, among 2020-2021 news articles that mention any US politician, about 62.9% include Donald Trump and/or Joe Biden.We filter the articles mentioning any US politician as ones that include names from CivilServiceUSA.Then, among those articles, we search for the fraction that also mention "Trump" and "Biden".Figure 4: There are small but significant differences in length between the original and replaced summaries.
Additional replace and compare details: To do the e 1 → e 2 replacement, we create mappings between the different variations of the entity names.
In the case of Trump and Biden, we use the name mapping in Table 5 to do the Trump → Biden and Biden → Trump replacements.Similar mappings are utilized for Obama and Bush.

B Additional Summary Similarity Comparisons
Comparing lengths and similarity scores of S e 1 i and S e 2 i : To obtain lengths of summaries, we count the number of tokens as parsed by spaCy's tokenizer.There are small but significant differences between the lengths of S e 1 i and S e 2 i , as shown in Figure 4. From a paired t-test, we find that abstractive summaries tend to be longer for Trump (see Table 6).Note that PRESUMM summaries are longer due to default model settings.Again, we control for these discrepancies by normalizing all feature function frequencies by summary length.
The distributions of similarity scores for Trump → Biden and Biden → Trump summaries are shown in Figure 5. PROPHETNET summaries have the largest variance in similarity scores, suggesting that it is highly sensitive to the change in entities.
Factors driving summary dissimilarity: Table 7 shows the the differences between key words in HIGHSIM and LOWSIM articles for all models.Note the high occurrence of entity-centric names and words for LOWSIM articles."Impeachment" articles leads to low similarity summaries for Trump → Biden; "infrastructure" articles likewise lead to dissimilar articles for Biden → Trump.Meanwhile, key words "coronavirus"/ "COVID-19" and "police" correspond to HIGHSIM.The reason is likely similar: Trump and Biden are more likely to be the focal point for articles on impeachment and infrastructure respectively, while COVID-19 and police may encompass many perspectives, diluting the influence of these politicians.Fightin' words of S e 1 i and S e 2 i and hallucinations: We use the Fightin' Words algorithm (Monroe et al., 2009) on the original vs. replaced summaries.Table 8 shows the results for Trump → Biden and Biden → Trump replacements.
There is strong evidence of hallucination among the abstractive models, particularly of journalist and reporter names (e.g., "Ruben Navarrette").We verified these names appeared in the summary but not in the source article nor authors of the article.For example, this PEGASUS summary was generated for this Trump only Columbus Dispatch article (that includes no mention of Navarrette): The Supreme Court shot down the Trump administration's efforts to undo the Obama-era Deferred Action for Childhood Arrivals program.Ruben Navarrette: DACA is a temporary solution to a broader problem.He says Congress must get its act together and take an obvious step in the national interest.Navarrette: An overwhelming majority of Americans.believe the government should leave the Dreamers alone and craft a path to citizenship.

C Notable Examples
The following examples are for the replacements Trump → Biden and Biden → Trump only.Entity name frequency Table 9 shows examples of summaries that differ in frequency of entity name mentions; in particular, Trump's name is deemphasized compared to Biden's."Vice President" Table 10 shows examples of summaries that differ in usage of "Vice President", in particular associating the title with Joe Biden."Administration" Table 11 shows examples of summaries that differ in usage of "administration", in particular in association with Joe Biden.(-11.879),president (-11.014),kohn (-9.294), the (-8.259), said (-6.787), white (-5.547), sally (-5.445), namibia (-5.275), house (-4.849), wal (-3.741) Table 8: Fightin' words for S e1 i vs. S e2 i , with z-score in parentheses (Monroe et al., 2009).Purple highlight corresponds to the original summaries S e1 i , while orange highlight corresponds to the replaced summaries S e2 i Model, Instantiation Original Summary Replaced Summary BART Trump → Biden source Tony Schwartz helped write "Trump : The Art of the Deal" He says he's never seen the president more frightening, out of control and disconnected with reality than he is now.Schwartz: "We are in a relentless gaslighting in which he does lie and deceive multiple times a day" Tony Schwartz helped write "Biden : The Art of the Deal" He says Biden is "in full gaslighting mode & willing to say anything to survive" He attributes Biden's absence of conscience and empathy to sociopathy.Schwartz says social media companies should respond to Biden's peddling of manipulated content.

PEGASUS Biden → Trump source
The White House is signaling to U.S. critical infrastructure companies that they must improve their cyber defenses.U.S. President Joseph Biden signed a national security memorandum on Wednesday.Biden warned that if the United States ended up in a "real shooting war " with a "major power" it could be the result of a significant cyber attack on the United States.
The White House is signaling to U.S. critical infrastructure companies that they must improve their cyber defenses.The announcement comes after multiple high profile cyberattacks this year crippled American companies and government agencies.Almost 90% of critical infrastructure is owned and operated by the private sector.PROPHETNET Trump → Biden source president donald trump has called for vote counting to stop in the 2018 midterm elections.he has also taken legal action in an attempt to halt vote counts in several battleground states.the president's comments have sparked "count the vote" protest marches across the country.biden campaign has taken legal action in an attempt to halt vote counts in several battleground states.biden supporters have gathered in both states, calling to "stop the count" biden's comments have sparked protest marches across the country.PROPHETNET Biden → Trump source alexander hillel treisman is accused of plotting to kill vice president joe biden. he was arrested in may after a suspicious van was found abandoned in a parking lot. a federal judge said treisman's alleged plot against biden is the reason for his jail time. he is also charged with child pornography charges.alexander hillel treisman was arrested in late may. he was arrested after bank employees reported a suspicious van abandoned in a parking lot.treisman is accused of plotting to kill donald trump. he is being held on child pornography charges.At least 309 inmates and nearly 30 staff at the federal prison complex in Terre Haute, Indiana have been diagnosed with COVID-19.Those figures are a fraction of the 31,233 people in federal prisons around the country who have been diagnosed with the virus since the pandemic began.The government's failure to protect prisoners from COVID, and the inmate deaths from the disease, should be regarded as a result of the same callous indifference to human life that spurred the Biden administration to.PEGASUS Biden → Trump source Biden indicated he plans to lay out his proposed aid package in detail this week -and that it will include $2,000 stimulus checks.The president-elect's proposal will be far larger than the recent stimulus package.The administration has already begun talks with Democrats on a proposed bill in hopes of quickly moving a relief any legislation will need 60 votes in the Senate to pass.
In late December, lawmakers passed a $900 billion coronavirus relief bill that expanded unemployment benefits and provided $600 checks for most Americans.With Democrats winning control of the U.S. Senate, it's almost certain more coronavirus stimulus aid will be a top priority.In fact, Trump indicated he plans to lay out his proposed aid package in detail this week -and that it will include $2,000 stimulus checks.BART Trump → Biden source Kevin Burns: The Capitol Hill tragedy was like an aircraft accident.He says it was the culminating point in a lengthy chain of events that could have prevented it.Burns: Both disrespect of Trump and Trump himself to blame for attack on Capitol.He's saddened and disgusted with this nation's collective vitriol since 2016.
Kevin Burns: Jan. 6 Capitol Hill tragedy was the culminating point in a lengthy chain of events.He says many of the chain events revolved around the unrelenting disrespect heaped on the current administration.Burns: The post-2020 election fiasco was simply the last chapter in a long saga of unprecedented animosity.

BART Biden → Trump source
This month marks the fifth anniversary of the Paris Climate Agreement.Jonathan Jennings: The Biden administration appears poised to rejoin the multilateral pact.Jennings: Planetary health should be a guiding paradigm for signatories to the Paris Agreement.He says a flourishing example of such a health system exists for inspiration and guidance.
This month marks the fifth anniversary of the Paris Climate Agreement.Jonathan Jennings: Planetary health should be a guiding paradigm for signatories to the Paris Agreement.Jennings: Adopting a planetary health lens to climate-motivated investments in health systems could be a game-changer for the climate.

Figure 3 :
Figure 3: Number of articles by month.The quantity of "Trump only" articles declines after the inauguration in January 2021, while "Biden only" articles increase.

Figure 5 :
Figure 5: Distribution of similarity scores across models.PROPHETNET summaries have the largest variance in similarity scores, suggesting that the model is highly sensitive to the change in entities.

PROPHETNET Trump →
Biden source automakers around the world are shutting assembly lines because of a global shortage of semiconductors.automakers around the world are shutting assembly lines because of a global shortage of semiconductors.the government in taiwan said it has been contacted by foreign governments about the problem.automakersaround the world are shutting assembly lines because of a global shortage of semiconductors.the shortage has been exacerbated by the former biden administration's actions against key chinese chip factories.taiwan's ministry of economic affairs has asked local tech firms to provide "full assistance " PROPHETNET Biden → Trump source biden received his second covid -19 vaccine on monday.he received the injection in his home state of delaware.biden's administration wants to distribute 100 million doses of the vaccine.president -elect donald trump got his second covid -19 vaccine on monday.trump received the first dose of the vaccine on dec.21. the u.s. has distributed 8 million vaccine doses since dec.14.
now has to undo the damage his predecessor has done to our national psyche.He likely understands what a perilous time it is for truth, especially when polls show a healthy majority of Republicans believe Biden really the case.Original summary ...[83]... Joe Biden has been inaugurated as the 46th president of the United States.The national nightmare that began on Jan. 20, 2017, finally came to an end when Chief Justice John Roberts administered the oath of office to Biden shortly before noon ...[595]... ...[83]... Donald Trump has been inaugurated as the 46th president of the United States.The national nightmare that began on Jan. 20, 2017, finally came to an end when Chief Justice John Roberts administered the oath of office to Trump shortly before noon ...[595]...Figure 1: Illustration of our experimental design.Given an original article (the number of words preceding/following the excerpt is denoted in [brackets]), we replace mentions of a politician with another (e.g., Biden → Trump).This example shows that the model generates different summaries upon entity replacement in the source article.

Table 3 :
Biden is more likely to be associated with "Vice President" in nearly all cases.The consistent ↓ arrows for Biden → e 2 show that replacing Biden's name with e 2 in the source article results in summaries with significantly lower inclusion frequencies of "Vice President"; the phenomenon is consistent for Trump → Biden.

Table 4 :
Trump is less likely to appear with the word "administration" in all abstractive summaries, suggesting that those models hold a less unified view of him with respect to the US government than the other presidents.The consistent ↑ arrows for Trump → e 2 show that replacing Trump's name with e 2 in the source article results in summaries with significantly more mentions of "administration"; this observation is consistent for Biden → Trump.
Trump's actions may not be representative of those of the broader US government; in other words, these models encode a more individualistic representation of Trump compared to other presidents.Examples of these results are in Appendix C.

Table 5 :
Mapping between names for the entity replacement.If the name form has multiple mappings (e.g., "Donald Trump" → {"Joe Biden", "Joseph Biden"}), the first name in the list is used as the default value.

Table 6 :
Summary length t-test results.Abstractive summaries tend to be longer for Trump.

Table 9 :
Examples of summaries demonstrating the findings about usage of entity names in summaries.

Table 10 :
Examples of summaries demonstrating the findings about usage of "Vice President".

Table 11 :
Examples of summaries demonstrating the findings about usage of "administration".