Characterizing News Portrayal of Civil Unrest in Hong Kong, 1998–2020

We apply statistical techniques from natural language processing to a collection of Western and Hong Kong–based English-language newspaper articles spanning the years 1998–2020, studying the difference and evolution of its portrayal. We observe that both content and attitudes differ between Western and Hong Kong–based sources. ANOVA on keyword frequencies reveals that Hong Kong–based papers discuss protests and democracy less often. Topic modeling detects salient aspects of protests and shows that Hong Kong–based papers made fewer references to police violence during the Anti–Extradition Law Amendment Bill Movement. Diachronic shifts in word embedding neighborhoods reveal a shift in the characterization of salient keywords once the Movement emerged. Together, these raise questions about the existence of anodyne reporting from Hong Kong–based media. Likewise, they illustrate the importance of sample selection for protest event analysis.


Introduction
In an era where movements against entrenched power structures are both widespread and well documented, we can conduct computational analyses of language to guide, support, and challenge hypotheses about unrest and its discussion in mainstream written media sources. We direct these tools to analyze portrayals of protest and unrest in Hong Kong over a period of 22 years.
Public protests in Hong Kong date back to British colonial rule and have evolved from the bloody riots of the 1960s to the protests of 2019-2020, when up to two million people took to the streets over an extradition bill. They feared it would make the Hong Kong inhabitants subject to China's legal system in violation of the Basic Law 1 , which 1 https://www.basiclaw.gov.hk/en/basiclaw guarantees that Hong Kong's capitalist system, judicial independence, and existing civil and political liberties would remain unchanged until 2047. Hong Kong protests captured the world's attention with defiant crowds commemorating the 1989 Tiananmen Square incidents, the July 1, 1997 transfer of sovereignty from the UK to China, and students blockading roads in the Admiralty district while doing their homework during the pro-democracy Umbrella Movement in 2014 (Weiss and Aspinall, 2012). Over time, the instability created by the protests has become a threat to the credibility of Hong Kong as a financial hub and the possibility of applying the principles of one country, two systems beyond Hong Kong and Macau (Overholt, 2021).
We apply a host of techniques from natural language processing to mark inconsistencies in event characterizations, analyzing news articles related to episodes of civil unrest between 1998 and 2020, in both western-and Hong Kong-based Englishlanguage newspapers. In the volatile context of Hong Kong politics, newspapers' tendency to re-port more dramatic than ordinary events may encourage reporting bias that either emphasizes or undermines the legitimacy of the protests or the legitimacy of the regime against which the protests are directed (Snyder and Kelly (1977); Earl et al. (2004); Schrodt et al.).
Our contributions are manifold. Foremost, our work is novel amongst work on protests and natural language due to the expanse of our time horizon. Second, we characterize crucial differences in Western-and Hong Kong-based portrayals of protest: statistically significant differences in protest-related lexical choice ( §5.1), reinforced by differences in treatment of democracy and police violence ( §5.2), though with no major differences in sentiment ( §5.4). Third, we find several key points where coverage differs ( §5.2), including a major shift in the notion of "confrontation".

Related Work
Content analysis (Berelson, 1952), in general, is a set of non-invasive techniques for studying communication artifacts such as documents, photographs, and recordings. Computational methods have supercharged content analysis by complementing subject matter expertise with the potential for massive scale. Lucy et al. (2020) consider the content of United States history textbooks in Texas, using word embedding similarity, topic models, and dependency parsing to generate clues toward differing portrayals of race and gender. Field et al. (2018) relate the content of Russian state-run news articles to the nation's economic performance, finding an agenda of distraction through the framework of Granger causality (Granger, 1988). Other attempts at content analysis and stylometry consider authorship (Mosteller and Wallace, 1984;Bergsma et al., 2012), native language identification (Koppel et al., 2005;Bergsma et al., 2012), and deceptive communication in reviews (Ott et al., 2013).
With the advent of fast-paced 'social' media, recent work (De Silva and Riloff, 2014;Alsaedi et al., 2017;Sech et al., 2020) has aimed to characterize unrest through Tweets, short communiques on the platform Twitter.
Within the specific focus of protests, the closest work to ours in longitudinal scope is Papanikolaou and Papageorgiou (2020), whose 541 thousand news articles (albeit not all about protest) reflect Greece from 1996 to 2014; other similarly broad-scale work is rare. Wueest et al. (2013) ap-ply topic models and named entity recognition to protest event analysis. The CLEF 2019 Protest-News shared task asked participants to perform event extraction, even in news articles about a country outside of the training set. The organizers report consistent drops in performance after this shift. Inverting this, our work calls into question different views on protest in the same location.

Data
We collected a corpus of news articles collected from six Western-based English language newspapers: The New York Times, The Wall Street Journal, The Washington Post, The Financial Times, The Guardian, and The Times; and two Hong Kongbased English language newspapers: The China Daily and The South China Morning Post, covering multiple incidents of protests that took place between January 1998 and June 2020. The newspapers were purposefully selected because they are English-language newspapers; the selection ensures newspaper diversity within western-and Hong Kong-based newspapers to allow for insights into differences across cultures.
The articles were collected through keywordbased searches in ProQuest Newspapers for the western English language newspapers, and Newsbank Access World News Research Collection for the English language Hong Kong newspapers. Keywords used in the search "Hong Kong" + "protests", "Hong Kong" + "rallies", "Hong Kong" + "marches", and "Hong Kong" + "riots". We used the East Coast editions for The New York Times and The Wall Street Journal; the UK editions for Financial Times, The Guardian, and The Times, and the overseas edition for China Daily (which is run and printed in Hong Kong). To be eligible for collection, articles had to be at least 300 words long.
We manually screened the collected articles to eliminate irrelevant items such as duplicates within each publication, readers' letters, and articles that included any of the research chosen keywords but whose content was not about the protest incidents.
Following the manual screening, we retained 4676 articles, with a mean length of 782 tokens.

Method
We aim to contrast the treatment of civil unrest in Hong Kong, both across news sources and over time. Here we outline four techniques to suit this purpose: analysis of word choice with ANOVA, analysis of word clusters with latent Dirichlet allocation, analysis of word usage with embedded neighborhood shifts, and analysis above the word level with sentiment analysis.

Comparing lexical frequency
Word frequency exposes obvious discrepancies in word choice and word usage. A lack of eventrelated keywords in contemporaneous articles from different newspapers may signal the omission of events in some of them.
Each source will have some degree of variation in keyword counts. An author's voice accounts for some mismatch in frequency, but not all. It is therefore challenging to determine whether the distribution of keyword counts is due to pure chance or something more meaningful. Analysis of Variance (ANOVA) is a sampling theory-based method for comparing the means of a quantitative response variable, when the explanatory variable is categorical (Agresti, 2017). A statistically significant p-value supports that the means of both populations are different. According to Agresti (2017), ANOVA is analogous to regression with a continuous response variable and a categorical explanatory variable.
We apply ANOVA to our corpus to determine important differences in frequencies. We first select 19 keywords of interest related to Hong Kong protests. 2 Then, for one keyword at a time, we 1) split the corpus in two by some categorical attribute, 2) obtain the keyword's frequency in each article of both corpora, and then 3) apply ANOVA to establish whether our categorical variable is associated with a variation in frequency. In this work, we use the location of the article's publisher as the categorical variable.
This statistical analysis cannot, however, reveal the motive for a difference in lexical choice. It merely raises the question to subject matter experts.
It then befalls those experts to determine whether the difference arises due to intentional omission, niceties of a newspaper's style guide, or some other feature.
ANOVA uses the F -test to check equality of the word frequencies in each group. We set a significance level of α = 0.05 and employ the Bonferroni correction (Dunn, 1961).
We also attempted to identify discrepancies between the words used by different subsets of articles using a weighted log-odds ratio (Monroe et al., 2017) with an informative Dirichlet prior (following Jurafsky et al., 2014;Field et al., 2018;Lucy et al., 2020), to mixed results. We omit this from later discussion.

Topic modeling
Topic modeling characterizes documents by the topics they contain, automatically identifying the topics from corpora. We use latent Dirichlet allocation (LDA; Blei et al., 2003) for our topic models. It is a probabilistic generative model that maintains distributions over the words within each topic and the topics with each article, representing each article in the traditional vector space model (Salton et al., 1975). With LDA, we capture and convey the prevalence of various topics, so that we can contrast these across news sources and over time.
We perform topic modeling with MALLET (Mc-Callum, 2002). To preprocess the articles, we lemmatize all tokens with WordNet's morphy feature (Miller, 1995). We also extract common bigrams. The resulting unigrams and bigrams were converted to term-document matrices and provided as inputs to MALLET. We created models, setting the number of topics from k = 10 to 60, and evaluated the coherence of the resultant topics according to Mimno et al. (2011). We found that using 13 topics produced the highest coherence score. We then identified each of these topics with an identifying label (see Table 2).
Our topic model represents each article as a mixture of topics. More prevalent topics have higher mixture weight, and the weights sum to 1 for each article. (In LDA, these can be interpreted as samples from a k-dimensional Dirichlet distribution.) We can estimate a topic's prevalence in a news source or year by averaging the topic's weight across the articles from that source or year.

Comparing lexical usage
Complementary to the previous methods which consider which words are used, we would like to investigate the evolution of how words are used differently, both in the Western/non-Western split and over time.
Diachronic shifts in word usage are often identified with changes in words' neighborhoods in an embedding space (Hamilton et al., 2016;Gonen et al., 2020). For instance, (Hamilton et al., 2016) used these to find a shift in the word "broadcast" from agricultural to television contexts between the 1850s and 1900s. A word embedding model seeks to assign similar vectors (measured by dot product) to words in similar contexts, and different vectors to words in different contexts. If the usage of a word changes, then this should be reflected in changes to the word's context and consequent changes in the word's embedding.
We re-implement and extend the difference-inusage model of Gonen et al. (2020), which measures how the contexts of words differ.
1. Partition the corpus C into C a and C a based on the attribute of interest a.
2. Fit separate word embedding models for each partition: M a and M a .
3. Select a keyword w of interest.
4. Obtain the set of nearest neighbors NN a (w) and NN a (w) of w according to each of M a and M a . 3 5. Score the usage-change of w as the size of the intersection, | NN a (w) ∩ NN a (w)|.
After this process, if w is used differently based on the presence or absence of the attribute, we expect its score to be quite small. Words whose usage does not depend on the attribute will have similar neighborhoods in each split.
To extend the work of Gonen et al. (2020), we contextualize the similarity score of a given word against a reference set. Considering all words that occur at least 100 times, in which percentile does w's similarity score fall? We find this to be more meaningful than the raw similarity score.
We focus on three splits, but apply the same methods of analysis to each split. For the first split, we divide the corpus by the location of the source. For the second split, we consider whether the 2019-2020 mark a turning point in media coverage of protests, whereas for the third split, we investigate whether June and July, high points in the 2019-2020 protests, mark any shifts in media coverage. For all splits, we calculate the scores of words that appear at least 100 times in both sub-corpora. Then, we use those scores to calculate the percentile of a given keyword's score. This makes it clearer to compare these relative scores.

Sentiment analysis
Sentiment analysis measures the attitude of an author from the tone and connotations of their document. While it may be performed based on handcrafted sentiment (valency) lexica (Mohammad, 2018), we select a technique that is robust to the specific words that are chosen. We select a BERTbased model to classify a given sentence as positive or negative because of its near state-of-the-art sentiment classification abilities.
We treat sentiment as a binary attribute 4 (+, −) and use a probabilistic classifier trained on the Stanford Sentiment Treebank (SST-2; Socher et al., 2013). The model uses DistilBERT (Sanh et al., 2019) for feature extraction from text; DistilBERT has previously been used for sentiment analysis of product reviews (Büyüköz et al., 2020). We split each article into sentences, then classify each sentence. An article's sentiment is taken as the average sentiment over all of its sentences.
While this sentiment score obscures the reason for the author's attitude (Were they opposed to the protests, or opposed to the police response?), it still provides coarse-grained evidence of stylometric differences between news sources.

Results and Discussion
In this section, we analyze and give historical context for the results of the four techniques from §4.

Comparing lexical frequency
The ANOVA results in Table 1 show that 15 of our 19 selected keywords have statistically significant differences in frequency. The top five keywords with the highest F -statistics, in descending order, are "democracy", "protest", "protests", "freedom", and "occupation". We find consistent suppression of discussion of protests in Hong Kong-based sources. The high F -statistic of "protest" and "protests" implies a disparity in the coverage of protests. Figure 2 shows how the median number of times "protest" is lower in Hong Kong-based media sources than Westernbased sources.
In conjunction with the following subsection's findings of the prevalence of the "democracy movements" topic, the high F -statistics of "democracy" and "freedom" suggest that discourse about democracy is much more common in Western-based sources than in Hong Kong-based sources. Table 2 shows the most prominent words for the 13 topics we identified in §4.2. Figure 5 shows the evolution of topics over time, revealing that at several key points in Hong Kong's history, Western-based and Hong Kongbased sources wrote about different topics. This is not entirely unexpected for a number of reasons, including a media organization's possible desire to appeal to their own readership and therefore maintain loyal readers. Furthermore, the local na- This might be shown by the pervasiveness of the Marches/Rallies topic and the Bill topic in Hong Kong-based media when compared to the presence of the same topic in Western-based media. Hong Kong-based newspapers may have reported any marches or rallies that took place between 1998 and 2020, whereas Western-based newspapers may have focused only on landmark ones such as those organized around the anniversaries of the July 1 Handover or the June 4 Tiananmen Square incidents. As for the Bill topic, Hong Kong-based media coverage peaks in 2010, when the Legislature debated a number of legal initiatives, whereas the western-based coverage of the same topic remain relatively stable and much lower overtime."

Topic modeling
The topics reflect known events in Hong Kong's history; spikes in the students/schools topic track the Scholarist movement and its resurgence in 2014 in the Umbrella Revolution. Several spikes emerge around discussions of the election process for Hong Kong's chief executive. However, at key points Western-based newspapers reported police violence to a far greater extent than Hong Kong-based media.

Comparing lexical usage
The methods from §4.3 reveal semantic divergence in certain keywords between Western-based and non-Western-based news sources. We also find that June-July is a turning point, after which the meaning of several keywords shifts for at least the remainder of 2019.

Western-based vs. Hong Kong-based sources
We divide the data by the location of each article's publisher. Corpus C West is composed of all 711 articles published by Western-based sources. Corpus C West is composed of all 3464 articles published by Hong Kong-based sources.
We then trained Word2Vec models on both corpora. Despite the relatively small size of corpus W , a visual inspection of the resulting Word2Vec model shows sound performance. We then scored each keyword in Table 1 and compared each models' nearest neighbors.
We observe noticeable semantic differentiation between the two models for several keywords. For example, "resistance" has an unexpectedly low score. In comparison to the scores of all words that appear more than 100 times in both corpora, the score of "resistance" is only in the 17th percentile.
A visual inspection of the term's nearest neighbors in the Western-based model suggests an association with the feelings of protesters (ex. "frustra-tion", "anxiety"). In contrast, the nearest neighbors of "resistance" in the Hong Kong-based model relate to adversarial behavior. This is evidence of the dichotomous framing of anti-government demonstrators.
Authors commonly employ the the words "tension" and "severe" to describe protest events and confrontations. "Tension" and "severe" both had low similarity scores, with the score of the former in the 1st percentile and the score of the latter in the 7th percentile. This is evidence of high semantic divergence between Western and non-Western news sources in their usage of polarizing framing.
Curiously, "protest" scored only in the 91st percentile. We attribute this finding to be a function of low prevalence of the word in Hong Kong-based protests, which may also betray self-censorship. Additionally, we interpret the finding to mean that the context in which "protest" occurs is not dissimilar in our two corpora.
Before vs. after July 2019 Here, we sought to quantify the degree to which the introduction of the Fugitive Offenders amendment bill acted as a pivotal moment in the style of newspapers' portrayal of the Hong Kong protesters.
We again obtain the scores of words with a frequency higher than 100 in both corpora to contextualize our keywords' scores. We find that "resistance" again has a low score, and therefore high semantic shift. We inspected its nearest neighbors in each model and saw that the term became associated with dissent in the months after July 2019.
We note a similar trend for "confront" (9th percentile) and "confrontation" (11th percentile). After July 1, confrontations became associated with "provocative", "battles", and "mayhem". These changes may be suggestive of how Englishlanguage Hong Kong-based newspapers intended to shape the international understanding of what was happening in Hong Kong, favoring the inclusion of strong and negative terms to portray the 2019 street protests.

Sentiment analysis
We find a consistent pessimism across news sources: all display positive sentiment in only 30% to 40% of their content. While no clear-cut relationship can be established between whether an article is from a western source from its sentiment, Hong Kong-based sources are more negative. There is, however, internal variation. The China Daily with

Conclusion
We show that techniques from natural language processing can guide, answer, and suggest questions in social science. While past work focuses on single movements or eras, we characterize the portrayal of civil unrest in Hong Kong over a period of 22 years. Using a curated and manually filtered corpus of 4512 articles from Western-based and Hong Kong-based newspapers, we identified clear differences in framing both across time and between Western-based and Hong Kong-based newspapers.
Our approaches shed light on the ways in which Western and Hong Kong-based portrayals have evolved over time. For instance, while both discussed the Scholarist movement's rise to prominence in 2012 in roughly equal proportions, the discussion of police violence was much more prominent in Western sources than in Hong Kong-based sources. Similarly, Western-based sources are far more likely to discuss protests than Hong Kongbased sources. This has implications for the extraction of protest-related events from corpora with politically opposed sources such as ours. Further, July 1, 2019 marked a turning point across Western and non-Western sources in the characteristics of usage for confrontation-related vocabulary.
The efficacy of event extraction models presupposes that the event in question is discussed in the considered collection of documents. In characterizing significant differences in portrayal across news sources, we implore that a critical eye be applied to the data selection process. We are working to quantify the degree to which event extraction systems are stymied by content and framing differences.
Finally, we have binned our articles at the granularity of years for much of our analysis. This blends news coverage leading up to unrest and portrayals of it afterward. Is it possible that language in news media causes (or at least, Granger-causes) protest sizes? Future work will more precisely measure differences in news content and framing around flashpoints of civil unrest.