Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study

This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish. We describe our content acquisition methodology and perform cross-site unsupervised topic clustering on the resulting multilingual dataset. We also perform linguistic and temporal analysis of the web page translations and topics over time, and investigate articles with false publication dates. We make publicly available this new dataset of 14,053 articles, annotated with each language version, and additional metadata such as links and images. The main contribution of this paper for the NLP community is in the novel dataset which enables studies of disinformation networks, and the training of NLP tools for disinformation detection.


Introduction
Coordinated, state-backed disinformation operations have become an increasing problem in recent years, particularly surrounding the war in Ukraine (Morkūnas, 2022).In September 2022, a sophisticated network of doppelganger websites (impersonating genuine news sites from across Europe) was discovered by EUDisinfoLab (Alaphilippe et al., 2022) and later expanded on in a report from Meta (Nimmo and Torrey, 2022).Among these was also a small number of conventional false news sites.
The focus of this study is on two related disinformation sites in particular: Reliable Recent News 1 (RRN) and War On Fakes 2 (WoF).Both sites are multilingual, publishing in Arabic, Chinese, English, French, German, and Spanish, and RRN additionally in Italian 3 .They have been promoted by Russian government sources, including being shared by Russian embassies (Maitland, 1 https://rrn.world,formerly called Reliable Russia News using rrussianews.com. 2 https://waronfakes.com 3 WoF also has a separate Russian-language site 2022; Roache, 2022), and publicised by the Ministry of Foreign Affairs of Russia's official Twitter account 4 .We focus on these two "news" sources due to their links to the Doppelganger network, their potential to deceive unsuspecting citizens (compared to better known propaganda sources such as Russia Today), and their prior exposure as disinformation spreaders (see Appendix A).Backovic and Walter (2023) investigated the ownership of WarOnFakes, and stated it was operated by Russian journalist Timofey Vasiliev, a known affiliate of Russian propaganda groups, due to the presence of his name, email and phone number on the website.However, they do not state precisely how they found this information, and do not attempt to establish a link between Vasiliev and RRN or the Doppelganger operation.
Hanley et al. ( 2022) included selected articles from WarOnFakes and nine other disinformation websites in an analysis of narratives spread on Reddit.In contrast, our dataset includes all WarOn-Fakes posts and extracts the full article content.
Propaganda is defined as content that intentionally influences opinion to advance its creators' goals (Bolsover and Howard, 2017).Numerous propaganda datasets have previously been created, with both document-level (Rashkin et al., 2017;Barrón-Cedeño et al., 2019) and span-level (Da San Martino et al., 2019b) technique annotations, using articles collected from multiple disinformation sites.At article-level, classifiers using combinations of multiple linguistic representations based on style and readability outperform content representation (Barrón-Cedeño et al., 2019), whereas content-based transformer models such as BERT have seen use at span-level (Da San Martino et al., 2019a).Detectors are often evaluated on single datasets, prompting concerns on generalisation (Martino et al., 2020).
We are not aware of any prior work including RRN, nor of any work which has released a com-plete dataset of a disinformation operation, including a detailed linguistic analysis.
Thus the contributions of this paper are: i) a new publicly available5 dataset of content from two state-backed disinformation websites; ii) a linguistic, topic, and temporal analysis of their articles; and iii) our open-source toolkit for processing site data and extraction of translations6 .

Data Collection
In March 2023 we used the WordPress REST API7 to obtain all posts from WoF and RRN.Each post was parsed to extract its text, removing non-article content (such as figure captions).The webpage of each post was then analysed to extract the different translations from the language picker.Our extraction tool supports the specific markup used by these two sites, but can be easily extended to support others.An example of an extracted article is shown in Appendix B.
Publication and modification times, which are provided in GMT by the API, were also converted to Moscow local time for analysis, since it is believed that at least one of the sites is based in Russia (Backovic and Walter, 2023).

Topic Analysis
The articles were clustered using BERTopic (Grootendorst, 2022).We assume that whilst each article may discuss many topics, each sentence of an article is likely to discuss a single topic.Articles were split using spaCy's dependency-parsebased sentenceizer, and sentences with less than 5 tokens were removed.The remaining sentences were embedded with Sentence Transformers MP-NET8 (Reimers and Gurevych, 2019;Song et al., 2020).The dimensionality of each embedding was reduced using UMAP (McInnes et al., 2020) from 768 to 5, whilst keeping the structure of the higherdimensional space.This is necessary to avoid the 'curse of dimensionality'9 .
The 5d embeddings were clustered with HDB-SCAN (Campello et al., 2013), which notably al-lows for embeddings to not be included in a cluster, preventing overly broad clusters by forcing nearby but unrelated sentences in.It is expected that this produces a large number of outliers, since it is natural that many of the sentences in the articles will have meanings unrelated to any other.A minimum cluster size of 25 is set to prevent too many small clusters from being generated.
Keyword representations are generated by creating a bag-of-words vector of the unigrams and bigrams of each topic (excluding English stopwords) which is L1-normalized to account for cluster size.An adapted class-based TF-IDF is used to calculate the most significant words in each cluster.This representation is then fine-tuned by selecting keywords with a high Maximal Marginal Relevance, in order to maximise their diversity.The diversity parameter was set to 0.5.The top 3 most significant keywords are used to name the cluster.
Each article is then labelled with the unique set of clusters assigned to its sentences.

Article Backdating
In WordPress, article publication dates can be set to any given date, however this does not affect the auto-incrementing IDs which are generated in the order of article creation.Thus backdated articles can be detected based on their IDs being higher than that of their following articles, when they are ordered by supposed publication date.

n-gram Frequency
Frequent 2-4-grams were extracted using NLTK, after tokenisation, lowercasing, and stopword and punctuation removal.N-gram frequency was calculated monthly, and the most frequent 10 n-grams per month were selected, excluding the phrase "armed forces"10 , and n-grams which are part of another, longer n-gram of equal frequency (e.g.removing "ukrainian armed" in favour of "ukrainian armed forces").We include ties for 10th place.site and language, and mean token and sentence counts11 .

Article Frequency
Figure 1 shows the proportion of each language over time for each site.The first WoF article is published on the 4th March 2022, and the first RRN article on the 11th.WarOnFakes has an unusual pattern of publication in its first few days, publishing sixty articles on the first day, and an average of 34 articles/day over the first 7 days, whereas RRN published only 7 articles on day one and an average of 21 articles/day over the first week.
Generally, posts are published on weekdays, with only 9.5% of posts having publication dates and 7.0% having modification dates on a Saturday or Sunday.The week beginning 2nd January 2023, much of which is public holidays in Russia12 , has the lowest activity in the sites' history, with only 60 articles published on RRN and 19 on WoF.For comparison, the mean in other weeks is 200 (RRN) and 66 (WoF).
25 identical articles were published on both sites predominantly in March 2022, and in all but one case they were a WoF-style debunk.They were not published simultaneously on the two sites, nor is it consistent which site published first.

Language Coverage
Only a small minority of posts (∼9.1%) are not available in English, and the majority of these do not have any translations at all, suggesting they are likely 'orphaned' translations.The mean number of available languages for a post is 4.1 ± 1.5 (1 std) All site-language pairs continued to be published until the end of the collection period, except Arabic and Spanish on WoF and Chinese on RRN, which stopped in July and October 2022 respectively.Spanish posts resumed in December 2022.

Topics
Amongst the 45,991 English sentences in the English articles, 24,800 were considered outliers and 21,191 were assigned one of 144 topics.These topics ranged from broad, recurrent themes (e.g.#0, the donation of arms and aid to Ukraine) to more specific, time-limited ones (e.g.#139, the burning of the Quran by far-right activist Rasmus Paludan).
The mean number of topics assigned per article is 4.33 ± 2.66 (1 std).In the first week of the war in Ukraine (beginning 28th Feb 2022), the vast majority of articles are categorised as #2 (russian military, ukranian telegram, telegram channels, according [to] ukranian).These articles are all from WarOnFakes (since RRN did not start publishing until the following week) and are claiming that various evidence from the war in Ukraine is fake.
Of the 144 topics we identified, 126 were assigned to articles from both RRN and WoF, and only 18 were assigned to posts from just one of the two sites.This demonstrates the significant topical overlap between the sites.Further details and figures are provided in Appendix C.

LIWC Analysis
We use LIWC2015 (Pennebaker et al., 2015) to compare the linguistic properties of English RRN and WoF posts against the metrics for genuine New York Times (NYT) articles provided by Pennebaker et al. (see Table 2 and Appendix Emotional tone, which is on a scale of 0-100 (negative to positive), shows that RRN and WoF are written more negatively than real news, with WoF being even more negative than RRN.This is confirmed by the values for Affective Processes, which show that both sites use more emotion-laden words than NYT.The sub-metrics show this is skewed towards negativity, particularly anger (where both sites have over double the proportion of angerindicating words than NYT).
All three sources focus most commonly on the present (e.g.words like "today", "is", "now"), however RRN and WoF do so at a higher rate than the NYT.RRN and WoF also use more future focus terms (e.g."may", "will", "soon") compared to the NYT , and past focus terms (e.g."ago", "did", "talked") less frequently.This suggests that the content of RRN and WoF comments is more speculative as compared to reputable journalism and is more focused on covering current events than past ones.
Table 3 shows the top 5 LIWC categories with the strongest correlation for each of the two sites.The strong correlation of colons and interrogatives for WoF is unsurprising, given its repeated use of the phrase "What's really going on:".RRN's correlation with conjunctions suggests it tends to use more complex sentences.The remaining attributes are below the 0.3 threshold of strong correlation.However RRN is weakly correlated to personal pronouns which is due to its tendency to cover individual politicians (see nouns (i.e.one, you, they) as it tends to discuss groups, such as the Russian and Ukrainian armed forces (see Table 7 in Appendix C).

Article Backdating
Both sites tend to backdate non-English posts (by as much as 136 days in two cases, see Appendix C, Table 5), in order to make translations appear published at a similar time.The two most backdated articles are Spanish and Chinese translations of an English article, which were actually published 136 days later.
Our hypothesis for the backdating is due to limited resources articles were only translated into a given language when that became necessary for a particular disinformation campaign.In order to convey timeliness, the translations were then backdated to the date of the original.

n-gram Analysis
Tables 6 and 7 in Appendix C show the top occuring n-grams per month for the respective websites.The most frequent "really going" n-gram on WoF is part of the phrase "What's really going on", which appears in all of its fact-check-style articles.The n-gram also appears frequently in the first month of RRN data, due to the articles copied from WoF.On WoF, the most frequent n-grams typically relate directly to the war in Ukraine itself ("russian troops", "ukranian armed forces"), whereas on RRN they relate to the consequences of the conflict for the rest of the world ("united states", "russian gas").Consequently, the most frequent n-grams on WoF are relatively constant across the different months, whereas RRN's n-grams change from one month to the next as they tend to be connected to current affairs.For example, the bigram "antirussian sanctions" enters the top 10 in June 2022, and remains the second most used bigram from July to September, and refers to the damage allegedly caused to Western economies.Other terms demonstrate that RRN also covers some genuine news, e.g."elizabeth ii" in September 2022 and "world cup" in November and December 2022.
Even though to a much lesser degree, WoF still responds to specific highly controversial events from the conflict.For example in August 2022, in response to Ukraine and Russia blaming each other for the shelling of the Zaporizhzhia nuclear power station 13 , the n-grams "nuclear power" and "nuclear power plant" both appear with high frequency in WoF articles that promote the Russian perspective on these events.

Presence of Cyrillic Characters
178 of the articles were found to contain characters in the Cyrillic codepoint range (Table 4), which were manually examined to determine the reason.Accidental Cyrillic: Incorrect usage of Cyrillic characters instead of the intended character in the Latin alphabet.For example, 11 times the "c" in Robert Habeck, a German politician, is actually the identical-looking lowercase Cyrillic Es 14 .Forgotten Cyrillic: Issues with translation where a Russian sentence was left in the article, with or without the target language translation.Intentional: Expected usage of Cyrillic characters 13 https://reut.rs/46KWvTS 14https://en.wikipedia.org/wiki/Es_(Cyrillic)e.g. the name of a Russian organisation.Unclear: We were unable to determine why the characters were used.
Given that both RRN and WoF had forgotten Russian text in all languages, we hypothesise that all articles were originally written in Russian.Two Arabic articles on RRN contain the phrases "the translation is too long" and "save translation" in Russian, likely copied from a machine translation tool's UI, although we were not able to determine the specific tool used.Although this was only found in one language on one of the sites, it suggests it is more likely the articles are machine than human translated.

Future Work
There is much additional work which could be performed on this dataset.Although we identify the subject of articles via topic clustering and n-grams, we do not attempt to identify stance towards it.More complex topic analysis, such as identifying commonly co-occuring topics, would also be possible.Given the mixture of true and false posts on the sites, this dataset may be a useful resource for automated fact-checking, although this would require human annotation and ground-truth may be difficult to establish in the complex information environment of the war in Ukraine.

Conclusion
This paper presented an analysis of the Russian disinformation sites Recent Reliable News and WarOnFakes, including an analysis of the articles' topics, publication times, and linguistic properties.We show that the sites cover a diverse range of topics, and that their linguistic properties differ from those of reputable media.We analysed the presence of Cyrillic characters due to site operator errors, and their practice of backdating articles, showing that a significant proportion of translations are falsely dated.This new multilingual dataset will facilitate further research in disinformation analysis and promote repeatability.

Limitations
Although our work provides a complete collection of WoF and RRN, since these two websites seem to be highly related, it is unsurprising that they tend to publish similar types of content.Therefore this dataset cannot be considered fully representative 5733 of all kinds of Russian disinformation.Nevertheless, it is complementary to overtly Russian state media, such as Sputnik and Russia Today.Unfortunately, due to the ban on accessing their content from the EU, we could not supplement the dataset from those sources or compare against them.
Our topic analysis model has not been formally validated, for example by comparing topics to those assigned by human or expert annotators.Some small scale manual validation was performed in order to find good hyperparameters, however this consisted of inspecting a small random sample of some of the categories.A particular area warranting validation in future work is examining the texts not assigned categories.These are only a very small number, however as we aggregate sentence classifications at article level, which means that an article can be assigned the correct topics even if some of its sentences may not be.
In our LIWC analysis, we compare to the New York Times data provided by Pennebaker et al. (2015).Although this is the closest source out of the provided LIWC baselines, the New York Times represents a more formal style of journalism than many online media.In future work we plan to compare these two disinformation sites against official state-affiliated news sources such as Russia Today.
Finally, we did not analyse the separate Russianlanguage edition of WarOnFakes.As it is a separate site in Russian only, there is no reliable way to connect its articles to their similar English-language versions (if such are published).Analysing the Russian WoF website is planned for future work, as it requires adaptation of the analysis to be bilingual, which is out of scope for this paper.

Ethics
The data collection was carried out in accordance with our institutional ethics policy.
Collection was via the Wordpress API, followed by automated processing and a limited amount of manual analysis by the authors.No external volunteers or crowd-workers were recruited.Due to the disinformation nature of these two websites, the data may contain content which is disturbing or distressing.Therefore we limited the possibility of harm during analysis by: i) minimising the number of individual articles studied by the authors as much as possible; ii) where necessary, viewing only the text of articles, to avoid the possibility of viewing distressing media; iii) ensuring familiarity with supporting resources for researchers working with potentially disturbing content.
As the websites in question are not legitimate news websites, they do not have a terms of use to allow or prohibit the acquisition of their content.We consider the collection and distribution of their articles in the public interest, due to the prominence of their disinformation and the harm that results from it.It is not feasible to contact them to obtain permission, as they have previously been unresponsive to enquiries15 .The dataset does not include images, as in many cases they appear to have been taken from stock agencies.This is a commonly used tactic by disinformation websites.
We have checked that the dataset does not contain personally identifiable information in the user data files, as all users have either generic (e.g."Admin") or random (e.g."UiXnZyvH") names.No user comments were available to collect.
It is possible that the process of creating a disinformation dataset increases the spread and prominence of the disinformation.We would argue that is not the case with this dataset as we: i) are only focusing on content from disinformation websites, the low credibility of which has already been widely publicised (see Appendix A); iii) are not increasing the longevity of disinformation narratives by preserving them after they have being taken down, since the two independent websites that are publishing them are still publicly accessible via all common search engines.Some articles make reference to individuals, albeit only public figures to our knowledge, and many contain narratives which are hateful towards individuals and groups.We encourage researchers who use this dataset to do so responsibly, and in particular to avoid highlighting specific individuals and to ensure that the disinformation narratives are presented alongside authoritative evidence of their untrue nature.We would like to specifically discourage the use of this dataset for training generative models that are capable of creating new disinformation.The dataset is released under a license which prohibits commercial activity.10039039 (approved under the Horizon Europe Programme as VIGILANT, EU grant agreement number 101073921). 16Freddy Heppell is supported by a University of Sheffield Faculty of Engineering PGR Prize Scholarship.

A Evidence of Disinformation
For WarOnFakes, there is a substantial number of articles and fact-checks establishing it as a disinformation source.PolitiFact undertook a review of over 380 of their fact-checks and found a significant number of falsehoods 17 .In an article by AFP via France24, Roman Osadchuk, from the Atlantic Council's Digital Forensic Research Lab (DFRLab), is quoted as saying "Since Russia's invasion, the 'War On Fakes' initiative has become a powerhouse of spreading false debunks" and "It is an effective tool of state propaganda and disinformation" 18 .The Institute of Network Cultures describes it as "Kremlin-Sponsored Particpatory Propaganda" 19 , and highlights connections between the Russian state and the website, including promotion from organisations under the Russian Ministry of Foreign Affairs, and on the Russian Ministry of 17 https://www.politifact.com/article/2022/aug/08/how-war-fakes-uses-fact-checking-spread-pro -russia/ 18 https://www.france24.com/en/live-news/20230216-fake-fact-checks-seek-to-obscure-russian-rol e-in-war 19 https://networkcultures.org/tactical-media-r oom/2022/07/22/weaponized-osint-the-new-kremlin -sponsored-participatory-propaganda/ Defence's Telegram channel.BBC Monitoring, the specialist media source analysis division of BBC News, states "Some of its fact-checks are genuine but most content is Russian talking points on the invasion which do not stand up to scrutiny" 20 .
The site has also been covered by EUvsDisinfo 21 , DFRLab 22 , the European Digital Media Observatory 23 , and Media Bias/Fact Check 24 .
RRN has received comparatively less attention from fact checkers, however was described as disinformation by NewsGuard (Maitland, 2022), which additionally claims that they reuse content from WarOnFakes, and EU Disinfo Lab have noted a connection in the hosting infrastructure of the two sites (Alaphilippe et al., 2022).It is therefore probable that the apparent state-backing of WarOnFakes also applies to RRN.

B Data Example
Figure 3 shows an example of an article published on WoF 25 in English, French, Spanish, Chinese and Arabic.Full texts are omitted for languages other than English.This story was judged to be fake by fact-checkers 26 .Usage of guillemets (« ») as quote marks is reproduced as returned by the WordPress API, but this appears to be normalised when the page is rendered.

C Detailed Dataset Statistics
Figure 4 shows a weekly chart of the 10 most common topics on the site.In general, there is no clear variation between these topics, with the exception of the initial popularity of the topic #2 due to the majority of posts that week being from WarOn-Fakes.The significant dip in January 2023 is due to the Russian public holidays discussed in section 3.2.
Table 5 shows the proportion of backdated articles per language, and the mean and maximum backdating period for each.For English, a small number of posts are backdated after a short period of time.It is likely this is caused by posts that have been forward-dated (i.e.set to be published in the future) by one or two days, resulting in subsequent posts appearing to be backdated until the publication date catches up.However, for other languages, backdates are for a much longer period.

C.1 Complete LIWC2015 Data
The complete listing of LIWC2015 is included in

Figure 2 :
Figure 2: Number of unique topics assigned per article

Table 3 :
Table 6 in Appendix C), while WoF is weakly correlated to impersonal pro-Top 5 correlated LIWC values.Bold values above strength threshold.

Table 4 :
Frequency of Cyrillic usage reasons

Table 5 :
Backdating per language for both sites.

Table 8 ,
in the hope it can be used for comparison in future work.