Fine-grained Classification of Political Bias in German News: A Data Set and Initial Experiments

We present a data set consisting of German news articles labeled for political bias on a five-point scale in a semi-supervised way. While earlier work on hyperpartisan news detection uses binary classification (i.e., hyperpartisan or not) and English data, we argue for a more fine-grained classification, covering the full political spectrum (i.e., far-left, left, centre, right, far-right) and for extending research to German data. Understanding political bias helps in accurately detecting hate speech and online abuse. We experiment with different classification methods for political bias detection. Their comparatively low performance (a macro-F1 of 43 for our best setup, compared to a macro-F1 of 79 for the binary classification task) underlines the need for more (balanced) data annotated in a fine-grained way.


Introduction
The social web and social media networks have received an ever-increasing amount of attention since their emergence 15-20 years ago. Their popularity among billions of users has had a significant effect on the way people consume information in general, and news in particular (Newman et al., 2016). This development is accompanied by a number of challenges, which resulted in various NLP tasks that deal with information quality (Derczynski and Bontcheva, 2014;Dale, 2017;Saquete et al., 2020). Due to the data-driven nature of these tasks, they are often evaluated under the umbrella of (un)shared tasks, on topics such as rumour detection or verification (Derczynski et al., 2017;Gorrell et al., 2019), offensive language and hate speech detection (Zampieri et al., 2019;Basile et al., 2019;Struß et al., 2019;Waseem et al., 2017;Fišer et al., 2018;Roberts et al., 2019;Akiwowo et al., 2020) or fake news and fact-checking (Hanselowski et al., 2018;Thorne et al., 2019;Mihaylova et al., 2019).
Several shared tasks concentrate on stance (Mohammad et al., 2016) and hyper-partisan news detection (Kiesel et al., 2019), which predict either the stance of the author towards the topic of a news piece, or whether or not they exhibit allegiance to a particular party or cause. We argue that transparency and de-centralisation (i. e., moving away from a single, objective "truth" and a single institution, organisation or algorithm that decides on this) are essential in the analysis and dissemination of online information (Rehm, 2018). The prediction of political bias was recently examined by the 2019 Hyperpartisan News Detection task (Kiesel et al., 2019) with 42 teams submitting valid runs, resulting in over 30 publications. This task's test/evaluation data comprised English news articles and used labels obtained by Vincent and Mestre (2018), but their five-point scale was binarised so the challenge was to label articles as being either hyperpartisan or not hyperpartisan.
We follow Wich et al. (2020) in claiming that, in order to better understand online abuse and hate speech, biases in data sets and trained classifiers should be made transparent, as what can be considered hateful or abusive depends on many factors (relating to both sender and recipient), including race (Vidgen et al., 2020;Davidson et al., 2019), gender (Brooke, 2019;Clarke and Grieve, 2017), and political orientation (Vidgen and Derczynski, 2021;Jiang et al., 2020). This paper contributes to the detection of online abuse by attempting to uncover political bias in content.
We describe the creation of a new data set of German news articles labeled for political bias. For annotation, we adopt the semi-supervised strategy of Kiesel et al. (2019) who label (English) articles according to their publisher. In addition to opening up this line of research to a new language, we use a more fine-grained set of labels. We argue that, in addition to knowing whether content is hyperpartisan, the direction of bias (i. e., left-wing or rightwing) is important for end user transparency and overall credibility assessment. As our labels are not just about hyperpartisanism as a binary feature, we refer to this task as political bias classification. We apply and evaluate various classification models to the data set. We also provide suggestions for improving performance on this challenging task. The rest of this paper is structured as follows. Section 2 discusses related work on bias and hyperpartisanism. Section 3 describes the data set and provides basic statistics. Section 4 explains the methods we apply to the 2019 Hyperpartisan News Detection task data (for evaluation and benchmarking purposes) and to our own data set. Sections 5 and 6 evaluate and discuss the results. Section 7 sums up our main findings.

Data sets
For benchmarking purposes, we run our system on the data from Kiesel et al. (2019). They introduce a small number of articles (1,273) manually labeled by content, and a large number of articles (754,000) labeled by publisher via distant supervision, using labels from BuzzFeed news 2 and Media Bias Fact Check 3 . Due to the lack of article-level labels for German media, we adopt the strategy of labeling articles by publisher.
Several studies use the data from allsides.com 4 , which provides annotations on political ideology for individual articles in English. Using this data, Baly et al. (2020) introduce adversarial domain adaptation and triplet loss pre-training that prevents over-fitting to the style of a specific news medium, Kulkarni et al. (2018) demonstrate the importance of the article's title and link structure for bias prediction and Li and Goldwasser (2019) explore how social content can be used to improve bias prediction by leveraging Graph Convolutional Networks to encode a social network graph. Zhou et al. (2021) analysed several unreliable news data sets and showed that heterogeneity of the news sources is crucial for the prevention of sourcerelated bias. We adopt their strategy of splitting the sources into two disjoint sets used for building train and test data sets respectively. Gangula et al. (2019) work on detecting bias in news articles in the Indian language Telugu. They annotate 1,329 articles concentrating on headlines, which they find to be indicative of political bias. In contrast to Kiesel et al. (2019), but similar to our approach, Gangula et al. (2019) treat bias detection as a multi-class classification problem. They use the five main political parties present in the Teluguspeaking region as their classification labels, but do not position these parties on the political spectrum.
Taking into account the political orientation of the author, SemEval 2016 Task 6 (Mohammad et al., 2016) worked on stance detection, where sub-task A comprised a set of tweets, the target entity or issue (e. g., "Hillary Clinton", or "Climate Change") and a label (one of favour, against, neither). The tweet-target-stance triples were split into training and test data. Sub-task B had a similar setup, but covered a target not included in the targets of task A, and presented the tweet-targetstance triples as test data only (i. e., without any training data for this target). While (political) stance of the author is at the core of this challenge, it differs from the problems we tackle in two important ways: 1) The task dealt with tweets, whereas we process news articles, which are considerably longer (on average 650 words per text for both corpora combined, see Section 3, compared to the 140-character limit 5 enforced by Twitter) and are written by professional authors and edited before posted. And 2) unlike the shared task setup, we have no target entity or issue and aim to predict the political stance, bias or orientation (in the context of this paper, we consider these three words synonymous and use the phrase political bias throughout the rest of this paper) from the text, irrespective of a particular topic, entity or issue.
One of the key challenges acknowledged in the literature is cross-target or cross-topic performance of stance detection systems (Küçük and Can, 2020). Trained for a specific target or topic (Sobhani et al., 2017), performance is considerably lower when these systems are applied to new targets. Vamvas and Sennrich (2020) address this issue by annotating and publishing a multilingual (standard Swiss German, French, Italian) stance detection corpus that covers a considerably higher number of targets (over 150, compared to six in Mohammad et al., 2016). Vamvas and Sennrich (2020) work with comments, which are longer than tweets (on average 26 words), but still shorter than our news articles. Similar to Mohammad et al. (2016) but unlike our approach, the data is annotated for stance toward a particular target.
Earlier work on political stance is represented by Thomas et al. (2006), who work on a corpus of US congressional debates, which is labeled for stance with regard to a particular issue (i. e., a proposed legislation) and which uses binary labels for supporting or opposing the proposed legislation. From this, political bias could potentially be deduced, if information on the party of the person that proposed the legislation is available. However, first of all this correlation is not necessarily present, and second, it results in a binary (republican vs. democratic) labeling scheme, whereas we use a larger set of labels covering the political spectrum from left-wing to right-wing (see Section 3).
A comprehensive review of media bias in news articles, especially attempting to cover insights from social sciences (representing a more theoretical, rational approach) and computer science (representing a more practical, empiricist approach), is provided by Hamborg et al. (2018). The authors observe a lack of inter-disciplinary work, and although our work is mainly empirical, we agree that using a more diverse range of corpora and languages is one way to move away from "too simplistic (models)" (Hamborg et al., 2018, p. 410) that are currently in use. In this respect, we would like to stress that, unlike Kulkarni et al. (2018); Baly et al. (2020); Li and Goldwasser (2019), who all either work on or contribute data sets (or both) to political bias classification in English, we strongly believe that a sub-discipline dealing with bias detection benefits especially from a wide range of different data sets, ideally from as many different languages and cultural backgrounds as possible. We contribute to this cause by publishing and working with a German data set.

Models
With regard to the system architecture, Bießmann (2016) use similar techniques as we do (bag-ofwords and a Logistic Regression classifier, though we do not use these two in combination), but work on the domain of German parliament speeches, attempting to predict the speaker's affiliation based on their speech. Iyyer et al. (2014) use a bag-ofwords and Logistic Regression system as well, but improve over this with a Recursive Neural Network setup, working on the Convote data set (Thomas et al., 2006) and the Ideological Book Corpus 6 . Hamborg et al. (2020) use BERT for sentiment analysis after finding Named Entities first, in order to find descriptions of entities that suggest either a left-wing or a right-wing bias (e. g., using either "freedom fighters" or "terrorists" to denote the same target entity or group). Salminen et al. (2020) work on hate speech classification. We adopt their idea of evaluating several methods (features and models, see Sections 4.1 and 4.2) on the same data and also adopt their strategy of integrating BERT representations with different classification algorithms.

Data Collection and Processing
We obtain our German data through two different crawling processes, described in Sections 3.1 and 3.2, which also explain how we assign labels that reflect the political bias of the crawled, German news articles. Since the 2019 shared task data which we use for benchmarking purposes is downloaded and used as is, we refer to Kiesel et al. (2019) for more information on this data set.

News-Streaming Data
This work on political bias classification is carried out in the context of a project on content curation (Rehm et al., 2020). 7 One of the project partners 8 provided us with access to a news streaming service that delivers a cleaned and augmented stream of content from a wide range of media outlets, containing the text of the web page (without advertisements, HTML elements or other non-informative pieces of text) and various metadata, such as publisher, publication date, recognised named entities and sentiment value. We collected German news articles published between February 2020 and August 2020. Filtering these for publishers for which we have a label (Section 3.4) resulted in 28,954 articles from 35 publishers. The average length of an article is 741 words, compared to 618 words for the 2019 Hyperpartisan News Detection shared task data (for the by-publisher data set).

Crawled Data
To further augment the data set described in Section 3.1, we used the open-source news crawler news-please 9 . Given a root URL, the crawler extracts text from a website, together with metadata such as author name, title and publication date. We used the 40 German news outlets for which we have bias labels (Section 3.4) as root URLs to extract news articles. We applied regular expression patterns to skip sections of websites unlikely to contain indications of political bias 10 . This resulted in over 60,000 articles from 15 different publishers.

Data Cleaning
After collecting the data, we filtered and cleaned the two data sets. First, we removed duplicates in each collection. Because the two crawling methods start from different perspectives -with the first one collecting large volumes and filtering for particular publishers later, and the second one targeting these particular publishers right from the beginning -but overlap temporally, we also checked for duplicates in the two collections. While we found no exact duplicates (probably due to differences in the implementation of the crawlers), we checked articles with identical headlines and manually examined the text, to find irrelevant crawling output.
Second, we removed non-news articles (e. g., personal pages of authors, pages related to legal or contact information, or lists of headlines). This step was mostly based on article headlines and URLs. Because the vast majority of data collected was published after 2018, we filtered out all texts published earlier, fearing too severe data sparsity issues with the older articles. Due to the low number of articles, a model may associate particular events that happened before 2018 with a specific label only because this was the only available label for articles covering that specific event.
Finally, we inspected our collection trying to detect and delete pieces of texts that are not part of the articles (such as imprints, advertisements or subscription requests). This process was based on keyword search, after which particular articles or sections of articles were removed manually.
This procedure resulted in 26,235 articles from 34 publishers and 21,127 articles from 15 publishers 11 in our two collections respectively. We combined these collections, resulting in a set of 47,362 articles from 34 different publishers. For our experiments on this data, we created a 90-10 training-test data split. Because initial experiments showed that models quickly over-fit on publisher identity (through section names, stylistic features or other implicit identity-related information left after cleaning), we ensured that none of the publishers in the test set appear in the training data. Due to the low number of publishers for certain classes, this requirement could not be met in combination with 10-fold cross-validation, which is why we refrain from 10-fold cross-validation and use a single, static training and test data split (see Table 1).

Label Assignment
To assign political bias labels to our news articles, we follow the semi-supervised strategy of Kiesel et al. (2019), who use the identity of the publisher to label (the largest part of) their data set. The values for our labels are based on a survey carried out by Medienkompass.org, in which subjects were asked to rate 40 different German media outlets on a scale of partiality and quality. For partiality, a range from 1 to 7 was used with the following labels: 1 -left-wing extremism (fake news and conspiracy theories), 2 -left-wing mission (questionable journalistic values), 3 -tendentiously left, 4minimal partisan tendency, 5 -tendentiously right, 6 -right-wing mission (questionable journalistic values), 7 -right-wing extremism (fake news and conspiracy theories). For quality, a range from 1 to 5 was used: 1 -click bait, 2 -basic information, 3 -meets high standards, 4 -analytical, 5 -complex.
A total of 1,065 respondents positioned these 40 news outlets between (an averaged) 2.1 (indymedia) and 5.9 (Compact) for partiality, and between 1.3 (BILD) and 3.5 (Die Zeit, Deutschlandfunk) for quality. We used the result of this survey, available online 12 , to filter and annotate our news articles for political bias based on their publisher. In this paper we use the bias labels for classification and leave quality classification for further research.
Because 60-way classification for partiality (1 to 7 with decimals coming from averaging respondents' answers) results in very sparsely populated (or even empty) classes for many labels, and even rounding off to the nearest natural number (i. e., 7way classification) leads to some empty classes, we converted the 7-point scale to a 5-point scale, using the following boundaries: 1-2.5 -far-left, 2.5-3.5centre-left, 3.5-4.5 -centre, 4.5-5.5 -centre-right, 5.5-7 -far-right. We favoured this equal distribution over the scale of the survey over class size balance (there are more far-right articles than farleft articles, for example). The distribution of our data over this 5-point scale is shown in Table 1.

Topic Detection
To get an overview of the topics and domains covered in the data set, we applied a topic detection model, which was trained on a multilingual data set for stance detection (Vamvas and Sennrich, 2020) where, in addition to stance, items are classified as belonging to one of 12 different news topics. We trained a multinomial Naive Bayes model on the BOW representation of all German items (just under 50k in total) in this multilingual data set, achieving an accuracy of 79% and a macro-averaged F 1score of 78. We applied this model to our own data set. The results are shown in Table 2. Note that this is just to provide an impression of the distribution and variance of topics. Vamvas and Sennrich (2020) work on question-answer/comment pairs, and the extent to which a topic detection model trained on such answers or comments is eligible for transfer to pure news articles is a question we leave for future work.
Since the majority of articles was published in 2020, a year massively impacted by the COVID-19 pandemic, we applied simple keyword-based heuristics, resulting in the estimate that approxi-12 https://medienkompass.org/deutsche-medienlandschaft/  We publish the data set as a list of URLs and corresponding labels. Due to copyright issues, we are unable to make available the full texts.

Methodology
In this section we describe the different (feature) representations of the data we use to train different classification models on as well as our attempts to alleviate the class imbalance problem (Table 1).

Features
Bag-Of-Words Bag-of-Words (BOW) represents the text sequence as a vector of |V | features with V being the vocabulary size. Each feature value contains the frequency of the word associated with the position in the vector in the input text. The vocabulary is based on the training data.

TF-IDF Term-Frequency times
Inverse-Document-Frequency (TF-IDF) differs from BOW in that it takes into account the frequency of terms in the entire corpus (the training data, in our case). In addition to its popularity in all kinds of IR and NLP tasks, TF-IDF has recently been used in hate speech detection tasks (Salminen et al., 2019).
BERT Since its introduction, BERT (Devlin et al., 2019), has been used in many NLP tasks. We use the German BERT base model from the Hugging Face Transformers library 13 . We adopt the fine-tuning strategy from (Salminen et al., 2020): first, we fine-tune the BertForSequenceClassification model, consisting of BERT's model and a linear softmax activation layer. After training, we drop the softmax activation layer and use BERT's hidden state as the feature vector, which we then use as input for different classification algorithms.

Models
Logistic Regression We use logistic regression as our first and relatively straightforward method, motivated by its popularity for text classification. We add L2 regularization to the cross-Entropy loss and optimize it using Stochastic Average Gradient (SAGA) (Defazio et al., 2014).
Naive Bayes Equally popular in text classification, Naive Bayes is based on the conditional independence assumption. We model BOW and TF-IDF features as random variables distributed according to the multinomial distribution with Lidstone smoothing. BERT features are modeled as Gaussian random variables.
Random Forest Random Forest is an ensemble algorithm using decision tree models. The random selection of features and instances allows reduction of the model's variance and co-adaptation of the models. To handle class imbalance we use the Weighted Random Forest method (Chen and Breiman, 2004). This changes the weights assigned to each class when calculating the impurity score at the split point, penalises mis-classification of the minority classes and reduces the majority bias.
EasyEnsemble EasyEnsemble is another ensemble method targeting the class imbalance problem (Liu et al., 2009). It creates balanced training samples by taking all examples from the minority class and randomly selecting examples from the majority class, after which AdaBoost (Schapire, 1999) is applied to the re-sampled data.

Hyperpartisan News Detection Data
For benchmarking purposes, we first apply our models to the 2019 Hyperpartisan News Detection task. This data set uses binary labels as opposed to our 5-point scale. Since the 2019 shared task used TIRA , the organisers requested submission of functioning code and ran the evaluation on a dedicated machine to which the shared task participants did not have access. The test set used in the shared task was not published and even after submission deadline has not been made publicly available. As a consequence, we use the validation set to produce our scores on the data. This renders a direct comparison impossible. To provide an estimate of our performance, we include Table 3, which lists the top 3 systems participating in the task. As illustrated by the row TF-IDF+Naive Bayes (our best-performing setup on this data set), we achieve a considerably lower accuracy score, but a comparable macro F 1 -score. The performance of the other setups is shown in Table 3. BERT+Logistic Regression scored just slightly worse than TF-IDF+Naive Bayes, with a precision score that is one point lower.

German Data Set
We apply the models to our own data. The results are shown in Table 5 for accuracy and in Table 6 for macro-averaged F 1 -score. The per-class performance is shown in Table 7, which, in addition, contains performance when binarising our labels (the last three rows) to compare this to the 2019 shared task data and to provide an idea of the difference in performance when using more fine-grained labels. We assume articles with the labels Far-left and Far-right to be hyperpartisan, and label all other articles as non-hyperpartisan. The accuracy for binary classification (not listed in Table 7) was 86%, compared to 43% (Naive Bayes+BOW in Table 5) for 5-class classification.
From the results we can conclude the following. First, class imbalance poses a serious problem, though some setups suffer from this more than others. Linear Regression, on all different features, performed poorly on the Far-left articles. We assume this is due to the small number of Far-left articles (215 in the test set, 1,146 in the training set) and publishers (one in the test set, two in the training set). Despite the high degree of class imbalance, the EasyEnsemble method, designed to target this problem particularly, does not outperform the others with any of the different feature sets. Second, BERT features scored surprisingly low with all classification models. Overall, we can conclude that the two best-performing setups that show both high accuracy and F 1 -score are BOW+Naive Bayes and TF-IDF+Random Forest features. Table 7 includes the scores for TF-IDF+Random Forest, our best-performing setup.

Discussion
In many NLP tasks, the strategy of using BERT as a language model that is fine-tuned to a specific     Table 6: Macro-averaged F 1 -measure for different features and classification methods task, has recently been shown to exhibit significant improvements over previously used methods and models, such as Naive Bayes and Random Forest. To determine why our BERT-based setups did not outperform the others, we investigated the impact of training data volume. We trained the BERT+Logistic Regression setup on only 10% of the original training data of the 2019 setup explained earlier and evaluated it on the same test setup (i. e., the validation set in the 2019 shared task). As illustrated by the last row in Table 4, the accuracy dropped by only 2% and F 1 -score remained the same, suggesting that data volume has relatively little impact.
To further analyse our results, we examined the attention scores of the first BERT layer and selected the ten tokens BERT paid most attention to for ev-  ery article. We then combined adjacent tokens and finished non-complete words (with their most likely candidate) to determine the key phrases of the text that the model used for classification. We repeated this procedure on all hyperpartisan articles (i. e., Far-left and Far-right) and derived a list of words and phrases that the model paid most attention to. The result is shown in Table 8. The question whether or not attention can be used for explaining a model's prediction is still under discussion (Jain and Wallace, 2019;Wiegreffe and Pinter, 2019). Note that with Table 8, we attempt to gain insight into how words are used to construct BERT embeddings, and not necessarily which words are used for prediction.
The lists of words show that the majority of words for the Far-left classification are neither exclusively nor mainly used by left-wing news media in general, e. g., wirkt (works), seither (since) or Geliebte (beloved, lover). An exception is antisemi-  tische (anti-semitic), with anti-semitism in society being a common topic in left-wing media. Other highlighted words are likely to be related to the topic of refugee migration and its causes, such as Hungernden (hungry people) and Sahelzone (Sahel), an area known for its conflicts and current societal challenges. In contrast to the words we identified for the Far-left, we found most of the words we identified for the Far-right to be more descriptive of this side of the political spectrum. Nearly all words listed under Far-right in Table 8 are typically either used sarcastically or in a highly critical manner in typical right-wing media outlets. For example, Willkommenskultur (welcoming culture) is a German compound describing a welcoming and positive attitude towards immigrants, which is often mocked and criticised by the far right. Another example is Gutmensch (of which Gutmenschen is the plural), a term mainly used by the right as an ironic or contemptuous denigration of individuals or groups that strive to be 'politically correct'. Another word in the right column of Table 8 is Tichys, referring to the blog and print magazine Tichys Einblick. This news magazine calls itself a platform for authors of the liberal and conservative spectrum but is considered by some observers to be a highly controversial right-wing magazine with neo-liberal tendencies. 14 Since we made sure that the training data publishers and test data publishers are disjoint sets, this cannot be a case of publisher identity still being present in the text and the model over-fitting to this. Upon closer investigation, we found 15 that indeed, many other publishers refer to Tichy's Einblick, and these were predominantly publishers with the Far-right label.
Generally, entries in Table 8 (for both the Far-left and Far-right columns) in italics are those we consider indicative of their particular position on the political spectrum. Some words on the right side are in themselves neutral but often used by rightwing media with a negative connotation, which is why we italicised them, too (e. g., Islam, Diversity).

Conclusion and Future Work
We present a collection of German news articles labeled for political bias in a semi-supervised way, by exploiting the results of a survey on the political affiliation of a list of prominent German news outlets. 16 This data set extends on earlier work on political bias classification by including a more fine-grained set of labels, and by allowing for research on political bias in German articles. We propose various classification setups that we evaluate on existing data for benchmarking purposes, and then apply to our own data set. Our results show that political bias classification is very challenging, especially when assuming a non-binary set of labels. When using a more fine-grained label set, we demonstrate that performance drops by 36 points in accuracy, from 79 in the binary case to 43 in the more fine-grained setup.
Political orientation plays a role in the detection of hate speech and online abuse (along with other dimensions, such as gender and race). By making available more data sets, in different languages, and using as many different publishers as possible (our results validate earlier findings that models quickly over-fit to particular publisher identity features), we contribute to uncovering and making transparent political bias of online content, which in turn contributes to the cause of detecting hate speech and abusive language (Bourgonje et al., 2018).
While labeling articles by publisher has the obvious advantage of producing a larger number of labeled instances more quickly, critical investigation and large-scale labeling of individual articles must be an important direction of future work.