An Integrated Approach for Political Bias Prediction and Explanation Based on Discursive Structure

One crucial aspect of democracy is fair information sharing. While it is hard to prevent biases in news, they should be identified for better transparency. We propose an approach to automatically characterize biases that takes into account structural differences and that is efficient for long texts. This yields new ways to provide explanations for a textual classifier, going beyond mere lexical cues. We show that: (i) the use of discourse-based structure-aware document representations compare well to lo-cal, computationally heavy, or domain-specific models on classification tasks that deal with textual bias (ii) our approach based on different levels of granularity allows for the generation of better explanations of model decisions, both at the lexical and structural level, while addressing the challenge posed by long texts.


Introduction
In an expanding information-based society, where public opinion is influenced by a plurality of sources and discourses, there is growing concern about fair information sharing.Biased speech, slanted presentation of events are inevitable, whether intentional or not, but must be transparent to ensure a more democratic public space.This has motivated substantial work on text classification to identify political orientation, what stances are supported by a text, or to characterize misleading or fake information (Hamborg et al., 2019).It is also important that such methods can provide justifications to their decisions, both to understand what linguistic expressions are characteristic of certain positions, and also to provide some transparency in the analysis itself.Explainability of supervised models is now a large subfield addressing this concern, with methods providing justifications, mostly in the form of relevant tokens in the case of textual tasks, e.g.(Kusner et al., 2015).
In this work, we contribute to both these lines of research by proposing an integrated approach for predicting and explaining political biases, where the structure of the document can inform the proposed bias characterization, as opposed to current approaches only relying on lexical, local cues.Indeed, by focusing on local formulation, existing research (Da San Martino et al., 2020;Field et al., 2018) ignores that political expression also relies on argumentation, i.e. the way information is presented.Example 1 is segmented into Elementary Discourse Units (EDUs), the minimal spans of text to be linked by discourse relations as described e.g. in the Rhetorical Structure Theory (Mann and Thompson, 1988).The discourse structure built upon these segments represents how information is conveyed in a right-leaning text about climate and can inform on how the information is presented (why the climate is not a problem, what opposing argument the writer wants to highlight), and also to detect the most important spans of texts.Example 1. [There's nothing abnormal about the weather this January,] 1 [it's just part of the Earth's natural climate patterns.] 2 [The mainstream media is just pushing the idea of climate change] 3 [to push their own agenda.]4 To the best of our knowledge, we are the first to investigate discourse-based information for bias characterization, and we do so through: (i) a segmentation of the texts based on discourse units rather than sentences, (ii) experiments on discourse connectives that can be seen as shallow markers of the structure, (iii) and crucially, a model based on latent structures, as a proxy for discourse structures, that can help the prediction and provide a different sort of input for explainability methods.
Furthermore, while recent progress on text classification has been largely due to the wide-spread use of pretrained language models, fine-tuned on specific tasks, they remain limited in terms of input size (i.e.512 sub-tokens in general) and can-not easily deal with phenomena that relate elements far apart.Long texts are also problematic for many explanation methods.Our proposed approach addresses this limitation on both sides.The code is available at: https://github.com/neops9/news_political_bias.

Our work makes the following contributions:
-we propose a model to predict political bias of news articles, with unrestricted input length, using latent structured representations on EDUs; -we propose improvements to perturbation-based explanation methods, using different levels of granularity (i.e.words, sentences, EDUs, or structures); -we evaluate experimentally our propositions for both the prediction and the explanation of bias.

Related work
The prediction of the political orientation in texts has long been of interest in political science (Scheufele and Tewksbury, 2007), and has generated growing interest in NLP, either for classification at document level, e.g.detecting extreme standpoints (Kiesel et al., 2019) or more general left/center/right orientation in news (Kulkarni et al., 2018;Baly et al., 2020;Li and Goldwasser, 2021), but also at a finer-grain local level, locating specific framing (Card et al., 2015;Field et al., 2018), or various linguistic devices such as "propaganda techniques", as in the SemEval 2020 task (Da San Martino et al., 2020).For a more general view, see the survey in (Hamborg et al., 2019).Recently, Liu et al. (2022) have developed a language model over RoBERTa (Liu et al., 2019b), fine-tuned on a large corpus of news to address both stance and ideology prediction, by incorporating new "ideology-driven" pre-training objectives, with very good results.In contrast, we develop a generic approach that could be applied as is to new classification tasks.
Aside from approaches whose objective is just prediction of an orientation, some studies aim at characterizing bias, and rely on lexical statistics or surface cues (Gentzkow et al., 2019;Potthast et al., 2018).In contrast, we want to investigate other factors as well, at a more structural level, mainly document-level organization aka discourse structure.Automated discourse analysis is the subject of a rich body of work but current parsers still have rather low performance and weak generalization.This is why we took inspiration from Liu and Lapata (2018), who use structural dependencies over sentences that are induced while encod-ing the document to feed downstream supervised models.Their results indicate that the learned representations achieve competitive performance on a range of tasks while arguably being meaningful.This approach is effective for summarization with the learned structures, while less complex than relying on rhetorical relations, capturing consistent information (Liu et al., 2019a;Isonuma et al., 2019;Balachandran et al., 2021).Similar results were found for fake news classification (Karimi and Tang, 2019).Our model relies on these approaches, but adds a finer-grain level of analysis relying on Elementary Discourse Units.
The last aspect of our approach is the use of explainable methods to characterize bias.We propose an integrated approach where a classification model is used with methods to explain its decision, thus providing cues about the way bias is present and detected in texts.Numerous explainability methods have been proposed in recent years, most of which are amenable to being used on text classification tasks.Almost all of them are local i.e. provide information about the role of separate parts of the input for a given instance only, e.g.input tokens most relevant to a model's prediction for textual tasks.These methods can be either black box methods, operating only on predictions of the models (Castro et al., 2009;Ribeiro et al., 2016), or can observe the impact of the input on some of its internal parameters (Simonyan et al., 2014;Sundararajan et al., 2017).We extend the use of such methods to take into account structural elements.Although some studies have recently investigated how structural / discourse information is encoded in pretrained languages models (Wu et al., 2020;Huber and Carenini, 2022), to the best of our knowledge, we are the first to explore textual explainable methods not relying only on surface form information.This is crucial for long texts, as methods such as LIME (Ribeiro et al., 2016) that rely on sampling word perturbations can become expensive for high token counts.

Integrated bias detection and characterization
Our approach is based on a model that predicts a bias while inducing a structure over documents, and explanation methods that could either take as inputs simply the tokens, the EDUs, the sentences, or that could be based on the induced structures, see Figure 1.In this section, we describe our model for predicting bias, on which we rely to produce structure-based explanations.

Base Bias Prediction model
In Liu and Lapata (2018), the sentences are composed of sequences of static word embeddings that are fed to a bi-LSTM to obtain hidden representations used to compute the sentence representations, that are then passed through another bi-LSTM to compute the document representation.At both levels, representations are built using the structured attention mechanism allowing for learning sentence dependencies, constrained to form a non-projective dependency tree.Finally, a 2-layer perceptron predicts the distribution over class labels.Note that LSTMs do not have limitations on the input size.We modify the model to include the improvements proposed by Ferracane et al. (2019).In particular: (i) we remove the document-level bi-LSTM, (ii) for the pooling operation, we aggregate over units using a weighted sum based on root scores, instead of a max pooling, (iii) we perform several additional levels of percolation to embed information from the children's children of the tree, and not only direct children.On top of that, we skip the sentence-level structured attention, as it adds an unnecessary level of composition that was found to have a negative empirical impact on the results.

Improvements
We make two additional important modifications to the classification model, one generic (replace the base unit of the latent structure), the other specific to the task considered.
Segmentation The learning of a latent structure is supposed to leverage argumentative processes that can reflect the author's political orientation.We thus changed the base textual units from sentences to more discourse-oriented ones, as given by a discourse segmenter.Discourse segmentation is the first stage of discourse parsing, identifying text spans called Elementary Discourse Units that will be linked by discourse relations.We chose to use an existing segmenter (Kamaladdini Ezzabady et al., 2021) 1 as it showed good performance on the latest segmentation shared task (Zeldes et al., 2021), while being the only one from that campaign not needing features other than tokens.
Adversarial Adaptation Media source of an article can be easily determined using some specific lexical cues, such as the media name.Since most articles from a media share the same political label, a model could exploit these features, that wouldn't generalize to other news sources.It is difficult to remove these cues via preprocessing, as they can be various and source-specific.Baly et al. (2020) suggest two approaches: adversarial adaptation (AA) (Ganin et al., 2016), and triplet loss pre-training (Schroff et al., 2015), and chose the latter based on preliminary results, while we found AA more promising.AA involves incorporating a media classifier in the model's architecture and maximizing its loss using a gradient reversal layer, resulting in a model that is discriminative for the main task yet independent of the media source.

Lexical and Structural
Perturbation-Based Explanations Among the numerous existing methods for interpreting a model's decision, we chose to focus on so-called black box approaches, only relying on a model output predictions, and not its internal representations, for more generality.However, the most popular black box approaches, LIME (Kusner et al., 2015), Anchor (Ribeiro et al., 2018) or Shap (Lundberg and Lee, 2017) rely on lexical features when applied to textual tasks, looking for relevant subsets of features or using perturbations by removing/switching words in the input which makes them computationally expensive for high token counts, or forces approximation via sampling, which still has to be representative enough to be useful.Of these methods we chose to only consider LIME, which is intrinsically based on sampling and has been shown by Atanasova et al. (2020) to have the best or near-best performance on their metrics, and thus present a good compromise.LIME works by learning a simple model around an instance, which approximates the prediction of the model in the "neighborhood" of the instance.The neighborhood of an instance is sampled by slightly perturbing the input with respect to some features, words in the case of textual models, yielding a set of (perturbed) instances.Then a simple linear model is fitted on these instances to match the model predictions, with a weight given to the instances according to their distance from the original instance.The parameters of the simple model then yield importance scores for the input features, and the best ones are chosen as an "explanation" of the decision on the original instance.
Despite its usefulness, LIME has some known limitations, regarding the cost of the sampling process (Molnar, 2022, section 9.2.5) or the robustness of the explanations (Alvarez-Melis and Jaakkola, 2018).The main issue is that the quality of the explanations highly depends on the amount of generated perturbed samples, to be representative of the model's behavior, and to avoid spurious or not robust explanations.For texts, where features are words, this can mean a high computational cost, especially for long documents, since the number of possible perturbations of a text grows exponentially with its size.We thus propose four strategies to reduce this cost and still produce relevant explanations, by focusing on different levels of granularity.
Token-level explanations The first level still operates at the token level, removing tokens randomly, but focusing on specific words.We consider three subcases: (1) ignoring functional words, less likely to be relevant to a classification decision, while being very frequent; or (2) sampling only with respect to some specific classes of tokens: (2a) named entities extracted with spaCy,2 and (2b) discourse connectives (Webber et al., 2019), using the extended list of markers3 proposed by Sileo et al. (2019), that could act as shallow indicators of argumentative structures.
EDU/Sentence-level The second level moves away from word-based explanations to focus on a higher granularity: either sentences, preprocessed using Stanza (Qi et al., 2020), or EDUs to take into account the general organization of the document.
EDUs are supposed to be the atomic level of structure analysis, and thus more coherent in terms of size and content than full sentences.The process for generating explanations is then very similar to word-based ones: instead of perturbing a document by removing a random set of words, we remove a random set of EDUs.An EDU-based explanation then consists of a subset of the most impactful EDUs for the model.This also reduces drastically the perturbation space, making it more feasible and reliable to sample.Two-level explanations Using a higher level of granularity may provide less detailed explanations, we thus propose to combine the previous level of analysis, EDU-based, with the classical wordbased approach, restricted to the selected EDUs.In practice, we define a hyperparameter k, apply the first stage of explanation, and then generate wordlevel perturbations only for words present in the k most impactful EDUs of the explanation.
Structure-Level Explanations Finally, we propose to generate explanations directly at the level of the structure learned by the model, still using the LIME method.Here, we will perturb the entire structure extracted via the latent model for a given example (see Section 3.1).We chose to rely on perturbations that remove a subset of head-dependent relations in the original tree, i.e. a pair of segments.An explanation of the structure is then the subset of the most impactful relations in the tree.
By combining all levels of explanation presented, we can generate an enhanced explanation covering multiple aspects of the data (see Figure 2).

Explanation evaluation metrics
Evaluating the explanations is an important challenge, and common practices mostly depend on costly human judgments.Here we rely on the diagnostic properties proposed by Atanasova et al. (2020) in the context of text classification.We discarded two measures that cannot be computed: the agreement with human rationales measure, since we do not have access to human annotations for the explanation of political datasets, and the rationale consistency measure, since it is meant to compare an explanation method across different models.We consider that a document is composed of a set of features, and that our explanation method generates a saliency score for each of them.Confidence Indication (CI) When generating an explanation, the feature scores for each possible class can be computed.It is then expected that the feature scores for the predicted class will be significantly higher than those of the other classes.If not, this should indicate that the model is not highly confident in its prediction, and the probability of the predicted class should be low.We can then measure a confidence indication score as the predictive power of the explanation for the confidence of the model.Predicted confidence is computed from the distance between saliency scores of the different classes and then compared to actual confidence by using the Mean Absolute Error (MAE).
Faithfulness Faithfulness is an indication that features selected in an explanation were actually useful for the model to make a prediction.It is measured by the drop in the model's performance when a percentage of the most salient features in the explanation are masked.Starting from 0%, 10%, up to 100%, we obtain the performance of the model for different thresholds.From these scores, the faithfulness is then measured by computing the area under the threshold-performance curve (AUC-TP).
Dataset Consistency (DC) DC measures if an explanation is consistent across instances of a dataset.Two instances similar in their features should receive similar explanations.Similarity between instances is obtained by comparing their activation maps, and similarity between explanations is the difference between their saliency scores.The consistency score is then the Spearman's correlation ρ between the two similarity scores.The overall dataset consistency is the average obtained for all the sampled instance pairs.

Datasets
which is a platform that offers an analysis of the political leanings of various English-language media at the article level.An article is labeled by the political positioning of its media.

Hyperpartisan (HP)
A binary classification task (Kiesel et al., 2019) of predicting whether a given news article is hyperpartisan or not (takes an extreme left-wing or right-wing standpoint), task 4 of SemEval-2019.We considered the dataset containing 1, 273 manually annotated articles.
C-POLITICS We built on the large-scale news articles dataset POLITICS7 (Liu et al., 2022).
It comes with an aligned version containing 1, 060, 512 clusters of articles aligned on the same story from 11 media.We propose a reduced version of this dataset meeting three desirable constraints: class balance, temporal framing and media-agnostic.We kept only articles published between 2020 and 2021 (annotation stability), excluding the possibility of a media appearing in several splits (train, validation, test) and forcing to have at least one article of each label per cluster (homogeneity).We evaluate on the 3-ways classification task of predicting the political leaning (left, center, right).We ended up with a dataset containing 37, 365 articles for 12, 455 clusters.An article is labeled by the political positioning of its media.This will be made available upon acceptance.

Experimental Settings
Baselines For Allsides and Hyperpartisan, we compare to the results obtained by the authors of the datasets, and the winners of the task (HP).We also compare to three additional transformerbased baselines on the three tasks, for which we fine-tuned a classification model (on a single run): (1) RoBERTa-base (Liu et al., 2019b) (2) Longformer-4096 (Beltagy et al., 2020), a language model designed to handle very long sequences of text, up to 4096 tokens (3) POLITICS (Liu et al., 2022), a state-of-the-art language model built over RoBERTa-base for political ideology prediction, pretrained on more than 3.6M news articles (see above).RoBERTa and POLITICS are fine-tuned on the whole input using a sliding window of size 512 and an overlap of size 64; we built on Liu et al. ( 2022)'s implementation8 .All baselines and proposed models have similar numbers of parameters (cf. the appendix).For the explanations, we compare to the original version of LIME for text classification, which is based on words perturbation, and a random explanation on the whole input.
Settings For the classification model, we built on Ferracane et al. ( 2019)'s implementation,9 itself based on Liu and Lapata (2018)'s.We adapted the code according to the modifications and additions proposed in our approach, as detailed in Section 3.1.Hyperparameters were set using grid search and are the same for all tasks (Table 8 in Appendix B).We used pretrained 300D GloVe vectors (Pennington et al., 2014).For the AA training, since the training set may contain many media sources with a long tail distribution, we only consider the 10 most frequent sources.Hyperparameters for the finetuning of RoBERTa, POLITICS and Longformer are given in Appendix B. 2-level explanations are generated using the 10 most impactful EDUs.
Evaluation We evaluate two versions of the classification model: segmentation into sentences, or into EDUs (on a single run).We report accuracy as it is the standard measure in previous work on these tasks.We built on the LIME python package10 to implement our methods (Section 4).We generate and evaluate explanations on 100 documents from the test set for 1, 000 and 10, 000 perturbed samples and compute a score for each feature.Explanations are generated for our trained classification model with EDU segmentation (Section 3.1).
The confidence interval for the evaluation of the explanations is only given for the baseline (LIME Words) for 10 generations.Since each of the proposed improvements has a reduced perturbation space relative to the baseline, which is the impact factor of the variance, and to avoid a disproportionate computational cost, we consider that the confidence interval will be at worst equal or better, and therefore we do not give it for all experiments.

Results
Results obtained for the different classification tasks are given in Table 2.As expected, the fine-tuning of the pre-trained and specialized model POLITICS obtains the best results on all tasks.Followed closely by Longformer with an average of −3.45 points, which shows the interest of keeping the whole document as input.
Regarding our structured approaches, we can note that despite lower scores compared to POL-ITICS and Longformer, the EDU-based version performs better than RoBERTa on corpora with the longest text lengths (i.e.Allsides +1.76 points, C-POLITICS +4.37 points).The segmentation into EDUs significantly improves the results on all tasks compared to the segmentation into sentences (+4.59 points on average), showing the importance of the fine-grain discourse approach.Putting these results in perspective, our approach is more generic than POLITICS, as it does not require heavy and domain-specific pre-training, and much lighter than Longformer (w.r.t.computational cost).
Table 3 presents the evaluation metrics for each of the proposed LIME alternatives.We observe that in general, except for discourse markers and named entities, the two-level explanation performs better, obtaining strong evaluation scores for all the proposed metrics.The use of a higher level of granularity (sentences, EDUs) improves the quality of the explanations compared to the baseline; note that between EDUs and sentences, the finer segmentation into EDUs is the most accurate, showing the effectiveness of discourse-based approaches.The higher CI score for EDUs shows that it is the appropriate level of granularity with respect to the impact of their content on the model decision, it is also the level of segmentation on which the model has been trained.Similarly, reducing the perturbation space by targeting classes of words generates better quality explanations, in particular for named entities, which are particularly informative for the model as already shown in the literature (Li and Goldwasser, 2021).Regarding the explanation of the structure, although the scores obtained are in the low range, we can state that they represent relevant information for the decision of the model as compared to baselines.In general, the two-level explanation seems to be the best compromise between explanation quality, computational cost, and level of detail, while the LIME baseline (words) suffers from a high perturbation space.
As we are reducing the sampling space in our approaches, we also made comparisons on the number of samples used to generate the explanation for these metrics, between 1, 000 and 10, 000 samples.We notice that the scores obtained by most of our approaches on 1, 000 samples remain better than those of the baseline for 10, 000 samples.This shows that it is possible to generate good explanations, and often of better quality, with a number of samples 10 times smaller, which is a major improvement over the computational cost.

Analysis of explanations
By looking at the explanations generated for the different levels of granularity and properties targeted, we can gain some insights about the model's decisions.An important property that must be fulfilled by the explanation is its comprehensibility by a human in order to characterize biases.We propose a qualitative analysis of the explanations and a comparison of the various approaches, both at the lexical and structural level.
Table 4 shows the most recurrent and impactful words in the explanations, as given by the aggregated saliency scores of the 100 generated explanations, for each class for the Allsides task, depending on the method of explanation.Similar results are reported for Hyperpartisan and C-POLITICS in Table 11 and 12 of the Appendix C. Overall, the words that emerge seem consistent with the classes, and it is relatively straightforward to understand the possible biases that characterize them.Regarding the differences between word-based explanation approaches, we observe that two-level explanations yields more relevant information and specific lexical cues (e.g.environmental, transgender, scientists, archbishops), which confirms the interest of a first pass through an adapted level of granularity in order to target the most interesting parts of the text.Explanations based on discourse markers or named entities show overlap with the other methods, indicating consistency between approaches.EDU-based explanations are more comprehensive and self-sufficient, while covering information con- tained in word-based explanations.This seems to make it an appropriate compromise between human readability and computational cost.Furthermore, there does not seem to be any particular trend in the relative position of the most impactful EDUs in the text, which confirms the interest of keeping the entire document (Figures 6, 7 and 8 of Appendix C).By comparing the results between the different classes (left, center, right), and without entering into political considerations, we can establish a first diagnosis of the biases that characterize them.From the word-based explanations, we observe a shift in the lexical fields between classes (pacific, aids, percent -transgender, environmental, scientists -fired, surveillance, archbishops), which indicates a bias in topics covered and in the way information is conveyed.Articles from the right class seem to favor negative-sounding terms, while the pitch used is more neutral for the center and left classes.We can also note the over-representation of public and political figures in the explanations, which is distinguished between each class by the political leaning and the social category of the people being mentioned.In particular, we notice that articles from the right are almost exclusively mentioning personalities from their side, with the specificity of recurrently referring to religious figures (e.g.John Sentamu, Jerry Falwell).While the profiles are more diversified for the left and center classes, giving a lot of attention to right-wing personalities.About discourse markers, three trends can be identified from each of the classes.The left class seems to prefer markers of certainty or uncertainty (e.g.absolutely, maybe).The center class focuses on markers indicating time or frequency (e.g.then, already, frequently).Finally, the right class favors markers that indicate contrast or emphasis (e.g.though, however, obviously, naturally).
For the analysis of the structure and its explanation, we compare various statistics following Ferracane et al. (2019).Average height of trees (6.36), average proportion of leaf nodes (0.87) and the average normalized arc length (0.35) are equivalent between classes, although the right-wing class have slightly more shallow trees.Regarding the explanations, the most impactful relationships are mainly located in the first levels of the tree, close to the root, independently of the class.Although the explanation by perturbing the tree relations is not the most intuitive at first sight, it allows for a new level of abstraction by providing an understanding of the model's decisions with respect to the induced structure, which combined with other methods of analysis, can reveal additional biases.

Conclusion
We propose an integrated approach to both predict and analyze political bias in news articles, taking into account discourse elements.We show that structured attention over EDUs yields significant improvement at different levels over existing approaches, or comparable results, if lower, with respect to data-or computation-hungrier models.We proposed new variants for perturbation-based explanation methods when dealing with long texts, both at the lexical and structural level, that would not be possible with the other models.We demonstrate the effectiveness of our system by evaluating it on a series of diagnostic properties, and propose a qualitative analysis and comparison of the various approaches for the characterization of political bias.

Limitations
We reused data collected by previous work in the literature.Collecting news articles is susceptible to various sampling biases, related to the sources collected, the topics covered, and the time span of the collection, which influences what appears in the articles.In addition, labels given to articles are actually the political orientation of their source in the case of the Allsides and POLITICS datasets, which is obviously likely to induce errors.They rely on expertise provided respectively by the Allsides11 and Ad Fontes12 websites.The exact methods are undisclosed, but such labeling has necessarily a subjective aspect, oversimplifying predefined political categories, and can evolve in time.This affects classification reliability when applied to different sources, different times, different topics.This is on top of any specific elements related to the language (English) and cultural background of the sources (predominantly U.S.-based sources).This study is not intended to provide an accurate tool for predicting the political orientation of a text, but to provide analyses of the linguistic expression of bias, as seen through a supervised model.

Ethical considerations
Studying the political orientation of various media is already the objective of various institutions (Allsides, Ad Fontes, Media Bias/Fact Check).It depends on many factors, and a reliable automatic identification is still out of reach of current models, as can be seen from existing experimental results, and some of the limitations underlined above.These models should thus not be used for something other than research purposes, or supporting human analysis.This is one of the reasons why we develop an explainable approach to bias predic-

A Dataset Statistics
Statistics about the are reported in Tables 5,  6 and 7

B Settings
RoBERTa and POLITICS are initialized using the hyperparameters given in   D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Overview of the approach: a supervised classification model relies on latent structures over textual units, and a module provides perturbation-based explanations, relying on various levels of analysis: words, sentences, EDUs, or latent trees.

Figure 2 :
Figure 2: Fabricated examples of generated explanations (blue), according to which part of the input is perturbed to generate the LIME approximation around an instance.Structure-based explanations need the structure produced by the model.Numbers in the structure refers to EDUs.
. The distributions of the number of tokens per dataset (Figures 3, 4 and 5) show that Hyperpartisan has overall shorter news articles compared to Allsides and C-POLITICS.

Figure 3 :
Figure 3: Distribution of the number of (BERT) tokens per article for the Allsides dataset.

Figure 4 :
Figure 4: Distribution of the number of (BERT) tokens per article for the C-POLITICS dataset.

Figure 5 :
Figure 5: Distribution of the number of (BERT) tokens per article for the Hyperpartisan dataset.
* indicates results not reproduced, taken from the original papers.Note that POLITICS is based on RoBERTa, and already specifically fine-tuned on political texts before our own finetuning.

Table 4 :
Prototype explanations by class (Allsides), ordered from most to least impactful, as given by the highest saliency scores of the explanations.

Table 5 :
Statistics about the Allsides dataset.

Table 6 :
Statistics about the C-POLITICS dataset.

Table 9 ,
Table 10 is for Longformer.The classification model we pro- pose (Structured Attention/EDU) contains about 120M parameters, RoBERTa and POLITICS contain about 125M parameters, and it is about 148M for Longformer.Training is done on an Nvidia GeForce GTX 1080 Ti GPU card.

Table 11 :
Prototype explanations by class (Hyperpartisan), ordered from most to least impactful, as given by the highest saliency scores of the explanations.

Table 12 :
Prototype explanations by class (C-POLITICS), ordered from most to least impactful, as given by the highest saliency scores of the explanations.C2.Did you discuss the setup, including hyperparameter search and best-found hyperparameter values?Section 7 (Experimental Settings) and Appendix B.C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 8 (Results).C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Sections 4, 5 and 7.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.