Understanding Politics via Contextualized Discourse Processing

Politicians often have underlying agendas when reacting to events. Arguments in contexts of various events reflect a fairly consistent set of agendas for a given entity. In spite of recent advances in Pretrained Language Models, those text representations are not designed to capture such nuanced patterns. In this paper, we propose a Compositional Reader model consisting of encoder and composer modules, that captures and leverages such information to generate more effective representations for entities, issues, and events. These representations are contextualized by tweets, press releases, issues, news articles, and participating entities. Our model processes several documents at once and generates composed representations for multiple entities over several issues or events. Via qualitative and quantitative empirical analysis, we show that these representations are meaningful and effective.


Introduction
Over the last decade political discourse has moved from traditional outlets to social media. This process, starting in the '08 U.S. presidential elections, has peaked in recent years, with former-president Trump announcing the firing of top officials as well as policy decisions over Twitter. This presents a new challenge to the NLP community, how can this massive amount of political content be used to create principled representations of politicians, their stances on issues and legislative preferences?
This is not an easy challenge as in political texts perspective is often subtle rather than explicit (Fan et al., 2019). Choices of mentioning or omitting certain entities or attributes can reveal the author's agenda. For example, tweeting "mass shootings are due to a huge mental health problem" in reaction to a mass shooting is likely to be indicative of opposing gun control measures, despite the lack of an explicit stance in the text.
Recent advances in Pretrained Language Models (PLMs) in NLP (Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019) have greatly improved word representations via contextualized embeddings and powerful transformer units, however such representations alone are not enough to capture nuanced biases in political discourse. Two of the key reasons are: (i) they do not directly focus on entity/issuecentric data and (ii) they only represent linguistic context rather external political context.
Our main insight is that effectively detecting such bias from text requires modeling the broader political context of the document. This can include understanding relevant facts related to the event addressed in the text, the ideological leanings and perspectives expressed by the author in the past, and the sentiment/attitude of the author towards the entities referenced in the text. We suggest that this holistic view can be obtained by combining information from multiple sources, which can be of varying types, such as news articles, social media posts, quotes from press releases and historical beliefs expressed by politicians.
For example, consider the following tweet in context of a school shooting: We need to treat our teachers better! We should keep them safe. If the author of the tweet is Kamala Harris (known to be pro-gun control), this tweet is likely to be understood as "ban guns to avoid mass shootings in schools". However, if the same tweet is from Mike Pence, whose stance on guns is: "firearms in the hands of law abiding citizens makes our communities safer", the tweet could mean "arming school teachers stops active shooters". This example demonstrates that depending on the context, the same text could signal completely different real-world actions. Hence, we need to model the broader context of the text in order to understand its true meaning. Visualization projecting the tweet representation into a 2D space is given in figure 1, and shows how contextualization from our model helps disambiguate this example. First, we show the BERT-base representation of the tweet (Tweet-BERT). We also show the BERT-base representations of the known stances of Pence and Harris on gun control ({Mike Pence,Kamala Harris} Stance-BERT). Finally, we apply our model, contextualizing the ambiguous tweet representation with speaker information ({Mike Pence,Kamala Harris} Tweet-Contextualized). The visualization captures how this representation can disambiguate the different interpretations of the same text, and capture their differences. A computational setting for this approach, combining text and context analysis, requires two necessary attributes: (i) an input representation that combines all the different types of information meaningfully and (ii) the ability to create a meaningful unified representation in one-shot, that captures the complementary strengths of the different inputs.
We address the first challenge by introducing a graph structure that ties together first-person informal (tweets) and formal discourse (press releases and perspectives), third-person current (news) and consolidated (Wikipedia) discourse. These documents are connected via their authors, the issues/events they discuss and the entities mentioned in them. As a clarifying example consider the tweet by former-President Trump "The NRA is under siege by Cuomo". This tweet will be represented in our graph by connecting the text node to the author node (Trump) and the referenced entity node (NY Gov. Cuomo). These settings are shown in Fig. 2.
We propose a novel neural architecture that unifies all the information in the graph in one-shot. Our architecture generates a distributed representation for each item in the graph that is contextualized by the representations of others. It can dynamically respond to queries, focusing the induced representation on a specific context. In our exam-ple, this results in a modified tweet representation helping us characterize Trump's opinion of Cuomo in the context of the guns issue. Our architecture consists of an Encoder combining all documents related to a given node to generate an initial node representation and a Composer, a Graph Attention Network (GAT), composing the graph structure to generate contextualized node embeddings.
We design two self-supervised learning tasks to train the model and capture structural dependencies over the rich discourse representation, predicting Authorship and Referenced Entity links over the graph structure. Intuitively, the model is required to understand subtle language usage; Authorship prediction requires the model to differentiate between: (i) the language of one author from another and (ii) the language of the author in context of one issue vs another issue. Referenced Entity prediction requires understanding the language used by a specific author when discussing a particular entity, given the author's past discourse.
We focus on a specific graph elementpoliticians, and evaluate their resulting discourse representation on several empirical tasks which capture their stances and preferences.  (Biessmann, 2016;Johnson andGoldwasser, 2018, 2016;Kornilova et al., 2018a;Chen et al., 2019), would benefit from more focused representations. Of late, several works attempted to solve such tasks, such as analyzing relationships and their evolution (Iyyer et al., 2016;Han et al., 2019), analyzing political discourse on news and social media (Demszky et al., 2019;Roy and Goldwasser, 2020) and political ideology (Diermeier et al., 2012;Preoţiuc-Pietro et al., 2017;Kulkarni et al., 2018). Various political tasks such as roll call vote prediction (Clinton et al., 2003;Kornilova et al., 2018b;Patil et al., 2019;Spell et al., 2020a;Davoodi et al., 2020), entity stance detection (Mohammad et al., 2016;Fang et al., 2019), hyper-partisan/fake news detection (Li and Goldwasser, 2019;Palić et al., 2019;Baly et al., 2020) require a rich understanding of the context around the entities that are present in the text. But, the representations used are usually limited in scope to specific tasks and not rich enough to capture information that is useful across several tasks. The Compositional Reader model, that builds upon Devlin et al. (2019) embeddings and consists of a transformer-based Graph Attention Network inspired from Veličković et al. (2017) and Müller et al. (2019), aims to address those limitations via a generic entity-issue-event-document graph, which is used to learn highly effective representations.
Representing legislative preferences is typically done by modeling the ideal point of legislators represented in a Euclidean space from roll-call records (Poole et al., 1997). Recent approaches incorporate bill text information into this representation (Gerrish and Blei, 2011;Nguyen et al., 2015;Kraft et al., 2016;Kornilova et al., 2018c). Most relevant to our work is (Spell et al., 2020b) which uses social media information. We significantly extend these approaches by contextualizing the social media content using a novel architecture.

Event Identification
To identify news events, we use news article headlines. We find the mean (µ) and standard deviation (σ) of the number of articles published per day for each issue. If more than µ + σ number of articles are published on a single day for a given issue, we identify it as the beginning of an event. Then, we skip 7 days and look for a new event.
In our setting, events within an issue are nonoverlapping. We divide events for each issue separately, hence events for different issues do overlap. These events last for 7 − 10 days on average and hence the non-overlapping assumption within an issue is a reasonable relaxation of reality. To illustrate our point: coronavirus and civil-rights are separate issues and hence have overlapping events. An example event related to coronavirus could be "First case of COVID-19 outside of China". Similarly an event about civil-rights could be "Officer part of George Floyd killing suspended". We inspected the events manually by random sampling. More example events are in the appendix.

Data Pre-processing
We use Stanford CoreNLP tool (Manning et al., 2014), Wikifier (Brank et al., 2017) and BERTbase-uncased implementation by Wolf et al. (2019) to preprocess data for our experiments. We tokenize the documents, apply coreference resolution and extract referenced entities from each document. The referenced entities are then wikified using Wikifier tool (Brank et al., 2017). The documents are then categorized by issues and events. News articles from allsides.com and perspectives from ontheissues.org are already classified by issues. We use keyword based querying to extract issue-wise press releases from Propublica API. We use hashtag based classification for tweets. A set of gold hashtags for each issue was created and the tweets were classified accordingly 3 . Sentence-wise BERTbase embeddings of all documents are computed.

Query Mechanism
We implemented a query mechanism to obtain relevant subsets of data from the corpus. Each query is a triplet of entities, issues & lists of event indices corresponding to each of the issues. Given a query triplet, news articles related to the events for each of the issues, Wikipedia articles for each of the entities, background descriptions of the issues, perspectives of each entity regarding each of the issues and tweets & press releases by each of the entities related to the events in the query are retrieved. Referenced entities for each of the sentences in documents and sentence-wise BERT embeddings of the documents are also retrieved.

Compositional Reader
In this section, we describe the architecture of the proposed 'Compositional Reader' model in detail. It contains 3 key components: Graph Generator, Encoder and Composer. Given a query output of the query mechanism from Sec. 3.3, Graph Generator creates a directed graph with entities, issues, events and documents as nodes. Encoder is used to generate initial node embeddings for each of the nodes. Composer is a transformer-based Graph Attention Network (GAT) followed by a pooling layer. It generates the final node embeddings and a single summary embedding for the query graph. Each component is described below.
3 Data collection is detailed in appendix

Graph Generator
Given the output of the query mechanism for a query, the Graph Generator creates a directed graph with 5 types of nodes: authoring entities, referenced entities, issues, events and documents. Directed edges are used by Composer to update source node representations using destination nodes. We design the topology with the main goal of capturing the representations of events, issues and referenced entities that reflect author's opinion about them. We add edges from issues/events to author's documents but omit the other direction as our main goal is to contextualize issues/events using author's opinions.
Bidirectional edges from authors to their Wikipedia articles, tweets, press releases and perspectives, from issues to their background description, events and from events to news articles describing them are added. Uni-directional edges from events to tweets and press releases, from issues to author perspectives and from referenced entities to the documents that mention them are added. An example graph is shown in Fig. 2.

Encoder
Encoder is used to compute the initial node embeddings. It consists of BERT followed by a Bi-LSTM. For each node, it takes a sequence of documents as input. The documents are ordered temporally. The output of Encoder is a single embedding of dimension d m for each node. Given a node N = {D 1 , D 2 , . . . , D d } consisting of d documents, for each document D i , contextualized embeddings of all the tokens are computed using BERT. Token embeddings are computed sentence-wise to avoid truncating long documents. Then, token embeddings of each document are mean-pooled to get the document

Composer
Composer is a transformer-based graph attention network (GAT) followed by a pooling layer. We use the transformer encoding layer proposed by Vaswani et al. (2017), without the positionwise feed forward layer, as graph attention layer. Position-wise feed forward layer is removed as in contrast with sequence-to-sequence prediction tasks, nodes in a graph usually have no ordering relationship between them. Adjacency matrix of the graph is used as the attention mask. Self-loops are added for all nodes so that updated representation of the node also depends on its previous representation. Composer module uses l = 2 graph attention layers in our experiments. Composer module generates updated node embeddings U ∈ R n×dm and a summary embedding S ∈ R 1×dm as outputs.
The output dimension of node embeddings is 768.
Equations that describe Composer unit are: where n is number of nodes in the graph, d m is the dimension of a BERT token embedding, d k , are weight parameters to be learnt. E ∈ R dm×n is the output of the encoder. A ∈ {0, 1} n×n is the adjacency matrix. We set n h = 12 and d k = d v = 64.

Learning Tasks
We design two learning tasks to train the Compositional Reader model: Authorship Prediction and Referenced Entity Prediction. Both the tasks are intuitively designed to train the model to learn the association between the author node representation and the language used by the particular author. These tasks are two variations of link prediction over the graph. The tasks are detailed below.

Authorship Prediction
Authorship Prediction is designed as a binary classification task. In this task, the model is given a graph generated by the graph generator in subsection 4.1, an author node and a document node. The task is to predict whether or not the document was authored by the input author. Intuition behind this learning task is to enable our model to learn differentiating between: 1) language of an author's first-person discourse vs. third person discourse in news articles, 2) language of an author vs. language used by other authors and 3) language of an author in context of one issue vs. in context of other issues. The model sees documents by the author in the graph and learns to decide whether or not the input document is by the same author and talking about the same issue.  Data Training data for the task was created as follows: for a particular author-issue pair, we obtain a data graph similar to Fig. 2 using the query mechanism in subsection 3.3. To create a positive data sample, we sample a document d i authored by the entity a i and remove the edges between the nodes a i and d i . Negative samples were designed carefully in 3 batches to align with our above task objectives. In the first batch, we sample news article nodes from the same graph. In the second batch, we obtain tweets, press releases and perspectives of the same author but from a different issue. In the third batch, we sample documents related to the same issue but from other authors. We generate 421, 284 samples in total, with 252, 575 positive samples and 168, 709 negative samples. We randomly split the data into training set of 272, 159 samples, validation set of 73, 410 samples and test set of 75, 715 samples. Architecture We concatenate the initial and final node embeddings of the author, document and also the summary embedding of the graph to obtain inputs to the fine-tuning layers for Authorship Prediction task. We add one hidden layer of dimension 384 before the classification layer.

Out-sample Evaluation
We perform out-sample experiments to evaluate generalization capability to unseen author data. We train the model on training data from two-thirds of politicians and test on the test sets of others. Results are shown in Tab. 2. Graph Trimming We perform graph trimming to make the computation tractable on a single GPU. We randomly drop 80% of the news articles, tweets and press releases that are not related to the event to which d i belongs. We use graphs with 200-500 nodes and batch size of 1.

Referenced Entity Prediction
This is also a binary classification task. Given a data graph, a document node with a masked entity and a referenced entity node the graph, the task is to predict whether the referenced entity is same as the masked entity. Intuition behind this learning task is to enable our model to learn the correlation between the language of the author in the document and the masked entity. For example, in context of recent Donald Trump's impeachment hearing, consider the sentence 'X needs to face the consequences of their actions'. Depending upon the author, X could either be 'Donald Trump' or 'Democrats'. Learning to understand such correlations by looking at other documents from the same author is effective in capturing meaningful author representations.
Data To create training data, we sample a document from the data graph. We mask the most frequent entity in the document with a generic <ENT> token. We remove the link between the masked entity and the document in the data graph. We sample another referenced entity from the graph to generate a negative example. We generated 252, 578 samples for this task, half of them positive. They were split into 180, 578 training samples, validation and test sets of 36, 400 samples each.
Architecture We use fine-tuning architecture similar to Authorship Prediction on top of Compositional Reader for this task as well. We keep separate fine-tuning parameters for each task as they are fundamentally different prediction problems. Compositional Reader is shared. We apply graph trimming for this task as well. We also perform out-sample evaluation for this learning task.

Evaluation
We evaluate our model and pre-training tasks in a systematic manner using several quantitative tasks and qualitative analysis. Quantitative evaluation includes Grade Paraphrase task, Grade Prediction on National Rifle Association (NRA) and League of Conservation Voters (LCV) grades data followed by Roll Call Vote Prediction task. Qualitative evaluation includes entity-stance visualization for issues and Opinion Descriptor Generation. We compare our model's performance to BERT representations, the BERT adaptation baseline and representations from the Encoder module. Baselines and the evaluation tasks are detailed below. Further evaluation tasks are in the appendix.

Baselines
BERT: We compute the results obtained by using pooled BERT representations of relevant documents for each of the quantitative tasks. Details of the chosen documents and the pooling procedure is described in the relevant task subsections. We chose BERT-base over BERT-large due to the complexity of running the learning tasks on embedding   In BERT adaptation, once we generate the data graph, we pass the mean-pooled sentence-wise BERT embeddings of the node documents through a Bi-LSTM. We meanpool the output of Bi-LSTM to get node embeddings. We use fine-tuning layers on top of thus obtained node embeddings for both the learning tasks. BERT Adaptation baseline allows us to showcase the importance of our proposed training tasks via comparison with BERT-base representations. It also demonstrates the usefulness of Composer.

Grade Paraphrase Task
National Rifle Association (NRA) assigns letter grades (A+, A, . . . , F) to politicians based on candidate questionnaire and their gun-related voting. We evaluate our representations on their ability to predict these grades. We collected the historical data of politicians' NRA grades from everytown.org. In Grade Paraphrase task, we evaluate our rep-resentations directly without training on the NRA data. Grades are divided into two classes: grades including and above B+ are in positive class and grades from C+ to F are clustered into negative. We formulate representative sentences for them: • POSITIVE: I strongly support the NRA • NEGATIVE: I vehemently oppose the NRA For each politician, we obtain data graph for the issue guns. We input the data graph to Compositional Reader model and use the node embeddings of the author politician ( n auth ), issue guns ( n guns ) and referenced entity NRA ( n N RA ). For some politicians, n N RA is not available as they have not referenced NRA in their discourse. We just use n auth and n guns for them. We compute BERT-base embeddings for the representative sentences to obtain pos N RA and neg N RA . We meanpool the three embeddings n auth , n guns and n N RA to obtain n stance . We compute cosine similarity of n stance with pos N RA & neg N RA . Politician is assigned the higher similarity class. We compare our model's results to BERT-base, BERT adaptation and Encoder embeddings. For BERT-base, we compute n stance by mean-pooling the sentence-wise BERT embeddings of tweets, press releases and perspectives of the author on all events related to the issue guns. Results are shown in Tab. 3. Compositional Reader achieves 63.32% accuracy. Encoder embeddings get 56.16%. Meanpooled BERT-base embeddings get 41.55%. Using node embeddings from BERT adaptation model yields 37.54%. When we evaluate using only 'A'/'F' grades, we obtain 63.93% accuracy for Com-positional Reader, 48.36% for Encoder, 42.62% for BERT adaptation and 38.52% for BERT-base.

Grade Prediction Task
NRA Grades This is designed as a 5-class classification task for grades {A, B, C, D & F}. We train a simple feed-forward network with one hidden layer. The network is given 2 inputs n auth & n guns . When n N RA is available for an author, we set n guns = mean( n N RA , n guns ). The output is a binary prediction.
We perform k = 10-fold cross-validation for this task. We repeat the entire process for 5 random seeds and report the results with confidence intervals. We perform this evaluation for BERTbase, BERT adaptation, Encoder and Compositional Reader. To compute n auth for BERT-base, we mean-pool the sentence-wise embeddings of all author documents on guns. For n guns , we use the background description document of issue guns. Results on the test set are in Tab. 3. LCV Grades This is similar to NRA Grade Prediction task. This is a 4-way classification task. League of Conservation Voters (LCV) assigns a score ranging between 0-100 to each politician depending upon their environmental voting activity. We segregate politicians into 4 classes (0 − 25, 25 − 50, 50 − 75, 75 − 100). We obtain input to the prediction model by concatenating n auth and n environment . We use same fine-tuning architecture as NRA Grade Prediction task. Results of Grade Prediction task are shown in Tab. 3. On NRA Grade Prediction, which is a 5-way classification task, our model achieves an accuracy of 81.62 ± 1.23 on the test set. Our model outperforms BERT representations by 26.79±3.02 absolute points on the test set. On LCV Grade Prediction task which is a 4-way classification, our model achieves 9.61 ± 1.77 point improvement over BERT representations.

Roll Call Vote Prediction Task
This task was proposed in Patil et al. (2019). We skip the finer details of the task for brevity. The task aims to predict the voting behaviour of US politicians on roll call votes. Given the bill texts and voting history of the politicians, the aim is to predict future voting patterns of the politicians. We inject our politician author embeddings from Compositional Reader model to improve the performance on the task. We input all the politician first-person discourse from our data to compute politician author embeddings using Compositional Reader model. We use these embeddings to initialize the legislator embeddings in their news-augmented glove model, which is their best performing model. We use the data splits provided in their official implementation. We use their code to reproduce their results. Results are shown in Tab. 4.

Qualitative Evaluation
Politician Visualization We perform Principle Component Analysis (PCA) on issue embeddings ( n issue ) of politicians obtained using the same method as in NRA Grade prediction. We show one such interesting visualization in Fig. 4. Sen. Mc-Connell, a Republican who expressed right-wing views on both environment and guns. Sen. Sanders, a Democrat that expressed left-wing views on both. Rep. Rooney, a Republican who expressed rightwing views on guns but left-wing views on environment. Fig. 4 demonstrates that this information is captured by our representations. Additional such visualizations are included in the appendix. Issue Visualization We present visualization of politicians on the issue guns in Fig. 4. We observe that guns tends to be a polarizing issue. This shows that our representations are able to effectively capture relative stances of politicians. We observe that issues that have traditionally had clear conservative vs liberal boundaries such as guns & abortion are more polarized compared to issues that evolve with time such as middle-east & economic-policy. Visualization for issue abortion is in the appendix.

Opinion Descriptor Generation
This task demonstrates a simple way to interpret our contextualized representations as natural language descriptors. It is an unsupervised qualitative evaluation task. We generate opinion descriptors for authoring entities for specific issues. We use the final node embedding of the issue node ( n issue )    (2019), we define our candidate space for descriptors as the set of adjectives used by the entity in their tweets, press releases and perspectives related to an issue. Although Han et al. (2019) uses verbs as relationship descriptor candidates, we opine that adjectives describe opinions better. We compute the representative embedding for each descriptor by meanpooling the contextualized embeddings of that descriptor from all its occurrences in the politician's discourse. This is the one of the key differences with prior descriptor generation works such as Han et al. (2019) and Iyyer et al. (2016). They work in a static word embedding space. But, our embeddings are contextualized and also reside in a higher dimensional space. In an unsupervised setting, this makes it more challenging to translate from distributional space to natural language tokens. Hence, we restrict the candidate descriptor space more than Han et al. (2019) and Iyyer et al. (2016). We rank all the candidate descriptors according to cosine similarity of its representative embedding with the vector n issue .
We present some of the results in Tab. 5. In contrast to Iyyer et al. (2016) and Han et al. (2019), our model doesn't need the presence of both the entities in text to generate opinion descriptors. This is often the case in first person discourse. Results are shown in table 5.

Ablation Study
Further, we investigate the importance of various components. We perform ablation study over various types of documents on the NRA Grades Paraphrase task. the results are shown in Tab. 6. They indicate that perspectives are most useful while tweets are the least useful documents for the task. As perspectives are summarized ideological leanings of politicians, it is intuitive that they are more effective for this task. Tweets are informal discourse and tend to be very specific to a current event, hence they are not as useful for this task.

Conclusion
We tackle the problem of understanding politics, i.e., creating unified representations of political figures capturing their views and legislative preferences, directly from raw political discourse data originating from multiple sources. We propose the Compositional Reader model that composes multiple documents in one shot to form a unified political entity representation, while capturing the real-world context needed for representing the interactions between these documents. We evaluate our model on several qualitative and quantitative tasks. We outperform BERT-base model on both types of tasks. Our qualitative evaluation demonstrate that our representations effectively capture nuanced political information.

Appendices A Event Examples
In this section, we provide examples of events that were identified by our event identification heuristic. For each automatically extracted event, we observe that the news headlines with in the cluster usually describe the same real world event. The span of each event is 10 days at most. Hence, the assumption that the events with in each issue are non-overlapping is a reasonable relaxation of reality. We made event segregated document data available for future research along with our code. Examples are shown in Tab 7.

B Reproducibility
We use seeds (set to 4056 for both tasks) for both random example generation and training neural networks. For fine-tuning layers of learning tasks we initialize the models using Xavier uniform (Glorot and Bengio, 2010) initialization with gain=1.0. We optimize the parameters using Stochastic Gradient Descent with an initial learning rate=0.0075 and momentum=0.4. We used 4 Nvidia GeForce GTX 1080 Ti GPUs with 12 GB memory and linux servers with 64 GB RAM for our experiments. CPU RAM and GPU memory are the main bottlenecks for training the model. It takes 80 hours to train authorship prediction for 5 epochs and 14 hours to train referenced entity prediction task for the same. Generating test results for both tasks together takes 3 hours. We use a batch size of 1 for both training and evaluation. For NRA Grade Prediction task we use 5 random seeds: {5, 7, 11, 13, 17} and report mean and standard deviation. The encoder-composer architecture is made up of 8.26M parameters, encoder consisting of 3.54M and composer 4.72M . Due to long training time, the only hyper-parameter we experimented with is the graph size. We retained as many nodes as possible without exceeding GPU memory (500 nodes). We divide the 3, 640 queries into 151 batches of 24 queries each (3 politicians × 8 issues) and 1 batch of 16 queries (2 politicians × 8 issues). Train, val and test data examples are generated for each query batch. For Authorship Prediction and Referenced Entity Prediction tasks, Compositional Reader model is trained on one batch for 5 epochs, the best parameters are chosen according to the validation performance of that batch and we proceed to training on future batches. Politicians are ordered randomly when generating queries.

B.1 Data Collection
We collected data from 5 sources: Wikipedia, Twitter, ontheissues.org, allsides.com and ProPublica Congress API. We scraped articles from Wikipedia related to all the politicians in focus. We collected tweets from Congress Tweets and Baumgartner (2019). We used a set of hand build gold hashtags to separate them by issues. They are shown at the end of this document. We collected all news articles related to the 8 issues in focus from allsides.com. We collected press releases from Propublica API using key word search. We use issue names as keywords. We only maintain pointers to processed tweet and text data in data releases. All social media text analyzed is by public political figures, not private citizens.

C.1 Grade Prediction Additional Ablation
For Grade Prediction task, we perform experiments by training the model on a fraction of the data. We monitor the validation and test performances with change in training data percentage. We observe that, in general, the gap between Compositional Reader model and the BERT baseline widens with increase in training data. It hints that our representation likely captures more relevant information for this task. Results are shown in figure 13.