NewsMTSC: A Dataset for (Multi-)Target-dependent Sentiment Classification in Political News Articles

Previous research on target-dependent sentiment classification (TSC) has mostly focused on reviews, social media, and other domains where authors tend to express sentiment explicitly. In this paper, we investigate TSC in news articles, a much less researched TSC domain despite the importance of news as an essential information source in individual and societal decision making. We introduce NewsMTSC, a high-quality dataset for TSC on news articles with key differences compared to established TSC datasets, including, for example, different means to express sentiment, longer texts, and a second test-set to measure the influence of multi-target sentences. We also propose a model that uses a BiGRU to interact with multiple embeddings, e.g., from a language model and external knowledge sources. The proposed model improves the performance of the prior state-of-the-art from F1_m=81.7 to 83.1 (real-world sentiment distribution) and from F1_m=81.2 to 82.5 (multi-target sentences).

We investigate TSC in news articles -a much less researched domain despite its critical relevance, especially in times of "fake news," echo chambers, and news ownership centralization (Hamborg et al., 2019). How persons are portrayed in news on political topics is very relevant, e.g., for individual and societal opinion formation (Bernhardt et al., 2008).
We define our problem statement as follows: we seek to detect polar judgments towards target persons (Steinberger et al., 2017). Following the TSC literature, we include only in-text, specifically insentence, means to express sentiment. In news texts such means are, e.g., word choice and generally framing 1 (Kahneman and Tversky, 1984;Entman, 2007), e.g., "freedom fighters" vs. "terrorists," or describing actions performed by the target, and indirect sentiment through quoting another person (Steinberger et al., 2017). Other means may also alter the perception of persons and topics in the news, but are not in the scope of the task (Balahur et al., 2010), e.g., because they are not on sentence-level. For example, story selection, source selection, article's placement and size (Hamborg et al., 2019), and epistemological bias (Recasens et al., 2013).
The main contributions of this paper are: (1) We introduce NewsMTSC, a large, manually annotated dataset for TSC in political news articles. We analyze the quality and characteristics of the dataset using an on-site, expert annotation. Because of its fundamentally different characteristics compared to previous TSC datasets, e.g., as to sentiment expressions and text lengths, NewsMTSC represents a challenging novel dataset for the TSC task. (2) We propose a neural model that improves TSC performance compared to prior state-of-the-art models. Additionally, our model yields competitive performance on established TSC datasets. (3) We perform an extensive evaluation and ablation study of the proposed model. Among others, we investigate the recently claimed "degeneration" of TSC to sequence-level classification  finding a performance drop in all models when comparing single-and multi-target sentences.
In a previous short-paper, we explored the characteristics of how sentiment is expressed in news articles by creating and analyzing a small-scale TSC dataset (Hamborg et al., 2021). The paper at hand addresses our former exploratory work's critical findings, including essential improvements to the dataset. Key differences and improvements are as follows. We significantly increase the dataset's size and the number of annotators per example and address its class imbalance. Further, we devise annotation instructions specifically created to capture a broad spectrum of sentiment expressions specific to news articles. In contrast, the early dataset misses the more implicit sentiment expressions commonly used by news authors (Hamborg et al., 2021;Steinberger et al., 2017). Also, we comprehensively test various consolidation strategies and conduct an expert annotation to validate the dataset.
We provide the dataset and code to reproduce our experiments at: https://github.com/fhamborg/NewsMTSC

Related Work
Analogously to other NLP tasks, the TSC task has recently seen a significant performance leap due to the rise of language models (Devlin et al., 2019). Pre-BERT approaches yield up to F 1 m = 63.3 on the SemEval 2014 Twitter set (Kiritchenko et al., 2014). They employ traditional machine learning combining hand-crafted sentiment dictionaries, such as SentiWordNet (Baccianella et al., 2010), and other linguistic features (Biber and Finegan, 1989). On the same dataset, vanilla BERT (also called BERT-SPC) yields 73.6 (Devlin et al., 2019;Zeng et al., 2019). Specialized downstream architectures improve performance further, e.g., LCF-BERT yields 75.8 (Zeng et al., 2019).
The vast majority of recently proposed TSC approaches employ BERT and focus on devising specialized down-stream architectures (Sun et al., 2019a;Zeng et al., 2019;. More recently, to improve performance further, additional measures have been proposed. For example, domain adaption of BERT, i.e., domain-specific language model finetuning prior to the TSC finetuning (Rietzler et al., 2019;Du et al., 2020); use of external knowledge, such as sentiment or emotion dictionaries (Hosseinia et al., 2020;, rule-based sentiment systems (Hosseinia et al., 2020), and knowledge graphs (Ghosal et al., 2020); use of all mentions of a target and/or related targets in a document (Chen et al., 2020); and explicit encoding of syntactic information (Phan and Ogunbona, 2020;Yin et al., 2020).
To train and evaluate recent TSC approaches, three datasets are commonly used: Twitter (Nakov et al., 2013(Nakov et al., , 2016Rosenthal et al., 2017), Laptop and Restaurant (Pontiki et al., 2014(Pontiki et al., , 2015. These and other TSC datasets (Pang and Lee, 2005) suffer from at least one of the following shortcomings. First, implicitly or indirectly expressed sentiment is rare in them. In their domains, e.g., social media and reviews, typically authors explicitly express their sentiment regarding a target (Zhang et al., 2018). Second, they largely neglect that a text may contain coreferential mentions of the target or mentions of different concepts (with potentially different polarities), respectively .
Texts in news articles differ from reviews and social media in that news authors typically do not express sentiment toward a target explicitly (exceptions include opinion pieces and columns). Instead, journalists implicitly or indirectly express sentiment (Section 1) because language in news is typically expected to be neutral and journalists to be objective (Balahur et al., 2010;Godbole et al., 2007;Hamborg et al., 2019).
Our problem statement (Section 1) is largely identical to prior news TSC literature (Steinberger et al., 2017;Balahur et al., 2010) with key differences: we do not generally discard the "author-" and "reader-level." Doing so would neglect large parts of sentiment expressions. Thus, it would degrade real-world performance of the resulting dataset and models trained on it. For example, word choice (listed as "author-level" and discarded from their problem statement) is in our view an in-text means that may in fact strongly influence how readers perceive a target, e.g., "freedom fighters" or "terrorists." While we do not exclude their "reader-level," we do seek to exclude polarizing or contentious cases, where no uniform answer can be found in a set of randomly selected readers (Sections 3.3 and 3.4). As a consequence, we generally do not distinguish between the three levels of sentiment ("author," "reader," and "text") in this paper.
Previous news TSC approaches mostly employ sentiment dictionaries, e.g., created manually (Balahur et al., 2010;Steinberger et al., 2017) or extended semi-automatically (Godbole et al., 2007), but yield poor or even "useless" (Steinberger et al., 2017) performances. To our knowledge, there exist two datasets for evaluation of news TSC methods (Steinberger et al., 2017), which -perhaps due to its small size (N = 1274) -has not been used or tested in recent TSC literature. Recently, Hamborg et al. (2021) proposed a dataset (N = 3002) used to explore target-dependent sentiment in news articles. The dataset suffers from various shortcomings, particularly its small size, class imbalance, and lacking the more ambiguous and implicit types of sentiment expressions described above. Another dataset contains quotes extracted from news articles, since quotes more likely contain explicit sentiment (N = 1592) (Balahur et al., 2010).

NewsMTSC: Dataset Creation
In creating the dataset, we rely on best practices reported in literature on the creation of datasets for NLP (Pustejovsky and Stubbs, 2012), especially for the TSC task (Rosenthal et al., 2017). Compared to previous TSC datasets though, the nature of sentiment in news articles requires key changes, especially in the annotation instructions and consolidation of answers (Steinberger et al., 2017).

Data sources
We use two datasets as sources: POLUSA (Gebhard and Hamborg, 2020) and Bias Flipper 2018 (BF18) (Chen et al., 2018). Both satisfy five criteria that are important to our problem. First, they contain news articles reporting on political topics. Second, they approximately match the online media landscape as perceived by an average US news consumer. 2 Third, they have a high diversity in topics due to the number of articles contained and time frames covered (POLUSA: 0.9M articles published between Jan. 2017 and Aug. 2019, BF18: 6447 articles associated to 2781 events). Fourth, they feature high diversity in writing styles because they contain articles from across the political spectrum, including left-and right-wing outlets. Fifth, we find that they contain only few minor content errors albeit being created through scraping or crawling.

Creation of examples
To create a batch of examples for annotation, we devise a three tasks process: first, we extract example candidates from randomly selected articles.
Second, we discard non-optimal candidates. Only for the train set, third, we filter candidates to address class imbalance. We repeatedly execute these tasks so that each batch yields 500 examples for annotation, contributed equally by both sources.
First, we randomly select articles from the two sources. Since both are at least very approximately uniformly distributed over time (Gebhard and Hamborg, 2020;Chen et al., 2018), randomly drawing articles will yield sufficiently high diversity in both writings styles and reported topics (Section 3.1). To extract from an article examples that contain meaningful target mentions, we employ coreference resolution (CR). 3 We iterate all resulting coreference clusters of the given article and create a single example for each mention and its enclosing sentence.
Extraction of mentions of named entities (NEs) is the commonly employed method to create examples in previous TSC datasets (Rosenthal et al., 2017;Nakov et al., 2016Nakov et al., , 2013Steinberger et al., 2017). We do not use it since we find it would miss 30% mentions of relevant target candidates, e.g., pronominal or near-identity mentions.
Second, we perform a two level filtering to improve quality and "substance" of candidates. On coreference cluster level, we discard a cluster c in a document d if |M c | ≤ 0.2|S d |, where |...| is the number of mentions of a cluster (M c ) and sentences in a document (S d ). Also, we discard non-persons clusters, i.e., if ∃m ∈ M c : t(m) / ∈ {−, P }, where t(m) yields the NE type 4 of m, and − and P represent the unknown and person type, respectively. On example level, we discard short and similar examples e, i.e., if |s e | < 50 or if ∃ê : sim(s e , sê) > 0.6 ∧ m e = mê ∧ t e = tê where s e , m e , and t e are the sentence of e, its mention, and the target's cluster, respectively, and sim(...) the cosine similarity. Lastly, if a cluster has multiple mentions in a sentence, we try to select the most meaningful example. In short, we prefer the cluster's representative mention 5 over nominal mentions, and those over all other instances.
Third, for only the train set, we filter candidates to address class imbalance. Specifically, we discard examples e that are likely the majority class (p(neutral|s e ) > 0.95) as determined by a simple binary classifier (Sanh et al., 2019). Whenever annotated and consolidated examples are added to the train set of NewsMTSC, we retrain the classifier on them and all previous examples in the train set.

Annotation
Instructions used in popular TSC datasets plainly ask annotators to rate the sentiment of a text toward a target (Rosenthal et al., 2017;Pontiki et al., 2015). For news texts, we find that doing so yields two issues (Balahur et al., 2010): low inter-annotator reliability (IAR) and low suitability. Low suitability refers to examples where annotators' answers can be consolidated but the resulting majority answer is incorrect as to the task. For example, instructions from prior TSC datasets often yield low suitability for polarizing targets, independently of the sentence they are mentioned in. Figure 2 (Appendix) depicts our final annotation instructions.
In an interactive process with multiple test annotations (six on-site and eight on Amazon Mechanical Turk, MTurk), we test various measures to address the two issues. We find that asking annotators to think from the perspective of the sentence's author strongly facilitates that annotators overcome their personal attitude. Further, we find that we can effectively draw annotators' attention not only at the event and other "facts" described in sentence (the "what") but also at word choice ("how" it is described) by exemplarily mentioning both factors and abstracting these factors as the author's holistic "attitude." 6 We further improve IAR and suitability, e.g., by explicitly instructing annotators to rate sentiment only regarding the target but not other aspects, such as the reported event.
Technically, we closely follow previous literature on TSC datasets (Pontiki et al., 2015;Rosenthal et al., 2017). We conduct the annotation of our examples on MTurk. Each example is shown to five randomly selected crowdworkers. To participate in our annotation, crowdworkers must have the "Master" qualification, i.e., have a record of successfully completed, high quality work on MTurk.
To ensure quality, we implement a set of objective measures and tests (Kim et al., 2012). While we pay all crowdworkers always (USD 0.07 per assignment), we discard all of a crowdworker's answers if at least one of the following conditions is met. A crowdworker (a) was not shown any test question or answered at least one incorrectly 7 , (b) provided answers to invisible fields in the HTML form (0.3% of crowdworkers did so, supposedly bots), or (c) the average duration of time spent on the assignments was extremely low (< 4s).
The IAR is sufficiently high (κ C = 0.74) when considering only examples in NewsMTSC. The expected mixed quality of crowdsourced work becomes apparent when considering all examples, including those that could not be consolidated and answers of those crowdworkers who did not pass our quality checks (κ C = 0.50).

Consolidation
We consolidate the answers of each example to a majority answer by employing a restrictive strategy. Specifically, we consolidate the set of five answers A to the single-label 3-class polarity p ∈ {pos., neu., neg.} if ∃C ⊆ A : |C| ≥ 4 ∧ ∀c ∈ C : s(c) = p, where s(c) yields the 3-class polarity of an individual 7-class answer c, i.e., neutral ⇒ neutral, any positive (from slightly to strongly) ⇒ positive, and respectively for negative. If there is no such consolidation set C, A cannot be consolidated and the example is discarded. Consolidating to 3-class polarity allows for direct comparison to established TSC dataset.
While the strategy is restrictive (only 50.6% of all examples are consolidated this way), we find it yields the highest quality. We quantify the dataset's quality by comparing the dataset to an expert annotation (Section 3.6) and by training and testing models on dataset variants with different consolidations. Compared to consolidations employed for previous TSC datasets, quality is improved significantly on our examples, e.g., our strategy yields F 1 m = 86.4 when comparing to experts' annotations and models trained on the resulting set yield up to F 1 m = 83.1 whereas the two-step majority strategy employed for the Twitter 2016 set (Nakov et al., 2016) yields 50.6 and 53.4 respectively.

Splits and multi-target examples
NewsMTSC consists of three sets as depicted in Table 1. For the train set, we employ class balanc-

Set
Total Pos. Neu. Neg. MT-a MT-d +Corefs Pos. Neu. Neg .  Train  8739 2395 3028 3316  972  341  11880 3434 3744 4702  Test-mt 1476  246  748  482  721  294  1883  333  910  640  Test-rw 1146  361  587  624  73  30  1572  361 587 624 we create a second test set named test-rw, which omits the MT filtering and is thus designed to be as close as possible to the real-world distribution of sentiment. We seek to provide a sentiment score for each person in each sentence in train and testrw but mentions may be missing, e.g., because of erroneous coreference resolution or crowdworkers' answers could not be consolidated.

Quality and characteristics
We conduct an expert annotation of a random subset of 360 examples used during the creation of NewsMTSC with five international graduate students (studying Political or Communication Science at the University of Zurich, Switzerland, 3 female, 2 male, aged between 23 and 29). Key differences compared to the MTurk annotation are: first, extensive training until high IAR is reached (considering all examples: κ C = 0.72, only consolidated: κ C = 0.93). We conduct five iterations, each consisting of individual annotations by the students, quantitative and qualitative review, adaption of instructions, and individual and group discussions. Second, comprehensive instructions (4 pages). Third, no time pressure, since the students are paid per hour (crowdworkers per assignment). When comparing the expert annotation with our dataset, we find that NewsMTSC is of high quality (F 1 m = 86.4). The quality of unfiltered answers from MTurk is, as expected, much lower (50.1).
What is contained in NewsMTSC? In a random set of 50 consolidated examples from MTurk, we find that most frequent, non-mutually exclusive means to express a polar statement (62% of the 50) are usage of quotes (in total, direct, and indirect 42%, 28%, and 14%, respectively), target being subject to action (24%), evaluative expression by the author or an opinion holder mentioned outside of the sentence (18%), target performing an action (16%), and loaded language or connotated terms (14%). Direct quotes often contain evaluative expressions or connotated terms, indirect quotes less. Neutral examples (38% of the 50) contain mostly objective storytelling about neutral events (16%) or variants of "[target] said that ..." (8%).
What is not contained in NewsMTSC? We qualitatively review all examples where individual answers could not be consolidated to identify potential causes why annotators do not agree. The predominant reason is technical, i.e., the restrictiveness of the consolidation (MTurk compared to experts: 26% ≈ 30%). Other examples lack apparent causes (24% 8%). Further potential causes are (not mutually exclusive): ambiguous sentence (16% ≈ 18%), sentence contains positive and negative parts (8% ≈ 6%), opinion holder is target (6% ≈ 8%), e.g., "[...] Bauman asked supporters to 'push back' against what he called a targeted campaign to spread false rumors about him online." What are qualitative differences in the annotations by crowdworkers and experts? We review all 63 cases (18%) where answers from MTurk could be consolidated but differ to experts' answers. The major reason for disagreement is the restrictiveness of the consolidation (53 cases have no consolidation among the experts). In 10 cases the consolidated answers differ. We find that in few examples (2-3%) crowdsourced annotations are superficial and fail to interpret the full sentence correctly.
Texts in NewsMTSC are much longer than in prior TSC datasets (mean over all examples): 152 characters compared to 100, 96, and 90 in Twitter, Restaurant, and Laptops, respectively.

Methodology
The goal of TSC is to find a target's polarity y ∈ {pos., neu., neg.} in a sentence. Our model consists of four key components (Figure 1): a pretrained language model (LM), a representation of external knowledge sources (EKS), a target mention mask, and a bidirectional GRU (BiGRU) . We adapt our model from Hosseinia et al. (2020) and change the design as follows: we employ a target mask (which they did not) and use multiple EKS simultaneously (instead of one). Further, we use a different set of EKS (Section 5) and do not exclude the LM's parameters from finetuning. [

Input representation
We construct three model inputs. The first is a text input T constructed as suggested by Devlin et al. (2019) for question-answering (QA) tasks. Specifically, we concatenate the sentence and target mention and tokenize the two segments using the LM's tokenizer and vocabulary, e.g., WordPiece for BERT (Wu et al., 2016). 8 This step results in a text input sequence T = [CLS, s 0 , s 1 , ..., s p , SEP, t 0 , t 1 , ..., t q , SEP] ∈ N n consisting of n word pieces, where n is the manually defined maximum sequence length. The second input is a feature representation of the sentence, which we create using one or more EKS, such as dictionaries (Hosseinia et al., 2020;. Given an EKS with d dimensions, we construct an EKS representation E ∈ R n×d of S, where each vector e i∈{0,1,...,p} is a feature representation of the word piece i in the sentence. To facilitate learning associations between the token-based EKS representation and the WordPiece-based sequence T , we create E so that it contains k repeated vectors for each token where k is the token's number of word pieces. Thereby, we also consider special characters, such as CLS. If multiple EKS with a total number of dimensionŝ d = d are used, their representations of the sentence are stacked resulting in E ∈ R n×d . The third input is a target mask M ∈ R n , i.e., for each word piece i in the sentence that belongs to the target, m i = 1, else 0 (Gao et al., 2019).

Embedding layer
We feed T into the LM to yield a contextualized word embedding of shape R n×h , where h is the number of hidden states in the language model. We feed E into a randomly initialized matrix W E ∈ Rd ×h to yield an EKS embedding. We repeat M to be of shape R n×h . By creating all embeddings in the same shape, we facilitate a balanced influence of each input to the model's downstream components. We stack all embeddings to form a matrix T EM ∈ R n×3h .

Interaction layer
We allow the three embeddings to interact using a single-layer BiGRU (Hosseinia et al., 2020), which yields hidden states H ∈ R n×6h = BiGRU(T EM ). RNNs, such as LSTMs and GRUs, are commonly used to learn a higher-level representation of a word embedding, especially in state-of-the-art TSC prior to BERT-based models but also recently (Liu et al., 2015;Li et al., 2019;Hosseinia et al., 2020;. We choose an BiGRU over an LSTM because of the smaller number of parameters in BiGRUs, which may in some cases result in better performance (Chung et al., 2014;Hosseinia et al., 2020;Gruber and Jockisch, 2020).

Pooling and decoding
We employ three common pooling techniques to turn the interacted, sequenced representation H into a single vector (Hosseinia et al., 2020). We calculate element-wise (1) mean and (2) maximum over all hidden states H and retrieve the (3) last hidden state h n−1 . Then, we stack the three vectors to P , feed P into a fully connected layer F C so that z = F C(P ) and calculate y = σ(z).

Experimental data
In addition to NewsMTSC, we use the three established TSC sets: Twitter, Laptop, and Restaurant.

Evaluation metrics
We use metrics established in the TSC literature: macro F1 on all (F 1 m ) and only the positive and negative classes (F 1 pn ), accuracy (a), and average recall (r a ). If not otherwise noted performances are reported for our primary metric, F 1 m .

Baselines
We compare our model with TSC methods that yield state-of-the-art results on at least one of the established datasets: SPC-BERT (Devlin et al., 2019): input is identical to our text input. FC and softmax is calculated on CLS token. TD-BERT (Gao et al., 2019): masks hidden states depending on whether they belong to the target mention. LCF-BERT (Zeng et al., 2019): similar to TD but additionally weights hidden states depending on their token-based distance to the target mention. We use the improved implementation (Yang, 2020) and enable the dual-LM option, which yields slightly better performance than using only one LM instance (Zeng et al., 2019). We also planned to test LCFS-BERT (Phan and Ogunbona, 2020) but due to technical issues we were not able to reproduce the authors' results and thus exclude LCFS from our experiments.

Implementation details
To find for each model the best parameter configuration, we perform an exhaustive grid search. Any number we report is the mean of five experiments that we run per configuration. We randomly split each test set into a dev-set (30%) and the actual testset (70%). We test the base version of three LMs: BERT, RoBERTa, and XLNET. For all methods, we test parameters suggested by their respective authors. 9 We test all 15 combinations of the following 4 EKS: (1) SENT (Hu and Liu, 2004): a sentiment dictionary (number of non-mutually exclusive dimensions: 2, domain: customer reviews).  Table 2 reports the performances of the models using different LMs and evaluated on both test sets. We find that the best performance is achieved by our model (F 1 m = 83.1 on test-rw compared to 81.8 by prior state-of-the-art). For all models, performances are (strongly) improved when using RoBERTa, which is pre-trained on news texts, or XLNET, likely because of its large pre-training corpus. Because of limited space, XLNET is not reported in Table 2, but results are generally similar to RoBERTa except for the TD model, where XLNET degrades performance by 5-9pp. Looking at BERT, we find no significant improvement of GRU-TSC over prior state-of-the-art. Even if we domainadapt BERT (Rietzler et al., 2019) for 3 epochs on a random sample of 10M English sentences (Gebhard and Hamborg, 2020), BERT's performance (F 1 m = 81.8) is lower than RoBERTa. We notice a performance drop for all models when comparing test-rw and test-mt. It seems that RoBERTa is better able to resolve in-sentence relations between multiple targets (performance degeneration of only up to −0.6pp) than BERT (−2.9pp). We suggest to use RoBERTa for TSC on news, since fine-tuning it is faster than fine-tuning XLNET, and RoBERTa achieves similar or better performance than other LMs.

Model
Test-rw   previous TSC datasets (Table 3), LCF is the top performing model. 10 When comparing the performances across all four datasets, the importance of the consolidation becomes apparent, e.g., performance is lowest on Twitter, which employs a simplistic consolidation (Section 3.4). The performance differences of individual models when contrasting their use on prior datasets and NewsMTSC highlight the need LCF performs consistently best on prior datasets but worse than GRU-TSC on NewsMTSC. One reason might be that LCF's weighting approach relies on a static distance parameter, which seems to degrade performance when used on longer texts as in NewsMTSC (Section 3.6). When increasing LCF's window width SRD, we notice a slight improvement of 1pp (SRD=5) but degradation for larger SRD.

Ablation study
We perform an ablation study to test the impact of four key factors: target mask, EKS, coreferential mentions, and fine-tuning the LM's parameters. We test all LMs and if not noted otherwise report results for RoBERTa since it generally performs best (Section 5.5). We report results for test-mt (performance influence is similar on either test set, 10 For previous models, Table 3   with performances generally being ≈ 3-5pp higher on test-rw). Overall, we find that our changes to the initial design (Hosseinia et al., 2020) contribute to an improvement of approximately 1.9pp. The most influential changes are the selected EKS and in part use of coreferential mentions. Using the target mask input channel without coreferences and LM fine-tuning yield insignificant improvements of up to 0.3 each. We do not test the VADER-based sentence classification proposed by Hosseinia et al. (2020) since we expect no improvement by using it for various reasons. For example, VADER uses a dictionary created for a domain other than news and classifies the sentence's overall sentiment and thus is target-independent. Table 4 details the results of exemplary EKS, showing that the best combination (SENT, MPQA, and NRC) yields an improvement of 2.6pp compared to not using an EKS (zeros). The single best EKS (LIWC or SENT) each yield an improvement of 2.4pp. The two EKS "no EKS" and "zeros" represent a model lacking the EKS input channel and an EKS that only yields 0's, respectively.
The use of coreferences has a mixed influence on performance (Table 5). While using coreferences has no or even negative effect in our model for large LMs (RoBERTa and XLNET), it can be beneficial  for smaller LMs (BERT) or batch sizes (8). When using the mode "ignore," "add coref. to mask," and "add coref. as example" we ignore coreferences, add them to the target mask, and create an additional example for each, respectively. Mode "none" represents a model that lacks the target mask input channel.

Error Analysis
To understand the limitations of GRU-TSC, we carry out a manual error analysis by investigating a random sample of 50 incorrectly predicted examples for each of the test sets. For test-rw, we find the following potential causes (not mutually exclusive): edge cases with very weak, indirect, or in part subjective sentiment (22%) or where both the predicted and true sentiment can actually be considered correct (10%); sentiment of given target confused with different target (14%). Further, sentence's sentiment is unclear due to missing context (10%) and consolidated answer in NewsMTSC is wrong (10%). In 16% we find no apparent reason. For test-mt, potential causes occur approximately similarly often except that targets are confused more often (20%).

Future Work
We identify three main areas for future work. The first area is related to the dataset. Instead of consolidating multiple annotators' answers during the dataset creation, we propose to test to integrate the label selection into the model (Raykar et al., 2010). Integrating the label selection into the machine learning part could improve the classification performance. It could also allow us to include more sentences in the dataset, especially the edge cases that our restrictive consolidation currently discards. To improve the model design, we propose to design the model specifically for sentences with multiple targets, for example, by classifying multiple targets in a sentence simultaneously. While we early tested various such designs, we did not report them in the paper due to their comparably poor performances. Further work in this direction should perhaps also focus on devising specialized loss functions that set multiple targets and their polarity into relation. Lastly, one can improve various technical details of GRU-TSC, e.g., by testing other interaction layers, such as LSTMs, or using layer-specific learning rates in the overall model, which can increase performance (Sun et al., 2019b).

Conclusion
We present NewsMTSC, a dataset for targetdependent sentiment classification (TSC) on news articles consisting of 11.3k manually annotated examples. Compared to prior TSC datasets, it is different in key factors, such as its texts are on average 50% longer, sentiment is expressed explicitly only rarely, and there is a separate test set for multi-target sentences. As a consequence, state-of-the-art TSC models yield non-optimal performances. We propose GRU-TSC, which uses a bidirectional GRU on top of a language model (LM) and other embeddings, instead of masking or weighting mechanisms as employed by prior stateof-the-art. We find that GRU-TSC achieves superior performances on NewsMTSC and is competitive on prior TSC datasets. RoBERTa yields best results compared to using BERT, because RoBERTa is pre-trained on news and we find it can better resolve in-sentence relations of multiple targets.

A Appendices
Imagine you are a journalist asked to write a news article about a given topic. Depending on your own attitude towards the topic or the people involved in the news story, you may portray people more positively and other more negatively. For example, by using rather positive of negative words, e.g., 'freedom fighters' vs. 'terrorists' or 'cross the border' vs. 'invade,' or by describing positive or negative aspects, e.g., that a person did something negative.
In the sentence below, what do you think is the attitude of the sentence's author towards the underlined subject? Consider the attitude only towards the underlined subject, not the event itself or other people. FYI: further assignments may show the same sentence but with a different underlined subject than the subject shown below.

Subject: the president
The comments come after McConnell expressed his frustrations with the president for having "excessive expectations" for his agenda.
The attitude of the sentence's author towards the underlined subject is…