“We will Reduce Taxes” - Identifying Election Pledges with Language Models

In an election campaign, political parties pledge to implement various projects–should they be elected. But do they follow through? To track election pledges from parties’ election manifestos, we need to distinguish between pledges and general statements. In this paper, we use election manifestos of Swedish and Indian political parties to learn neural models that distinguish actual pledges from generic political positions. Since pledges might vary by election year and party, we implement a Multi-Task Learning (MTL) setup, predicting election year and manifesto’s party as auxiliary tasks. Pledges can also span several sentences, so we use hierarchical models that incorporate contextual information. Lastly, we evaluate the models in a Zero-Shot Learning (ZSL) framework across countries and languages. Our results indicate that year and party have predictive power even in ZSL, while context introduces some noise. We ﬁnally discuss the linguistic features of pledges.


Introduction
Before any election, political parties publish manifestos that summarize their pledges to the voters. The exact nature of those pledges varies. A singleissue party might campaign on the same promise year after year, but most parties will adapt to the shifting trends and needs of the electorate. However, there is a difference between pledging and fulfilling. Political scientists are highly interested in whether pledges were fulfilled, a question that is gaining a growing interest in the broader scientific community (Naurin et al., 2019). Several approaches exist, but they are primarily confined to manual analysis of individual countries or elections. They indicate that governmental parties mostly fulfill their election pledges (Naurin et al., 2019;Thomson et al., 2017). However, there are too many elections worldwide to analyze all campaign pledges manually. We need automated ways to identify pledges and hold governments accountable systematically.
Checking whether a pledge was fulfilled still requires manual work by trained political scientists, but the first step-identifying pledges-is a problem very much made for NLP, for at least two reasons. First, NLP can automate pledge identification to distinguish pledges from irrelevant content. This allows the study of pledge fulfillment at scale. An average election manifesto in our corpus has 418 sentences, but only 118 of them (27.5%) will contain a pledge. The rest is filler material. It takes several days to train an annotator, who then spends around 6-8 hours on a single manifesto, to identify those 27.5% of pledges. Cutting down on this laborious first part frees up time to focus on the more complex issue of determining whether those pledges were fulfilled. Second, NLP methods can help us understand the linguistic style and communication strategies associated with election pledges. This interpretation is necessary for social sciences to understand how political messages are structured and conveyed.
This paper presents neural pledge identification models to address these two points. Our work is part of a larger interdisciplinary project, "Mixed methods for analyzing political parties' promises to voters during election campaigns." We use a data set of almost 13k sentences from election manifestos concerning the last 25 years and 11 parties from Sweden and India. Each sentence is annotated as including a pledge ("pledge") or not ("non-pledge"). We implement several deep neural models based on BERT (Devlin et al., 2018). We use its Swedish, English (for the Indian data), and multi-lingual (mBERT) versions. We feed BERT's output into customized attention mechanisms to detect specific pledge-related patterns. We com-

Corpus
Text Class Swedish Vi i Centerpartietär stoltaöver vad vi uppnått i regeringen. non-pledge In the Center Party we are proud of what we achieved in the government. Barnkonventionen ska göras till svensk lag.
pledge The Convention on the Rights of the Child shall be made Swedish law.

Indian
They have neither competence nor commitment. non-pledge Five new IITs will be established before 2005. pledge Table 1: Examples of pledges and non-pledges from Swedish and Indian manifestos.
pare our neural models with a Logistic Regression baseline that the deep models easily outperform.
However, pledges can not just depend on some signal words or expressions. References to the environment might be core pledges for one party, but just commentary for another. Specific issues will be pledge-worthy one year (think pandemic responses), but not in others. To measure the effects of all of these confounds (i.e., election year and party), we adopt a Multi-Task Learning (MTL) framework. The main task is to classify sentences as pledge or non-pledge, with auxiliary tasks predicting the year, party, or both. We identify the conditions where MTL models with year and party improve the models' performance, indicating when these two factors are useful confounds. There seem to be stark differences between countries, though: even using a multi-lingual approach (which has access to more training data) does not improve on language-specific approaches.
We are also interested in zero-shot learning, i.e., training models on data from a country and testing it on a different country. This would allow us to work on pledges from new countries directly, without any previous manual annotation. It turns out that the models perform reasonably well despite the challenging conditions. However, the differences between test countries indicate that pledges are not as universal as we might think.
Surprisingly, we also find that incorporating a context of any sort (that is, one or more sentences preceding the target text) does not help but hurts performance. Presumably, this happens because pledges are rare, and context introduces more noise than signal.
We are also interested in learning more about pledges' nature. I.e., what their linguistic features and patterns are. To gain those insights, we extract the Information Gain value (Forman, 2003) of 1-4-grams and visualize the model's decisions via the Sampling and Occlusion (SOC) algorithm (Jin et al., 2019). SOC provides a hierarchical view of BERT's most informative linguistic patterns in the classification.
Our data and our models are available at https: Contributions The contributions of this paper are: 1) We provide a new, multi-lingual corpus of election manifestos from Swedish and Indian parties, annotated at sentence level as pledges or non-pledges; 2) We are the first to apply neural models to the task of election pledge classification, accounting for confounds; 3) We provide insights about the linguistic features of election pledges and the models' interpretation.

Data
We collect and annotate a corpus of election manifestos from two countries: Sweden and India. The texts are in Swedish and English, respectively. We provide some examples in Table 1. The Swedish data contain 5098 instances from 9 parties and six elections, ranging from 1994 to 2014. The amount of pledges per manifesto is 32.09%. These texts are also part of the corpus of the Manifesto Project (MP) (Volkens et al., 2012;Merz et al., 2016, Section 7).
For all manifestos, we adopted the annotation scheme of the Comparative Party Pledges Project (CPPP) of Naurin et al. (2019) and Thomson et al. (2017). This is a large international political science project whose annotation scheme is the most appropriate for identifying campaign promises, which is the focus of our experimental designs. In particular, following the CPPP scheme, we further distinguish between broad and narrow pledges, i.e., between generic and detailed commitments to undertake determined actions. Based on this distinction, we ran additional experiments included in the Appendix. We have 23.32% narrow and 8.77%  broad pledges in the Swedish data. The Indian texts contain 7729 sentences from two parties and five election cycles from 1999 to 2019. 1 Here, the annotators only distinguished sentences including a narrow pledge from non-pledge sentences, with a pledge rate of 24.52%.
In total, we have 12827 sentences and 3531 pledges (27.53%). Since we only have binary labels for the Indian data, we combine broad and narrow pledges in the Swedish corpus. 2 Table 2 shows some corpora statistics.

Annotation process
In the CPPP scheme, an election pledge is a statement that can be tested for fulfillment. Annotators must therefore assess whether a statement refers to an action or outcome that is verifiable, in the sense that we can objectively determine whether it was achieved. This definition also requires annotators to have to contextual knowledge of the country and specific information about the political situation in each election campaign.
We therefore trained Swedish and Indian annotators to label the Swedish and Indian manifestos for our study, respectively. Four people were involved in the annotation of the manifestos. Two domain experts, one for each data set, conducted the training. The two annotators interacted with the two respective domain experts throughout the annotation process to handle complicated cases.
To test agreement in the Indian data, three trained annotators labeled 100 sentences. Their Krippendorff's α and Fleiss's κ are 0.65. On the Swedish data, two trained annotators labeled 100 sentences again, with Krippendorff's α and Cohen's κ at 0.61.
In both cases, the agreement can be considered as 'substantial' (Landis and Koch, 1977). Our results are coherent with those reported by Naurin et al. (2019).

Methods
We have three experimental conditions: 1) Swedish texts alone, 2) Indian texts alone, and 3) Swedish and Indian texts together (multilingual condition).
In conditions 1) and 2) we evaluate the models on test sets from the same county (standard test split), or from the respective other country, i.e., a Zero-Shot Learning (ZSL) condition. This is not possible in the third condition, where the models are trained on data from both countries. We use this last condition to see whether performance improves with access to more training data and whether pledges are comparable across countries.
As baselines, we train two Logistic Regression (LR) models, optimized with the parameter C = 1, based on TF-IDF-weighted Bag-Of-Words (BOW) from 1− to 3−grams, with document frequency range from 0.001 and 0.75. We feed the first models with simple n−gram tokens.
However, we also hypothesized that pledges could be expressed by formal grammatical patterns, such as specific Parts-of-Speech (PoS) sequences or verb tenses (future tense, modal verbs). Therefore, we trained a second LR model, fed with tokens incorporating the PoS information. Tables 3 and  4 show the performance. We evaluate our models with standard metrics: precision, recall, and F1-measure averaged over the two classes.

Neural models
For the first two experimental conditions, we consider separate, mono-lingual Swedish and English BERT models and the multi-lingual (mBERT) version. In the third experimental condition, where we merge the two data sets, we can only use the multi-lingual BERT.   Single-Task Learning. Our base models are binary classifiers, i.e., single-tasks (STL) models. Standard BERT classifiers perform the task with a fully connected layer on top of BERT's output. In contrast, we reframe BERT's [CLS] token representation as a single-row matrix, and feed it into a single-layer, single-head Transformer (Vaswani et al., 2017). Our pilot studies found that this specialized structure allows us to detect specific pledge patterns from the BERT representation more effectively than a standard dense output layer alone. Finally, the Transformer is connected to a dense output layer for the prediction.
Multi-Task Learning. We implement three different MTL versions, differing by the auxiliary task combinations. We have two potential auxiliary tasks: predicting the election year and the party that produced the manifesto. We add a further dense output layer to the base model to perform the MTL tasks: 1) predicting the election year, 2) the party, or 3) both. We use the mean of the task losses for error-backpropagation in the MTL networks. Since their magnitude is bounded by the fact that all predictions are probability distributions, no normalization is needed. Figure 1 (left) shows the scheme of the MTL models.
Contextual models. We also build models considering the sentence preceding the target text as context, allowing us to test its impact on classification performance. We incorporate the context sentence in two state-of-the-art ways: through pair-BERT, which accepts two texts as input, and through a hierarchical model. In the first case, the model is structurally equivalent to the base model: only the input representation for BERT changes to include two sentences, separated by the separator token [SEP]. In the second case, we stack the representations of the BERT classification tokens ([CLS]) of both context and target sentences and feed them into a Transformer connected to a dense layer that gives the output. Figure 1 (right) depicts its structure. Settings and significance tests. To reduce the variability of the models' random initialization and make our results more robust, we run ten repeats for each experimental condition and compute the overall performance. To test the significance of the improvements over the base model, we use a bootstrap sampling test on all runs (Søgaard et al., 2014), with 1000 loops and a sample size of 30%.
For each experiment, we run 10-fold crossvalidation. In each fold, we use 80% of texts as the training set, 10% for the development, and 10% for the test. In the ZSL experiments, we use 90% and 10% of a data set for training and development, respectively, and the whole other data set as the test set.
For the main task, the loss function is the binary (sigmoid) cross-entropy; it is the (soft-max) crossentropy for the auxiliary tasks. We use the Adam optimizer (Kingma and Ba, 2014). We select the models through early-stopping that requires the development set's loss to drop by less than 8% for five consecutive epochs. Our learning rate is 0.002, drop-out probability 0.3, and batch size 512, manually tuned. The attention mechanisms that analyze BERT's outputs are single-layer, singlehead Transformers.

Experiments
We report results on all models for each of the three experimental conditions: 1) Swedish corpus encoded with Swedish and multi-lingual BERT (Table 5); 2) Indian corpus encoded with English and    multi-lingual BERT (Table 6); and 3) the joint Swedish and Indian data together, encoded with multi-lingual BERT (Table 7).
For each of these conditions, we train a baseline Logistic Regression model (Section 3)-an STL base model as described in Section 3.1-and compare them with MTL and contextual models. Since all the models outperform the Logistic Regression baselines, we report significance levels concerning the improvement over the STL models.

Results
We see a substantial performance difference between the two BERT encodings (Swedish and mBERT) regarding the Swedish data. The Swedish version outperforms the multi-lingual one and reaches the best performance of the experiments (Table 5).
We do not see the same performance difference in the Indian data, where English and multi-lingual BERT produce similar outcomes, with the multilingual even slightly better. Results are generally lower than those for the Swedish data (Table 6).
To interpret this performance gap, we need to consider the differences between the two corpora. As shown in Table 2, the Swedish and Indian data sets differ remarkably in terms of the number of parties and manifestos. Within each manifesto, the two data sets also contain a remarkably different number of sentences, pledges, and words in each sentence. In particular, the Indian data set contains a lower pledge rate than the Swedish data. This reduced amount of training examples prevents a direct comparison between the models trained on the two corpora.
As expected, the results of the multi-lingual model trained on the joint data set lie between the respective multi-lingual models on the two data sets separately. So while the Swedish BERT is more effective than the multi-lingual one on Swedish texts, the amount of data in the multi-lingual language model presumably counteracts the lack of annotated data in the Indian data set.
MTL vs STL. The MTL models are effective in several cases.First, they help in the ZSL conditions. This suggests that training the models to contextualize the notion of a pledge for party and year reduces overfitting. Also, when effective, MTL models improve precision. This is an expected effect, as the models learn to detect pledges as well as historical periods and political areas. This is an interesting feature for ZSL, where confidence in identified positive cases is more valuable than a good recall. In fact, even though models maximizing recall would make the human activity of pledge identification easier, in terms of downstream pledge fulfillment verification, it is preferable to start from a smaller set of texts that are likely to be true pledges.
Furthermore, in ZSL, by definition, the years and parties of the target country differ from those of the training country. Therefore, the auxiliary predictions for the training country are not relevant to the target country. This is the reason why we frame the problem as multi-task rather than multiinput: we could not have fed the models with test data from unseen countries/election campaigns in multi-input. Nevertheless, models trained to distinguish between different contexts for years and parties can effectively transfer this knowledge to entirely different test data, improving the predictions' precision. This suggests that some generalization is possible, even in front of different dependent variables.
We also tested the MTL models in the case of a reduced amount of data. In particular, we trained models considering the election manifestos from 2000 only. We found that the MTL contributes more strongly under those conditions. The results of these experiments are included in the Appendix.
Does Context Help? In a word, no. While a disappointing outcome, we find it important to include this finding here, as it goes very much against both intuition and prior research. Bilbao-Jayo and Almeida (2018), for example, found that contextual information is helpful when classifying political topics (see Section 7). Election pledges seem to be more self-contained statements, relying on linguistic formulas that make them recognizable (and probably memorizable) regardless of their linguistic context (Section 6).
We explored two different models to incorporate the sentence preceding the target texts. In both cases, though, we consistently find that the previous sentence's contextual information adds more noise than a helpful signal for prediction. The decrease ranges from moderate to drastic (up to 10 points in F1), particularly for the pair-BERT models where, by design, target and context representations are not trainable. The hierarchical models' performance is more stable, but the context does not improve the performance.

The language of pledges
To better understand the pledges' linguistic features, we follow two strategies: 1) computing the Information Gain (IG) of word n-grams, and 2) using the Sampling and Occlusion (SOC) algorithm (Jin et al., 2019).
Information Gain measures the entropy of (sequences of) terms between the different classes. The more skewed a set of terms is towards one label class at the other's expense, the higher the IG value. Tables 8 and 9 show the trigrams with the highest IG values (and relative frequencies), divided according to the class of which they are indicative, i.e., where they are more frequently found. While   we computed the IG score from 1 − −5-grams, we show only tri-grams here for illustration. They represent the best trade-off between meaningful and frequent chunks of text. For the complete translation of the Swedish texts, see the Appendix.
These n-grams suggest that a formulaic language characterizes election pledges: stereotypical expressions characterize specific sentences as pledges. For example, in the Swedish data set, the bullet is a clear marker that introduces statements containing some form of commitment. We also find expressions indicating volition ("Vi vill också" -"We also want..."), consequences ("så att det" -"So that..."), future ("will be set", "will be launched") and determined temporal horizons ("in five years", "over the next"). In contrast, both in the Indian and the Swedish data, references to political entities such as parties ("Alliansen"), congresses ("National Congress") and even countries ("Sverige", "India") are associated with non-pledge texts: they refer, more probably, to broad political positions or to claims about the past ("has always been", "ska vara ett" -"should be one").
Interestingly, the phrase "skarpa förslag" does not signal pledges, even though it means "specific policy proposals" (which are essentially the same as pledges). This distinction indicates that this phrase merely introduces pledges or provides a strong language for un-testable policy statements (such as "we promise safety to all children" or "we will put forward strict legislation to make our country safe again").
Given the relatively limited frequency of the selected n-grams, we did not measure the IG stratification by party and/or election year. However, given the relative MTL models' success, we hypothesize that, with more data, it will be possible to identify specific trends for political areas and historical moments.
Aware that the patterns detected by the neural models are not necessarily interpretable in terms of human common sense, we also wanted to highlight the words that the models find to be the most influential for their output. These patterns can feedback into the interpretation of pledge structures and mechanisms by social scientists. We also use the Sampling and Occlusion (SOC) algorithm (Jin et al., 2019), a post-hoc explanation algorithm that measures the importance of specific words in a sentence by considering the prediction difference after replacing each word with a MASK token (Jin et al., 2019). Since the outcomes depend on the context words, but Jin et al. (2019) are interested in the single words' relevance, they do not use the whole context but sample words from it. In this way, they reduce the context weight, emphasizing that of the word itself. Figure 2 and 3 show four examples of correctly classified sentences, two pledges and two nonpledges from Swedish and English language respectively (the same as shown in Table 1). The model interprets the red words as indicative of pledges, the blue ones of non-pledges. However, they cannot be interpreted as representative of the overall models' functioning. Even so, they show how generic words such as "stolta" ("proud") are indicative of non-pledges, while expressions indicating commitment ("ska göras till" -"to be made to") and con-crete topics ("Barnkonventionen" -"Convention on Children's Rights") are signals for pledges.

Related Work
In political sciences, the elections that we consider have been extensively studied by Håkansson and Naurin (2016), Lindvall et al. (2020) and Adhikari et al. (2020). Moreover, applying NLP methods to the analysis of political parties' statements has recently developed into an active field of research, with various groups investing in creating dedicated corpora and annotating them for specific purposes.
The Manifesto Project (MP) (Volkens et al., 2012;Merz et al., 2016) collects electoral programs from more than 50 countries for democratic elections since 1945, making it a notable initiative within the field. It provides data on different manifesto aspects in several countries and over time.
Recently, the Comparative Party Pledges Project (CPPP) of Naurin et al. (2019) has added detailed qualitative coding of what exactly pledges are made of (Naurin and Thomson, 2020). Subramanian et al. (2018) study the MP data, addressing the identification of fine-vs. coarsegrained positions taken by political parties. Despite the different classification task, similarly to our study, they adopt hierarchical models that encode the texts' structure, finding that contextual information improves the models' performance. However, they train bi-LSTM networks from scratch, while we rely on pre-trained BERT language models. Bilbao-Jayo and Almeida (2018) also work on the MP corpus, applying multi-input Convolutional Neural Networks (CNN) that take into account the statements' context, analogously to our study. They seek to classify the texts according to seven topics corresponding to general areas of interest.
We partially use the same data as the MP, as we study Swedish manifestos included in that data set. However, we are specifically interested in the identification of election pledges. This is similar to the task studied by Subramanian et al. (2019a). They focus on eleven Australian federal election cycles and distinguish rhetorical (broad) from detailed (narrow) pledges. The annotation of the Swedish texts considers this distinction, while the annotated Indian texts of our corpus do not (Section 2). Subramanian et al. (2019a) use a bidirectional Gated Recurrent Unit (biGRU) to carry out the prediction over ordinal classes.
From a methodological point of view, our approach is related to that of Abercrombie et al. (2019), which also uses BERT. They work on motions tabled in the UK Parliament and find that BERT effectively detects specific categories of proposals in the politicians' speeches.
Concerning the MTL methods, our study is analogous to that of Subramanian et al. (2019b). They consider texts from the 2016 Australian election and propose a new annotation scheme for different speech acts. They also perform the classification task using biGRU networks with ELMo embeddings (Peters et al., 2018), relying on a MTL framework in which the auxiliary task is the party prediction: this is also one of our experimental conditions.

Conclusion
We propose deep neural models that combine pretrained language models and trainable attention mechanisms to identify election pledges in party manifestos. We find that these models outperform a non-neural baseline. Even in zero-shot cross-lingual conditions (with some contribution by the MTL methods), the performance of the multilingual models indicates that we could identify pledges in low-resource languages.
Finally, we gained some insight into election pledges' linguistic profile. They are self-contained statements, independent of the context in which they appear. They are likely to be characterized by formulaic expressions that express commitment, intentions, and temporal terms concerning concrete topics. These results stem from close interdisciplinary cooperation between political scientists and NLP researchers.
Pledge identification is the first step for future downstream NLP tasks within the theoretical framework of political science, which is typically interested in societal developments and explanations such as pledge fulfillment and power distribution in democracies. For example, the fine-grained study of topics, biases, and the temporal evolution of election pledges. Our results provide a blueprint for successful future research in that vein.