Multilingual estimation of political-party positioning: From label aggregation to long-input Transformers

,


Introduction
A widely used tool in computational political science is the so-called 'scaling analysis': a set of methods for representing political platforms as numbers on a certain scale, such as leftright, authoritarian-libertarian, or conservativeprogressive (Laver et al., 2003;Slapin and Proksch, 2008;Diermeier et al., 2012;Lauderdale and Clark, 2014;Barberá, 2015).A wide variety of scales have been proposed in the literature, some based on political-theoretic considerations (Jahn, 2011), others more data-driven (Gabel and Huber, 2000;Albright, 2010;Rheault and Cochrane, 2020).
One well-established scoring scheme of this kind is the Standard Right-Left Scale, also known as the RILE score (Budge, 2013;Volkens et al., 2013).
It was developed in the framework of the Manifesto Research on Political Representation project (MARPOR), formerly known as the Comparative Manifestos Project (CMP),1 which collects, annotates, and makes available a large collection of party platforms from different countries.The RILE score is a deductive, first-principle-based method for describing party positions geared towards the widest possible applicability across time and countries (Budge, 2013).For this very reason, it is rather conservative and inflexible and has been repeatedly criticised (see, e.g., Flentje et al., 2017).Despite this, it is widely used in computational political science for model validation (Rheault and Cochrane, 2020), as a dependent variable in regression analyses (Greene and O'Brien, 2016), or as a basis for party-stance analysis (Däubler and Benoit, 2022).
A major practical drawback of the RILE score is the fact that it is computed based on the labels manually assigned by MARPOR annotators to all statements in party manifestos (see Section 2 for details).This procedure is expensive and time consuming, which raises the question of whether we can adequately approximate the RILE score using natural-language-processing methods, especially in a multilingual setting.This will make it possible to efficiently perform analyses of political texts that have not yet been covered by the MARPOR project due to timing constraints (e.g.manifestos from upcoming elections), accidental gaps (e.g., Indonesia and the Philippines are not part of the dataset, and coverage of many countries, such as South Africa, is incomplete), or lack of resources (there are very few annotated manifestos from before 2000).
This work is a first step in this direction.Our contributions are the following: 1. Previous works on computational analysis of party positioning targeted a limited number of texts from a single country or several coun-tries.We scale the analysis up to 41 countries and 27 languages, including comparatively low-resource languages (such as Georgian and Armenian) that have not been tackled before.
2. We contrast the label-aggregation approach, based on a statement-level classifier mimicking the work of a human annotator, with using long-input Transformer models predicting the scores directly from raw manifesto texts.
3. In the label-aggregation setting, we further compare the performance of multilingualmodelling-based and machine-translationbased approaches.While the former is more straightforward in the sense that a single base model can be directly used without any preprocessing, MT systems are easier to train for less-resource-rich languages, and only a single-language classifier is needed for predictions.
4. We evaluate the generalisability of models regarding two dimensions: local (moving to new countries) and temporal (moving from the past to the future).These correspond to different real-life research scenarios.We show that our methods deal reasonably with both cases.
The paper is structured as follows: In § 2, we provide more information on the MARPOR annotations and on how the RILE score is computed.The exact problem statement, different operationalization strategies, and the experimental setup are presented in § 3, while the results of the study are given in § 4. Additional discussion is provided in § 5. Section 6 surveys related work.Section 7 concludes the paper and discusses directions for future research.

MARPOR categories and political scales
Categories The annotations of the manifesto created in the framework of the Comparative Manifestos Project follow the project codebook (Volkens et al., 2020).Each statement of a given manifesto is annotated with a category representing a specific policy domain (e.g.Military or Sustainability).These categories can be identified via their names and numbers (e.g., 103, Anti-Imperialism). 2 A key feature of MARPOR categories is that they are not stance-neutral.Thus, category 201, 2 See Appendix B for the major categories with numbers.
Freedom and Human Rights, or subtypes thereof, are assigned to 'favourable mentions of importance of personal freedom and civil rights' (Volkens et al., 2020, 12).Some categories form binary oppositions (e.g.Constitutionalism: Positive vs. Constitutionalism: Negative), and some are purely onesided (e.g.Freedom and Democracy have positive loadings and do not have negative counterparts).As a result, it is possible to derive inferences about political stances of different parties from category counts alone.This provides a straightforward operationalization of the political-science notion of issue salience, which is commonly used to analyse political positioning (Epstein and Segal, 2000) the number of occurrences of a category correlates with how important it is for a party.
In total, there are 143 different categories, with 56 major categories, 32 sub-categories of the major categories, 54 additional categories, and the residual category 0.3 Right-Left scale A prominent way of analysing party positioning is the Standard Right-Left Scale, a.k.a.RILE score (Budge, 2013;Volkens et al., 2013).Originally developed in the framework of the MARPOR project, it has been consistently used in its publications and remains a standard reference scale for party positioning, despite a number of proposals to improve or replace it, both using theorybased and data-driven approaches (cf.Cochrane, 2015;Mölder, 2016;Flentje et al., 2017).

Operationalization
Label aggregation As outlined above, we aim at automatically estimating positions of political parties on the Left-Right scale.An approach that closely mirrors the traditional MARPOR procedure would be to automatically label the sentences in the manifestos with MARPOR categories and aggregate them according to Eq. 1.Unfortunately, classifying the sentences is difficult, as we will show below.Reasons include the large number of labels, their uneven distribution, and the country-specific nature of manifestos.
However, the predicted categories arguably do not have to perfect -it may be sufficient for highquality scaling analysis if the mistakes are uncorrelated so that, for example, the number of sentences mistakenly classified as left-leaning or neutral is close to the number of sentences mistakenly classified as right-leaning or neutral.One way to further raise the signal-to-noise ratio is to predict more high-level labels.To compute the RILE score, we do not require specific categories, but only a threeway classification (R[ight], L[eft], O[ther]), which is much more tractable.This approach can be easily mapped into other dimensions as long as there is a list of categories from MARPOR belonging to both poles of the scale.

Direct prediction
As an alternative, we can define a function T → [−1, 1] that directly maps a text to its RILE score, and approximate it with a neural regression model.Until recently, such an approach was infeasible due to the restrictions on the input length in the state-of-the-art embedding models: 512 or 1024 tokens depending on the model size, which is not enough to analyse longer texts.However, a new generation of long-input Transformers (LITs) based on lightweight variants of the self-attention mechanism increased the input limit to 4096 tokens or more (Tay et al., 2021).This still does not give us a way to compute a score for a whole text, but averages of RILE scores for 4095token chunks of manifestos nearly perfectly correlate with gold manifesto-level scores (Spearman's r > 0.99), which makes by-chunk estimation a good proxy.
An additional motivation to pursue this avenue is provided by the fact that it not only removes the need to classify the labels of individual statements but also saves researchers the effort to identify statements in the first place.This is a non-trivial problem as, according to the MARPOR codebook, any sequence of words with a distinct meaning can be considered a statement.E.g., a sentence All well-meaning citizens should strive to maintain the world peace can be construed as a single example of the category Peace, or all well-meaning citizens can be assigned its own label of Civic mindedness.In line with previous work, our aggregationbased approach assumes that statement boundaries are known, but in practice they will have to be predicted together with the labels, or the coding scheme must be simplified, e.g. by assigning a single 'majority' label to each sentence.By virtue of working with raw text spans, LITs do not have to make such compromises.

Problem settings
We consider two settings, corresponding to two different research scenarios.In the LEAVE-ONE-COUNTRY-OUT (X-COUNTRY) setting, we train the model on all data from n − 1 countries (split into training and development sets), and evaluate it one held-out country.This corresponds to the situation when manifestos from a country not yet covered by the MARPOR project, such as Indonesia, need to be analysed.This is repeated for all countries.
In the OLD-VS.-NEW(X-TIME) setting, we train the model on all data from before 2019 and evaluate it on the data from 2019-2021.This corresponds to the situation when new data from an already covered country become available.5

Dataset
We use the annotated subset of the latest release of the MARPOR dataset (version 2022a; Lehmann et al., 2022a) augmented with the separately curated South American dataset (Lehmann et al., 2022b). 6We excluded manifestos annotated before the year 2000 to obtain a more uniform training dataset.Furthermore, to ensure comparability between two approaches to cross-lingual modellingpreprocessing using machine translation and using a multilingual encoder (see § 3.4 below) -we excluded languages for which no pretrained free NMT system was readily available.This leaves us with 1314 manifestos from 41 different countries in 27 different languages.
In the X-COUNTRY setting, the rolling test set includes all of the data, while in the X-TIME setting it is much smaller (163,714 vs. 1,062,302 statements in the training set, i.e. around 13%) and has a weaker geographical coverage: only 18 countries have manifestos from 2019 and later.
The data for LITs have the same train-test general splits, but sentences in them were consecutively concatenated into text chunks of size no more than 4095 tokens (see Section 3.1), with a RILE score computed for each chunk based on its gold MARPOR labels.Chunks of size less than 1000 tokens were discarded. 7

Models
The MARPOR dataset is multilingual, which raises the challenge of language transfer.The two current approaches in this case are using a multilingual encoder or machine translating all the data into the pivot language, usually English (Litschko et al., 2022;Srinivasan and Choi, 2022).
Label aggregation Here we experiment with both options.For the MULTILINGUAL-ENCODER TRACK (XLM-ENC), we extract the representation of the CLS token from XLM-RoBERTa base (in the X-COUNTRY setting) and XLM-RoBERTa base and large (in the X-TIME setting). 8Throughout, ifestos of smaller parties that did not win any seats in previous elections and were not included in the dataset.The converse -NEW-VS.-OLD-would permit running a historical analysis of party positioning within a country.We have not addressed this scenario due to the scarcity of annotations from before 2000.
6 All data are available on the project web page: https: //manifestoproject.wzb.eu/datasets.
7 Statistics of the datasets are shown in Tables 8 (by country) and 9 (by language) in Appendix C. 8 The necessity to train 41 different models on the full dataset in the X-COUNTRY setting made it impractical to use the classification head is a 2-layer MLP with the inner dimension of 1024 and tanh activation after the first layer.
In the X-COUNTRY setting, the model was then repeatedly trained for two epochs using crossentropy loss and the AdamW optimiser (Loshchilov and Hutter, 2019) with the learning rate of 10 −5 .9In the X-TIME setting, the general setup is the same but the model was trained for five epochs with a checkpoint selected based on the dev-set accuracy.
For the MACHINE-TRANSLATION TRACK (MT), all manifestos are translated into English, for which the best MT systems and arguably the best pretrained encoders are available.The current MT systems, however, are still rather noisy, especially for non-WEIRD (Henrich et al., 2010) languages, which offsets the benefits of a stronger base model.
We use the EasyNMT toolkit10 giving access to the Opus-MT models (Tiedemann and Thottingal, 2020).A cursory inspection of the translated sentences shows that the translation quality does vary across languages.However, even for manifestos whose source languages are difficult to translate (e.g.Georgian) the results produced by the classifier are still acceptable.
The translated sentences are encoded using pooled representations from all-mpnet-base-v2, a version of MPNet (Song et al., 2020) fine-tuned following the SBERT methodology (Reimers and Gurevych, 2019) and available on HuggingFace. 11he same classification head was then used as in the XLM-ENC approach, as well as the same training parameters.
For each model, we aggregate the labels across manifesto sentences and compute its RILE score according to Eq. 1.

Direct prediction
We experiment with two longinput encoder models: Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020). 12They are only available for English, and we apply them to the translated dataset.We use the embedding of the large model.
the last layer's CLS token as input to a regression head.In the training step, each chunk receives a gold label computed from its sentences using Eq. 1.The final RILE score of each manifesto is the average of regression values of its chunks.The regression head is similar to the classification head described above with the final softmax layer replaced with a single node with tanh activation mapping the output into the [−1, 1] range.The systems are trained using MSE loss.

From regression to classification with LITs
A possible concern about the direct computation of RILE scores, as we frame the task for LITs, is that the models may fail to implicitly recreate the labelling-and-aggregation pipeline and instead learn spurious shortcuts by observing correlations between properties of texts and their RILE scores, which will then hurt test performance.
To address this concern, we carry out an additional experiment where we make the models' task more comparable to what a human political analyst would do.We train the LITs in a binned-regression setting: the range of RILE scores is split into five regions, corresponding to hard left [−1, −0.6), centre left [−0.6, −0.2), centrist [−0.2, 0.2), centre right [0.2, 0.6), and hard right [0.6, 1].The models are then trained to predict these classes instead of real-valued RILE scores using cross-entropy loss.

Evaluation metrics
For the label-aggregation models, we first diagnose the performance of the label classifiers using the weighted macro-averaged F1 score.
We then evaluate both the label-aggregation and the direct-prediction models on the target task of predicting RILE score.We use Spearman's correlation coefficient, which shows if our scores are monotonically related to those computed from gold annotations using Eq. 1.Additionally, we look at absolute values of errors and their directionality.
We evaluate the performance of the LIT-based classifiers in the binned-regression setting using accuracy and F1 score.

Results
The main results of the experiments are summarised in Tables 2 and 3.Sections 4.1 and 4.2 discuss the results while § 4.3 provides some detail about the strengths and weaknesses of the models.

Predicting MARPOR categories
As Table 2 shows, predicting the fine-grained MAR-POR categories directly is a very hard task, both in the X-COUNTRY and X-TIME settings.Our models easily beat the majority-class baseline but only achieve an accuracy above 50% in the X-TIME setting with the XLM-ENC encoder.
Aggregating labels into the three RILE-relevant classes makes the task predictably simpler: the baseline F1 score rises from nearly zero to 0.44/0.49(Other becomes the dominant category), but so does the performance of the models, to accuracies and F1 scores of 0.7 and above.However, there is still ample room for improvement.Interestingly, while using machine translation leads to consistent improvements in the X-COUNTRY setting, the X-TIME setting is better served with the multilingual encoder.

Computing RILE scores
Label aggregation In agreement with our working hypothesis, Table 3 shows that even noisy labels can be used to calculate manifesto-wide scale values that are largely in agreement with gold values.When predicting RILE via label aggregation the best results are attained by using the multilingual encoder, both in the X-COUNTRY and in the X-TIME setting.Somewhat surprisingly, aggregating the labels, even though this leads to a small number of surfacelevel classification mistakes, does not improve the eventual RILE scores in the X-COUNTRY setting (r = 0.72 from aggregated labels vs. 0.73 from all labels) and gives only a modest boost in the X-TIME setting (0.9 vs. 0.88).

Long-input Transformers
The performance of LITs is vastly uneven.In the X-COUNTRY setting, both models struggle: by-chunk RILEs from Longformer are essentially uncorrelated with gold ones, while BigBird's predictions show a non-negligible correlation (0.55), which is still much worse than the label aggregation results.In the X-TIME setting, while Longformer's predictions are still extremely noisy (r = 0.35), BigBird's ones are comparable to what the label aggregation approach achieves in the X-COUNTRY setting (0.71).As we discuss below, however, this correlation is somewhat misleading: while producing scores that are monotonically aligned with correct ones, BigBird predicts values that are very close to zero and thus differ greatly in their absolute values from the gold scores.

LIT-based classifiers
The results of the application of the better-performing LIT, BigBird, to the task of 5-way stance classification are shown in Table 4. Unlike RILE scores, by-chunk stance labels cannot be averaged, so for the final prediction each manifesto is assigned its majority class.The performance of the BigBird-based model in this setting is reasonable, with F1 scores ≈ 0.7.

Regression to the mean
The distributions of gold RILE scores and those predicted in the X-COUNTRY setting by the bestperforming label-aggregation pipeline and the bestperforming LIT are shown in Figure 1. 14 The plots make it clear that both models are very conservative: predicted values cluster closer to the mean RILE score than in the gold data.BigBird is especially affected by this, which we take to indicate that it suffers from a lack of training data: the training dataset was big enough to correctly estimate the mean of the distribution but not big enough to approximate the correct dispersion.
The predictions of the label-aggregation model based on XLM-ENC approximate the dispersion much better.However, the model still fails to account for the heavy right tail in the gold data and presents a more symmetric picture.In terms of RILE scores, this corresponds to a left skew: the model often presents right-leaning manifestos (those with positive RILE scores) as more centrist.
A more detailed picture of the relationship between the gold RILE scores and those predicted by the label-aggregation model is shown, for both settings, in Figure 2, which also presents the density of the prediction errors.Consistently with Figure 1, the density of the X-COUNTRY error distribution has a slightly heavier left tail.To characterize this behavior, we can look at the cases where the sign of the prediction is flipped, i.e. the upper-left and the lower-right quadrants of the scatterplot.While the UL quadrant is nearly empty, the LR quadrant is populated not only near the x = 0 asymptote, but also further to the right.This suggests that in the cross-country and cross-lingual setting, the hardest aspect of the problem is correct identification of right-wing statements across countries.One of the challenges associated with right-wing labels are their differing distributions across countries.While the variation in the cumulative share of left-wing labels in manifestos is bounded roughly between 0.2 and 0.3, with the same labels dominant everywhere, the variability of right-wing labels is much higher and their share is lower on average.See Figure 4 in Appendix E for details and Lachat (2018); Fielitz and Laloire (2021); Jahn (2022) for more in-depth analyses.
As the bottom panel of Figure 2 shows, the mag- nitude of errors in the X-TIME setting is considerably lower, with only a handful of sign-flip errors.This indicates that when a model has access to incountry data, the estimation of political positioning becomes easier, and the identification of right-wing tendencies is not a major hurdle any more.
The 5-way LIT-based party-stance classifier also suffers from the regression-to-the-mean problem, as can be seen in Table 5: the centrist category is overpredicted, while two extreme categories, which are rare in the data, are never predicted correctly.

Classifier errors and scaling analysis
One of the surprising results in Tables 2 and 3 is that low accuracy of the models trying to predict all MARPOR labels directly does not translate into low quality of respective RILE scores in the X-COUNTRY setting.This seems to suggest that errors of the models are not random: the models rather substitute, e.g., another Left-category label for a true Left-category label than replace a label from the Left set with a label from the Right set.A confusion matrix for the 3 coarse-grained labels (computed based on the fine-grained labels predicted by XLM-ENC in the X-COUNTRY setting) shown in Table 6 demonstrates that this is indeed the case.

Discussion
Our results show that multi-lingual automatic analysis of political-party positioning is at least partially feasible.It is possible to provide a high-level overview of the party system in a new country with a reasonable degree of precision, and even better results can be achieved with some amount of incountry data: the RILE scores computed using our method demonstrate a remarkably high correlation with the gold scores.Interestingly, the main obstacle to the success of our method seems to be not the language barrier, which is bridged well by either the off-the-shelf MT systems or the multilingual encoder, but the differences in the political culture across countries: the models struggle to correctly identify right-wing statements in the manifestos.
In practical terms, using long-input Transformers instead of sentence-level classifiers offers a way to greatly simplify the analysis and obviate the problems of subsentence identification in the input, as such models are able to make holistic judgements about long spans of text.In terms of performance, LITs struggle on the task of directly estimating RILE, compared to label-aggregation models, with the best model only approaching a reasonable level of performance.However, this must be taken with a grain of salt, since the label aggregation models have the advantage of gold-statement boundaries.Furthermore, our binned-regression experiment shows that LITs are promising candidates for coarse-grained party positioning analysis in terms of political 'camps'.For all models, the tails of the distribution remain hard to identify, with extreme categories rarely predicted correctly and centre left/centre right labels often mistaken for centrist.

Related work
The work on computational analysis of political documents traditionally employs bag-of-words methods, such as those popularised by Laver et al. (2003) and Slapin and Proksch (2008).Glavaš et al. (2017) introduce distributional semantics in the leftright analysis by using multilingual word alignment in the embedding space and a graph-based scorepropagation algorithm.This approach is then built upon by Nanni et al. (2022).Rheault and Cochrane (2020) adapt the word2vec methodology to the analysis of parliamentary speeches in a single-language setting via the use of trained party vectors, whose dimension-ality they reduce using PCA; they then interpret one of the resulting axes as the left-right scale.Vafa et al. (2020) instead develop a methodology for identifying the political position of lawmakers on the progressive-to-moderate dimension with a bag-of-words-based topic-modelling approach.
The use of contextualised embeddings for political analysis has not yet become mainstream.Abercrombie et al. ( 2019) test a wide range of methods, from unigram statistics to BERT-based classifiers, for assigning MARPOR labels to classify debate motions from the UK parliament.Dayanik et al. (2022) use several pre-trained single-language BERT models for the task of political-statement classification in five languages.Facing the same issues of label-frequency imbalance and rare labels, they mitigate them to some degree by using the hierarchical organisation of MARPOR labels; they do not try to compute RILE scores.Ceron et al. (2022) introduce sentence transformers (Reimers and Gurevych, 2019) into the problem space and fine-tune the embedding model itself in order to learn a politically informative distance measure between manifesto texts.Ceron et al. (2023) further extend this method to analyse interparty differences with regard to major policy domains, such as Law and Order or Sustainability and Agriculture.
More generally, our work falls into the domain of zero-shot classification with test data coming from a country or a time period not covered by the training data.The question of whether machine translation (Schäfer et al., 2022) or multilingual encoders (Litschko et al., 2022) is better suited for cross-lingual transfer is still actively debated, and we explore both options.From another perspective, the task of identifying and characterising political positions from textual data abuts larger fields of stance detection and argument mining (Küçük and Can, 2020;Reimers et al., 2019).

Conclusion
In this paper, we have proposed the first series models that generalise the task of political-party positioning across countries and election cycles.We showed that the main challenge -predicting MARPOR labels across countries and election cycles with high accuracy -is, surprisingly, not a real barrier on the way to a highly precise multilingual scaling analysis.We experimented with the Standard Right-Left Scale (RILE score), which is widely discussed in the political-science literature, and demonstrated that party manifestos can be effectively characterized in these terms using state-of-the-art multilingual modeling techniques applied to sentence-level classification with subsequent label aggregation and that even better results can be achieved via task-specific label clustering.
We further experimented with replacing the label-aggregation approach with long-input Transformers -both using regression and classification formulations -in order to obviate the task of identifying spans of statements from manifestos.These models demonstrate promising performance but still underperform the more traditional pipeline mimicking manual analysis.
Bridging the gap between long-input models and political analysis is an important avenue for future work, together with tackling other political dimensions and further widening the scope of the analysis.

Limitations
The main limitations of our work are twofold, and both stem from our dependence on the categories and annotations produced by the MARPOR project: 1.The RILE scale that we target is computed based on the MARPOR category labels, and we do not test if our methodology can be easily projected to other categorisation schemes.However, given the important role of the MARPOR codebook in the political-science literature and the amount of annotated data already available, we hope that our work makes a valuable contribution to the debate.
2. In label-aggregation pipeline, we are dependent not only on the labels themselves but also on the way they are applied to manifestos: following previous work (Dayanik et al., 2022;Ceron et al., 2022), we use the sub-sentence boundaries selected by MARPOR annotators in order to assign a single category to each statement.In the manifesto texts, sentences therefore sometimes can be associated with several labels.There are several possible ways to address this issue (e.g., selecting a 'majority' label for each sentence in the training data, training a multi-label classifier, or learning splits together with labels from the training set), and they need to be explored to obtain best possible performance in real-world settings.Using LITs removes this issue, but their performance is not competitive.
Confusion matrix for the party stance predicted by the BigBird-based classifier in the X-COUNTRY setting.L: left, CL: centre left, C: centrist, CR: center right, R: right.

Figure 1 :Figure 2 :
Figure 1: The distributions of gold and predicted RILE scores in the X-COUNTRY setting.

Figure 3 :
Figure 3: The distributions of gold and predicted RILE scores in the X-TIME setting.

Figure 4 :
Figure 4: Cumulative shares of left-wing and right-wing labels in manifestos from different countries.See Appendix B for the explanation of label codes.

Table 1 :
The MARPOR categories used for calculating the RILE score.

Table 6 :
Confusion matrix of coarse-grained labels used to compute the RILE score based on all MARPOR labels (the XLM-ENC + X-COUNTRY setting).True labels are in the rows, predicted labels in the columns.