An Ordinal Latent Variable Model of Conflict Intensity

Measuring the intensity of events is crucial for monitoring and tracking armed conflict. Advances in automated event extraction have yielded massive data sets of “who did what to whom” micro-records that enable data-driven approaches to monitoring conflict. The Goldstein scale is a widely-used expert-based measure that scores events on a conflictual–cooperative scale. It is based only on the action category (“what”) and disregards the subject (“who”) and object (“to whom”) of an event, as well as contextual information, like associated casualty count, that should contribute to the perception of an event’s “intensity”. This paper takes a latent variable-based approach to measuring conflict intensity. We introduce a probabilistic generative model that assumes each observed event is associated with a latent intensity class. A novel aspect of this model is that it imposes an ordering on the classes, such that higher-valued classes denote higher levels of intensity. The ordinal nature of the latent variable is induced from naturally ordered aspects of the data (e.g., casualty counts) where higher values naturally indicate higher intensity. We evaluate the proposed model both intrinsically and extrinsically, showing that it obtains comparatively good held-out predictive performance.


Introduction
On a scale from −10 for conflictual to +10 for cooperative, which of the following events should be considered more "intense": "Military injured two civilians" or "Rebels detained fifty soldiers"? Assessment of armed conflict based on event descriptions is extremely sensitive and subjective. Even worse, the tremendous amount of events collected every day makes a case-by-case analysis by human domain experts often unmanageable. This is particularly true for unstructured, full-text event descriptions (Lepuschitz and Stoehr, 2021). As a consequence, there is a strong demand to parse, standardize and aggregate individual events into meaningful "conflict intensity" measures (Moses et al., 1967;Azar, 1980;McClelland, 1984). Today more than ever, quantitative measures of conflict intensity are indispensable for assessment of international relations, provision of humanitarian aid and political decision-making (Beck et al., 2000).
One of the most frequently used measures is the Goldstein Scale (Goldstein, 1992). Major event datasets like IDEA (Bond et al., 2003), KEDS (Schrodt, 2008), GDELT (Leetaru and Schrodt, 2013), ICEWS (Boschee et al., 2015), Phoenix (Beieler, 2016) and NAVCO (Lewis et al., 2016) all rely on it. The Goldstein Scale assigns intensity scores between −10.0 and +7.0 to the 20 action categories defined by the Conflict and Mediation Event Observations (CAMEO) taxonomy (Schrodt, 2012). CAMEO specifies 204 low-level event types which are summarized into 20 highlevel action categories. We only consider the highlevel action categories. For example, the Goldstein Scale ranks "use unconventional mass violence" and "fight" as the most conflictual CAMEO action categories (−10.0) and "provide aid" (+7.0) as the most cooperative, see Fig. 1.
However, despite the active deployment in complex application settings, there exist few systematic evaluations of the Goldstein Scale (King and Lowe, 2003;Boschee et al., 2015;Raleigh and Kishi, 2019;Schrodt, 2019). The scale was defined by human domain experts and is defined over action categories only. For this reason, it does not consider fatality counts or involved perpetrators and victim types (Raleigh and Kishi, 2019). This brings us back to the opening question: how would one assess the intensity of the two events without considering the difference in the number of victims or the fact that some victims are civilians?
Conflict intensity is an abstract concept, while involved perpetrators, victims, action categories and fatality counts are observable and easier to classify. Instead of imposing a conflict scale a priori, we argue for treating intensity as a latent variable that  is correlated with these observables and that can be learned from data. We model the latent variable to be ordinal. Ordinal scales are well-suited when dealing with subjective preferences that are hard to express through exact quantities (cardinal), but also not categorical (nominal). The Likert Scale (Likert, 1932) is a popular example in psychometrics. Ordinal scales are also considered when differences between measurement levels are not equal as in the Richter Scale (Richter, 1935), describing intensity magnitudes of earthquakes. Ordered classes are highly interpretable (Moustaki, 2000) and a natural fit for discrete CAMEO action categories. We start with a discussion of the Goldstein Scale in §2 and descriptively evaluate the Goldstein Scale on real-world data in §3. This motivates the conceptualization of our generative model ( §4). We evaluate the performance of our model intrinsically ( §5) and extrinsically ( §6) and find that it outperforms the original Goldstein Scale. Our proposed modeling motif can be applied widely to other applications: for instance, sentiment analysis or pain perception may be interpreted as having ordinal latent characteristics, rendered visible through observed words or pain communication ( §8).

Empirical fit
The Goldstein Scale (Goldstein, 1992) was created by a panel of domain experts who ranked event de-scriptions based on their conflict potential. The scale was initially created for action categories defined by the World Event / Interaction Survey (WEIS) (McClelland, 1984) and later adapted to CAMEO action categories (Schrodt, 2012(Schrodt, , 2019. However, these categories were not necessarily created with distinct levels of conflict intensity in mind in the first place (Schrodt, 2012).
Thus, we observe in Fig. 1 that some of the action categories are assigned the same Goldstein value -the values −10.0, −4.0 and −2.0 are assigned twice. Moreover, the Goldstein Scale does not cover the entire range from −10.0 to 10.0. When validated on empirical data, some CAMEO action categories occur more frequently and may thus be under-specified while others rarely occur Schrodt, 2019). A datadriven and machine-automated scale could augment the well-established Goldstein Scale and be (re-)learned for different spatio-temporal contexts as outlined in §6.

Fatalities, Perpetrator and Victim Types
Each Goldstein value is assigned based only on event type, e.g. "protest", "fight"... (Raleigh and Kishi, 2019). A violent protest with 10 deaths receives the same intensity score as a protest incurring 100 deaths because both are "protest" events. This raises the question of why fatality counts are disregarded when assigning Goldstein values.
The Goldstein Scale was developed under the traditional "Westphalian-Clausewitzian" doctrine of international relations which accounted for sovereign states interacting through military and diplomatic threats Schrodt, 2012). For the assessment of modern-day multiethnic or asymmetric warfare, high-level diplomatic interaction types appear less relevant. Recent work in international relations bases conflict assessment on fatality counts (Chaudoin et al., 2017). Fatality counts may be considered less ambivalent to measure and easier to compare across event ontologies and datasets (Kalyvas, 2006). Disregarding fatality counts also had technical reasons: when the Goldstein Scale was developed, extensive conflict event data with human-verified fatality counts was missing. For translating full-text event descriptions into CAMEO action categories, event resolution software such as Petrarch2 (Norris et al., 2017) and its predecessors Petrarch, BBN AC-CENT (Boschee et al., 2015), VRA-Reader (King and Lowe, 2003) and Tabari (Schrodt, 2009) are used. They rely on lambda calculus-based semantic parsing using so-called verb dictionaries, which map predicates to action categories on the basis of regular expression matching. Other semantic roles such as arguments or numbers in text are not considered in the parsing. This led to the one-to-one mapping between Goldstein values and CAMEO action categories (see Fig. 1).

Conflict Event Data
We start with a descriptive analysis of the Goldstein Scale on real-world data. Therefore, we consider the publicly available Nonviolent and Violent Campaigns and Outcomes (NAVCO) data collection (Lewis et al., 2016). Specifically, we use the latest release NAVCO 3.0 (Chenoweth et al., 2018) from November 2019 which comprises 112,089 events around the world between December 1990 and December 2012. An exemplary event description is "On 19 May 2012, military (S) injured (P ) two (Q) civilians (O) in Afghanistan". In NAVCO, each part of this description has been parsed by human coders into standardized, structural features. Ensuring generalizability of our approach, there exist many other datasets such as IDEA, KEDS, GDELT, ICEWS and Phoenix comprising these features. We interpret the features as the semantic roles subject (S), predicate (P ), quantifier (Q), object (O) and contextual information. Thus, each data point n is a four-element tuple s n , p n , q n , o n .  Subject S: perpetrator types. NAVCO features columns termed "actor3", "actor6" and "actor9". These correspond to actor types defined by the CAMEO actor codebook. We merge the higherlevel categories "actor3" and "actor6" resulting in 33 different actor types such as "government", "rebels" and "civilians". We then map all actor types into one of the 4 classes "civilian", "military", "governmental" and "political" which we describe as the subject or perpetrator of an event 1 .
Predicate P: Goldstein values. NAVCO codes 2 each event description into one of the 20 CAMEO action categories in the column "verb10". We term this column predicate since the event coding predominantly relies on predicates (Norris et al., 2017). Due to the one-to-one mapping between action categories and Goldstein values, the predicates possess an expert-based intensity ordering.
Quantifier Q: victim counts. Each event description is annotated with human-verified fatality as well as injury counts. We add the two columns, add +1 to be able to take the logarithm and refer to the resulting value as victim counts or quantifiers. In Fig. 1, we plot the average number of victims associated with an event over their Goldstein values. The result confirms the intuition: events that Goldstein ranks as highly conflictual (e.g. "fight" (−10.0)) coincide with higher victim counts. This correlation is statistically significant at p ≤ 0.01 according to Spearman (−0.5147) and Pearson (−0.1191) correlation.
Object O: victim types. Similar to the perpetrator types, each event features victim types that we regard as objects. The types rely on the same CAMEO coding as the perpetrators, but are called "target3" and "target6" in NAVCO.

Contextual information: location and time.
Finally, each event entry is annotated with the date of the event and its location at the country and sub-county level. We consider this contextual information in an extrinsic evaluation in §6.

Generative story
We operationalize conflict intensity as a latent variable that models the correlation between the observed variables subject (S), predicate (P ), quantifier (Q) and object (O). Each data point is a tuple s n , p n , q n , o n representing an event. Our model is depicted in Fig. 3. We assume the following generative story: For each event, we sample a discrete event intensity class z n according to where π (z) is a C-dimensional vector of class probabilities drawn from a Dirichlet distribution. Conditioned on the sampled intensity class z n , we independently sample each of our observed sites per event tuple s n , p n , q n and o n : zn are row vectors of C-row matrices Π (s) and Π (o) , indexed by z n . Each row vector is drawn independently from a Dirichlet distribution and parameterizes a Categorical over subject Figure 3: Our latent variable model for conflict intensity. Each observed event tuple s n , p n , q n , o n corresponds to the semantic roles perpetrator type, Goldstein value, victim counts, victim type as in "Military injured two civilians". The priors ω (p) , δ (q) and b (q) are ordinally transformed vectors, indicated by double borders. and object types respectively. The scalar parameters ω as discussed in the following.

Ordinal Latent
Crucially, we want to make our latent variable Z ordinal. Higher values Z = z n should correspond to higher intensity levels, which should mean that associated events have, on average, higher Goldstein values p n and victim counts q n . Goldstein values and victim counts have intrinsic intensity orderings. Similarly, one could argue that civilian victims should be considered more "intense" than military victims, however we refrain from imposing such normative orderings on subject s n and object types o n . As can be seen in eq. (3) and eq. (4), z n indexes into the vectors ω (p) , κ (p) and δ (q) , b (q) , which, in turn parametrize P and Q. Thus, if we order the vectors, the categories of Z will be ordered as well. To this end, we define an ordered Normal prior and say a sequence of random variables {λ 1 , . . . , λ C } is drawn according to an OrderedNormal(µ, σ) prior if they are sampled according to the following generative process: where Ord(x) takes in a vector x, and transforms it into a vector of same dimension whose components are in strictly increasing order, This transform is a smooth bijection, which means that the p.d.f. of the ordered Normal is well-defined (App. A.1). This is closely related to the latent cutoff points in ordinal regression (Wooldridge, 2010).
Ordinal Beta. We choose Beta(ω zn ) for modeling Goldstein values p n . ω (p) is an ordered vector sampled from an ordered Normal. κ (p) is an unordered vector where each element is sampled independently from a Gamma distribution. Both vectors are indexed by z n to select the scalars ω (p) zn and κ (p) zn respectively. σ(·) is the element-wise logistic sigmoid function that ensures that parameters range between 0 and 1.
zn can be seen as the mean and κ (p) zn + 2 as the concentration parameter of the Beta distribution. This parametrization via mean and concentration translates into the more familiar Beta(α, β) parametrization. κ (p) zn + 2 ensures that either α or β are always ≥ 1.
Ordinal Zero-inflated Geometric. For modeling victim counts q n , we rely on a Zeroinflated Geometric distribution denoted as zn ), with two sigmoidtransformed ordered Normals as priors: δ (q) and b (q) correspond to the "gate", or the inflated probability of sampling zero, and the success probability respectively. Importantly, here we reverse the ordering to ∀c ∈ [1, C] : λ c < λ c−1 since higher δ zn correspond to lower victim counts and thus lower event intensity.

Technical Details
Implementation. We implement our model using the probabilistic programming API Pyro   (Metropolis et al., 1953). We rely on parallel_enumeration to marginalize out the discrete latent during gradientbased NUTS inference. To sample from its posterior, we rely on Pyro's infer_discrete method.

Intrinsic Evaluation
We evaluate our model by performing an imputation task. Therefore, we split the 112,089 data points into a 70% training and 30% held-out set. Imputation Procedure. We fit the model on the training set, considering T = 1000 posterior samples of parameters θ (t) after a warm-up period of 200 samples. Next, we move on to the held-out set of entirely unseen data points. We remove one observed site from all N event tuples and refer to the missing site as y, e.g. we remove all Goldstein values and obtain y n = p n . We use the remaining three-way event tuples x n = s n , q n , o n jointly with the posterior parameters θ (t) to make predictions of y n . To predict y n , we first sample the discrete latent variable z (t) n according to p(z (t) n | x n , θ (t) ). Second, we sample p(y n | z (t) n , x n , θ (t) ).  Figure 5: Intrinsic evaluation on held-out set: we impute removed values y n of one site, e.g. P, based on values x n of remaining sites, e.g. S, Q and O. We present mean results of the posterior predictive density (PPD) and point predictions (F1 / MSE) over 10 runs. We compare three variants of our model with C = 5 latent classes: a majority vote (majority) fitted only to the single site that is removed at test time; an unfitted variant (random prior) and a fitted variant (model). In addition, we train Bayesian linear regression (LR) models to directly predict y n from x n . In contrast to LR, our model compresses observed values into a single latent Z for predicting y n via x n → z n → y n . Still, our model significantly outperforms all baselines as tested by t-tests with α = 0.05.
Evaluation Metrics. We compute the averaged log-likelihood of p(y n | z (t) n , x n , θ (t) ) over data points, known as posterior predictive density (PPD) 3 (Gelman et al., 1996). By exponentiating, the metric ranges between 0 and ∞: We also approximate the posterior predictive mean 1 n , x n , θ (t) ]. This allows comparing predicted and true values based on the F1 score and the mean squared error (MSE).
Baselines. We compare our model against baselines: instead of fitting one model to all four observed sites per event s n , p n , q n , o n on the training set, we fit four separate models to each site individually, which resembles a majority vote. Further, we consider an unfitted, random prior variant of our model. As another strong baseline, we train four Bayesian linear regression models (LR) to directly predict y n from x n respectively. Results. Our model can learn the optimal number of intensity classes C that best fit the data and consequently best impute the removed site. To this end, we fit the model on C = [3, . . . , 7] classes on the training set and evaluate it on the held-out set. We find that C = 5 yields the highest PPD overall, as shown in Fig. 4. Fig. 5 presents the results of the imputation task on the held-out set. We find that our model, with the optimal C = 5 classes, outperforms all baselines on all metrics. 3 The inner sum over T samples corresponds to a discretized integral over the probability density function of the posterior parameters. exp 1 N N n=1 log(·) represents the geometric mean over N data points.

Extrinsic Evaluation
The Goldstein Scale was purposefully designed to rank individual events in isolation of context. However, context strongly affects the assessment of intensity in real settings (Gutierrez-Sanin and Wood, 2017). In countries where high intensity and high fatality events rarely occur, a global intensity scale may not be sensitive enough. Our data-driven model has the advantage that it can be refitted to different contexts. Event locations and time are readily available in the NAVCO dataset. This allows us to filter the data by country and refit the model. To derive an intensity scale, we consider the posterior mean estimate z n = E[z (t) n | s n , p n , q n , o n , θ (t) ] ∼ 1 T T 1=t z (t) n of the latent intensity level per event. 4 We average all z n by month and linearly interpolate missing months to construct a time series. We evaluate the resulting country-specific, intensity time series, termed Z for simplicity, in two tasks.

Auto-regressive forecasting
First, we test if knowledge of the time series Z improves autoregressive forecasting of Goldstein P and victim count Q time series for fixed locations. We start by taking the first derivative of all time series and verify their stationarity based on the Augmented Dickey-Fuller (ADF) test (Fuller, 1976) with a significance level of p < 0.05. We compare an Autoregression (AR) and a Vector Autoregression (VAR) model and determine their optimal orders (time lag) in months based on the Bayesian 4 We are not facing the problem of label switching (Stephens, 2000) since the ordering transformation (see eq. (8)) represents an identifiability constraint. We may re-scale the set of sampled posterior latents zn to cover a desired range (e.g. from 0.0 to 1.0 as we did with the Goldstein Scale).  Table 1: Auto-regressive forecasting: we test how much predictive information our intensity time series Z holds on future developments of Goldstein values P and victim counts Q. In all cases, including Z in a vector autoregression (VAR) outperforms the autoregression based solely on P or Q in terms of mean squared error (MSE) (results reported are 10 −2 ) on the held-out set. VAR trained on Z and P or Q even outperforms VAR trained on S, P , Q and O in most cases. If the p value of the corresponding Granger (Grang) test is small, we can reject the null hypothesis that Z does not add predictive information, which is a stricter requirement than the reduction in MSE.
Information Criterion (BIC). We fit and evaluate the models via time series cross-validation to avoid any leakage of held-out data into the training set. In Tab. 1, we report results in mean squared error (MSE) on the validation sets averaged over all folds. Indeed, we find that our intensity time series Z holds predictive information on both P and Q. VAR trained on P and Z even outperforms a VAR trained on all four time series S, P , Q and O in most cases. In addition, we run the Granger test (Granger, 1969) that tests the null hypothesis that the forecasting of variable P based on its own history only is more accurate than based on its own plus another variable's history Z.

Conflict intensity and online attention
Extrinsically evaluating intensity scales is difficult because there exist no objective ground truth measures. A growing body of work analyzes the correlation between shifts in online behavior or media attention and conflict intensity (Wolfsfeld et al., 2013;Chykina and Crabtree, 2018;Timoneda and Wibbels, 2022). We download time series of Google search keywords using the Google Trends Anchorbank (G-TAB) (West, 2020). We constrain searches to the category of world news (1209) and use the country name as search keyword. Next, we fit our model to a subset of the NAVCO dataset filtered by the respective country and obtain an intensity time series Z. Fig. 6 shows the comparison between our intensity, the Goldstein and Google trends time series for Syria between 2004 and 2013 5 . We find that Google trends are more strongly correlated with our (Pearson 0.427) than with the Goldstein time series (Pearson 0.318).

Related Work
Modeling International Relations. Douglass et al. (2022) qualitatively refer to international relations as a latent variable supporting our quantitative operationalization. O'Connor et al. (2013) present an unsupervised, probabilistic model for extracting international relations from political context using parsers and topic models. They extract "source -receiver -time step -predicate quadruples" and model these data using a time-smoothed latent variable representing event types. For evaluating their inferred event types, O'Connor et al. (2013) consider the Goldstein Scale as a gold standard. Schein et al. (2015Schein et al. ( , 2016 decompose four-way tensors (senders, receivers, CAMEO-coded action categories, time steps) to infer multilateral relations using Bayesian Poisson tensor factorization. They evaluate event types through an exploratory analysis and forecasting, but do not consider intensity. Neural network-based approaches model relationships of novel characters on an ordinal 5point scale (Srivastava et al., 2016) or via clustering (Iyyer et al., 2016). Han et al. (2019) transfer these models to non-ordered country relations which are not validated against event coding schemes.  classify ally-enemy relationships in a signed network using a graph neural network.

Model Limitations and Future Work
The latent variable model presented in this work represents a general framework for learning interpretable, ordinal scales from any set of observed variables of arbitrary distribution type. Observed variables could come from any tabular dataset or semantic parsing (Makarov, 2018), but should feature continuous or discrete ranking scores. A crucial limitation of our model is the requirement that observed variables have to be correlated. Considering Fig. 1 again -if Goldstein values and victim counts were not correlated, our model could not infer one from the other nor learn a latent correlate. Harnessing this correlation, data imputation ( §5) and anomaly detection represent promising application as large conflict event datasets are typically noisy (Ward et al., 2016). Conversely, if Goldstein values and victim counts were perfectly correlated, considering both would be superfluous since one incorporated all information of the other. Assuming civilian victims are "more intense" than military victims, we could impose an additional ordering on victim types O modeled by a Binomial. One could argue for conditioning observed variables on each other; e.g. make victim counts Q conditionally dependent on O or even P . We plan to incorporate multiple latents to model multi-dimensional intensity concepts (e.g. CAMEO QuadClass). Future models could also be conditioned on location and include a temporal component to account for surprisal in intensity perception. Further improvement will be enabled through better extraction of event and victim types as well as counts from full-text.  Table 2: Evaluation through predictive data imputation on held-out data. We compare three models: a majority vote baseline fitted only to the respective observed site (major), an unfitted version (random prior) and a fitted version (model) of our latent variable model, all with optimal number of latent classes C = 5. The table shows mean results of the posterior predictive density (PPD) and point predictions (F1 / MSE) over 10 runs. Our model significantly outperforms the baselines in all metrics as tested by t-tests with α = 0.05.