Evaluating Evaluation Measures for Ordinal Classification and Ordinal Quantification

Ordinal Classification (OC) is an important classification task where the classes are ordinal. For example, an OC task for sentiment analysis could have the following classes: highly positive, positive, neutral, negative, highly negative. Clearly, evaluation measures for an OC task should penalise misclassifications by considering the ordinal nature of the classes. Ordinal Quantification (OQ) is a related task where the gold data is a distribution over ordinal classes, and the system is required to estimate this distribution. Evaluation measures for an OQ task should also take the ordinal nature of the classes into account. However, for both OC and OQ, there are only a small number of known evaluation measures that meet this basic requirement. In the present study, we utilise data from the SemEval and NTCIR communities to clarify the properties of nine evaluation measures in the context of OC tasks, and six measures in the context of OQ tasks.


Introduction
In NLP and many other experiment-oriented research disciplines, researchers rely heavily on evaluation measures. Whenever we observe an improvement in the score of our favourite measure, we either assume or hope that this implies that we have managed to moved our system a little towards what we ultimately want to achieve. Hence it is of utmost importance to examine whether evaluation measures are measuring what we want to measure, and to understand their properties.
This paper concerns evaluation measures for Ordinal Classification (OC) and Ordinal Quantification (OQ) tasks. In an OC task, the classes are ordinal, not nominal. For example, Task 4 (Sentiment Analysis in Twitter) Subtask C in SemEval-2016/2017 is defined as: given a set of tweets about a particular topic, estimate the sentiment conveyed by each tweet towards the topic on a five-point scale (highly negative, negative, neutral, positive, highly positive) (Nakov et al., 2016;Rosenthal et al., 2017). On the other hand, an OQ task involves a gold distribution of labels over ordinal classes and the system's estimated distribution. For example, Task 4 Subtask E of the SemEval-2016/2017 workshops is defined as: given a set of tweets about a particular topic, estimate the distribution of the tweets across the five ordinal classes already mentioned above (Nakov et al., 2016;Rosenthal et al., 2017). The Dialogue Breakdown Detection Challenge (Higashinaka et al., 2017) and the Dialogue Quality subtasks of the NTCIR-14 Short Text Conversation  and the NTCIR-15 Dialogue Evaluation (Zeng et al., 2020) tasks are also OQ tasks. 1 Clearly, evaluation measures for OC and OQ tasks should take the ordinal nature of the classes into account. For example, in OC, when a highly positive item is misclassified as highly negative, that should be penalised more heavily than when it is misclassified as positive. Surprisingly, however, there are only a small number of known evaluation measures that meet this requirement. In the present study, we use data from the SemEval and NTCIR communities to clarify the properties of nine evaluation measures in the context of OC tasks, and six measures in the context of OQ tasks. Some of these measures satisfy the aforementioned basic requirement for ordinal classes; others do not.
Section 2 discusses prior art. Section 3 provides formal definitions of the measures we examine, as this is of utmost importance for reproducibility. Section 4 describes the data we use to evaluate the measures. Sections 5 and 6 report on the results on the OC and OQ measures, respectively. Finally, Section 7 concludes this paper.
2 Prior Art

Evaluating Ordinal Classification
As we have mentioned in Section 1, Task 4 Subtask C of the SemEval-2016/2017 workshops is an OC task with five ordinal classes (Nakov et al., 2016;Rosenthal et al., 2017). While SemEval also features other OC tasks with fewer classes (e.g., Task 4 Subtask A from the same years, with three classes), we use the Subtask C data as having more classes should enable us to see more clearly the difference between measures that consider ordinal classes and those that do not. 2 Note that if there are only two classes, OC is reduced to nominal classification. Subtask C used two evaluation measures that consider the ordinal nature of the classes: macroaveraged Mean Absolute Error (MAE M ) and the standard Mean Absolute Error (MAE µ ) (Baccianella et al., 2009).
At ACL 2020, Amigó et al. (2020) proposed a measure specifically designed for OC, called Closeness Evaluation Measure (CEM ORD ), and discussed its axiomatic properties. Their metaevaluation experiments primarily focused on comparing it with other measures in terms of how each measure agrees simultaneously with all of preselected "gold" measures. However, while their results showed that CEM ORD is similar to all of these gold measures, the outcome may differ if we choose a different set of gold measures. Indeed, in the context of evaluating information retrieval evaluation measures,  demonstrated that a similar meta-evaluation approach called unanimity (Amigó et al., 2018) depends heavily on the choice of gold measures. Moreover, while Amigó et al. (2020) reported that CEM ORD also performs well in terms of consistency of system rankings across different data (which they refer to as "robustness"), experimental details were not provided in their paper. Hence, to complement their work, the present study conducts extensive and re-producible experiments for OC measures. Our OC meta-evaluation experiments cover nine measures, including MAE M , MAE µ , and CEM ORD .

Evaluating Ordinal Quantification
As we have mentioned in Section 1, Task 4 Subtask E of the SemEval-2016/2017 workshops is an OQ task with five ordinal classes (Nakov et al., 2016;Rosenthal et al., 2017). 3 Subtask E used Earth Mover's Distance (EMD), remarking that this is "currently the only known measure for ordinal quantification" (Nakov et al., 2016;Rosenthal et al., 2017). Subsequently, however, Sakai (2018a) proposed a new suite of OQ measures based on Order-aware Divergence (OD), 4 and compared them with Normalised Match Distance (NMD), a normalised version of EMD. Sakai utilised data from the Third Dialogue Breakdown Detection Challenge (DBDC3) (Higashinaka et al., 2017), which features three ordinal classes, and showed that his Root Symmetric Normalised OD (RSNOD) measure behaves similarly to NMD. However, his experiments relied on the run submission files from his own team, as he did not have access to the entire set of DBDC3 submission files. On the other hand, the organisers of DBDC3 (Tsunomori et al., 2020) compared RSNOD, NMD, and the official measures of DBDC (namely, Mean Squared Error and Jensen-Shannon Divergence, which ignore the ordinal nature of the classes) using all the run submission files from DBDC3. They reported that RSNOD was the overall winner in terms of system ranking consistency and discriminative power, i.e., the ability of a measure to obtain many statistical significant differences (Sakai, 2006(Sakai, , 2007(Sakai, , 2014. In addition to the aforementioned two Subtask E data sets from SemEval, the present study utilises three data sets from the Dialogue Quality (DQ) Subtasks of the recent NTCIR-15 Dialogue Evaluation (DialEval-1) Task (Zeng et al., 2020). Each DQ subtask is defined as: given a helpdesk-customer dialogue, estimate the probability distribution over the five-point Likert-scale Dialogue Quality ratings (See Section 4). Our OQ meta-evaluation experiments cover six measures, including NMD and RSNOD.

Classification Measures
In the OC tasks of SemEval-2016/2017, a set of topics was given to the participating systems, where each topic is associated with N tweets. (N varies across topics.) Given a set C of ordinal classes represented by consecutive integers, each OC system yields a |C| × |C| confusion matrix for each topic. From this, we can calculate evaluation measures described below. Finally, the systems are evaluated in terms of mean scores over the topic set.
Let c ij denote the number of items (e.g., tweets) whose true class is j, classified by the system into That is, C + is the set of gold classes that are not empty. We compute MAE's as follows.
Unlike the original formulation of MAE M by Baccianella et al. (2009), ours explicitly handles cases where there are empty gold classes (i.e., j s.t. c •j = 0). Empty gold classes actually do exist in the SemEval data used in our experiments. It is clear from the weights used above (|i − j|) that MAEs assume equidistance, although this is not guaranteed for ordinal classes. Hence Amigó et al. (2020) propose the following alternative: where prox ij = − log 2 (max{0.5, K ij }/N ), and Our formulation of prox ij with a max operator ensures that it is a finite value even if K ij = 0. We also consider Weighted κ (Cohen, 1968). We first compute the expected agreements when the system and gold labels are independent: e ij = c i• c •j /N . Weighted κ is then defined as: where w ij is a predefined weight for penalising misclassification. In the present study, we follow the approach of MAEs (Eqs. 1-2) and consider Linear Weighted κ: w ij = |i − j|. However, it should be noted here that κ is not useful if the OC task involves baseline systems such as the ones included in the aforementioned SemEval tasks: that is, a system that always returns Class 1, a system that always returns Class 2, and so on. It is easy to mathematically prove that κ returns a zero for all topics for all such baseline systems. We also consider applying Krippendorff's α (Krippendorff, 2018) to OC tasks. The α is a measure of data label reliability, and can handle any types of classes by plugging in an appropriate distance function. Instead of the |C| × |C| confusion matrix, the α requires a |C| × N class-by-item matrix that contains label counts n i (u), which represents the number of labels which say that item u belongs to Class i. For an OC task, n i (u) = 2 if both the gold and system labels for u is i; n i (u) = 1 if either the gold or system label (but not both) for u is i; n i (u) = 0 if neither label says u belongs to i. Thus, this matrix ignores which labels are from the gold data and which are from the system.
For comparing two complete sets of labels (one from the gold data and the other from the system), the definition of Krippendorff's α is relatively simple. Let n i = u n i (u); this is the total number of labels that Class i received from the two sets of labels. The observed coincidence for Classes i and j (i, j ∈ C, i = j) is given by O ij = u n i (u)n j (u), while the expected coincidence is given by E ij = n i n j /(2N − 1). The α is defined as: where, for ordinal data, and for interval data, δ 2 ij = |i − j| 2 (Krippendorff, 2018). We shall refer to these two versions of α as α-ORD and α-INT, respectively. Unlike κ, the α's can evaluate the aforementioned baseline systems without any problems.
The three measures defined below ignore the ordinal nature of the classes. That is, they are axiomatically incorrect as OC evaluation measures.
First, let us consider two different definitions of "Macro F1" found in the literature (Opitz and Burst, 2019): to avoid confusion, we give them different names in this paper. For each j ∈ C + , let Prec j = c jj /c j• if c j• > 0, and Prec j = 0 if c j• = 0 (i.e., the system never chooses Class j). Let Rec j = c jj /c •j . Also, for any positive values p and r, let f 1(p, r) = 2pr/(p + r) if p + r > 0, and let f 1(p, r) = 0 if p = r = 0. Then: HMPR stands for Harmonic mean of Macroaveraged Precision and macroaveraged Recall. Opitz and Burst (2019) recommend what we call F1 M over what we call HMPR. Again, note that our formulations use C + to clarify that empty gold classes are ignored. Finally, we also consider Accuracy: 5 From Eqs. 2 and 10, it is clear that MAE µ and Accuracy ignore class imbalance (Baccianella et al., 2009), unlike the other measures.

Quantification Measures
In an OQ task, a comparison of an estimated distribution and the gold distribution over |C| ordinal classes yields one effectiveness score, as described below. The systems are then evaluated by mean scores over the test instances, e.g., topics (Nakov et al., 2016;Rosenthal et al., 2017) or dialogues (Zeng et al., , 2020. Let p i denote the estimated probability for Class i, so that i∈C p i = 1. Similarly, let p * i denote the true probability. We also denote the entire probability distributions by p and p * , respectively. Let cp i = k≤i p k , and cp * i = k≤i p * k . Normalised Match Distance (NMD) used in the NT-CIR Dialogue Quality Subtasks (Zeng et al., , 2020 is given by (Sakai, 2018a): This is simply a normalised version of EMD used in the OQ tasks of SemEval (See Section 2.2) (Nakov et al., 2016;Rosenthal et al., 2017). We also consider two measures that can handle OQ tasks from Sakai (2018a). First, a Distance-Weighted sum of squares for Class i is defined as: Note that the above assumes equidistance. Let That is, C * is the set of classes with a positive gold probability. Orderaware Divergence is defined as: with its symmetric version SOD(p, p * ) = (OD(p p * ) + OD(p * p))/2. Root (Symmetric) Normalised Order-aware Divergence is defined as: The other three measures defined below ignore the ordinal nature of the classes (Sakai, 2018a); they are axiomatically incorrect as OQ measures. Normalised Variational Distance (NVD) is essentially the Mean Absolute Error (MAE): Root Normalised Sum of Squares (RNSS) is essentially the Root Mean Squared Error (RMSE): The advantages of RMSE over MAE is discussed in Chai and Draxler (2014). The Kullback-Leibler divergence (KLD) for system and gold probability distributions over classes is given by: As this is undefined if p * i = 0, we use the more convenient Jensen-Shannon divergence (JSD) instead, which is symmetric (Lin, 1991):    Table 2: System ranking similarity in terms of Kendall's τ for each OC task. Correlation strengths are visualised in colour (τ ≥ 0.8, 0.6 ≤ τ < 0.8, and τ < 0.6) to clarify the trends. Table 1 provides an overview of the SemEval and NTCIR task data that we leveraged for our OC and OQ meta-evaluation experiments. From SemEval-2016/2017 Task 4 (Sentiment Analysis in Twitter) (Nakov et al., 2016;Rosenthal et al., 2017), we chose Subtask C as our OC tasks, and Subtask E as our OQ tasks for the reason given in Section 2.1. 6 Moreover, for the OQ meta-evaluation experiments, we also utilise the DQ (Dialogue Quality) subtask data from NTCIR-15 DialEval-1 (Zeng et al., 2020). As these subtasks require participating systems to estimate three different dialogue quality score distributions, namely, A-score (task accomplishment), E-score (dialogue effectiveness), and S-score (customer satisfaction), we shall refer to the subtasks as DQ-A, DQ-E, and DQ-S hereafter. We utilise both Chinese and English DQ runs for our OQ meta-evaluation (22 runs in total), as the NTCIR task evaluates all runs using gold distributions that are based on the Chinese portion of the parallel dialogue corpus (Zeng et al., 2020). As the three NTCIR data sets are larger than the two SemEval data sets both in terms of sample size and the num- 6 We do not use the Arabic data from 2017 as only two runs were submitted to Subtasks C and E (Rosenthal et al., 2017). ber of systems, we shall focus on the OQ metaevaluation results with the NTCIR data; the results with Sem16T4E and Sem17T4E can be found in the Appendix.

Meta-evaluation with Ordinal
Classification Tasks 5.1 System Ranking Similarity Table 2 shows, for each OC task, the Kendall's τ rank correlation values (Sakai, 2014) between two system rankings for every pair of measures. We can observe that: (A) the α's, the two "Macro F1" measures (F1 M and HMPR), MAE M and κ produce similar rankings; (B) MAE µ and Accuracy (i.e., the two measures that ignore class imbalance) produce similar rankings, which are drastically different from those of Group A; and (C) CEM ORD produces a ranking that is substantially different from the above two groups, although the ranking is closer to those of Group A.  Table 3: System ranking consistency for the OC tasks. / /♠/♣/♥/♦/ ‡ / † means "statistically significantly outperforms the worst 8/7/6/5/4/3/2/1 measure(s)," respectively. V E2 is the residual variance computed from each 1000 × 9 trial-by-measure matrix of τ scores, which can be used for computing effect sizes. For example, from Part (a), the effect size for the difference between α-ORD and CEM ORD can be computed as (0.962 − 0.806)/ √ 0.00211 = 3.40.
system rankings according to these two measures were completely different even in the official results. For example, in the 2016 results (Table 12 in Nakov et al. (2016)), while the baseline run that always returns neutral is ranked at 10 among the 12 runs according to MAE M , the same run is ranked at the top according to MAE µ . Similarly, in the 2017 results (Table 10 in Rosenthal et al. (2017)), a run ranked at 10 (tied with another run) among the 20 runs according to MAE M is ranked at the top according to MAE µ . Our results shown in Table 2 generalise these known discrepancies between the rankings.

System Ranking Consistency
For each measure, we evaluate its system ranking consistency (or "robustness" (Amigó et al., 2020)) across two topic sets as follows (Sakai, 2021): (1) randomly split the topic set in half, produce two system rankings based on the mean scores over each topic subset, and compute a Kendall's τ score for the two rankings; (2) repeat the above 1,000 times and compute the mean τ ; (3) conduct a ran-domised paired Tukey HSD test at α = 0.05 with 5,000 trials on the mean τ scores to discuss statistical significance. 7 Table 3 (a) and (c) show the consistency results with the OC tasks. For example, Part (a) shows that when the 100 topics of Sem16T4C were randomly split in half 1,000 times, κ statistically significantly outperformed all other measures, as indicated by a " ." Table 3 (b) and (d) show variants of these experiments where only 10 topics are used in each topic subset, to discuss the robustness of measures to small sample sizes. If we take the averages of (a) and (c), the top three measures are the two α's and κ, while the worst two measures are CEM ORD and Accuracy; we obtain the same result if we take the averages of (b) and (d). Thus, although Amigó et al. (2020) reported that CEM ORD performed well in terms of "robustness," this is not confirmed in our experiments.
Recall that κ has a practical inconvenience: it cannot distinguish between baseline runs that always return the same class. While SemEval16T4C contains one such run (which always returns neutral), SemEval17T4C contains as many as five such runs (each always returning one of the five ordinal classes). This is probably why κ performs well in Table 3(a) and (b) but not in (c) and (d).

Discriminative Power
In the information retrieval research community, discriminative power (Sakai, 2006(Sakai, , 2007(Sakai, , 2014 2020)). Given a set of systems, a p-value for the difference in means is obtained for every system pair (preferrably with a multiple comparison procedure (Sakai, 2018b)); highly discriminative measures are those than can obtain many small p-values. While highly discriminative measures are not necessarily correct, we do want measures to be sufficiently discriminative so that we can draw some useful conclusions from experiments. Again, we use randomised paired  Tukey HSD tests with 5,000 trials for obtaining the p-values. Figure 1 shows the discriminative power curves for the OC tasks. Curves that are closer to the origin (i.e., those with small p-values for many system pairs) are considered good. We can observe that (i) CEM ORD , Accuracy, MAE M , and MAE µ are the least discriminative measures in both tasks. (ii) Among the other measures that perform better, κ performs consistently well. Again, the fact that κ distinguishes itself from others in the SemEval16T4C results probably reflects the fact that the data set contains only one run that always returns the same class, which cannot be handled properly by κ. Table 4 summarises the properties of the nine measures we examined in the context of OC tasks. Column (IV) shows that, for example, the Group A measures produce similar rankings. Based on this table, we recommend (Linear Weighted) κ as the primary measure for OC tasks if the tasks do not in-  Table 5: System ranking similarity in terms of Kendall's τ for each OQ task (NTCIR). Correlation strengths are visualised in colour (τ ≥ 0.9, 0.8 ≤ τ < 0.9, and τ < 0.8) to clarify the trends.

Recommendations for OC Tasks
volve multiple baseline runs that always return the same class. Such runs are unrealistic, so this limitation may not be a major problem. On the other hand, if the tasks do involve such baseline runs (as in SemEval), we recommend α-ORD as the primary measure. In either case, it would be good to use both κ and α-ORD to examine OC systems from multiple angles. According to our consistency and discriminative power experiments, using α-INT instead of α-ORD (i.e., assuming equidistance) does not seem beneficial for OC tasks.
6 Meta-evaluation with Ordinal Quantification Tasks 6.1 System Ranking Similarity Table 5 shows, for each OQ task from NTCIR, the Kendall's τ between two system rankings for every pair of measures. It is clear from the "NMD" column that NMD is an outlier among the six measures. In other words, among the only axiomatically correct measures for OQ tasks, RNOD and RSNOD are the ones that produce rankings that are similar to those produced by well-known measures such as JSD and NVD (i.e., normalised MAE; see Eq. 16). Also, in Table 5(I) and (III), it can be observed that the ranking by RSNOD lies somewhere between that by NMD (let us call it "Group X") and those by the other measures ("Group Y"). However, this is not true in Table 5(II), nor with our SemEval results (See Appendix Table 8).  Table 6: System ranking consistency for the OQ tasks (NTCIR). ♣/♥/♦/ ‡ / † means "statistically significantly outperforms the worst 5/4/3/2/1 measure(s)," respectively. V E2 is the residual variance computed from each 1000 × 6 trial-by-measure matrix of τ scores, which can be used for computing effect sizes. For example, Part (a), the effect size for the difference between RNOD and NMD can be computed as (0.909 − 0.717)/ √ 0.00130 = 5.33 (i.e., over five standard deviations apart). Table 6 shows the system ranking consistency results with the OQ tasks from NTCIR. These experiments were conducted as described in Section 5.2. If we take the averages of (a), (c), and (e) (i.e., experiments where the 300 dialogues are split in half), the worst measure is NMD, followed by RSNOD. Moreover, the results are the same if we take the averages of (b), (d), and (f) (i.e., experiments where two disjoint sets of 10 dialogues are used), we obtain the same result. Hence, among the axiomatically correct measures for OQ tasks, RNOD appears to be the best in terms of system ranking consistency, and that introducing symmetry (Compare Eqs. 14 and 15) may not be a good idea from a statistical stability point of view. Note that, for comparing a system distribution with a gold distribution, symmetry is not a requirement.  Figure 2 shows the discriminative power curves for the OQ tasks from NTCIR. We can observe that: (i) NMD performs extremely poorly in (I) and (III), which is consistent with the full-split consistency results in Table 6(a) and (e); (ii) RNOD outperforms RSNOD in (I) and (III). Although RSNOD appears to perform well in (II), if we consider the 5% significance level (i.e., 0.05 on the y-axis), the number of statistically significantly different pairs (out of 231) is 117 for RNOD, 116 for RSNOD, NMD, and NVD, and 115 for RNSS and JSD. That is, RNOD performs well in (II) also. These results also suggest that introducing symmetry to RNOD (i.e., using RSNOD instead) is not beneficial.

Conclusions
We conducted extensive evaluations of nine measures in the context of OC tasks and six measures in the context of OQ tasks, using data from SemEval and NTCIR. As we have discussed in Sections 5.4 and 6.4, our recommendations are as follows.
OC tasks Use (Linear Weighted) κ as the primary measure if the task does not involve multiple runs that always return the same class (e.g., one that always returns Class 1, another that always returns Class 2, etc.). Otherwise, use α-ORD (i.e., Krippendorff's α for ordinal classes) as the primary measure. In either case, use both measures.
OQ tasks Use RNOD as the primary measure, and NMD as a secondary measure.
All of our evaluation measure score matrices are available from https://waseda.box.com/ ACL2021PACKOCOQ, to help researchers reproduce our work. Among the above recommended measures, recall that Linear Weighted κ and RNOD assume equidistance (i.e., they rely on w ij = |i − j|), while α-ORD and NMD do not. Hence, if researchers want to avoid relying on the equidistance assumption (i.e., satisfy the ordinal invariance property (Amigó et al., 2020)), α-ORD can be used for OC tasks and NMD can be used for OQ tasks. However, we do not see relying on equidistance as a practical problem. For example, note that the Linear Weighted κ is just an instance of the Weighted κ family: if necessary, the weight w ij can be set for each pair of Classes i and j according to practical needs. Similarly, w ij = |i − j| (Eq. 12) for RNOD (and other equidistance-based measures) may be replaced with a different weighting scheme (e.g., something similar to the prox ij weights of CEM ORD ) if need be.
Our final and general remark is that it is of utmost importance for researchers to understand the properties of evaluation measures and ensure that they are appropriate for a given task. Our future work includes evaluating and understanding evaluation measures for tasks other than OC and OQ.

Appendix
For completeness, this appendix reports on the OQ experiments based on SemEval16T4E and Se-mEval17T4E, which we omitted in the main body of the paper. However, we view the OQ results based on the three NTCIR data sets as more reliable than these additional results, as the SemEval score matrices are much smaller than those from NTCIR (See Table 1). Table 8 shows the system ranking similarity results with SemEval16T4E and SemEval17T4E; this table complements Table 5 in the paper. Table 9 shows the system ranking consistency results with SemEval16T4E and SemEval17T4E; this table complements Table 6 in the paper. Figure 3 shows the discriminative power curves for SemEval16T4E and SemEval17T4E; this figure complements Figure 2 in the paper.    Table 9: System ranking consistency for the OQ tasks (SemEval). ♣/♥/♦/ ‡ / † 5 4 3 2 1 means "statistically significantly outperforms the worst 5/4/3/2/1 measure(s)," respectively. V E2 is the residual variance computed from each 1000 × 6 split-by-measure matrix of τ scores, which can be used for computing effect sizes. For example, from (I) Left, the effect size for the difference between JSD and RNOD can be computed as (0.934−0.847)/ √ 0.00175 = 2.08 (i.e., about two standard deviations apart).