Predicting cross-linguistic adjective order with information gain

Languages vary in their placement of multiple adjectives before, after, or surrounding the noun, but they typically exhibit strong intra-language tendencies on the relative order of those adjectives (e.g., the preference for `big blue box' in English, `grande bo\^{i}te bleue' in French, and `alsund\={u}q al'azraq alkab\={\i}r' in Arabic). We advance a new quantitative account of adjective order across typologically-distinct languages based on maximizing information gain. Our model addresses the left-right asymmetry of French-type ANA sequences with the same approach as AAN and NAA orderings, without appeal to other mechanisms. We find that, across 32 languages, the preferred order of adjectives largely mirrors an efficient algorithm of maximizing information gain.


Introduction
Languages that allow multiple sequential adjective modifiers tend to exhibit strong tendencies on the relative order of adjectives, as in 'big blue box' vs. 'blue big box' in English (Dixon, 1982). To date, most of the research on adjective ordering has focused on preferences in pre-nominal languages like English where adjectives precede the modified noun (Futrell et al., 2020a), or in post-nominal languages like Arabic where adjectives follow the noun . This research usually posits a metric, such as information locality (Futrell et al., 2020b) or subjectivity (Scontras et al., 2017), which governs the preferred distance between a noun and its adjectives. Because these theories predict only the relative linear distance between noun and adjective, they cannot be straightforwardly applied to mixed languages like French, where adjectives regularly appear both before and after the modified noun, at least not without added assumptions about hierarchical distance (Cinque, 1994). Instead, these mixed languages are often modeled with constraints on which adjective classes or functions can appear before or after a noun (Cinque, 2010;Fox and Thuilier, 2012).
Traditional accounts of adjective ordering in the linguistics literature often assume a tree structure in which the target measure is the hierarchical distance from noun (N) to adjective (A). According to syntactic accounts, ordering regularities are predicted by a universal hierarchy of lexical semantic classes (e.g., color adjectives are hierarchically closer to the modified noun than size adjectives; Cinque, 1994;Scott, 2002). Alternative accounts use aspects of adjective meaning to predict adjective order, making appeal to notions like 'inherentness' (Whorf, 1945) or 'definiteness of denotation' (Martin, 1969). Recently, Scontras et al. (2017) provide experimental evidence that their synthesis of semantic predictors into a continuum based on subjectivity reliably predicts ordering preference in English; followup studies have found subjectivity to be a reliable predictor in other languages as well (Tagalog: Samonte and Scontras, 2019;Mandarin: Shi and Scontras, 2020; Arabic: Spanish: Jr. and Scontras, 2019;. Explanations for the role of subjectivity in adjective ordering show how subjectivity-based orderings are more efficient than alternative orderings, thereby maximizing communicative success (Simonič, 2018;Hahn et al., 2018;Franke et al., 2019;. Other efficiency-based approaches to adjective order quantify efficiency with informationtheoretic measures of word distributions such as surprisal or entropy (Cover and Thomas, 2006;Levy, 2008). Models in this vein have a long conceptual history in the field, originating with the idea that semantic closeness between words is reflected in syntactic closeness in a surface realization (Sweet, 1900;Jespersen, 1922;Behaghel, 1932). Modern quantitative incarnations include integration cost (Dyer, 2017) and information locality (Futrell et al., 2020b), both generalizations of the widely-accepted principle of dependency distance minimization (Liu et al., 2017;Temperley and Gildea, 2018).
Crucially, while previous approaches are able to model symmetrical structures within the noun phrase, as in the mirror-image A 1 A 2 N orders of English and the N A 2 A 1 orders of Arabic, a hierarchical approach cannot model the left-right asymmetry of Romance A 1 N A 2 without an appeal to other mechanisms.
We advance an information-theoretic factor that predicts adjective ordering across the three typological 'templates' of adjective order-pre (AAN), mixed (ANA), and post (NAA)-based on information gain (IG), a measure of the reduction in uncertainty attained by transforming a dataset. IG is used in machine learning for ordering the nodes of a decision tree (Quinlan, 1986;Norouzi et al., 2015), where nodes are most often ordered in a greedy fashion such that the information gain of each node is maximized. By analogy, we view the noun phrase as a decision tree for reducing a listener's uncertainty about a speaker's intended meaning. Each word acts as a node in the decision tree; preferred adjective orders thus reflect an efficient ordering of nodes.

Empirical background
Empirical investigations of adjective ordering have focused on the cross-linguistic stability of these preferences across a host of unrelated languages (e.g., Dixon, 1982;Hetzron, 1978;Sproat and Shih, 1991). For example, where English speakers prefer 'big blue box' to 'blue big box', Mandarin speakers similarly prefer dà-de lán-de xiāng-zi 'big blue box' to lán-de dà-de xiāng-zi 'blue big box' (Shi and Scontras, 2020). In post-nominal languages, we find the mirror-image of the English pattern, such that adjectives that are preferred closer to the noun in pre-nominal languages are also preferred closer to the noun in post-nominal languages. 1 For example, speakers of Arabic prefer als . undūq al'azraq alkabīr 'the box blue big' to als . undūq alkabīr al'azraq 'the box big blue'.
In support of the cross-linguistic stability of adjective ordering preferences, Leung et al. (2020) present a latent-variable model capable of accurately predicting adjective order in 24 languages from seven different language families, achieving a mean accuracy of 78.9% on an average of 1335 sequences per language. Importantly, the model succeeds even when the training and testing languages are different, thus demonstrating that different languages rely on similar preferences. However, Leung et al.'s study was limited to AAN and NAA templates. There has been very little corpusbased empirical work on ordering preferences in the mixed ANA template, where adjectives both precede and follow the modified noun. 2 While Leung et al. (2020) learn adjective order by training on observed adjective pairs, an alternate strategy is to posit one or more a priori metrics as an underlying motivation for adjective order (e.g., Malouf, 2000, in part). This approach allows for the study of why adjective orders might have come about. To that end, Futrell et al. (2020a) report an accuracy of 72.3% for English triples based on a combination of subjectivity and informationtheoretic measures derived from the distribution of adjectives and nouns.
To our knowledge, the current study is the first attempt at predicting adjective order across all three templates, with an eye not only to raw accuracy, but in hopes of illuminating the functional pressures which might contribute to word ordering preferences in general.

Information gain 3.1 Picture of communication
We assume that a speaker is trying to communicate a meaning to a listener, with a meaning represented as a binary vector, where each dimension of the vector corresponds to a feature. Multiple features can be true simultaneously. For example, a speaker might have in mind a vector like Figure 1, where the vector has value 1 in the dimensions for 'is-big' (f 0 ), 'is-grey' (f 1 ), and 'is-elephant' (f 2 ), and 0 for all other features. A meaning of this sort would be conveyed by the noun phrase 'big grey elephant'. We call m a feature vector and the set of feature vectors M .
The listener does not know which meaning m the speaker has in mind; the listener's state of uncertainty can be represented as a probability distribution over all possible feature vectors, P (m), corresponding to the prior probability of encountering a given feature vector. We call this distribution the listener distribution L.
By conveying information, each word in a sequence causes a change in the listener's prior distribution. Suppose as in Figure 1 that a listener starts with probability distribution L, then hears a word w conveying a feature (f 2 ), resulting in the new distribution L . The amount of change from L to L is properly measured using the Kullback-Leibler (KL) divergence D KL [L ||L] (Cover and Thomas, 2006). Therefore, the divergence D KL [L ||L] measures the amount of information about meaning conveyed by the word.
Another measure of the change induced by a word is the information gain, an extension of KL divergence to include the notion of negative evidence. LetL represent the listener's probability distribution over feature vectors conditional on the negation of w. By taking a weighted sum of the positive and negative KL divergence, we recover information gain (Quinlan, 1986): where |L| indicates the number of elements in the support of L with non-zero probability. Information gain represents the information conveyed by a word and also the information conveyed by its negation.
Below, we discuss how information gain relates to other information-theoretic quantities, and why it is useful for us for predicting adjective order across typological templates.

Relationship to other quantities
Our IG quantity in Eq. 1 is drawn from the ID3 algorithm for generating decision trees (Quinlan, 1986). The goal of ID3 is to produce a classifier for some random variable (call it L) which works by successively evaluating some set of binary features in some order. The optimal order of these features is given by greedily maximizing information gain, where information gain for a feature f is a measure of how much the entropy of L is decreased by partitioning the dataset into positive and negative subsets based on whether f is present or absent.
. . . Our application of information gain to word order comes from treating each word as a binary indicator for the presence or absence of the associated feature, and then applying the ID3 algorithm to determine the optimal order of these features.
The first term of Eq. 1, the divergence D KL [L ||L], measures the amount of information about L conveyed by the word w and has been the subject of a great deal of study in psycholinguistics. In particular, Levy (2008) shows that if the word w and the context c can be reconstructed perfectly from the updated belief state L , then the amount of information conveyed by w reduces to nothing other than the surprisal of word w in context c: Importantly for our purposes, the positive evidence term D KL [L ||L] alone is unlikely to make useful predictions about cross-linguistic word-order preferences, because surprisal is invariant to reversal of word order across a language as a whole (Levy, 2005;Futrell, 2019): the same surprisal values would be measured for any given language and a language with all the same sentences in reverse order. Therefore, these metrics are unable to predict any a priori asymmetries in word-order preferences between pre-and post-nominal positions.

Negative evidence
The new feature of information gain, which has not been presented in previous information-theoretic models of language, is the negative evidence term in D KL [L ||L], indicating the change in the listener's belief about L given the negation of the features indicated by word w, a quantity related to extropy (Lad et al., 2015). For example, consider académie militaire 'military academy' in French. Let L represent a listener's belief state after having heard the noun académie 'academy'. Upon hearing the adjective militaire 'military', L is partitioned into L -the portion of L in which militaire is a feature-andL , the portion of L in which militaire is not a feature. Put another way,L is the probability distribution over non-military academies.
The negative evidence portion of information gain is of primary interest to us because it breaks the symmetry to word-order reversal that we would have if we used the positive evidence term alone. Therefore, information gain can predict left-right asymmetrical word-order preferences such as the order of adjectives in ANA templates; it also maps onto a well-known decision rule for the ordering of trees.

Data
Our study relies on two types of source data, both extracted from the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Ginter et al., 2017; a set of Common Crawl and Wikipedia text data across a variety of languages, automatically parsed according to the Universal Dependencies scheme with UDPipe (Straka and Straková, 2017). First, we extract noun phrases (NPs) containing at least one adjective as the source of feature vectors ( §4.3). Second, we extract triples, instances of a noun and two dependent adjectives in any order, where the three words are sequential in the surface order and neither the noun nor the adjectives have any other dependents.
We restrict triples in this way to minimize the effect that other dependents might have on order preferences. For example, while single-word adjectives tend to precede the noun in English, as in 'the nice people', adjectives in larger rightbranching phrases often follow: 'the people nice to us' (Matthews, 2014), a trend also seen in Romance . Similarly, conjunctions have been shown to weaken or neutralize preferences (Fox and Thuilier, 2012;Jr. and Scontras, 2019;. NPs and triples extracted from the Wikipedia dumps are used to generate feature vectors and to train our regression ( §4.4). We use triples from the Common Crawl dumps to perform hold-out accuracy testing.

Normalization
Because our source data are extracted from dumps of automatically-parsed text, they contain a large amount of noise, such as incorrectly assigned syntactic categories, HTML, nonstandard orthography, and so on. To combat this noise, we extract all lemmas marked as ADJ and NOUN in all Universal Dependencies (UD) v2.7 corpora (Zeman et al., 2020) for a given language-the idea being that the UD corpora are of higher quality-and include only NPs and triples in which the adjectives and nouns are in the UD lists. All characters are casenormalized, where applicable.

Feature vectors
Each NP attested in the Wikipedia corpus for a given language corresponds to a feature vector with value 1 in the dimension associated with each adjective or noun lemma. For example, an NP such as "the best room available" generates a vector containing 1 for 'is-available', 'is-best', and 'is-room'.
The relative count of each NP in the Wikipedia corpus yields a probability distribution on feature vectors. It is this distribution which is transformed by partitioning on each lemma in a triple.

Evaluation
For a given typological template (AAN, ANA, or NAA) there are two competing variants; our tasks are to (i) predict which of the variants will be attested in a corpus and (ii) show a cross-linguistic consistency in how that prediction comes about.
Because we are limiting our study to the two competing variants within each template, the position of the noun is invariant, leaving only the relative order of the two adjectives to determine the order of a triple. Our problem thus reduces to whether the information gain of the first linear adjective is greater than that of the second.
In the case of AAN and ANA triples, the IG of each adjective is calculated by partitioning the entire set of feature vectors L on each of the two adjectives. In the case of NAA triples, however, IG is calculated by partitioning only those feature vectors which 'survive' the initial partition by the noun, and are therefore part of L . Thus we calculate IG(L, a) before the noun and IG(L , a) after. Rather than simply implement the ID3 algorithm and choose adjectives based on their raw information gain, we train a logistic regression to predict surface orders based on the difference of IG between the attested first and second adjective, a method previously used by Morgan and Levy (2016) and Futrell et al. (2020a). The benefits of this approach are two-fold: we are able to account for bias in the distribution of adjectival IGs, and we can more easily deconstruct how strong information gain is as a predictor of adjective order.
Within each template, for each attested triple τ , let π 1 be the lexicographically-sorted first permutation of τ and π 2 be the second, with α 1 being the first linear adjective in π 1 and α 2 being the first linear adjective in π 2 . Our independent variable p is whether π 1 is attested in the corpus, and our dependent variable is the difference between the information gain of α 1 and α 2 . We train the coefficients β 0 and β 1 in a logistic regression of the form A positive value for β 1 tells us that permutations in which the larger-IG adjective is placed first tend to be attested. The value of β 0 tells us whether there is a generalized bias towards a positive or negative IG(π 1 ) − IG(π 2 ). The accuracy we achieve by running the logistic regression on held-out testing data tells us the effectiveness of an IG-based algorithm at predicting adjective order.

Reporting results
We report results for languages from which at least 5k triples could be analyzed, and for templates representing at least 10% of a language's triples in UD corpora. The count of analyzable triples for each language is a product of those available in the 2017 CoNLL Shared Task, those with sufficiently large UD v2.7 corpora, and those that meet our extraction requirements ( §4.1). Because we are interested in exploring a crosslinguistic predictor of adjective order, we report macro-average accuracies and β 1 coefficients. That is, each language's accuracy and coefficient are calculated independently and are then averaged. We report both type-and token-accuracy, using the latter in our analysis based on the intuition that the preference for the order of a commonly-occurring triple is stronger than a more rare one.

Results
We extracted and analyzed at least 5k triples from 32 languages across a variety of families. Because some languages contain triples in two typological templates, we report results for 44 sets of triples. Table 1 reports language-specific results and means for each template, including n triples analyzed, regression coefficient β 1 and P -value, token and type accuracy, and 95% confidence intervals. Figure 2 shows a plot of accuracy and β 1 coefficient for each language, categorized by template.
As reported in Table 1, we find above-chance (> 50%) accuracy for all languages tested. We accurately predict 65.6% of AAN triples, 73.7% of ANA triples, and 68.0% of NAA triples, for a comprehensive accuracy across all languages of 68.7%. Overlapping 95% confidence intervals across template means suggest that IG-based prediction performs equally well across templates. Though we cannot make a direct comparison to other studies due to a lack of standardized datasets, our cross-linguistic accuracy of 68.4% based on a single predictor compares reasonably favorably to a previous analysis of English AAN triples which achieved 72.3% accuracy using a combination of predictors (Futrell et al., 2020a).
The high performance on Vietnamese ANA triples (96.2%) is largely due to the algorithm correctly predicting that the highly-frequent adjective nhiều 'many' should be placed before the noun, while most other adjectives are placed after. 3 The learned β 1 coefficient is not significantly different between AAN (18.591) and NAA (31.313) triples, though that of NAA (4.140) triples is significantly smaller than the other two. More generally, of the 44 datasets tested, β 1 is positive in 41 (93.2%), suggesting that there is a strong preference to maximize information gain. Further, of the three instances of a negative β 1 , two (Croatian and Indonesian ANA) do not reach significance, perhaps due to a paucity of data. The sole significant negative β 1 is from Basque ANA triples.

Asymmetries
The preference for one variant of an ANA triple over the other is an asymmetry without a straightforward explanation in a distance-based model; there is no clear mapping from ANA onto the other templates, which means that an adjective's relative distance to the noun is not informative. Our algorithm is novel in that the placement of the adjectives is governed by greedy IG, not distance to the nounan innovation that allows us to break the symmetry between the adjectives in ANA triples. Similarly, IG makes no a priori prediction as to whether a mirror-or same-order will emerge between AAN and NAA triples: both pre-and post-nominal behavior is a product of ordering adjectives such that information gain is maximized, and IG itself is fundamentally derived from the distribution of adjectives and nouns that populate a language's possible feature vectors for conveying meaning.
Another left-right asymmetry that has been posited in the linguistics literature holds that depen-  dents placed before the head in a surface realization (e.g., the adjectives in an AAN triple) follow a more rigid ordering than those placed after (e.g., the adjectives in a NAA triple; Hawkins, 1983). Both noun modifiers in general and adjectives specifically have been reported to follow this pattern, with a largely-universal pre-nominal ordering and a mirror, same, or 'free' post-nominal order (Hetzron, 1978). However, there is as yet no large-scale empirical evidence for this claim.
In an effort to empirically assess the claim that post-nominal orderings are more flexible compared to orderings pre-nominally, Table 2 reports the average prevalence of adjective pairs attested in both possible orders (e.g., A 1 A 2 N and A 2 A 1 N, where N can be any noun) within each template in our dataset. At 95% confidence the difference between AAN and NAA does not reach significance, though the rate for ANA is significantly lower than the other two. More generally, the mean rate of just 1.6% across templates reinforces the notion that ordering preferences are quite robust regardless of template, at least for our normalized triples from the languages analyzed here.

Ablation
Equation 1 defines information gain as the conditioned sum of two elements, the positive evidence D KL [L ||L] and the negative evidence D KL [L ||L]. The positive evidence alone is akin to surprisal, a well-studied quantity in psycholinguistics ( §3.2). By ablating the IG formulation into the two terms discretely, we can show empirically that the proportionally-combined positive and negative evidence yield more accurate and consistent results than either of the two constituent terms alone. Table 3 shows the mean accuracy and polarity proportion of the β 1 coefficient across languages and templates. The polarity of β 1 tells us whether maximizing IG (positive) or minimizing IG (neg-   ative) is the better strategy. Thus a polarity percentage close to 0 or 1 indicates more consistent behavior across templates. For example, while the accuracy of using only positive evidence, D KL [L ||L], is 0.565, that accuracy is realized due to a 0.000 rate of positive β 1 coefficient-that is, the 56.5% accuracy is achieved by minimizing IG, placing the adjective with the lower IG first. On the other hand, while using only positive evidence to predict NAA triples yields the same accuracy, 0.565, the coefficient polarity proportion of 0.769 means that in most NAA cases IG should be maximized. The three templates together reflect a modest accuracy (0.566) and an inconsistent coefficient polarity proportion (0.273).
Using only negative evidence, D KL [L ||L], yields even worse accuracies and similarly inconsistent coefficients as positive only. Accuracy across templates is little better than chance at 0.535, and the average coefficient polarity proportion of 0.273 likewise demonstrates that using negative evidence alone does not produce consistent behavior across templates.
The full IG calculation, including both positive and negative evidence, yields the highest accuracy across templates (0.687), as well as the highest for each template-AAN (0.657), ANA (0.737) and NAA (0.680). IG also demonstrates the most consistent behavior across languages and templates: at a rate of 0.932, maximizing IG yields the highest accuracy, regardless of whether adjectives precede or follow the noun.

An efficient algorithm
The goal of algorithms such as ID3 is to produce a decision tree which divides a dataset into equalsized and mutually-exclusive partitions, thereby creating a shallow tree (Quinlan, 1986). While finding the smallest possible binary decision tree is NP-complete (Hyafil and Rivest, 1976), ID3's locally-optimal approach has proven quite effective at producing shallow trees capable of accurate classification (Dobkin et al., 1996).
By analogy, the ordering of adjectives in a noun phrase by maximizing information gain likewise produces a tree with balanced positive and negative partitions at each node. Specifically, adjectives that minimize the entropy of both the positive and negative evidence are placed before adjectives which are less 'decisive' at partitioning feature vectors.

Summary
We have taken a novel approach to the problem of predicting the surface order of adjectives across languages, casting it as a decision tree operating on a probability distribution over binary feature vectors. As each adjective is uttered, probability mass is partitioned into positive and negative subsets: those vectors which contain the feature and those that do not. The information gained by this partition can be used to order adjectives in a greedy manner, similarly to well-known algorithms for ordering nodes in a decision tree.
An IG-based approach allows us to provide the first quantitative information-theoretic account predicting the order of ANA triples. Further, with this approach we need not stipulate mirror-or sameorders for AAN and NAA triples. Because IG is not a distance metric between adjective and noun, and because IG incorporates negative evidence, both ANA and pre-or post-nominal asymmetries emerge within an IG framework, without appeal to other mechanisms.
Our results show that information gain is a good predictor of adjective order across languages. Importantly, IG-based prediction follows a consistent pattern across the three typological templates, namely that adjectives which maximize information gain tend to be placed first.