Measuring Language Closeness by Modeling Regularity

This paper addresses the problems of measuring similarity between languages— where the term language covers any of the senses denoted by language , dialect or linguistic variety , as deﬁned by any theory. We argue that to devise an effective way to measure the similarity between languages one should build a probabilistic model that tries to capture as much regular correspondence between the languages as possible. This approach yields two beneﬁts. First, given a set of language data, for any two models, this gives a way of objectively determining which model is better, i.e., which model is more likely to be accurate and informative. Second, given a model, for any two languages we can determine, in a principled way, how close they are. The better models will be better at judging similarity. We present experiments on data from three language families to support these ideas. In particular, our results demonstrate the arbitrary nature of terms such as language vs. dialect , when applied to related languages.


Introduction
In the context of building and applying NLP tools to similar languages, language varieties, or dialects, 1 we are interested in principled ways of capturing the notion of language closeness.
Starting from scratch to develop resources and tools for languages that are close to each other is expensive; the hope is that the cost can be reduced by making use of pre-existing resources and tools for related languages, which are richer in resources.
In the context of this workshop, we assume that we deal with some method, "Method X," that is applied to two (or more) related languages. For example, Method X may involve adapting/porting a linguistic resource from one language to another; or may be trying to translate between the languages; etc. We also assume that the success of Method X directly depends in some way on how similar-or close-the languages are: that is, the similarity between the languages is expected to be a good predictor of how successful the application of the method will be. Thus, in such a setting, it is worthwhile to devote some effort to devising good ways of measuring similarity between languages. This is the main position of this paper.
We survey some of the approaches to measuring inter-language similarity in Section 2. We assume that we are dealing with languages that are related genetically (i.e., etymologically). Related languages may be (dis)similar on many levels; in this paper, we focus on similarity on the lexical level. This is admittedly a potential limitation, since, e.g., for Method X, similarity on the level of syntactic structure may be more relevant than similarity on the lexical level. However, as is done in other work, we use lexical similarity as a "general" indicator of relatedness between the languages. 2 Most of the surveyed methods begin with alignment at the level of individual phonetic segments (phones), which is seen as an essential phase in the process of evaluating similarity. Alignment procedures are applied to the input data, which are sets of words which are judged to be similar (cognate)-drawn from the related languages.
Once an alignment is obtained using some method, the natural question arises: how effective is the particular output alignment?
Once the data is aligned (and, hopefully, aligned well), it becomes possible to devise measures for computing distances between the aligned words.
One of the simplest of such measures is the Levenshtein edit distance (LED), which is a crude count of edit operations needed to transform one word into one another. Averaging across LEDs between individual word pairs gives an estimate of the distance between the languages. The question then arises: how accurate is the obtained distance? LED has obvious limitations. LED charges an edit operation for substituting similar as well as dissimilar phones-regardless of how regular (and hence, probable) a given substitution is. Conversely, LED charges nothing for substituting a phone x in language A for the same phone in language B, even if x in A regularly (e.g., always!) corresponds to y in B. More sophisticated variants of LED are then proposed, which try to take into account some aspects of the natural alignment setting (such as assigning different weights to different edit operations, e.g., by saying that it is cheaper to transform t into d than t into w).
Thus, in pursuit of effective similarity measures, we are faced with a sequence of steps: procedures for aligning data produce alignments; from the individual word-level alignments we derive distance measures; averaging distances across all words we obtain similarity measures between languages; we then require methods for comparing and validating the resulting language distance measures. At various phases, these steps involve subjectivity-typically in the form of gold standards. We discuss the kinds of subjectivity encountered with this approach in detail in Section 2.1.
As an alternative approach, we advocate viewing closeness between languages in terms of regularity in the data: if two languages are very close, it means that either the differences between them are very few, or-if they are many-then they are very regular. 3 As the number of differences grows and their nature becomes less regular, the languages grow more distant. The goal then is to build probabilistic models that capture regularity in the data; to do this, we need to devise algorithms to discover as much regularity as possible.
This approach yields several advantages. First, a model assigns a probability to observed data. This has deep implications for this task, since it allows us to quantify uncertainty in a principled fashion, rather than commit to ad-hoc decisions and prior assumptions. We will show that probabilistic modeling requires us to make fewer subjective judgements. Second, the probabilities that the models assign to data allow us to build natural distance measures. A pair of languages whose data have a higher probability under a given model are closer than a pair with a lower probability, in a well-defined sense. This also allows us to define distance between individual word pairs. The smarter the model-i.e., the more regularity it captures in the data-the more we will be able to trust in the distance measures based on the model. Third-and equally important for this problem setting-this offers a principled way of comparing methods: if model X assigns higher probability to real data than model Y, then model X is better, and can be trusted more. The key point here is that we can then compare models without any "ground truth" or gold-standard, pre-annotated data.
One way to see this is by using the model to predict unobserved data. We can withhold one word pair (w A , w B ) from languages A and B before building the model (so the model does not see the true correspondence); once the model is built, show it w A , and ask what is the corresponding word in B. Theoretically, this is simple: the best guess forŵ B is simply the one that maximizes the probability of the pair p M (w A ,ŵ B ) under the model, over all possible stringsŵ B in B. 4 Measuring the distance between w B andŵ B tells how good M is at predicting unseen data. Now, if model M 1 consistently predicts better than M 2 , it is very difficult to argue that M 1 is in any sense the worse model; and it is able to predict better only because it has succeeded in learning more about the data and the regularities in it.
Thus we can compare different models for measuring linguistic similarity. And this can be done in a principled fashion-if the distances are based on probabilistic models.
The paper is organized as follows. We continue with a discussion of related work. In Section 3 we present one particular approach to modeling, based on information-theoretic principles. In Section 4 we show some applications of these models to several linguistic data sets, from three different language families. We conclude with plans for fu-ture work, in Section 5.

Related work
In this section we survey related work on similarity measures between languages, and contrast the principles on which this work relies against the principles which we advocate.

Subjectivity
Typically, alignment-based approaches use several kinds of inputs that have a subjective nature.
One such input is the data itself, which is to be aligned. For a pair of closely related dialects, deciding which words to align may appear "self-evident." However, as we take dialects/languages that are progressively more distant, such judgements become progressively less self-evident; therefore, in all cases, we should keep in mind that the input data itself is a source of subjectivity in measuring similarity based on data that is comprised of lists of related words.
Another source of subjectivity in some of the related work is gold-standard alignments, which accompany the input data. Again, for very close languages, the "correct" alignment may appear to be obvious. However, we must recognize that this necessarily involves subjective judgements from the creators of the gold-standard alignment.
Further, many alignment methods pre-suppose one-to-one correspondence between phones. On one hand, this is due to limitations of the methods themselves (there exist methods for aligning phones in other than one-to-one fashion); on another hand, it violates accepted linguistic understanding that phones do not need to correspond in a one-to-one fashion among close languages. Another potential source of subjectivity comes in the form of prior assumptions or restrictions on permissible alignments. 5 Another common assumption is insistence on consonant-to-consonant and vowel-to-vowel alignments. More relaxed assumptions may come in the form of prior probabilities of phone alignments. Although these may appear "natural" in some sense, it is important to keep in mind that they are ad hoc, and reflect a subjective judgement which may not be correct.
After alignment and computation of language distance, the question arises: which of the distance measures is more accurate? Again, one way 5 One-to-one alignment is actually one such restriction.
to answer this question is to resort to gold standards. For example, this can be done via phylogenetic clustering; if method A says language l 1 is closer to l 2 than to l 3 , and method B says the opposite (that l 1 is closer to l 3 ), and if we "know" the latter to be true-from a gold standard-then we can prefer method B. Further, if we have a gold-standard tree for the group of languages, we can apply tree-distance measures 6 to check how the trees generated by a given method differ from the gold-standard. The method that deviates least from the gold standard is then considered best.

Levenshtein-based algorithms
The Levenshtein algorithm is a dynamic programming approach for aligning a word pair (A, B) using a least expensive set of insertion, deletion and substitution operations required for transforming A into B. While the original Levenshtein edit distance is based on these three operations without any restrictions, later algorithms adapt this method by additional edit operations or restrictions. Wieling et al. (2009) compare several alignment algorithms applied to dialect pronunciation data. These algorithms include several adaptations of the Levenshtein algorithm and the Pair Hidden Markov Model. They evaluate the algorithms by comparing the resulting pairwise alignments to alignments generated from a set of manually corrected multiple alignments. Standard Levenshtein edit distance is used for comparing the output of each algorithm to the gold standard alignment, to determine which algorithm is preferred.
All alignment algorithms based on Levenshtein distance evaluated by Wieling et al. (2009) restrict aligning vowels with consonants.
VC-sensitive Levenshtein algorithm: uses the standard Levenshtein algorithm, prohibits aligning vowels with consonants, and assigns unit cost for all edit operations. The only sense in which it captures regularities is the assumption that the same symbol in two languages represents same sound, which results in assigning a cost of 0 to aligning a symbol to itself. It also prevents the algorithm from finding vowel-to-consonant correspondences (found in some languages), such as u-v, u-l, etc.
Levenshtein algorithm with Swap: adds an edit operation to enable the algorithm to capture phenomena such as metathesis, via a transposition: aligning ab in A to ba in B costs a single edit operation. This algorithm also forbids aligning vowels to consonants, except in a swap.
Levenshtein algorithm with generated segment distances based on phonetic features: The above algorithms assign unit cost for all edit operations, regardless of how the segments are related. Heeringa (2004) uses a variant where the distances are obtained from differences between phonetic features of the segment pairs. The authors observe that this is subjective because one could choose from different possible feature sets.
Levenshtein algorithm with generated segment distances based on acoustic features: To avoid subjectivity of feature selection, Heeringa (2004) experiments with assigning different costs to different segment pairs based on how phonetically close they are; segment distances are calculated by comparing spectrograms of recorded pronunciations. These algorithms do not attempt to discover regularity in data, since they only consider the word pair at a time, using no information about the rest of the data.
Levenshtein algorithm with distances based on PMI: Wieling et al. (2009) use Point-wise Mutual Information (PMI) as the basis for segment distances. They assign different costs to segments, and use the entire dataset for each alignment. PMI for outcomes x and y of random variables X and Y is defined as: PMI is calculated using estimated probabilities of the events. Since greater PMI shows higher tendency of x and y to co-occur, it is reversed and normalized to obtain a dissimilarity measure to be used as segment distance. Details about this method are in (Wieling and Nerbonne, 2011).

Other distance measures
Ellison and Kirby (2006) present a distance measure based on comparing intra-language lexica only, arguing that there is no well-founded common language-independent phonetic space to be used for comparing word forms across languages. Instead, they focus on inferring the distances by comparing how meanings in language A are likely to be confused for each other, and comparing it to the confusion probabilities in language B. Given a lexicon containing mappings from a set of meanings M to a set of forms F , confusion probability P (m 1 |m 2 ; L) for each pair of meanings (m 1 , m 2 ) in L is the probability of confusing m 1 for m 2 . This probability is formulated based on an adaptation of neighborhood activation model, and depends on the edit distance between the corresponding forms in the lexicon. Following this approach, they construct a confusion probability matrix for each language, which can be viewed as a probability distribution. Inter-language distances are then calculated as the distance between the corresponding distributions, using symmetric Kullback-Liebler distance and Rao distance . The inferred distances are used to construct a phylogenetic tree of the Indo-European languages. The approach is evaluated by comparing the resulting taxonomy to a gold-standard tree, which is reported to be a good fit.
As with other presented methods, although this method can be seen as measuring distances between languages, there remain two problems. First, they do not reflect the genetic differences and similarities-and regularities-between the languages in a transparent, easily interpretable way. Second, they offer no direct way to compare competing approaches, except indirectly, and using (subjective) gold-standards.

Methods for measuring language closeness
We now discuss an approach which follows the proposal outlined in Section 1, and allows us to build probabilistic models for measuring closeness between languages. Other approaches that rely on probabilistic modeling would serve equally well. A comprehensive survey of methods for measuring language closeness may be found in (Wieling and Nerbonne, 2015). Work that is probabilistically oriented, similarly to our proposed approaches, includes (Bouchard-Côté et al., 2007;Kondrak, 2004) and others. We next review two types of models (some of which are described elsewhere), which are based on informationtheoretic principles. We discuss how these models suit the proposed approach, in the next section.

1-1 symbol model
We begin with our "basic" model, described in , which makes several simplifying assumptions-which the subsequent, more advanced models relax (Wettig et al., 2012;Wettig et al., 2013). 7 The basic model is based on alignment, similarly to much of the related work mentioned above: for every word pair in our data setthe "corpus"-it builds a complete alignment for all symbols . The basic model considers pairwise alignments only, i.e., two languages at a time; we call them the source and the target languages. Later models relax this restriction by using N-dimensional alignment, with N > 2 languages aligned simultaneously. The basic model allows only 1-1 symbol alignments: one source symbol 8 may correspond to one target symbol-or to the empty symbol (which we mark as "."). More advanced models align substrings of more than one symbol to each other. The basic model also ignores context, whereas in reality symbol correspondences are heavily conditioned on their context. Finally, the basic model treats the symbols as atoms, whereas more advanced models treat the symbols as vectors of distinctive features.
We distinguish between the raw, observed data and complete data-i.e., complete with the alignment; the hidden data is where the insertions and deletions occur. For example, if we ask what is the "correct" alignment between Finnish vuosi and Khanty al (cognate words from these two Uralic languages, both meaning "year"): are two possible alignments, among many others. From among all alignments, we seek the best alignment: one that is globally optimal, i.e., one that is consistent with as many regular sound correspondences as possible. This leads to a chickenand-egg problem: on one hand, if we had the best alignment for the data, we could simply read off a set of rules, by observing which source symbol corresponds frequently to which target symbol. On the other hand, if we had a complete set of rules, we could construct the best alignment, by using dynamic programming (à la one of the above mentioned methods, since the costs of all possible edit operations are determined by the rules). Since at the start we have neither, the rules and the alignment are bootstrapped in tandem. 7 The models can be downloaded from etymon.cs.helsinki.fi 8 In this paper, we equate symbols with sounds: we assume our data to be given in phonetic transcription.
Following the Minimum Description Length (MDL) principle, the best alignment is the one that can be encoded (i.e., written down) in the shortest space. That is, we aim to code the complete datafor all word pairs in the given language pair-as compactly as possible. To find the optimal alignment, we need A. an objective function-a way to measure the quality of any given alignment-and B. a search algorithm, to sift through all possible alignments for one that optimizes the objective.
We can use various methods to code the complete data. Essentially, they all amount to measuring how many bits it costs to "transmit" the complete set of alignment "events", where each alignment event e is a pair of aligned symbols (σ : τ ) e = (σ : τ ) ∈ Σ ∪ ., # × T ∪ ., # drawn from the source alphabet Σ and the target alphabet T , respectively. 9 One possible coding scheme is "prequential" coding, or the Bayesian marginal likelihood, see, e.g., (Kontkanen et al., 1996), used in ; another is normalized maximum likelihood (NML) code, (Rissanen, 1996), used in (Wettig et al., 2012).
Prequential coding gives the total code length for data D. Here, c(e) denotes the event count, and K is the total number of event types. To find the optimal alignments, the algorithm starts with aligning word pairs randomly, and then iteratively searching for the best alignment given rest of the data for each word pair at a time. To do this, we first exclude the current alignment from our complete data. The best alignment in the realigning process is found using a Dynamic Programming matrix, with source word symbols in the rows and target word symbols as the columns. Each possible alignment of the word pair corresponds to a path from top-left cell of the matrix to the bottom-right cell. Each cell V (σ i , τ j ) holds the cost of aligning sub-string σ 1 ..σ i with τ 1 ..τ j , and is computed as: +L(σi : .) V (σi−1, τj−1) +L(σi : τj) (3) 9 Note, that the alphabets need not be the same, or even have any symbols in common. We add a special end-of-word symbol, always aligned to itself: (# : #). Empty alignments (. : .) are not allowed.
where L(e) is the cost of coding event e. The cost of aligning the full word pair, is then found in the bottom-right cell, and the corresponding path is chosen as the new alignment, which is registered back into the complete data.
We should mention that due to vulnerability of the algorithm to local optima, we use simulated annealing with (50) random restarts.

Context model
Context model is described in detail in (Wettig et al., 2013). We use a modified version of this model to achieve faster run-time.
One limitation of the basic model described above is that it uses no information about the context of the sounds, thus ignoring the fact that linguistic sound change is regular and highly depends on context. The 1-1 model also treats symbols of the words as atoms, ignoring how two sounds are phonetically close. The context model, addresses both of these issues.
Each sound is represented as a vector of distinctive phonetic features. Since we are using MDL as the basis of the model here, we need to code (i.e., transmit) the data. This can be done by coding one feature at a time on each level.
To code a feature F on a level L, we construct a decision tree. First, we collect all instances of the sounds in the data of the corresponding level that have the current feature, and then build a count matrix based on how many instances take each value. Here is an example of such a matrix for feature V (vertical articulation of a vowel). This shows that there are 10 close vowels, 25 mid-close vowels, etc.
This serves as the root node of the tree. The tree can then query features of the sounds in the current context by choosing from a set of candidate contexts. Each candidate is a triplet (L, P, F ), representing Level, Position, and Feature respectively.
L can be either source or target, since we are dealing with a pair of language varieties at a time. P is the position of the sound that is being queried relative to current sound, and F is the feature being queried. Examples of a Position are previous vowel, previous position, itself, etc. The tree expands depending on the possible responses to the query, resulting in child nodes with their own count matrix. The idea here is to make the matri-ces in the child nodes as sparse as possible in order to code them with fewer bits.
This process continues until the tree cannot be expanded any more. Finally the data in each leaf node is coded using prequential coding as before with the same cost explained in Equation 2.
Code length for the complete data consists of cost of encoding the trees and the cost of encoding the data given the trees. The search algorithm remains the same as the 1-1 algorithm, but uses the constructed trees to calculate the cost of events.
This method spends much time rebuilding the trees on each iteration; its run-time is very high. In the modified version used in this paper, the trees are not allowed to expand initially, when the model has just started and everything is random due to simulated annealing. Once the simulated annealing phase is complete, the trees are expanded fully normally. Our experiments show that this results in trees that are equally good as the original ones.

Normalized Compression Distance
The cost of coding the data for a language pair under a model reflects the amount of regularity the model discovered, and thus is a means of measuring the distance between these languages. However the cost also depends on the size of the data for the language pair; thus, a way of normalizing the cost is needed to make them comparable across language pairs. We use "Normalized Compression Distance" (NCD), described in (Cilibrasi and Vitanyi, 2005) to achieve this.
Given a model that can compress a language pair (a, b) with cost C(a, b), NCD of (a, b) is: Since N CD of different pairs are comparable under the same model, it can be used as a distance measure between language varieties.

Prediction of unobserved data
The models mentioned above are also able to predict unobserved data as described in Section 1 (Wettig et al., 2013). For the basic 1-1 model, since no information about the context is used, prediction simply means looking for the most probable symbol in target language for each symbol of w A . For the context model, a more sophisticated dynamicprogramming heuristic is needed to predict the unseen word, (Hiltunen, 2012  w B is then compared to the real corresponding word w B to measure how well the model performed on the task. Feature-wise Levenshtein edit distance is used for this comparison. The edit distances for all word pairs are normalized, resulting in Normalized Feature-wise Edit Distance (NFED) which can serve as a measure of model quality.

Experiments
To illustrate the principles discussed above, we experiment with the two principal model types described above-the baseline 1-1 model and the context-sensitive model, using data from three different language families.

Data
We use data from the StarLing data bases, (Starostin, 2005), for the Turkic and Uralic language families, and for the Slavic branch of the Indo-European family. For dozens of language families, StarLing has rich data sets (going beyond Swadesh-style lists, as in some other lexical data collections built for judging language and dialect distances). The databases are under constant development, and have different quality. Some datasets, (most notably the IE data) are drawn from multiple sources, which use different notation, transcription, etc., and are not yet unified. The data we chose for use is particularly clean.
For the Turkic family, StarLing at present contains 2017 cognate sets; we use 19 (of the total 27) languages, which have a substantial amount of attested word-forms in the data collection.

Model comparison
We first demonstrate how the "best" model can be chosen from among several models, in a principled way. This is feasible if we work with probabilistic models-models that assign probabilities to the observed data. If the model is also able to perform prediction (of unseen data), then we can measure the model's predictive power and select the best model using predictive power as the criterion. We will show that in the case of the two probabilistic models presented above, these two criteria yield the same result.
We ran the baseline 1-1 model and the context model against the entire Turkic dataset, i.e., the 19 × 18 language pairs, 10 (with 50 restarts for each pair, a total of 17100 runs). For each language pair, we select the best out of 50 runs for each model, according to the cost it assigns to this language pair. Figure 1 shows the costs obtained by the best run: each point denotes a language pair; X-coordinate is the cost according to the 1-1 model, Y-coordinate is the cost of the context model. The Figure shows that all 19×18 points lie below the diagonal (x=y), i.e., for every language pair, the context model finds a code with lower cost-as is expected, since the context model is "smarter," uses more information from the data, and hence finds more regularity in it.
Next, for each language pair, we take the run that found the lowest cost, and use it to impute unseen data, as explained in Section 3-yielding NFED, the distance from the imputed string to the  Figure 2; this time the X and Y values lie between 0 and 1, since NFED is normalized. (In the figure, the points are linked with line segments as follows: for any pair (a,b) the point (a,b) is joined by a line to the point (b,a). This is done for easier identification, since the point (a,b) displays the legend symbol for only language a.) Overall, many more points lie below the diagonal, (approximately 10% of the points are above). The context model performs better, and it would therefore be a safer/wiser choice, if we wish to measure language closeness; which agrees with the result obtained using raw compression costs.
The key point here is that this comparison method can accommodate any probabilistic model: for any new candidate model we checkover the same datasets-what probability values does the model assign to each data point. Probabilities and (compression) costs are interchangeable: information theory tells us that for a data set D and model M, the probability P of data D under model M and the cost (code length) L of D under M are related by: L M (D) = − log P M (D). If the new model assigns higher probability (or lower cost) to observed data, it is preferable-obviating the need for gold-standards, or subjective judgements.

Language closeness
We next explore various datasets using the context model-the better model we have available.
Uralic: We begin with Uralic data from Star-Ling. 11 The Uralic database contains data from more than one variant of many languages: we extracted data for the top two dialects-in terms of counts of available word-forms-for Komi, Ud- 11 We use data from the Finno-Ugric sub-family. The language codes are: est:Estonian, fin:Finnish, khn:Khanty, kom:Komi, man:Mansi, mar:Mari, mrd:Mordva, saa:Saami, udm:Udmurt.  It is striking that the pairs that score below Finnish/Estonian are all "true" dialects, whereas those that score above are not. E.g., the Mansi variants Pelym and Sosva, (Honti, 1998), and Demjanka and Vakh Khanty, (Abondolo, 1998), are mutually unintelligible. The same is true for North and Lule Saami.
Turkic: We compute NCDs for the Turkic languages under the context model. Some of the Turkic languages are known to form a much tighter dialect continuum, (Johanson, 1998), which is evident from the NCDs in Table 3. E.g., Tofa is mostclosely related to the Tuvan language and forms a dialect continuum with it, (Johanson, 1998). Turkish and Azerbaijani closely resemble each other and are mutually intelligible. In the table we highlight language pairs with NCD ≤ 0.30.
Slavic: We analyzed data from StarLing for 9 Slavic languages. 12 The NCDs are shown in Table 1. Of all pairs, the normalized compression costs for (cz, slk) and (lsrb, usrb) fall below the .30 mark, and indeed these pairs have high mutual intelligibility, unlike all other pairs. When the data from Table 1 are fed into the NeighborJoining algorithm, (Saitou and Nei, 1987), it draws the phylogeny in Figure 3, which clearly separates the languages into the 3 accepted branches of Slavic: East (ru, ukr), South  Table 1.

Conclusions and future work
We have presented a case for using probabilistic modeling when we need reliable quantitative measures of language closeness. Such needs arise, for example, when one attempts to develop methods whose success directly depends on how close the languages in question are. We attempt to demonstrate two main points. One is that using probabilistic models provides a principled and natural way of comparing models-to determine which candidate model we can trust more when measuring how close the languages are. It also lets us compare models without having to build goldstandard datasets; this is important, since goldstandards are subjective, not always reliable, and expensive to produce. We are really interested in regularity, and the proof of the model's quality is in its ability to assign high probability to observed and unobserved data. The second main point of the paper is showing how probabilistic models can be employed to measure language closeness. Our best-performing model seems to provide reasonable judgements of closeness when applied to languages/linguistic variants from very different language families. For all of Uralic, Turkic and Slavic data, those that fell 13 We should note that the NCDs produce excellent phylogenies also for the Turkic and Uralic data; not included here due to space constraints. below the 0.30 mark on the NCD axis are known to have higher mutual intelligibility, while those that are above the mark have lower or no mutual intelligibility. Of course, we do not claim that 0.30 is a magic number; for a different model the line of demarcation may fall elsewhere entirely. However, it shows that the model (which we selected on the basis of its superiority according to our selection criteria) is quite consistent in predicting the degree of mutual intelligibility, overall.
Incidentally, these experiments demonstrate, in a principled fashion, the well-known arbitrary nature of the terms language vs. dialect-this distinction is simply not supported by real linguistic data. More importantly, probabilistic methods require us to make fewer subjective judgements, with no ad hoc priors or gold-standards, which in many cases are difficult to obtain and justify-and rather rely on the observed data as the ultimate and sufficient truth.