Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text Corpora

Similarity metrics for text corpora are becoming critical due to the tremendous growth in the number of generative models. These similarity metrics measure the semantic gap between human and machine-generated text on the corpus level. However, standard methods for evaluating the characteristics of these metrics have yet to be established. We propose a set of automatic measures for evaluating the characteristics of semantic similarity metrics for text corpora. Our measures allow us to sensibly compare and identify the strengths and weaknesses of these metrics. We demonstrate the effectiveness of our evaluation measures in capturing fundamental characteristics by comparing it to a collection of classical and state-of-the-art metrics. Our measures revealed that recent metrics are becoming better in identifying semantic distributional mismatch while classical metrics are more sensitive to perturbations in the surface text levels.


Introduction
While there has been a long-standing interest in developing semantic similarity metrics1 (Rayson and Garside, 2000), measuring how close two text corpora are remains an open problem (Pillutla et al., 2021).Specifically, the recent advances in generative language models have led to an increased interest in the study of content similarity between human and generated language, as a mean for comparing the quality of generative models (Mille et al., 2021;Gehrmann et al., 2022).
While one can reasonably measure the semantic distance between two individual sentences (e.g., by calculating the cosine distance between the sentence embeddings), measuring the dissimilarity between two text corpora remains a challenge (Naeem et al., 2020).Corpus-level metrics seek to assess semantic similarity at the group level, for instance, assessing generated text fidelity, diversity, and coverage compared to the reference corpus (Sajjadi et al., 2018).Thus, one common approach for measuring the semantic dissimilarity between two corpora is to compare the densities of their sentences in the embedding space (Pillutla et al., 2021).
However, there are no standard automatic procedures for evaluating the precision and robustness of such similarity metrics.The semi-manual standard approach is to correlate the results of these metrics for human judgement.However, leveraging manual human judgements to construct numeric metrics has significant weaknesses.As we explain in Section 2, human judgements are expensive to obtain, are difficult to aggregate consistently from individual text instances into a corpus-level metric in a way that reflects all relevant aspects of the texts, and can be subjective and non-robust.
Therefore, in this paper, we adopt a middle ground between validating the metric against human judgement on real data and evaluating the metric with synthetic distributions by building "controllable-distance real data corpora" (Section 3).By precisely controlling the content of test corpora, we devised a unified evaluation of desired metric characteristics on real data.This technique allows aggregation of many small-difference judgements that should correspond to what a human would logically decide, to evaluate the distance metric overall in terms of desirable properties.The middle ground thus attempts to reflect human logical judgement in an inexpensive way, while avoiding some of the weaknesses described, such as lack of consistency.
To summarize, our contributions are as follows.First, we present a text similarity evaluation measures that allows researchers to compare the robustness of their newly proposed metrics against existing metrics using the same standards.Second, we evaluate classical and state-of-the-art similarity metrics and show that our benchmark per-arXiv:2211.16259v1[cs.CL] 29 Nov 2022 forms well in capturing their known properties.Finally, we provide a pip-installable Python package to compute an extensive set of text dissimilarity metrics, using a unified and simple API2 .

Literature Review
The most widely-used method to compute the quality of text similarity metrics investigates the correlation between the scores given by the metric and human judgements.However, human judgement, even on the sentence level, has several shortcomings, mainly that it is expensive and can be inconsistent and subjective (Popescu-Belis, 2003;Lin and Och, 2004;Graham et al., 2017).Also, superficial aspects of the sentences, such as text length or syllables per sentence, may influence human judgements of the semantic similarity (Novikova et al., 2017).Furthermore, though humans may be able judge the relative similarity of a pair of sentences, they are usually limited in their ability to make large-scale assessments of a similar type when comparing two corpora (i.e., two distributions of sentences) consistently and reliably.
In an attempt to standardize metric evaluation, several competitions and standard datasets containing compared data and human assessment were created for specific tasks, such as translation (Guo et al., 2018;Mathur et al., 2020).However, there is currently a lack of benchmarks against which to assess the semantic similarity between corpora.
Text similarity metrics can be thought of as belonging to several broad and overlapping classes (see e.g., Wang and Dong 2020), which partially depend on the form of the text representation (e.g., token-based or vector embedding).Here, we investigate metrics from three of these classes, comparing corpora based on these aspects: lexicographical (statistical properties of words and tokens), distribution ( densities of sentences represented in the embedding space), and discriminatability (ability to classify sentences as belonging to one corpus or the other).The metrics we use are summarized in Table 1.
Lexicographical Statistics These methods have been developed to compare various distributional properties of target text Q with respect to the reference samples P , based on some statistic measures T (P ) and T (Q), operating on the surface text level, e.g., sentence, words, word-parts, tokens, etc.Such commonly-used measures include resemblance in vocabulary distribution (Kilgarriff, 2001), likelihood of repetition (Pillutla et al., 2021), and n-gram matching (Papineni et al., 2002).However, these metrics tend to be overly sensitive or easily misled by adversarial samples or text peculiarities.In general, χ 2 -based metrics calculate distance between observed and expected frequencies of categorical variables.The metric in (Kilgarriff, 2001), denoted here as CHI, calculates E, the average (between P and Q) frequencies of the n most common tokens in the combined vocabulary of P and Q, then sums the χ 2 statistics comparing each of P and Q to the expected E, across tokens.Here, for both CHI and ZIPF, below, we use the top n = 5000 tokens.In contrast, the ZIPF metric (Holtzman et al., 2019) compares the use of vocabulary using Zipf's law, which suggests that the frequency of a given word in human text is inversely-proportional to its frequency rank.The Zipfian coefficient is fitted on a given corpus and the further it is from 1, the more the observed corpus differs from the 'ideal' theoretical distribution (Holtzman et al., 2019).We can thus use z P − z Q as a distance metric between corpora P and Q.
Distributional Metrics These metrics are based on quantifying the distributional relationship between the reference and target corpora in the embedded vector space, thereby capturing semantics beyond superficial token-level statistics.Here P and Q denote the reference and target corpora in the embedding space.Given samples from these, we can use the sample density estimates P and Q to approximate the true unknown corpus population distributions P and Q.The Fréchet Inception Distance (FID, Heusel et al. 2017) is computed by fitting a continuous multivariate Gaussian to the P and Q, and then calculating the Wasserstein-2 distance between them.However, FID is sensitive to both the addition of spurious modes as well as to mode dropping (Lucic et al., 2018).Also, while FID is able to detect distributional distances in the high-dimensional space, it cannot shed light upon the nature of this distance.Due to these weaknesses of FID, we additionally consider a metric denoted PR (precision and recall) proposed in computer vision (Sajjadi et al., 2018;Kynkäänniemi et al., 2019), which is inspired by the notion of precision and recall.Intuitively, the precision captures the average resemblance of the
To obtain a single distance value using the method in (Kynkäänniemi et al., 2019), we calculate the F 1 measure based on the returned precision and recall, denoted here by PR.Naeem et al. (2020) proposed an improved estimation of these precision and recall notions by mitigating the overestimation of manifolds caused by outliers and underestimating the similarity when the target and reference are taken from the same distribution.Similarly to PR, we calculate the F 1 to obtain a similarity value using this method, denoted as DC3 .MAUVE (Pillutla et al., 2021) is a recentlydeveloped metric that estimates the gap between human and generated text by computing the area under the information divergence frontiers in a quantized embedding space using the KL-divergence4 .
Discriminatability Metrics Similar to the distributional metrics, discriminative metrics calculate the distance using the embedding of the individual sentences in the two corpora.However, they do not aim to specifically capture the overlap between the distribution induced by the compared corpora.Rather, they calculate the relationship in classification terms, i.e., to what extent can sentences in one corpus be distinguished from the sentences in the other corpus, using a discriminative model.CLASSIFIER: Following (Lopez-Paz and Oquab, 2016), we measure the similarity between corpora using a binary classifier.We used SVM (Cortes and Vapnik, 1995) trained on samples of both source corpora to predict corpus membership in a test set of unseen samples.A higher test accuracy indicates higher inter-corpora distance.
While CLASSIFIER is a model-based metric that uses the entire corpus distribution, IRPR (information-retrieval precision and recall) is an example of an instance-(individual sentence) based corpus distance metric.Inspired by Zhao et al. (2017), we calculate the dissimilarity between corpora as follows.For each embedded sentence in corpus A, we find its closest neighbor in B by cosine distance.The average of these distances is then computed to find the "precision" value.The same procedure in reverse, from B to A, gives the "recall" value.We calculate the F 1 score of the recall and precision to obtain a single value.Note that the CLASSIFIER metric is used to represent model-based discriminative approaches, while IRPR is used to represent instance-based discriminative methods.
The values calculated by CHI, IRPR, PR, DC and Mauve capture the similarity rather than the distance between two corpora (for all metrics v ∈ [0, 1]).To make these metrics represent distances, we take 1 − v.
Our model selection was based on considering the trade-off between embedding quality and calculation time.The code as well as the scripts to reproduce the experiments are available online.5

Known Similarity Corpora
Most of the metric quality measures we propose are primarily based on the notion of knownsimilarity corpora (KSC) introduced by Kilgarriff (2001).The KSC set is created by mixing samples from two different source corpora A and B in gradually-changing proportions.The KSC set, denoted KSC(A, B), consists of k corpora {c 1 , c 2 , . . ., c k }, each of size n ≥ k − 1, where corpus c i , i = 1, . . ., k is constructed by sampling n k−i k−1 observations from A, and the remaining n i−1 k−1 from B (see Figure 1).The sampling resolution gradation between corpora is a fixed 1 k−1 .We now introduce some notation on the KSC set, which are used to define the measures in Section 4. Let [k] = {1, 2, . . ., k}.For given source corpora A and B, for each ∈ [k − 1] we define the -distant corpora set as follows: (2) To pool results across , we further define: Note that because we do not require the distance metrics considered to be 'metrics' in the mathematical sense, they may not be symmetric (i.e., possibly d(a, b) ≠ d(b, a)).However, since KSC enforces a pairwise order on pairs (c i , c j ) by requiring j > i, this ensures that D (A, B, d) is properly defined.Some of the metrics d have a pre-defined range (e.g., CHI, MAUVE, DC, PR only return values in the range [0,1]) while others have no preset scale or operation range.Therefore, to allow sensible comparison of distance metrics with different operation ranges and across source corpora, we obtain z-scores by normalizing the metric values, pooled across all D (A, B, d).In the following analysis, if not specified otherwise, D will always be the normalized rather than raw distances.Datasets Selection The measures described in Section 4 are applicable to any pair of textual datasets with differently-distributed textual content, allowing the corpora in the KSC set to be distinguishable.To ensure that each pair of source corpora were in fact different enough, in the following experiments we use pairs of human text corpora from different domains, rather than pairing a human corpus with a machine-generated version of itself.For our experiments we selected four public datasets (ATIS, Hemphill et al. 1990; yahoo6 ; banking77, Casanueva et al. 2020;clinc150, Larson et al. 2019) containing short user utterances from different domains summarized in Table 2.

Metric Robustness Measures
We now describe our measurements of desirable properties for distance metrics, given the normalized D on the KSC sets.In the three following measures (Monotonicity, Separability, and Linearity), we aim to capture three attributes of wellbehaved metrics that can be understood by considering the top line scatter plots of Figure 2; these show the relation between the D sets and .In these scatterplots, a high angle of the regression line, low vertical variability around it, and linearity are all desirable properties for the distance metric, and are captured in these measures.

Metric Monotonicity
A well-behaved distance metric d should have a natural monotonic relationship with the separation levels of the KSC.We use Spearman's rank correlation between and D , which we denote ρ(d), to assess the monotonicity.Spearman's correlation is defined as the Pearson correlation between the order ranks of two variables, and measures the strength of their monotonic, rather than linear, association.As can be seen in Table 3, MAUVE and CHI achieve the best monotonicity results, followed by DC and FID.

Metric Separability
It is desirable that (1) for a given , D has low variability, and (2) for different 2 > 1 , the samples D 1 and D 2 are distinguishable (e.g., by a twosample difference test), particularly as 2 − 1 grows.Here, we measure how grouping by explains the variability in D across .We perform a one-way fixed-effects analysis of variance (ANOVA) with as the unordered categorical treatment and D as the numeric response.Often, an F-test is performed; if its p-value is low, it means a significant amount of the variance in the response (D ) can be explained by the treatment ( ).Since the F-test for any reasonable d metric should be significant, we instead use the similar ω 2 effect-size metric (Hays, 1963), which is bounded by ±1, to better assess them.It is defined as where SS and M S are the sum and mean sums of squares, and df is the degrees of freedom, on a dataset of size n (here, n = D(A, B, d) ).In the following we denote this measure as W(d).

Metric Linearity
Here we examine to what extent linear changes in the corpus content ( ) are manifested in linear changes in the distance function.To do so, we calculate the coefficient of determination (R 2 ), where higher values indicate stronger linearity.This measure is denoted by L(d).Looking at the results in Table 3, we see that MAUVE achieves the best results followed by DC and FID.It appears that this measure is more affected by the source corpora and by the resolution than other metrics.

Metric Time Efficiency
The time complexity of the metric is commonly perceived as less important, thus seldom reported (Sai et al., 2022).This aspect is becoming ever more important, especially due to the growing interest in time-consuming divergence frontier methods (Djolonga et al., 2020).Such metrics perform multiple measurements to estimate the area under the curve (similar to precision-recall curves for binary classification), with tune-able but increasing resolution.We measure the time performance of the metric T (d) in terms of 100 similarity measurements operations per second ([100op sec]) on a standard CPU machine7 .Note that the measurements reported in Table 3 do not include the sentences' embedding time.Predictably, methods that operate on the token level and avoid complex density estimation tend to achieve the best time performance.Among the distributional metrics, MAUVE achieves the best results, followed by FID.PR and DC produce similar results since both are based on similar manifold calculations.

Metric Accuracy
The assessment measures described earlier in Section 4 use the observed values of the metric distances (or similarities) between the KSC corpora; however, the actual values of the distance may not be known.Nevertheless, we still have some partial information about the ordering of these values, which we will use to define an accuracy measure, which requires us to define the notion of a 'judgement', as follows.

Comparing paired corpora distances
Suppose we do not know the observed values of d in D(A, B, d) for the paired corpora in KSC (A, B), pooled across .Nevertheless, we can still assume that certain pairwise distances are larger than others.For instance, the proportions of observations from A in c 2 and c 3 are more similar than the respective proportions between c 1 and c 4 .Moreover, the interval of the first pair is 'contained' 8 in the second, and thus the first pair should have smaller distance.Thus, it should be true that, say, d(c 2 , c 3 ) ≤ (c 1 , c 4 ) in expectation (across repeated random sampling).In general, whenever the interval of one corpus pair contains (⊂) the interval of another, we expect the contained pair to have a smaller distance.
Given two pairs, (c i , c j ) and (c q , c r ), of paired corpora, we can only reliably predict 9 which of d(c i , c j ) or d(c q , c r ) is larger in expectation (a decision we call a 'judgement') if the interval of one pair contains the other's.The set J contains all and only such judgements: The judgement d(c q , c r ) ≤ d(c i , c j ) is correct when the second interval contains the first.This 8 (ci, cj) contains (cq, cr), i.e., (q, r) ⊂ (i, j), if i ≤ q and r ≤ j and i < r.
9 For instance, say we compare pairwise distances between (c1, c6) and (c5, c7).Even though the second interval length (7 − 5 = 2) is smaller than the first (6 − 1 = 5), because it is not contained in the first, we cannot necessarily say that d(c5, c7) ≤ d(c1, c6) since inter-corpus distance may not be proportional to the interval length.
gives the most probabilistically-logical partial order on the similarities between corpora in a KSC collection, that can be obtained without knowledge of the true pairwise d-distances between corpora.
Figure 3 shows a tree representation of KSC-set pair containment relations, from which the set of judgements J can be extracted.The leafs are the KSC collection and the inner nodes (circles) represent the corpora tuples (c i , c j ).The set J contains all judgements such that each node (i, j) is judged against all descended nodes.Namely, if there is a path from node a to node b, there is a judgement between the two nodes, and the judgement is correct if . The size of the judgements set can be expressed as: 2 − 1 .For instance, J = 339 if k = 7, and 6053 if k = 12.

Accuracy
The metric accuracy is defined as the rate of correct judgements, formally: where  = ((c q , c r ), (c i , c j )) is a judgement in J and 1(⋅) is the indicator function.Further, we propose a weighted version of the accuracy metric that assigns more weight to harder judgements.We define the hardness of judgement  as w() = 1 2 − 1 where 2 = j − i and 1 = r − q, and 2 > 1 by definition of J. Formally, are correlated, as one may expect, A w typically returns lower values (see Table 3).

KSC(A, B
).This was done to prevent perfect judgements by naively counting the number of common instances (e.g., by defining d(c i , c j ) = c i ∆c j where ∆ denotes the symmetric difference).MAUVE, followed closely by FID, CHI and DC, achieves the highest accuracy results across resolutions and source corpora.

Size Robustness
We are also interested in capturing the sensitivity of a metric to sample sizes.To accomplish this, we need to quantify the convergence pace of d(a s , b s ) to the asymptotic distance d(A, B), where a s , b s are samples from corpora A, B of increasing size s.Specifically, in our experiments s ∈ S = {50, 250, 450, . . ., 2850}.The middle plot in Figure 2 shows convergence patterns of the different metrics to the asymptotic distance.The asymptotic distance is estimated by the mean of repeated (10) calculations of the distance on samples of size 3000 each from A, B, rather than on the full corpora.To quantify the metric size robustness, (S), we calculate the mean absolute error, d(a s , b s ) − d(A, B) , for all s ∈ S, normalized by the asymptotic distance: Similar to previous measures, the normalization is performed to omit the influence of metric scale and operation ranges.While our results demonstrate (Figure 2) that most of the metrics examined require around 1000 samples to closely estimate the asymptotic distance between the source corpora, their measured accuracy (A(d) and A w (d)) is still fairly high even on small corpora within the KSC, and can capture relative differences in corpus content.

Imbalance Robustness
Similarity metrics are frequently used to compare datasets with unequal sample size.Especially when comparing real and generated corpora, the size of a generated corpus is usually much larger than the real corpus.The imbalance robustness measure quantifies the effect of corpora size imbalance on the metric's performance (see Figure 2, bottom).
Unsurprisingly, asymmetric metrics such as PR and DC are most affected by size imbalance.While PR, DC, and MAUVE were all originally designed to measure the disparity between human and generated data (and thus asymmetric in the referenceP and target Q), it seems that MAUVE overcomes the sensitivity to datasets of very unequal sizes.Interestingly, imbalance causs some metrics (CLAS-SIFIER and MAUVE) to underestimate the dis-tance, while others (FID) overestimate it.When we compare the convergence patterns of PR and DC, both are similarly asymmetric, maintaining d(P, Q).When we increase the reference size, PR diverges from the true asymptotic distance, while DC converges to it.The Imbalance Robustness score I(d) is calculated similarly to the size robustness score, only that b s = N − a s .KSC Parameters As shown in Section 4.6, most metrics require at least n = 1000 samples to capture the true distance between two source domain corpora; however, our experiments use n = 100.This is because our measures are relative, i.e., we do not aim to calculate the true asymptotic distance between two domains, but to measure the metrics' robustness in detecting small changes in the compared corpora.Furthermore, if n is large, k must also be large to ensure the k corpora in the KSC set have small enough absolute consecutive differences.Note that small consecutive differences in KSC corpora are needed so that the measures in Section 4 will have a high enough resolution and large enough sample size of D to properly differentiate the metric properties.In particular, this ensures the judgements (Section 4.5.1)used in the accuracy measures (A and A w ) are not too 'easy' to make correctly, in which case they would be less useful as a tool.For instance, a metric with 100% accuracy makes all correct judgements, e.g., that d(c 2 , c 3 ) ≤ d(c 1 , c 4 ).If k = 5, the gap (in expectation) between the pair distances compared is large, so the judgement is easy, and thus all metrics may have full accuracy.When k increases, the absolute consecutive differences in corpora fall, and thus the difficulty of the judgement increases.Some metrics will fail to make the judgement correctly (in a given random KSC), decreasing their accuracy; this allows us to better differentiate between the more and less accurate metrics.However, setting k too high results in a computationally prohibitive number J of judgements.Therefore, we opted to use the smaller n that are still sufficient to capture the quality and robustness of the investigated metrics.

Increasingly Fine-tuned Corpora
Here, we qualitatively investigate the metrics' ability to discriminate between generated and human text using the following procedure: We generated a sequence of equal-size synthetic corpora IF C = (g 1 , g 2 , . . ., g n ) by sampling from a gradually fine-tuned language model on a specific source corpus A. Namely, in each iteration, a fine-tuning step is performed by training the language model on a single epoch containing 1000 sentences randomly drawn from A, followed by a generation process to synthesize a corpus g i containing around 1000 sentences.The name IFC, or "Increasingly Fine-tuned Corpora", was chosen to parallel the name KSC ("Known Similarity Corpora").
For each generated corpus g i , we estimated the distance from A, i..e, d i = d(A, g i ), ∀i ∈ [n].While the true distance between those synthetic corpora and A are unknown, an effective metric should capture the decreasing distance between A and g i with increasing i, namely d 1 ≺ d 2 ≺ ⋅ ⋅ ⋅ ≺ d n .Due to our results, which show low imbalance robustness of some metrics, we maintained the same-size corpora when calculating corpora distance.
The results presented in Figure 5 show the gap between human and generated text captured by each metric in each iteration.To calculate the average self-distance of the reference corpus (A), we take the mean distance between two randomly sampled sub-corpora r 1 and r 2 from A, i.e. d(A, A) = E r 1 ,r 2 ∼A [d(r 1 , r 2 )]).
In our experiments we used two datasets, the banking77 dataset, mentioned above and the news dataset10 , representing different domains of text corpora.The IFC set for the banking77 dataset was generated in an iterative two-step procedure similar to the one described in LAMBADA (Anaby-Tavor et al., 2020).This procedure first generates sentences conditioned on the label, then filters out sentences that are out-of-domain or incorrectly labeled.However, the IFC set for the news dataset was generated by finetuning the pre-trained GPT-2 medium model (Radford et al., 2019).
The results in Figure 5 show that CHI is less effective than the other metrics in capturing the gradual nature of the IF C. Also, they show that FID and IRPR are sensitive in discriminating between the original and generated corpora, even after many fine-tuning iterations.Interestingly, the ZIPF distance increases with the iteration.This indicates that the generated text, despite becoming semantically closer to the original with the increasing iterations, becomes less 'natural' in that the token frequencies deviate from that of human text and the reference corpus.This can be explained, at least in part, by the TTR measure.TTR is a standard word diversity measure, calculated by dividing the number of unique words in a text by the total word count.A high TTR indicates significant lexical variation.Indeed, in the IFC of banking77, g 1 's TTR is 0.295 which is closer to the original dataset's TTR of 0.299 than g 40 's TTR of 0.322.

Conclusions
In this work, we propose a principled set of automatic measures for evaluating the quality of text dissimilarity metrics.By testing various metrics using our measures, we show that they do a good job of capturing their known characteristics, hence increasing our confidence in these measures; also, overall, recent metrics exhibit more favorable traits than their predecessors.The radar chart in Figure 4 shows that our measure scores correlate well with the compared distributional metrics recency M AU V E ≻ DC ≻ P R ∼ F ID as well as their known relative strengths.

Limitations
Although one of the main motivations for comparing corpora is to measure the semantic gap between human and generated short text, we used pairs of human text corpora from different domains to maintain controllably-distinct corpora in the KSC set.Despite this, future efforts to develop human and machine-generated benchmark pairs (Mille et al., 2021) will allow for future work to quantitatively measure the characteristics of semantic metrics on pairs of human and generated corpora using the approach devised in this paper.
Also, for more straightforward comparisons, we used only a single sentence-embedding model.However, as other studies (e.g., GPT-2 (Radford et al., 2019) in Pillutla et al. (2021) and Bert (Devlin et al., 2018) in Lo (2019)) have shown, the quality of a corpus distance metric can be affected by the embedding choice.In future extensions of our work, we plan to allow for multiple embeddings to obtain a more refined evaluation of the metrics.
An important limitation of this work is that it considers only English corpora of short text samples.We examined only a limited set of metrics and datasets, both of which we intend to extend.
In addition, we note that while our experiments calculate all KSC-based measures using a single KSC collection (same n and k values), it could be favourable to use different n and k for different measures.For instance, the time performance is calculated using a single size small dataset n = 100.In future work, the time scalability of metrics can be more closely investigated by comparing their time performance on increasing corpora sizes.
As indicated in Section 4, creating KSC collections with large k creates an excessive number of judgements (e.g., for k > 15, J > 50000), thus limiting the scalability of our method to smaller k and thus smaller n, if high resolution is required.This would preclude comparing the robustness of metrics that require large samples.We intend to rectify this in future work by creating representative smaller judgement sets by carefully sampling from the complete set.
As mentioned in Section 2, some of the investigated metrics were adapted to return a single value summarizing the distance between two corpora (e.g., averaging the precision and recall by the F 1 score).Further work is required to build measures that can compare metrics returning multiple values.

Figure 1 :
Figure 1: Construction of a k = 6 known similarity corpora (KSC) collection from source corpora A and B. The corpus c i is constructed by drawing n k−i k−1 and n i−1 k−1 samples from A and B, respectively.The adjacent densities denote the distributions of the source and KSC set corpora.
1) Let d(a, b) denote the distance from corpus a to b, according to metric d.Let D (A, B, d)-D for short-be the set of values of distance d for corpora pairs in KSC (A, B);

Figure 2 :
Figure 2: Top: Distance values (non-normalized) of corpora pairs in D versus .(n = 100, k = 12, J = 6053), pooled across 5 repetitions of KSC samples.Blue line indicates regression and confidence interval at 95%.Middle: Distance values calculated on increasing s size corpora a s and b s sampled from sources A and B, correspondingly.Bottom: Distance between imbalanced corpora a s and b s where b s = N − s and N = 2900.The x-axis represents s ∈ {50, 250, 450, . . ., 2850} (repetitions = 10).In middle and bottom figures, green horizontal line indicates the asymptotic distance d(A, B).In all figures A=clinc150 and B=banking77.

Figure 3 :
Figure 3: A tree representation of the judgements performed on the KSC collection given a metric d(⋅, ⋅), for calculating the accuracy (A, Section 4.5) measures.The leafs are the KSC collection and the inner nodes (circles) represent the corpora tuples (c i , c j ).The set J contains all judgements such that each node (i, j) is judged against all descended nodes.Namely, if there is a path from node a to node b, there is a judgement between the two nodes, and the judgement is correct if d(b) ≤ d(a).The size of the judgements set can be expressed as: J = ∑

Figure 4 :
Figure 4: Leading metrics characterization radar chart.Mean results from Table 3 for A=clinc150, B=booking77 and for k = 12, excluding time efficiency to maintain scale.

Figure 5 :
Figure 5: Similarity between reference corpus and iteratively fine-tuned corpora g i samples.Orange dots show the similarity between samples of generated text in iteration i and the source dataset.The blue line indicates regression and confidence interval at 95%.The green horizontal line specifies the mean estimation of the distance between two random samples of the original corpus.The top figure shows iterative generation on unlabeled news headlines dataset.The bottom shows the iterative conditional generation using LAMBADA (Anaby-Tavor et al., 2020) trained on banking77 dataset.

Table 1 :
Summary of investigated text similarity metrics.

Table 2 :
Datasets used as source corpora in our benchmark.Although some of the datasets are partitioned annotated with labels, in our experiments, if not mentioned otherwise, we ignored those labels.

Table 3 :
Summary of metrics evaluation scoring on two pairs of source datasets in low (k = 7) and high (k = 12) resolution KSC (n = 100).Best results with differences below .015aremarked in bold.T (d) units are [100op sec].MUV.stands for MAUVE and CLS. for CLASSIFIER.In the top table, A=clinc150 and B=banking77.In the bottom table A=atis and B=yahoo.The average results of 5 repetitions are reported for all measures except size and imbalance robustness, in which the number of repetitions is 10.More statistical details are provided in Figure6in the Appendix.