Exact Paired-Permutation Testing for Structured Test Statistics

Significance testing—especially the paired-permutation test—has played a vital role in developing NLP systems to provide confidence that the difference in performance between two systems (i.e., the test statistic) is not due to luck. However, practitioners rely on Monte Carlo approximation to perform this test due to a lack of a suitable exact algorithm. In this paper, we provide an efficient exact algorithm for the paired-permutation test for a family of structured test statistics. Our algorithm runs in \mathcal{O}(G N (\log GN )(\log N)) time where N is the dataset size and G is the range of the test statistic. We found that our exact algorithm was 10x faster than the Monte Carlo approximation with 20000 samples on a common dataset


Introduction
How confident can we be that System U is more accurate than System V ? Questions of this form are widespread in natural language processing (Dietterich, 1998;Koehn, 2004;Ojala and Garriga, 2010;Clark et al., 2011;Berg-Kirkpatrick et al., 2012;Dror et al., 2018) and statistical hypothesis testing provides answers (Lehmann and Romano, 2005). In this paper, we study the pairedpermutation test (Good, 2000)-a commonly used hypothesis test in NLP because it makes no assumptions on the distribution of the data or the evaluation metric used to compare the two systems (Yeh, 2000;Dror et al., 2018Dror et al., , 2020Deutsch et al., 2021). The paired-permutation test checks whether a test statistic is significant by evaluating the probability that a value at least as large as the observed statistic would occur if system outputs were randomly swapped. Thus, an exact algorithm for evaluating the paired-permutation test involves a summation over all 2 N possible swaps. Without any assumptions on the test statistic, we can only exactly compute this sum in O(2 N ) time. Thus, practitioners often resort to running a Monte Carlo (MC) approximation which replaces the summation with K 2 N randomly sampled swaps. Although the MC approximation is often practical, it unfortunately,introduces additional error when determining the significance of a test (Serlin, 2000;Koehler et al., 2009). This paper proposes a family of additively structured, integer-valued test statistics. Test statistics of this form admit an efficient exact algorithm that leverages the fast Fourier transform (Cooley and Tukey, 1965;Cormen et al., 2022) to run in O(GN (log GN )(log N )) time where N is the size of the dataset and G is the range of the test statistic. We compare the efficiency of our exact method to the MC approximation for comparing part-of-speech taggers on the Universal Dependency Dataset (Nivre et al., 2018). Surprisingly, our exact algorithm is faster than MC approximation: given 10000 sentences, our algorithm is 10x faster than MC with 20000 samples and 3x faster than MC with 5000 samples, taking ≈ 0.1 seconds.

Paired-Permutation Testing
The paired-permutation test (Good, 2000), the focus of this work, is a common null hypothesis significance test that has a natural application to many problems in NLP (Peyrard et al., 2021). The test attempts to reject the null hypothesis, described below, at significance level α; typically α = 0.05.
Preliminaries. Suppose we want to compare the performances of two systems U and V where each system was evaluated on a dataset of the same N entries. We place the entries of U and V into a pair of arrays of length N denoted u and v.
The null hypothesis. The goal of a pairedpermutation test is to test whether the entries u and v are independent of the labels U and V themselves. The reason that this is the question we ought to care about is that, fundamentally, if the system label (in this case, U or V ) provides no information (in the sense of mutual information) about the entry, then we should not prefer one system to another. And, from basic information theory, we know that two random variables (RVs) have no mutual information iff they are independent. So, independence of the system's label and the system's set of entries is the right thing to inquire about. In the language of frequentist testing, the hypothesis that a system's labels and individual entries are independent is known as the null hypothesis. And, under a paired-permutation test, the goal is to ascertain whether the data (the observed entries u and v) provide enough evidence to reject the null hypothesis, i.e., to conclude that the label of a system shares information with the quality of its individual entries, and are indeed dependent.
The null distribution. Next, in order to attempt to reject the null hypothesis, we require a distribution over (hypothetical) pairs of entries u and v whose individual entries are independent of the system labels U and V , which is achieved through the construction of RVs U ∅ and V ∅ , whose joint distribution can be used to sample our hypothetical u and v . Traditionally, P[U ∅ , V ∅ ] is referred to as the null distribution. A paired-permutation test provides a simple recipe for constructing such an RV pair. The first step is to make an entry-wise independence assumption: we define the joint probability P[U ∅ , . This means that the prediction a system makes for the n th entry is independent of the m th entry when n = m. In the second step, we further define the entry-wise joint distribution as In words, P[U ∅ n , V ∅ n ] is a uniform distribution over swapping U and V 's prediction for the n th entry. All in all, this definition of P[U ∅ , V ∅ ] as the null distribution gives us a uniform distribution over all 2 N ways swapping of the labels and the individual entries of the observed predictions u and v. And, importantly, the joint distribution P[U ∅ , V ∅ ], encodes the fact that the sampled entries are independent of the system label.
The test statistic and the p-value. The final ingredient we need in a null hypothesis test is a test statistic, whose job it is to provide a summary of samples (u , v ) ∼ P[U ∅ , V ∅ ] and thereby facilitate comparison of samples from the null distribution P[U ∅ , V ∅ ] and the observed entries (u, v). In this work, we will define a test statistic as function t(u, v). In principle, we can choose any test statistic t that allows us to distinguish u and v, i.e., we have have t(u, v) = 0 ⇐⇒ u = v. Now, given observed entries u and v, the p-value is defined as is the observed effect. In words, the p-value is the probability of observing a test statistic t(u , v ) with a value as large as are sampled from the null distribution. Recall that the system labels and entries are independent under the null distribution by construction, so the p-value tells us, under the independence assumption, how likely such a large test statistic would have been observed. The test says that we have sufficient evidence to reject the null hypothesis when p < α. These concepts are depicted in Fig. 2. We now discuss a common special case of the paired-permutation test where the test statistic has a particular structure. In §4, we show how to exploit this structure to develop an efficient algorithm to exactly compute the test. The specific assumption we make is that the test statistic is an integer-valued additively decomposable function. Formally, this assumption means that we can rewrite t as follows for any function h and additively decomposable function g(u, v) def = N n=1 g(u n , v n ) such that g is an integer-valued function with a range of size O(G). The structure of (3) will allow us to derive an efficient algorithm for evaluat- The domain of each PMF, dom(f n ), contains at most two elements. Let S def = dom(S). Clearly, |S| = O(GN ) as we have a sum over N RVs Z n each with domain size O(G). The following theorem shows that we can evaluate P[t(U ∅ , V ∅ )] from the distribution of S, which we we will show in the next section is efficient to compute.
Theorem 1. For any test statistic t that factorizes as in (3) with h and g, the distribution of the test statistic under the null distribution decomposes as Proof. for n = 1 to N : Compute observed effect 7: Sample K random stay or swap actions from each local pmf fn. 8: Algorithm 1: Monte Carlo approximation algorithm for the paired-permutation test.

Example.
A common test statistic is the difference in accuracy, in which each entry u n ∈ {1, ..., C} G where C is the number of classes and in this case, G is the maximum length of an entry sequence (or one if each entry has a binary accuracy value). Then g(u n , v n ) ∈ {−G, ..., G} is the difference in the number of correct predictions between individual entries u n and v n . We can additionally define the function h as either h(x) = x or h(x) = |x| depending on whether we want a one-tailed or two-tailed significance test.
A Monte Carlo paired-permutation test. To the best of our knowledge, no practical exact algorithm for the paired-permutation test has been given in the literature. Thus, most practical implementations of the paired-permutation test use an MC approximation, whereby one randomly samples from S to approximate P[U ∅ , V ∅ ]. We give this MC algorithm as monte_carlo in Alg 1 which runs in O(KN ) time where K is the number of samples taken.

An Exact Paired-Permutation Test
In this section, we describe two exact, efficient algorithms for computing the p-value under the pairedpermutation test for any structured test statistic (see (3)). 1 Our algorithms hinge on an important theorem in probability: The PMF of the sum of independent events is the convolution of their individual PMFs (Ross, 2008, p. 252). Let f S denote the PMF for n = 1 to N : of S. Since RVs Z n are independent, we have that where is the discrete convolution operator. For functions f i , f j ∈ S → R, f i f j ∈ S → R is given by the following expression Pseudocode for this algorithm is given as exact_perm_test in Alg 2. We omit the details of evaluating the convolution in exact_perm_test and discuss methods for efficient convolution in the remainder of this section.
Theorem 2. For any two entries, u and v, and test statistic t that factorizes as in (3)  for ξ ∈ dom(f n ) : |dom(fn)| ≤ 2 7: for ξ ∈ dom(F n−1 ) : 8: return F N Algorithm 3: Dynamic programming algorithm to compute the pmf of S as the N -fold convolution of the pmfs f 1 , ..., f N . Note that we only need to store F n and F n−1 at any given time.
subsections, we present O(GN 2 ) time and O(G(log GN )(log N )) time algorithms for performing this N -fold convolution.

Convolution by Dynamic Programming
Our first approach builds a dynamic program (DP) that takes advantage of the sparsity of our RVs to efficiently construct the PMF f S . We do this by constructing a PMF array F n ∈ R S for n ∈ {0, ..., N } (we use n = 0 as an initialisation base case) such that F n (ξ) = (f 1 ··· f n )(ξ). As we apply each convolution, we know that f n is only non-zero at − → ξ n and ← − ξ n , and so we can run each convolution in O(GN ) time. The pseudocode for this approach is given as convolve_DP in Alg 3. Proof. The proof of correctness of convolve_DP is given in App. A. Each F n has O(GN ) elements and so convolve_DP clearly runs in O(GN 2 ) time. Furthermore, at any iteration, we only require F n and F n−1 and so we can execute convolve_DP in O(GN ) space.

Convolution by FFT
Our second approach uses the fast Fourier transform (FFT; Cooley and Tukey, 1965;Cormen et al., 2022) to evaluate the convolutions. Using this method means that each convolution takes O(GN log GN ) time O(GN ) space. We further exploit the commutativity of convolution to perform the N -fold convolution in log N convolutions using a recursive program. The pseudocode for this approach is given as convolve_FFT in Alg 4. Theorem 4. For any RVs Z 1 , ..., Z N with PMFs Proof. The correctness of convolve_FFT is due to Cooley and Tukey (1965). The recursion of convolve_FFT can be given as Solving this recursion, we call T O(log N ) times. Therefore Proof. The correctness and complexity bounds are due to Theorem 2 and Theorem 4. Specifically, Line 7 can be executed using convolve_FFT.

Experiments
We demonstrate the efficiency of our exact algorithms by simulating paired-permutation tests between the accuracy of two systems. In order to have some control over the p-value, N , and G (maximum length of a sentence), we randomly generate our two system outputs from a measured distribution. Specifically, we will use the Stanza 3 (Qi et al., 2020) part-of-speech tag accuracy statistics when evaluating on the English Universal Dependencies (UD) test set (Nivre et al., 2018). We sample our outputs from the normal distribution where 2 We note that the log N factor in the space complexity may be eliminated by tail recursion elimination (Muchnick, 1998). 3 The code and pre-trained model are both freely accessible at https://github.com/stanfordnlp/stanza.

Metric
Mean Standard Dev.
Accuracy 0.9543 0.1116 Sentence length 12.08 10.60 the mean and standard deviation match the rates of Stanza's observed accuracy. We further sample the length of each sample sentence according to the distribution of lengths in the test set. These distributions are provided in Tab. 1. We show that, empirically, the exact test is more efficient than the MC approximation; this is evinced in Fig. 1 where we have compared the runtime of exact_perm_test using convolve_DP and convolve_FFT against monte_carlo for various sample sizes (K ∈ {5000, 10000, 20000, 40000}). ,4 We note that using convolve_DP is already more efficient than running monte_carlo with K = 40000 and K = 20000 (up to N ≈ 8000). 5 Furthermore, convolve_FFT is much faster and we observe a speed-up between 3x and 30x, depending on the number of samples K. Indeed, using convolve_FFT allows us to perform an exact paired-permutation test for N = 10000 in approximately one-tenth of a second.

Conclusion
We presented an algorithm to compute the exact p-value of a paired-permutation test for the case of a family of structured test statistics, including the difference in accuracy. Our algorithm runs in O(GN (log GN )(log N )) time and requires O(GN ) space. We empirically show that our exact algorithm is faster than Monte Carlo approximation techniques. The theory of our work is extensible to a more general class of test statistics which we discuss in App. B. We hope that this work encourages the use of exact paired-permutation tests in future NLP research.

Ethical Concerns
We foresee no ethical concerns in this work.

A Proof of Correctness of convolve_DP
We prove the correctness of convolve_DP using the following lemma.
This is exactly the construction in the for-loop between Line 7 and Line 8. Therefore, F n (ξ) = f :n (ξ).
As each F N will contain the N -fold convolution f 1 ··· f n , convolve_DP is correct by definition.

B Paired-Permutation Test for Higher-order Test Statistics
In this section, we extend our approach for the paired-permutation test to test statistics that are functions of m additively decomposable functions. In symbols, this assumption means that we can rewrite t as follows for any function h and integer-valued, additive decomposable functions g i (i.e., g i (u, v) def = N n=1 g i (u n , v n ). We now define − → ξ n and ← − ξ n as m-tuples, And so each RV Z n has the same PMF as in (4b). We can then define an analogous function to exact_perm_test for the case of m additively decomposable functions. We give pseudocode for this as exact_perm_test m in Alg 5. The convolution algorithms,convolve_DP and convolve_FFT, can both be used to perform for convolution step in Line 7.
Theorem 5. For any two entries, u and v, and test statistic t that factorizes as in (12)

Example.
A common example for a test statistic that requires multiple additively decomposable functions is the difference in F 1 scores. Similar to accuracy, each entry u n ∈ {1, ..., C} G and G is the maximum length of an entry sequence. Let tp(u n ) and in(u n ) be the number of true positive and incorrect predictions made in entry u n respectively. Then the difference in F 1 scores can be given as 1: def exact_perm_test m (u, v, g 1 , ..., g m , h) : 2: for n = 1 to N : 3: − → ξ n ← g 1 (u n , v n ), ..., g m (u n , v n ) Local effect (stay)

8:
Sum-up the pmf to get p 9: return ξ∈S f S (ξ) 1 h(ξ) ≥ ξ Algorithm 5: Pseudocode to find exact p value for the paired-permutation test for test statistics comprised of m additively decomposable functions.
We can therefore use four-additively decomposable functions, g 1 to g 4 , that decompose such that g 1 (u n , v n ) = g 3 (v n , u n ) = tp(u n ) and g 2 (u n , v n ) = g 4 (v n , u n ) = in(u n ). Our h function then takes four arguments and can be defined as We can additionally apply an absolute value to h to check for the absolute difference in F 1 scores; doing this would make the significance test two-tailed rather than one-tailed.