Replicating and Extending “Because Their Treebanks Leak”: Graph Isomorphism, Covariants, and Parser Performance

Søgaard (2020) obtained results suggesting the fraction of trees occurring in the test data isomorphic to trees in the training set accounts for a non-trivial variation in parser performance. Similar to other statistical analyses in NLP, the results were based on evaluating linear regressions. However, the study had methodological issues and was undertaken using a small sample size leading to unreliable results. We present a replication study in which we also bin sentences by length and find that only a small subset of sentences vary in performance with respect to graph isomorphism. Further, the correlation observed between parser performance and graph isomorphism in the wild disappears when controlling for covariants. However, in a controlled experiment, where covariants are kept fixed, we do observe a correlation. We suggest that conclusions drawn from statistical analyses like this need to be tempered and that controlled experiments can complement them by more readily teasing factors apart.


Introduction
We undertake a replication study of Søgaard (2020) which introduced graph isomorphism (DUG -directed unlabelled graph isomorphism) as a means of explaining differences in parser performance across different treebanks. It measures the ratio of graphs 1 in the test set that were also observed in the training data. It is intuitive that this would likely be related to parser performance.
However, DUG has two important covariants. The size of the training data impacts DUG because the smaller a treebank is, the less likely there will be many crossovers between training and test data. DUG is also tied to the mean sentence length in the test data: smaller sentences are much more likely to have a tree structure already seen in the training, as there are fewer possible trees and the reverse is true for longer sentences, e.g. the number of possible trees for a sentence with 20 tokens is 12,826,228.

Related Work
There is a long history of investigating the causes of variance in parser performance. The effect of training data size on parser performance is well attested (Sagae et al., 2008;Falenska and Ç etinoglu, 2017;Strzyz et al., 2019;Dehouck et al., 2020). Sentence length has also been observed to impact performance (McDonald and Nivre, 2011). One likely factor behind this is different sentence lengths having difference dependency distance distributions (Ferrer-i-Cancho and Liu, 2014) which in turn affects parsing as longer dependencies are typically harder to parse (Anderson and Gómez-Rodríguez, 2020;Falenska et al., 2020). Others have offered explanations based on linguistic characteristics such as morphological complexity (Dehouck and Denis, 2018;Çöltekin, 2020), part-of-speech bigram perplexity (Berdicevskis et al., 2018), and word order freedom (Gulordava and Merlo, 2016).
The history of reproduction and replication in NLP is not so well established, with only a few studies in recent years, e.g. on Universal Dependency (UD) parsing (Çöltekin, 2020) and on automatic essay scoring systems (Huber and Çöltekin, 2020).
Linear techniques, linear regression models or evaluating correlation coefficients are commonly used for statistical analyses of NLP systems. They have been used to model constituency parser performance (Ravi et al., 2008), to evaluate what affects annotation agreement (Bayerl and Paul, 2011), to investigate what impacts statistical MT systems (Guzman and Vogel, 2012), what impacts performance on span identifying tasks (Papay et al., 2020) Table 1: Issues with using multivariable linear model and cross-validation (CV) to evaluate explained variance. The first set of columns (Original) uses the exact same settings as the original paper (namely one CV split and the original seed) on the original data (CoNLL18) and the predictions from UDPipe 1.2 and UDPipe 2.0 for the extended data. The DUG explained variance is much smaller for the new data. The second set of columns show the same analysis but averaged over 10 different seeds used for the CV splits. The explained variances are almost all negative, which means the linear fit failed.
analysis will be impactful in a broader sense as the conclusions here can be applied in many subareas of NLP, namely the sensitive handling of covariants by using partial coefficients, controlled experiments, or signal subtraction; a strong adherence to visualising data; and considering whether the phenomena under consideration are likely to be sensitive to sentence length, as is often the case in NLP, and if so undertaking a sentence-length binning analysis to complement coarser analyses.

Original paper
Søgaard (2020) attempted to explain the difference of parser performance across treebanks by using DUG and also undirected unlabelled graph isomorphism (UUG). Two graphs are isomorphic if there is a renaming of vertices that makes them equal. The first process in calculating DUG (or UUG) is to collect the set of unique graphs that occur in the training data. In the original paper, this set of graphs is referred to as the isomorphisms. Once the training isomorphisms are obtained for a given treebank, the number of graphs in the test data that are members of one of these equivalence classes is counted. The final value is then the proportion of test instances that are isomorphic to the training data. This then gives a value between 0 (all test instances are unique) and 1 (no unique test instances). The analysis was undertaken using a small sample of treebanks that were used at the CoNLL 2018 shared task, using the LAS of the top performing system for each treebank to measure parser performance (Zeman et al., 2018). The impact DUG (or UUG) has on parsing performance was evaluated by fitting a linear regression to the data with DUG as the control variable. A number of other potential measurements that could explain parser performance were also taken into consideration, but only as alternative explanation and not covariants. The exception to this was using the size of the training data as a covariant. The explained variance and absolute error for each linear regression fit was reported using a three-fold cross-validation. The results suggested that DUG was the most strongly correlated measurement evaluated. We show that this result does not hold up when accounting for covariants, that using cross-validation method with the linear regression is not a robust method for an analysis like this, and that by controlling the main covariants of DUG, we can observe a more trustworthy correlation to parser performance.

Analysis and results
We evaluate directed graph isomorphism (DUG) as it was more strongly related to parser performance in the original paper.
Main covariants We focus on the two main covariants of DUG: training data size (in sentences) and mean sentence length of the test data, L test .

Data and parsers
The data from the original paper consists of 33 UD treebanks, with LAS taken from the respective top performing parser from the CoNLL 2018 shared task (Zeman et al., 2018). Note that these systems are all variations of the biaffine graph-based parser of Dozat and Manning (2017). For replication, we also use a neural transition-based system UDPipe 1.2 (Straka et al., 2016), using UD models 2.4 and UD v2.5 (Zeman et al., 2019), and a neural graph-based system UD-Pipe 2.0 (Straka, 2018), using UD models 2.6 and UD v2.7 (Zeman et al., 2020). This results in 94 treebanks for UDPipe 1.2 and 90 for UDPipe 2.0. The difference is due to issues running the webbased UDPipe 2.0 on larger files.

Reproduction and replication
In the original paper, the analysis focuses on fitting a multi-variable linear regression to the data to control for covariants. However, the models only used training size plus one other variable as features. Further, cross-validation is used so as to avoid over-fitting. While over-fitting isn't directly an issue, the metrics that are typically reported overestimate the variance explained by a linear model, e.g. explained variance, η 2 , or R 2 (Lane et al., 2007). Averaging η 2 over different splits can potentially offset this positive bias but it requires a certain amount of data to be reliable. In Table 1, we show the results using the original data from Søgaard (2020). The values shown in the left-most column are exact reproductions of the original values. Only the value for L test is different as the original paper appears to have used a normalised value. We also show η 2 for the linear model using all variables, which is negative, i.e. the fit failed.
We next show the results using UDPipe 1.2 and 2.0. While the values for training size on its own and with L test are similar, the high η 2 for training size with DUG is no longer observed. This seems to be due to specious results born out of serendipitous splits for the smaller sample from CoNLL 2018.
We then tested this same procedure using different seeds to shuffle the cross-validation splits. The results are almost exclusively negative, i.e. the linear models failed to fit to the data at all. This further highlights an issue of using this methodology when sample size is small, as the random split can have large impact on the statistical metrics.

Extending the analysis
As the linear models performed so poorly, we measured the correlation coefficients (Spearman's ρ) for each of the variables with respect to LAS and also the potential covariants with respect to DUG. These are reported in Table 2 and we include visualisations of these in Figures 5 and 6    for the CoNLL 2018 data and the UDPipe 2.0 data. Interestingly, DUG has the highest p-value for all systems, far from statistical significance. However, DUG appears to be strongly correlated to both covariants, especially L test with ρ > 0.9 and p < 0.001 for all datasets and systems. Also of note is that training data size is convincingly correlated to LAS, but based on the linear models it doesn't appear to be predictive of parser performance. Based on this and on the visualisation of the data in Figures 5 and 6 in the Appendix (as well as visualisations of training size vs. LAS in the literature, see §2), it seems clear that the relation between these variables is not linear but logarithmic. We show LAS against training data size with a logarithmic scale in Figure 4 in the Appendix. Table 3 shows the results of the limited linear model and cross-validation technique using 10 different seeds as above and using log training size. For these results, the explained variance of the models are all positive and relatively high, that is, the models manage to fit the data unlike in the original setup. This one change offsets the failure of the linear model technique, which is not surprising. However, it seems to suggest that DUG is not a useful feature, as training size with L test outperforms training size with DUG for all datasets except CoNLL18. And the models which use all features are worse than just using training data size and L test , with the CoNLL18 model resulting in a negative explained variance, again meaning the fit failed. For CoNLL18, training data size and DUG does outperform the model using L test .

Sentence length binning
We analyse the relation between test sentence lengths and DUG by binning the data with respect to sentence length. This entails taking each sentence of length l for each treebank, in both the training and test data, and calculating DUG and the corresponding LAS based on these subsets. Figure 1 shows some of these bins (for sentences of length of 5, 12, and 21 tokens) for UDPipe 2.0. A full visualisation of each bin ranging from length 3 tokens to 30 is shown in Figure 7 in the Appendix.
DUG is almost exclusively 1.0 for shorter sentences, as can be seen in Figure 1 for sentence length 5. The number of possible directed trees for sentences with less tokens is too small for there not to be crossover: there are only 9 possible unlabelled trees for sentences of length 5 (Sloane, 1996). Conversely, for longer sentences, DUG is almost exclusively 0.0 as the number of possible tree structures is considerable (35,221,832 for sentences of length 21).
For a small subset of sentence lengths, ranging from length 9 to 14, there is meaningful spread of values for DUG, with a broadly-speaking linear relation with respect to LAS. Based on this result, i.e. that only certain sentence lengths are suitable for using DUG, we considered using a focused version of DUG, i.e. a variant calculated considering only sentences between length 9 and 14 in the training and test data. We then analysed how this measurement correlated with parser performance. Table  4 shows the correlations for focused DUG with respect to LAS, training size, and L test . While the correlation between focused DUG and LAS is much higher than for DUG and LAS, this is due to the focused version being much more strongly correlated to training size (ρ = 0.91 with a p-value  less than 0.001 for both datasets) and the correlation with L test is much diminished. Also, this focused version of DUG improves performance for the linear model when used only with training data size, but L test improves it much more. Using all 3 is again worse than just using training data size with L test , however, focused DUG doesn't lower the performance as much as the full variant does.

Controlling covariants
Having established that DUG does not improve linear models predicting LAS and that DUG is strongly correlated to training treebank size and L test , we attempted to find a signal by removing the background signals associated with these variables. We applied a linear fit to the training data size and LAS and then divided the LAS scores by the predicted values of that fit. Then we applied a linear fit to L test and these normalised values and again divided these values out. Finally, we evaluated these doubly normalised values against DUG. This process is shown in Figure 2 for UD-Pipe 2.0 and the resulting coefficients for UDPipe 1.2 and 2.0 are in Table 7 of the Appendix. Removing the signals of the covariants results in a linear fit against DUG with a zero gradient and with a coefficient of 0.01 (p=0.926). Removing the variance associated with these covariants effectively removes any signal associated with DUG.
To corroborate this background subtraction analysis, we also report the partial coefficients in Table  5 Table 5: Partial Spearman's ρ for DUG with covariants. Figure 2: Visualisation of removing background signal associated with covariants of the log of training size (log(Size)) and mean test length L test . The spearman's ρ for DUG and LAS is -0.18 (p=0.083), for DUG and LAS/bcg size is -0.40 (p<0.001) compared to L test and LAS/bcg size of 0.465 (p<0.001), and finally DUG and LAS/bcg size bcg Ltest is 0.01 (p=0.926).
UDPipe systems. CoNLL18 has a stronger signal, but it is negative (which is the opposite relation one would expect) and has a large p-value.

Controlled experiment -fixing covariants
We also evaluated DUG's relation to LAS in a controlled experiment where we sampled subsets of treebanks keeping training data size constant and also the sentence length of both training and test data. We trained UDPipe 1.2 models (UDPipe 2.0 is not available beyond using pre-existing models), using standard settings. We were limited to 9 treebanks, as we required a reasonable amount of data and using only one sentence length reduces the number of usable treebanks. We combined all of the data for treebanks which had over 1200 sentences of length 12. We then created splits such that a single 1000-sentence training set was created by randomly sampling sentences. Then a number of 200-sentence test sets were created, generating as many splits as the data allowed for a given treebank. In this way we varied DUG indirectly, but by using different treebanks to sample from we obtained values spanning a reasonable range (0.6 -0.9). This results in a Spearman's ρ of 0.82 (p<0.001) and is visualised in Figure 3 in the Appendix. So in this rigid context, we do observe a very strong correlation between DUG and LAS, echoing the analysis from the sentence-length binning procedure.

Conclusion
With this case study we have shown the value of replicating analyses in NLP. Our analysis has shown that the original results were unreliable and it has highlighted methodological issues the original analysis had. Also, the results regarding the methodology presented here (i.e. the need to visualise and evaluate correlations before considering linear regression techniques, the potential sensitivity to sentence length of measurements used in NLP statistical analyses, the need to control for all covariants and evaluate their impact using partial coefficients at the very least, and finally that using controlled experiments can help better evaluate the impact of specific measurements and can complement statistical analyses) will likely be useful for other statistical analyses in different areas of NLP. The appendix mainly consists of visualisations corresponding to the statistical analyses described in Figure 4: LAS with respect to training set size, in logarithmic scale, for UDPipe 2.0 and UD v2.7.  Table 6: Partial Spearman's ρ for focused DUG (i.e. using only the measurement for sentences of length 9 to 14) with covariants.
the main body. Some additional information is given to supplement the main analyses in Tables 6  and 7 which give the correlations for the focused DUG analysis and the background removal process, respectively. Figure 4 shows the logarithmic relation between LAS and the training data size for UDPipe 2.0 and UD v2.7. Figure 5 gives the visualisations for the data used in the original paper and Figure 6 gives the corresponding visualisation for UDPipe 2.0 and UD v2.7. Figure 7 expands the example plots shown in Figure 1 which only showed extreme cases. This shows LAS versus DUG for every sentence length bin from length 3 to 30. This clearly shows the issue with DUG as discussed in the main body.
All the data used for the analyses presented in this paper can be found in the supplementary material associated with the paper.  Table 7: Correlation of DUG with LAS and then with LAS with the background associated with size and length (L) removed. Isolated row shows correlation of LAS without size background and mean sentence length in test data.