Strength in Numbers: Averaging and Clustering Effects in Mixture of Experts for Graph-Based Dependency Parsing

We review two features of mixture of experts (MoE) models which we call averaging and clustering effects in the context of graph-based dependency parsers learned in a supervised probabilistic framework. Averaging corresponds to the ensemble combination of parsers and is responsible for variance reduction which helps stabilizing and improving parsing accuracy. Clustering describes the capacity of MoE models to give more credit to experts believed to be more accurate given an input. Although promising, this is difficult to achieve, especially without additional data. We design an experimental set-up to study the impact of these effects. Whereas averaging is always beneficial, clustering requires good initialization and stabilization techniques, but its advantages over mere averaging seem to eventually vanish when enough experts are present. As a by product, we show how this leads to state-of-the-art results on the PTB and the CoNLL09 Chinese treebank, with low variance across experiments.


Introduction
Combinations of elementary parsers are known to improve accuracy. Sometimes called joint systems, they often use different representations, i.e. lexicalized constituents and dependencies (Rush et al., 2010;Green andŽabokrtský, 2012;Le Roux et al., 2019;Zhou et al., 2020). These approaches have been devised to join the strengths and overcome the weaknesses of elementary systems.
In this work, however, we follow another line of research consisting of mixtures and products of similar experts (Jacobs et al., 1991;Brown and Hinton, 2001), instantiated for parsing in (Petrov et al., 2006;Petrov, 2010) and especially appealing when individual experts have high variance, typically when training involves neural networks. Indeed Petrov (2010) used products of experts trained via Expectation-Maximization (a non-convex function minimization) converging to local minima.
In this work we propose to study the combination of parsers, from a probabilistic point of view, as a mixture model, i.e. a learnable convex interpolation of probabilities. This has previously been studied in (Petrov et al., 2006) for PCFGs with the goal of overcoming the locality assumptions, and we want to see if neural graph-based dependency parsers, with non-markovian feature extractors, can also benefit from this framework. It has several advantages: it is conceptually simple and easy to implement, it is not restricted to projective dependency parsing (although we only experiment this case), and while the time and space complexity increases with the number of systems, this is hardly a problem in practice thanks to GPU parallelization.
Simple averaging models, or ensembles, can also be framed as mixture models where mixture coefficients are equal. We are able to quantify the variance reduction, both theoretically and empirically and show that this simple model of graph-based parser combinations perform better on average, and achieve a higher accuracy than single systems.
While the full mixture model is appealing, since it could in principle both decrease variance and find the optimal interpolation weights to better combine parser predictions, the non-convexity of the learning objective is a major issue that, when added to the non-convexity of potential functions, can prevent parameterization to converge to a good solution. By trying to specialize parsers to specific input, the variance is not decreased. More importantly, experiments indicate that useful data, that is data with an effect on parameterization, becomes too scarce to train the clustering device.
Another drawback of finite mixture models is that inference, i.e. finding the optimal tree, becomes intractable. We tackle this issue by using an alternative objective similar to Minimal Bayes-Risk (Goel and Byrne, 2000) and PCFG-LA combination (Petrov, 2010) for which decoding is exact.
Our contribution can be summarized as follows: • We frame dependency parser combinations as finite mixture models ( §2) and discuss two properties: averaging and clustering. We derive an efficient decoder (LMBR) merging predictions at the arc level ( §3). • When isolating the averaging effect, we show that resulting systems exhibit an empirical variance reduction which corroborates theoretical predictions, and are more accurate ( §4). • We study the causes of instability in mixture learning, outline why simple regularization is unhelpful and give an EM-inspired learning method preventing detrimental overspecialization ( §5). Still, improvement over mere averaging is difficult to achieve. • These methods obtain state-of-the-art results on two standard datasets, the PTB and the CoNLL09 Chinese dataset ( §6), with low variance making it robust to initial conditions.

Notations
We write a sentence as x = x 0 , x 1 , . . . , x n , with x 0 a dummy root symbol, and otherwise x i the i th word, and n the number of words. For h, d ∈ [n] with [n] = {0, . . . , n}, (h, d) is the directed arc from head x h to dependent x d . We note the set of all parse trees (arborescences) for x as Y(x) and the elements in this set as y ∈ Y(x), with (h, d) ∈ y if (h, d) is an arc in y. L stands for the set of arc labels. The vector of arc labels in tree y is noted as l(y) ∈ L n . We note l(y) hd the label for arc (h, d) in y, or l hd when y is clear from the context.

Parsers as Experts
Experts can be any probabilistic graph-based dependency parser, provided that we can efficiently compute the energy of a parse tree, the global energy of a sentence (the sum of all parse tree energies, called the partition function) and the marginal probability of an arc in a sentence. In the remaining we focus on projective first-and second-order parsers, where these quantities are computed via tabular methods or backpropagation 1 .
Tree structure For a graph-based dependency parser, the tree probability is defined as: with s(x, y) the tree energy giving the correctness of y for x, and Z(x) the partition function.
In first-order models (Eisner, 1996), tree scores are sums of arc scores: Eisner (1997) generalizes scores to the secondorder by considering pairs of adjacent siblings: For projective first-or second-order models, Z(x) and p(y|x) are efficiently calculated (Zhang et al., 2020b). Moreover marginal arc probability p (h, d)|x can be efficiently calculated from the partition function by applying backpropagation from log Z(x) to s(h, d), see (Eisner, 2016;Zmigrod et al., 2020;Zhang et al., 2020a): Tree Labelling The labelling model is also a Boltzmann distribution: where s(l, h, d) is the score for label l on (h, d).
Following (Dozat and Manning, 2017;Zhang et al., 2020a), label predictions are independent: Parse Probability Given the structure y and its labelling l(y), the parse probability is: Learning Potential functions s can be implemented by feed-forward neural networks or biaffine functions (Dozat and Manning, 2017), and parameterized by maximizing a log-likelihood.

Mixture and Averaging
For arborescence probabilities a finite mixture model (MoE) is a weighted sum of the probabilities given by all experts: where mixture weights verify ∀x, ω k (x) ≥ 0 and K k=1 ω k (x) = 1 and can be adjusted by a gating network (Jacobs et al., 1991). We can interpret ω as a device whose role is to cluster input in K categories and assign each category to an expert.
By forcing ω k (x) = 1 K , ∀x, we have a simpler averaging model, sometimes called ensemble: Note that MoEs combine elementary probabilities, not tree scores: each expert energy is first normalized before the combination.
A similar mixture is applied to labelling, i.e.:

Decoding with a Mixture Model
Learning MoEs will be covered in Section 5 and we first turn to the problem of finding an appropriate tree, for instance the most probable parse tree: This maximization is difficult, even in the absence of labels, since this isn't a log-linear function of the arc scores anymore: y * cannot be searched in the log-space among unnormalized arc scores.

MBR Decoding
In this case, a more attractive alternative is Minimum Bayesian Risk (MBR) decoding (Smith and Smith, 2007), because it decomposes error in a way similar to the metrics used in dependency parsing (UAS/LAS) and is tractable. MBR requires to compute marginal arc probabilities which are the weighted sums of elementary marginals: The intuition behind MBR is that instead of maximizing the probability of the parse tree, we try to minimize the risk of choosing wrong arcs, i.e. to maximize the arc marginals in the parse tree: Once computed marginal log-probabilities, Eisner algorithm (Eisner, 1996), (Eisner, 1997) or Chu-Liu-Edmonds (McDonald et al., 2005) can be applied to solve MBR.

MBR Decoding with Labels
In many dependency parsing models, decoding of arcs and labels is pipelined, see for instance (Dozat and Manning, 2017;Zhang et al., 2020a;Fossum and Knight, 2009): first arcs are decoded and then, with the decoded arcs, maximization is performed over labels: However, solutions found this way are not the maximizers for p(l, y|x), as defined in Eq. 2. The problem is that the effect of labelling is not considered in arc decoding: a high probability arc can get picked up even with a low label score.
First we remark that each label in l * is the most probable label l for a pair (h, d), denoted by L hd = argmax l∈L p(l|(h, d), x) . Decoding becomes: This way l * is deterministic wrt to y * and (y * , l * ) are maximizers for Eq. 2. We note labelling L(y) where l(y) hd = L hd , ∀(h, d) ∈ y. This can be combined with MBR without changing decoding algorithms, and we call this variant LMBR: i.e. we can apply MBR with arc probabilities reparameterized with label probabilities. Experiments show that LMBR exhibits a small but consistent accuracy increase over MBR.

Averaging and Variance Reduction
In this section we assume all experts to be equally weighted. We define the variance of the system on T as the average variance of marginal arc probability: We show how the variance of the MoE is smaller than the variance of experts. We focus on structure prediction p(y|x), but definitions are applicable to the labelling model as well. This is an already known result for mixture models in general, but the proof is here instantiated for a mixture of graphbased parsers. Moreover, we will recover this result experimentally in Section 6.
Assuming we have a mixture of K elementary systems, we will estimate the marginal probability variance with: with π(k) the probability p k (h, d)|x given by the k th elementary system and averageπ = 1 K K k=1 π(k) Increasing the number of experts in the MoE will decrease variance of the system. To see this, we assume that the marginal probability for a well trained expert, over a fixed sentence and a fixed arc, is a measurable function f (h,d),x : R → R of a random seed S k ∈ R, which represents the fact that p k is the result of a learning process with many sources of randomization 2 (initialization, stochastic batches, dropout. . . ): with S k ∈ R a random seed assigned to k th expert at the beginning of training, assumed to be independent for different experts.
Since in practice a pseudo-random generator is used, the value of marginal probability for particular sentence and arc is deterministic when the random seed is fixed. Thus, it is sufficient to use a deterministic function to represent p k ((h, d), x), with random seed S k as input. Moreover, we just need the function to be measurable.
We can now view f (h,d),x (S) as a random variable and we note its variance as σ 2 (h,d),x . It is in fact the variance of the marginal arc probability given by this expert, for (h, d) given x. For an averaging MoE, the marginal probability becomes: with K number of experts in the mixture model. , 2017). Thus, the variance of the mixture model for particular sentence and arc should be 1 K times the variance of experts: with Σ the variance of the mixture model. In other words, the log-variance of a mixture model decreases linearly with log K, with slope −1, i.e.: Experiments in Section 6 Figure 1 show that the estimated log-variance of the averaging system decreases when the number of experts increases and that this relation is close to linear with a slope approaching −1, comforting our independence assumption.

Training with Clustering
When mixture weights are adjustable, MoE models are able to give more credit to experts believed to perform better on specific input. This can be exploited during parameterization. The role of ω is thus to learn how to cluster input into K categories, each category being assigned to an expert. 3 For input sentence x and corresponding tree y, assuming parameterization is performed by maximizing the log-likelihood of the training set via SGD, the objective of mixture model learning with gating network ω can be written as: where φ are the parameters of the gating network, and θ k are the parameters of the k th expert. Partial derivatives to the gating network are: while for expert parameters we have: (7) We found that optimizing directly with equations (6) and (7) causes degeneration, i.e. one ω k approaches 1 while the other ω k decrease to almost 0. Indeed, gradient ascent with (6) will increase ω k for an expert k that gives high weight to training samples while gradient ascent with (7) will generate increased gradient, and in turn increased probabilities, for experts with high value of ω k . The two processes re-enforce each other and result quickly in an extreme partition between experts.
One may think that the degeneration problem can be alleviated with a smoothing prior or regularization. In practice, we tried entropy as regularization to force towards a uniform distribution on ω k . We found that a heavy entropy penalization is required to avoid the degeneration problem, which makes ω k too uniform to be an accurate clustering device.
Avoid Extreme Partition Thus, to alleviate the degeneration problem without forcing a strong smoothing constraint, we propose to modify Eq. (6) into: we force the weight update to be proportional to the relative probability. The advantage of Eq. (8) is that gradient are weighted by a more objective quantity . For an example where p k (x) is close to uniform, we can benefit from the averaging effect, while for an example which shows strong preference for a particular expert, we can also learn the partition coefficients proportional to their correctness. Stabilize Training Neuron dropout (Srivastava et al., 2014) is a common technique to avoid overfitting which unfortunately proved difficult in this setting. The problem is that s k (x, y) gives very different results with or without dropout which reflects on p k (y|x) causing drastic changes from one evaluation to the other. To mitigate this problem, we use probabilities without dropout (noted asp k (θ)) to calculate the weighted coefficients of gradient.
The final optimization process can be separated into two alternate parts, (i) optimization of the gating parameters: and (ii) optimization of experts: In practice, this permitted reaching a lower loss value after training.

Experiments
Data We run experiments over two datasets for projective dependency parsing: The English Penn Treebank (PTB) data with Stanford Dependencies (Marcus et al., 1993) and CoNLL09 Chinese data (Hajič et al., 2009). We use standard train/dev/test splits and evaluate with UAS/LAS metrics. Customarily, punctuation is ignored on PTB evaluation.
Experts We run tests with first-order (FOP) and second-order parsers (SOP) as mixture model experts, with re-implemented versions of the CRF and CRF2o parsers of Zhang et al. (2020a). 4 For decoding, we use the LMBR decoding presented in Section 3.2, which guarantees a small but consistent improvement over pipeline MBR decoding.
For each input word, these systems use 3 embeddings: the first is a fixed pretrained vector 5 , the second is trainable and looked-up in a table, and the third is computed by a BiLSTM at the character level (CharLSTM). The first two embeddings are summed and concatenated with the char sequence embedding. For FOP and SOP, contextual lexical features are the results of 3-layer BiLSTMs applied to word embedding sequences. The scoring of arcs is then similar to (Dozat and Manning, 2017): lexical features are transformed for head or modifier roles by two feed-forward networks and combined to score arcs via a biaffine transformation.
On PTB, in order to compare with recent parsing results, we set up BFOP and BSOP (B for Bert), variants of the FOP and SOP settings: we follow Fonseca and Martins (2020) and concatenate an additional BERT embedding (Devlin et al., 2019) (the average of the 4 last layers of the bertbase-uncased model) to the embedding vector fed to the BiLSTM layers.
Gating (mixture weights ω) is implemented by a K-class softmax over a feed-forward network whose input are the concatenation of initial and final contextual lexical feature vectors returned by the 3-layer BiLSTM. Hyper-parameters are set similarly to Zhang et al. (2020a), with the exception of the learning rate decreased to 10 −4 and patience (that is the maximum number of epochs without LAS increase on the development set) set to 20.
We train 12 independent models for each expert type, with random seed set to system time.

Averaging Effect Analysis
The experimental procedure is shown in Experimental Setup 1, with M 1 , . . . , M 12 denoting the trained experts, K number of experts in the mixture model and r the number of repetitions.
Models: M 1 , . . . , M 12 ; Initialization: K, r; repeat r times 1. Shuffle the order of M 1 , . . . , M 12 ; 2. Combine sequentially every K models together, creating 12/K mixture averaging models; 3. Compute UAS, LAS of models; 4. Calculate system variance for models; end Experimental Setup 1: Averaging Effect We set K from 1 to 6 with r always set to 5. We show results for PTB and CoNLL09 Chinese on dev data for each type of mixture of experts, and different number of experts in Table 1 and Table 3. For UAS and LAS, each entry is given as: where average is the average score for all trials in this setting and max (resp. min) is the highest (resp. lowest) score obtained by an experiment in this setting. We also give standard deviation std as a way to see the effects of variance reduction.
Finally the last row gives the average relative error reduction (R.E.R) from single expert mode (K = 1) to ensemble mode with K = 6.

Clustering Effect Analysis
We conduct clustering effect analysis over the mixture model with 6 experts. Preliminary experiments showed that, like in most non-convex problems, good initialization is very important. For that reason we use already trained experts as starting points 6 although the mixture could benefit from more diversely trained experts. We leave this for future work. The procedure is described in Experimental Setup 2 and this whole procedure is repeated 5 times to compute average performance.
Models: M 1 , . . . M 12 ; Initialization: K = 6; repeat r times 1. Select randomly K models, creating mixture models; 2. Do fine tuning of mixture models with gating network; 3. Calculating UAS, LAS of mixture model after fine tuning; end Experimental Setup 2: Clustering Effect Scores on development set before and after fine tuning are shown in Table 4. Note that because shuffling might give different candidate sets than in the averaging experiments UAS and LAS results are not exactly the same as K = 6 results in Table 1, Table 2 and Table 3.

Discussion
Averaging Tables 1 to 3 show that UAS and LAS generally increase on average with the number of models in the mixture model, and that ensemble performs often on average better than the best single systems in each category (notable exceptions: UAS for FOP and models with BERT on PTB).
Averaging generally decreases the standard deviation, which is evident for (B)FOP. For (B)SOP the decrease trend is less clear. However, we still found that the smallest standard deviation is usually given by high number of experts (K = 5, 6).    We remark that on PTB similar performance on dev was achieved by FOP and SOP, with a slightly better UAS for SOP, which is expected by the capacity of the model to better represent structures. This corroborates findings of (Falenska and Kuhn, 2019). But this contradicts results for CoNLL09 where SOP always gives best results, in line with observations of Fonseca and Martins (2020). For BERT experiments on PTB, BSOP achieves better performance than BFOP with one or two experts. However, when the number of experts increases, BFOP outperforms BSOP.
We complement our discussion with Figure 1 7 which depicts variance reduction by the number of experts in log-scale: almost linear of for all models, as predicted by our independence assumption.
We note that UAS and LAS improves little or not at all from K = 5 to K = 6. This is in accordance with the variance analysis for that the decrease of variance will become smaller when number of experts becomes higher. Indeed, applying Eq. ( 4), the decrease of variance from K = 1 to K = 2 is Clustering We found that a modest improvement on UAS and LAS (0.01%-0.06% absolute) can be achieved by clustering (except for FOP on CoNLL09 Chinese). The average performance benefits generally from clustering while a tiny decrease (0.01%) is observed for BFOP on PTB.
Since FOP, SOP, BFOP and BSOP are all strong learners for PTB and CoNLL09 Chinese, i.e. UAS and LAS approaches 99% for both PTB and CoNLL09 on training data for all models, we can assume that an expert belonging to one of these models can learn efficiently most of the training data, as opposed to just a portion of it. Thus, only a a few of training instances can significantly be better covered by clustering. Moreover, as averaging has already achieved a considerable improvement (around 0.2%-0.6% absolute), a biased ω k obtained from clustering may harm the gain from averaging.

Results on Test
Tables 5 and 6 show test results on PTB and CoNLL09, comparisons with recent models. We show test results of SOP and CSOP with 6 experts for PTB and CoNLL09. Additionally for PTB, we show BFOP, CBFOP, BSOP and CBSOP with 6 experts to make comparison with recent parsers, often more sophisticated than our approach, with BERT. We give the results with the same typographical system as Zhang et al. (2020a) Please note that, while average results keep the same semantics, max and min give test results of the LAS highest-and lowest-(resp.) scoring systems on the development set. We note that results of Zhang et al. (2020a) would correspond our model with K = 1.
For averaging models, we apply significance ttests (Dror et al., 2018) with level α = 0.05 to FOP, BFOP, SOP, BSOP with K = 6 against K = 1. For PTB and CoNLL09, p-value is always smaller than 0.005. We note that for parsers without BERT, averaging can achieve a considerable improvement with SOP and gives new SOTA. We also point out that, if FOP and SOP could find equivalently good models on dev, SOP models seem to better generalize. For parsers with BERT, with a simple averaging of BSOP, we achieve comparable performances (or even better in case of LAS) when comparing to more involved methods such as Mohammadshahi and Henderson, 2021). It remains to be seen whether they can also benefit from MoEs.
Regarding clustering, even if we obtained an average improvement on dev, test data hardly benefits from it. Still, we note a small improvement of UAS on SOP CoNLL09. Finally we stress that best performing settings on PTB test, namely BSOP and CBSOP, were not better performing than BFOP and CBFOP on development data on average (although max systems were similar): second-order models seem to slightly better handle unseen data.

Parallel Training and Decoding
Training averaging ensembles can be paralleled with sufficient GPUs, since each expert is trained independently. For fine tuning with clustering, most of the training could in principle be paralleled as well, although for the sake of simplicity we didn't implement such a training procedure: the training time of clustering model increases linearly with number of experts. As we only need a few epochs for fine tuning, the overall training time is comparable to training a single expert.
For decoding, calculations are performed in parallel as well. First marginal probabilities for arcs and labels are computed for every expert in parallel. Then they are combined either as a simple average or as a weighted sum. Finally, we apply the decoding algorithm (LMBR) once over the combined   probability. The overhead is thus quite limited, for instance with K = 6 the overall decoding time is only around 10% higher than with a single expert.

Related Work
Ensembling parsers showed good results in shared tasks (Che et al., 2018) 8 and were framed as a combination of experts in (Petrov, 2010). In this work we show how this is related to mixtures and distinguish averaging and clustering effects.
The use of mixture model for syntactic parsing was introduced in (Petrov et al., 2006) for PCFG models, where it provided an access to non-local features unreachable to mere PCFGs. However, now that powerful non-Markovian feature extractors (i.e. BiLSTMs or Transformers) are widely used, the expected gain is more difficult to characterize, but we hypothesize that it is related to the softmax bottleneck (Yang et al., 2018) implied by using different exponential models in all predictions, even when richly parameterized.
We modelled parser combinations with finite mixture models, but more sophisticated parsing models (Kim et al., 2019) use infinite mixture models. In this case it might be more difficult to discriminate between averaging and clustering. Our mixture is essentially a latent variable model where 8 Ensembling is widely used in Machine Translation shared tasks, such as WMT. the latent variables range over experts. Although inspired from EM with neural networks, similarly to (Nishida and Nakayama, 2020), other methods based on ELBo and sampling could also be utilized (Corro and Titov, 2019;Zhu et al., 2020).

Conclusion
We framed dependency parser combination as a finite mixture model, showed that this model presents two distinct properties -an averaging effect and a clustering effect-and devised an efficient decoding method. Moreover, we studied the impact of the averaging effect, namely variance reduction during training, and consequently better accuracy. We investigated the reasons of instability when learning mixture models, and proposed an EM-inspired method to avoid over-specialization. When used as fine-tuning, this method may improve accuracy over averaging. As a by-product, this method gives state-of-the-art results when combined with first-order and second-order projective parsers on two standard datasets.
This work can be further expanded in future research: the increase of parameters can be seen as overparameterization, and many parameters must be redundant. A potentially fruitful avenue of research could be the investigation of the subnetwork hypothesis, i.e. whether distillation could give a smaller network with similar performance.
By changing the order of sum, we can have: The inner part is exactly p k ((h, d)|x). Thus, we have:

B Quick Gradient Analysis of Gating Network
We start from Eq. (6). For mixture model with well trained experts, most of the data are equivalent for all experts, which means p k (y|x) have similar value for all experts. To see quickly why gradient approaches 0 in this case, we assume further that p k (y|x) has the same value for equivalent data. Thus, Eq. (6) becomes: With a little more deduction, we have: As the function is continuous w.r.t. p k , for data which provides similar value of probability on all experts, the gradient will approaches zero. Thus, for training with Eq. (6), only a small part of data, which shows strong preference of particular experts, is used to train the gating network.
For training with Eq. (8), all the data is useful for training the gating network. In fact, the gradient of Eq. (8) becomes zero when: Thus for data which are equivalent for all experts, a uniform weight will be learnt while for data with strong preference of particular experts, a biased weight proportional to the probability correctness on each expert can also be learnt.

C Gating Network Structure, hyper-parameters of training
The gating network structure is similar to the structure of parse model. Embedding Word embedding for word x i is an concatenation of two parts: normal word embedding and CharLSTM embedding: when there is pre-trained embedding, the first item is the sum of word embedding calculated by neural network, and the exterior pretrained embedding: emb(x i ) = WordEMB(x i ) + PreEMB(x i ) We suppose that PreEMB has the same size as WordEMB BiLSTM The embedding vectors are then passed to 3 layers of BiLSTM, with the output at position i is noted as h i .
Coefficient Extractor The coefficient extractor part is constructed of one layer of LSTM (Hochreiter and Schmidhuber, 1997) and one layer of MLP. The last hidden state of LSTM is passed to MLP, which compress the vector size to the number of experts in the mixture model. Two groups of coefficient extractor are used to calculate separately the weight of combination for arc and label. We note the output of MLP as C ∈ R K , with: C arc = MLP arc (LSTM arc (h 0 , ..., h n )) C label = MLP label (LSTM label (h 0 , ..., h n )) The output of MLP is passed to Softmax to calculate the weight for each expert: [ω 1 , ..., ω K ] = Softmax(C arc ) [ω l 1 , ..., ω l K ] = Softmax(C label ) Model hyper-parameters of fine tuning is shown in Table 7. We use also Adam (Reddi et al., 2018) for training, with learning rate set to 2e −4 (10 times smaller than learning rate used for training experts). The patience is set to 20 instead of the original value 100. For fine tuning, we found that best score is usually achieved in less than 20 epochs and does not increase later.

D Implementation Differences
We implement Zhang et al. (2020a) CRF model and CRF2o model with two tiny technical differences. The first one is that the CharLSTM (Lample et al., 2016) part in Zhang et al. (2020a) treats the beginning of the sentence <bos> (the special token to represent the beginning of the sentence) as five separate characters: <,b,o,s,>.
Our implementation treats the beginning of sentence as one special character for CharLSTM.
Another difference is that Zhang et al. (2020a) treats the lengths of every sentence as n + 2 by considering two special tokens <bos> and <eos> (although in practice, only <bos> was added to every sentence). In our implementation, we keep the length of sentence as the number of words n. This is because the log probability of arc and label only considers the words in the sentence without special tokens. Thus our batch size should be a little bit higher than Zhang et al. (2020a).
One final difference is that for MBR decoding, (Zhang et al., 2020a) maximizes the sum of marginal arc probability. While in our implementation of MBR, we maximize the product of marginal arc probability.

E Variance Reduction on CoNLL09
We note that the label variance for FOP and SOP are quite similar that they overlap together for CoNLL09 Chinese.