Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor

Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student’s output distributions. However, for structured prediction problems, the output space is exponential in size; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this paper, we derive a factorized form of the knowledge distillation objective for structured prediction, which is tractable for many typical choices of the teacher and student models. In particular, we show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models under four different scenarios: 1) the teacher and student share the same factorization form of the output structure scoring function; 2) the student factorization produces more fine-grained substructures than the teacher factorization; 3) the teacher factorization produces more fine-grained substructures than the student factorization; 4) the factorization forms from the teacher and the student are incompatible.


Introduction
Deeper and larger neural networks have led to significant improvement in accuracy in various tasks, but they are also more computationally expensive and unfit for resource-constrained scenarios such as online serving. An interesting and viable solution to this problem is knowledge distillation (KD) (Buciluǎ et al., 2006;Ba and Caruana, 2014;Hinton et al., 2015), which can be used to transfer the knowledge of a large model (the teacher) to a smaller model (the student). In the field of natural language processing (NLP), for example, KD has been successfully applied to compress massive pretrained language models such as BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) into much smaller and faster models without significant loss in accuracy (Tang et al., 2019;Sanh et al., 2019;Tsai et al., 2019;Mukherjee and Hassan Awadallah, 2020).
A typical approach to KD is letting the student mimic the teacher model's output probability distributions on the training data by using the cross-entropy objective. For structured prediction problems, however, the output space is exponentially large, making the cross-entropy objective intractable to compute and optimize directly. Take sequence labeling for example. If the size of the label set is L, then there are L n possible label sequences for a sentence of n words and it is infeasible to compute the cross-entropy by enumerating the label sequences. Previous approaches to structural KD either choose to perform KD on local decisions or substructures instead of on the full output structure, or resort to Top-K approximation of the objective (Kim and Rush, 2016;Kuncoro et al., 2016;Wang et al., 2020a).
In this paper, we derive a factorized form of the structural KD objective based on the fact that almost all the structured prediction models factorize the scoring function of the output structure into scores of substructures. If the student's substructure space is polynomial in size and the teacher's marginal distributions over these substructures can be tractably estimated, then we can tractably compute and optimize the factorized form of the structural KD objective. As will be shown in the paper, many widely used structured prediction models satisfy the assumptions and hence are amenable to tractable KD. In particular, we show the feasibility and empirical effectiveness of structural KD with different combinations of teacher and student models, including those with incompatible factorization forms. We apply this technique to structural KD between sequence labeling and dependency parsing models under four different scenarios.
1. The teacher and student share the same factorization form of the output structure scoring function.
2. The student factorization produces more finegrained substructures than the teacher factorization.
3. The teacher factorization produces more finegrained substructures than the student factorization.
4. The factorization forms from the teacher and the student are incompatible.
In all the cases, we empirically show that our structural KD approaches can improve the student models. In the few cases where previous KD approaches are applicable, we show our approaches outperform these previous approaches. With unlabeled data, our approaches can further improve student models' performance. In a zero-shot crosslingual transfer case, we show that with sufficient unlabeled data, student models trained by our approaches can even outperform the teacher models.

Structured Prediction
Structured prediction aims to predict a structured output such as a sequence, a tree or a graph. In this paper, we focus on structured prediction problems with a discrete output space, which include most of the structured prediction tasks in NLP (e.g., chunking, named entity recognition, and dependency parsing) and many structured prediction tasks in computer vision (e.g., image segmentation). We further assume that the scoring function of the output structure can be factorized into scores of a polynomial number of substructures. Consequently, we can calculate the conditional probability of the output structure y given an input x as follows: P (y|x) = exp (Score(y, x)) y ∈Y(x) exp (Score(y , x)) = u∈y exp (Score(u, x)) where Y(x) represents all possible output structures given the input x, Score(y, x) is the scoring function that evaluates the quality of the output y, Z(x) is the partition function, and u ∈ y denotes that u is a substructure of y. We define the substructure space U(x) = y∈Y(x) {u|u ∈ y}as the set of substructures of all possible output structures given input x. Take sequence labeling for example. Given a sentence x , the output space Y(x) contains all possible label sequences of x. In linear-chain CRF, a popular model for sequence labeling, the scoring function Score(y, x) is computed by summing up all the transition scores and emission scores where i ranges over all the positions in sentence x, and the substructure space U(x) contains all possible position-specific labels {y i } and label pairs {(y i−1 , y i )}.

Knowledge Distillation
Knowledge distillation is a technique that trains a small student model by encouraging it to imitate the output probability distribution of a large teacher model. The typical KD objective function is the cross-entropy between the output distributions predicted by the teacher model and the student model: where P t and P s are the teacher's and the student's distributions respectively.
During training, the student jointly learns from the gold targets and the distributions predicted by the teacher by optimizing the following objective function: where λ is an interpolation coefficient between the target loss L target and the structural KD loss L KD . Following Clark et al. (2019); Wang et al. (2020a), one may apply teacher annealing in training by decreasing λ linearly from 1 to 0. Because KD does not require gold labels, unlabeled data can also be used in the KD loss.
When performing knowledge distillation on structured prediction, a major challenge is that the structured output space is exponential in size, leading to intractable computation of the KD objective in Eq. 2. However, if the scoring function of the student model can be factorized into scores of substructures (Eq. 1), then we can derive the following factorized form of the structural KD objective.
where 1 condition is 1 if the condition is true and 0 otherwise. From Eq. 3, we see that if U s (x) is polynomial in size and P t (u|x) can be tractably estimated, then the structural KD objective can be tractably computed and optimized. In the rest of this section, we will show that this is indeed the case for some of the most widely used models in sequence labeling and dependency parsing, two representative structured prediction tasks in NLP. Based on the difference in score factorization between the teacher and student models, we divide our discussion into four scenarios.

Teacher and Student Share the Same Factorization Form
Case 1a: Linear-Chain CRF ⇒ Linear-Chain CRF In this case, both the teacher and the student are linear-chain CRF models. An example application is to compress a state-of-the-art CRF model for named entity recognition (NER) that is based on large pretrained contextualized embeddings to a smaller CRF model with static embeddings that is more suitable for fast online serving.
For a CRF student model described in section 2.1, if we absorb the emission score S e (y i , x) into the transition score S t ((y i−1 , y i ), x) at each position i, then the substructure space U s (x) contains every two adjacent labels {(y i−1 , y i )} for i=1, . . . , n, with n be-ing the sequence length, and the substructure score is defined as Score((y i−1 , y i ), x) = S t ((y i−1 , y i ), x) + S e (y i , x). The substructure marginal P t ((y i−1 , y i )|x) of the teacher model can be computed by: where α(y i−1 ) and β(y i ) are forward and backward scores that can be tractably calculated using the classical forward-backward algorithm.
Comparing with the Posterior KD and Top-K KD of linear-chain CRFs proposed by Wang et al. (2020a), our approach calculates and optimizes the KD objective exactly, while their two KD approaches perform KD either heuristically or approximately. At the formulation level, our approach is based on the marginal distributions of two adjacent labels, while the Posterior KD is based on the marginal distributions of a single label.
Case 1b: Graph-based Dependency Parsing ⇒ Dependency Parsing as Sequence Labeling In this case, we use the biaffine parser proposed by Dozat et al. (2017) as the teacher and the sequence labeling approach proposed by Strzyz et al. (2019) as the student for the dependency parsing task. The biaffine parser is one of the state-of-the-art models, while the sequence labeling parser provides a good speed-accuracy tradeoff. There is a big gap in accuracy between the two models and therefore KD can be used to improve the accuracy of the sequence labeling parser.
Here we follow the head-selection formulation of dependency parsing without the tree constraint. The dependency parse tree y is represented by y 1 , . . . , y n , where n is the sentence length and y i = (h i , l i ) denotes the dependency head of the i-th token of the input sentence, with h i being the index of the head token and l i being the dependency label. The biaffine parser predicts the dependency head for each token independently. It models separately the probability distribution of the head index P t (h i |x) and the probability distribution of the label P t (l i |x). The sequence labeling parser is a MaxEnt model that also predicts the head of each token independently. It computes Score((h i , l i ), x) for each token and applies a softmax function to produce the distribution P s ((h i , l i )|x).
Therefore, these two models share the same factorization in which each substructure is a depen-dency arc specified by y i . U s (x) thus contains all possible dependency arcs among tokens of the input sentence x. The substructure marginal predicted by the teacher can be easily derived as: Note that in this case, the sequence labeling parser uses a MaxEnt decoder, which is locally normalized for each substructure. Therefore, the structural KD objective in Eq. 3 can be reduced to the following form without the need for calculating the student partition function Z s (x).
In all the cases except Case 1a and Case 3, the student model is locally normalized and hence we can follow this form of objective.

Student Factorization Produces More
Fine-grained Substructures than Teacher Factorization Case 2a: Linear-Chain CRF ⇒ MaxEnt In this case, we use a linear-chain CRF model as the teacher and a MaxEnt model as the student. Previous work (Yang et al., 2018;Wang et al., 2020a) shows that a linear-chain CRF decoder often leads to better performance than a MaxEnt decoder for many sequence labeling tasks. Still, the simplicity and efficiency of the MaxEnt model is desirable. Therefore, it makes sense to perform KD from a linear-chain CRF to a MaxEnt model. As mentioned in Case 1a, the substructures of a linear-chain CRF model are consecutive labels {(y i−1 , y i )}. In contrast, a MaxEnt model predicts the label probability distribution P s (y i |x) of each token independently and hence the substructure space U s (x) consists of every individual label {y i }. To calculate the substructure marginal of the teacher P t (y i |x), we can again utilize the forwardbackward algorithm: where α(y i ) and β(y i ) are forward and backward scores.
Case 2b: Second-Order Dependency Parsing ⇒ Dependency Parsing as Sequence Labeling The biaffine parser is a first-order dependency parser, which scores each dependency arc in a parse tree independently. A second-order dependency parser scores pairs of dependency arcs with a shared token. The substructures of second-order parsing are therefore all the dependency arc pairs with a shared token. It has been found that secondorder extensions of the biaffine parser often have higher parsing accuracy (Wang et al., 2019;Zhang et al., 2020;Wang et al., 2020d;Wang and Tu, 2020). Therefore, we may take a second-order dependency parser as the teacher to improve a sequence labeling parser.
Here we consider the second-order dependency parser of Wang and Tu (2020). It employs mean field variational inference to estimate the probabilities of arc existence P t (h i |x) and uses a first-order biaffine model to estimate the probabilities of arc labels P t (l i |x). Therefore, the substructure marginal can be calculated in the same way as Eq. 5.

Teacher Factorization Produces More
Fine-grained Substructures than Student Factorization Wu and Dredze, 2019) has shown that multilingual BERT (M-BERT) has strong zero-shot crosslingual transferability in NER tasks. Many such models employ a MaxEnt decoder. In scenarios requiring fast speed and low computation cost, however, we may want to distill knowledge from such models to a model with much cheaper static monolingual embeddings while compensating the performance loss with a linear-chain CRF decoder. As described in Case 1a, the substructures of a linear-chain CRF model are consecutive labels {(y i−1 , y i )}. Because of the label independence and local normalization in the MaxEnt model, the substructure marginal of the MaxEnt teacher is calculated by:

Factorization Forms From Teacher and Student are Incompatible
Case 4: NER as Parsing ⇒ MaxEnt Very recently, Yu et al. (2020) propose to solve the NER task as graph-based dependency parsing and achieve state-of-the-art performance. They represent each named entity with a dependency arc from the first token to the last token of the named entity, and represent the entity type with the arc label. However, for the flat NER task (i.e., there is no overlapping between entity spans), the time complexity of this method is higher than commonly used sequence labeling NER methods. In this case, we take a parsing-based NER model as our teacher and a MaxEnt model with the BIOES label scheme as our student. The two models adopt very different representations of NER output structures. The parsing-based teacher model represents an NER output of a sentence with a set of labeled dependency arcs and defines its score as the sum of arc scores. The Max-Ent model represents an NER output of a sentence with a sequence of BIOES labels and defines its score as the sum of token-wise label scores. Therefore, the factorization forms of these two models are incompatible.
Computing the substructure marginal of the teacher P t (y i |x), where y i ∈ {B l , I l , E l , S l , O|l ∈ L} and L is the set of entity types, is much more complicated than in the previous cases. Take y i = B l for example. P t (y i = B l |x) represents the probability of the i-th word being the beginning of a multi-word entity of type 'l'. In the parsing-based teacher model, this probability is proportional to the summation of exponentiated scores of all the output structures that contain a dependency arc of label 'l' with the i-th word as its head and with its length larger than 1. It is intractable to compute such marginal probabilities by enumerating all the output structures, but we can tractably compute them using dynamic programming. See supplementary material for a detailed description of our dynamic programming method.

Settings
Datasets We use CoNLL 2002/2003datasets (Tjong Kim Sang, 2002Tjong Kim Sang and De Meulder, 2003) for Case 1a, 2a and 4, and use WikiAnn datasets (Pan et al., 2017) for Case 1a, 2a, 3, and 4. The CoNLL datasets contain the corpora of four Indo-European languages. We use the same four languages from the WikiAnn datasets. For cross-lingual transfer in Case 3, we use the four Indo-European languages as the source for the teacher model and additionally select four languages from different language families as the target for the student models. 2 We use the standard training/development/test split for the CoNLL datasets. For WikiAnn, we follow the sampling of Wang et al. (2020a) with 12000 sentences for English and 5000 sentences for each of the other languages. We split the datasets by 3:1:1 for training/development/test. For Case 1b and 2b, we use Penn Treebank (PTB) 3.0 and follow the same pre-processing pipeline as in Ma et al. (2018). For unlabeled data, we sample sentences that belong to the same languages of the labeled data from the WikiAnn datasets for Case 1a, 2a and 4 and we sample sentences from the target languages of WikiAnn datasets for Case 3. We use the BLLIP corpus 3 as the unlabeled data for Case 1b and 2b.
Models For the student models in all the cases, we use fastText (Bojanowski et al., 2017) word embeddings and character embeddings as the word representation. For Case 1a, 2a and 4, we concatenate the multilingual BERT, Flair (Akbik et al., 2018), fastText embeddings and character embeddings (Santos and Zadrozny, 2014) as the word representations for stronger monolingual teacher models (Wang et al., 2020c). For Case 3, we use M-BERT embeddings for the teacher. Also for Case 3, we fine-tune the teacher model on the training set of the four Indo-European languages from the WikiAnn dataset and train student models on the four additional languages. For the teacher models in Case 1b and 2b, we simply use the same embeddings as the student because there is already huge performance gap between the teacher and student in these settings and hence we do not need strong embeddings for the teacher to demonstrate the utility of KD.  also propose Top-K KD but have shown that it is inferior to Pos. KD. For experiments using unlabeled data in all the cases, in addition to labeled data, we use the teacher's prediction on the unlabeled data as pseudo labeled data to train the student models. This can be seen as the Top-1 KD method 4 . In Case 2a and 3, where we perform KD between CRF and MaxEnt models, we run a reference baseline that replaces the CRF teacher or student model with a MaxEnt model and performs token-level KD (Token KD) of MaxEnt models that optimizes the cross entropy between the teacher and student label distributions at each position.
Training For MaxEnt and linear-chain CRF models, we use the same hyper-parameters as in Akbik et al. (2018). For dependency parsing, we use the same hyper-parameters as in Wang and Tu (2020) for teacher models and Strzyz et al. (2019) for student models. For M-BERT fine-tuning in Case 3, we mix the training data of the four source datasets and train the teacher model with the AdamW optimizer (Loshchilov and Hutter, 2018) with a learning rate of 5×10 −5 for 10 epochs. We tune the KD temperature in {1, 2, 3, 4, 5} and the loss interpolation annealing rate in {0.5, 1.0, 1.5}. For all experiments, we train the models for 5 runs with a fixed random seed for each run.     Table 4 show the effectiveness of Struct. KD in both cases. In Case 1a, our approach is stronger than both Top-WK KD and Pos. KD as well as the mixture of the two approaches on average. In Case 2a, Struct. KD not only outperforms Token KD, but also makes the MaxEnt student competitive with the CRF student without KD (87.32 vs. 87.36).

Amount of Unlabeled Data
We compare our approaches with the baselines with different amounts of unlabeled data for Case 1a, 1b and 3, which are cases that apply in-domain unlabeled data for NER and dependency parsing, and cross-lingual unlabeled data for NER. We experiment with more unlabeled data for Case 1b than for the other two cases because the labeled training data of PTB is more than 10 times larger than the labeled NER training data in Case 1a and 3. Results are shown in Figure 1. The experimental results show that our approaches consistently outperform the baselines, though the performance gaps between them become smaller when the amount of unlabeled data increases. Comparing the performance of the students with the teachers, we can see that in Case 1a and 1b, the gap between the teacher and the student remains large even with the largest amount of unlabeled data. This is unsurprising considering the difference in model capacity between the teacher and the student. In Case 3, however, we find that when using 30,000 unlabeled sentences, the CRF student models can even outperform the MaxEnt teacher model, which shows the effectiveness of CRF models on NER.

Temperature in Structural Knowledge Distillation
A frequently used KD technique is dividing the logits of probability distributions of both the teacher and the student by a temperature in the KD objective (Hinton et al., 2015). Using a higher temperature produces softer probability distributions and often results in higher KD accuracy. In structural KD, there are two approaches to applying the temperature to the teacher model, either globally to the logit of P t (y|x) (i.e., Score t (y, x)) of the full structure y, or locally to the logit of P t (u|x) of each student substructure u. We empirically compare these two approaches in Case 1a with the same setting as in Section 4.1. Table 5 shows that the local approach results in better accuracy for all the languages. Therefore, we use the local approach by default in all the experiments.

Comparison of Teachers
In Case 2a and Case 4, we use the same Max-Ent student model but different types of teacher models. Our structural KD approaches in both cases compute the marginal distribution P t (y i |x) of the teacher at each position i following the substructures of the MaxEnt student, which is then used to train the student substructure scores. We can evaluate the quality of the marginal distributions by taking their modes as label predictions and evaluating their accuracy. In Table 6, we compare the accuracy of the CRF teacher and its marginal distributions from Case 2a, the NER-as-parsing teacher and its marginal distributions from Case 4, and the MaxEnt teacher which is the KD baseline in Case 2a. First, we observe that for both CRF and NER-as-parsing, predicting labels from the marginal distributions leads to lower accuracy. This is to be expected because such predictions do not take into account correlations between adjacent labels. While predictions from marginal distributions of the CRF teacher still outperform MaxEnt, those of the NER-as-parsing teacher clearly underperform MaxEnt. This provides an explanation as to why Struct. KD in Case 4 has equal or even lower accuracy than the Token KD baseline in Case 2a in Table 3.
6 Related Work

Structured Prediction
In this paper, we use sequence labeling and dependency parsing as two example structured prediction tasks. In sequence labeling, a lot of work applied the linear-chain CRF and achieved state-of-the-art performance in various tasks ( (2020) showed that secondorder parsing is four times slower than the simple head-selection first-order approach (Dozat and Manning, 2017). Such speed-accuracy tradeoff as seen in sequence labeling and dependency parsing also occurs in many other structured prediction tasks. This makes KD an interesting and very useful technique that can be used to circumvent this tradeoff to some extent.

Knowledge Distillation in Structured Prediction
KD has been applied in many structured prediction tasks in the fields of NLP, speech recognition and computer vision, with applications such as neural machine translation ( In KD for structured prediction tasks, how to handle the exponential number of structured outputs is a main challenge. To address this difficult problem, recent work resorts to approximation of the KD objective. Kim and Rush (2016) proposed sequence-level distillation through predicting K-best sequences of the teacher in neural machine translation. Kuncoro et al. (2016) proposed to use multiple greedy parsers as teachers and generate the probability distribution at each position through voting. Very recently, Wang et al. (2020a) proposed structure-level knowledge distillation for linear-chain CRF models in multilingual sequence labeling. During the distillation process, teacher models predict the Top-K label sequences as the global structure information or the posterior label distribution at each position as the local structural information, which is then used to train the student. Besides approximate approaches, an alternative way is using models that make local decisions and performing KD on these local decisions. Anderson and Gómez-Rodríguez (2020) formulated dependency parsing as a head-selection problem and distilled the distribution of the head node at each position. Tsai et al. (2019) proposed MiniBERT through distilling the output distributions of M-BERT models of the MaxEnt classifier. Besides the output distribution, Mukherjee and Hassan Awadallah (2020) further distilled the hidden representations of teachers.

Conclusion
In this paper, we propose structural knowledge distillation, which transfers knowledge between structured prediction models. We derive a factorized form of the structural KD objective and make it tractable to compute and optimize for many typical choices of teacher and student models. We apply our approach to four KD scenarios with six cases for sequence labeling and dependency parsing. Empirical results show that our approach outperforms baselines without KD as well as previous KD approaches. With sufficient unlabeled data, our approach can even boost the students to outperform the teachers in zero-shot cross-lingual transfer. A Dynamic Programming for Case 4 We describe how the marginal distribution over BIOES labels at each position of the input sentence can be tractably computed based on the NERas-parsing teacher model using dynamic programming.
Given an input sentence x with n words, we first define the following functions.
• − → DP(i, l) represents the summation of scores of all possible labeling sequences of the subsentence from the first token to the i-th token while a span ends with the i-th token with a label l.
• − → DP(i, F) represents the summation of scores of all possible labeling sequences of the subsentence from the first token to the i-th token while there is no arc pointing to the i-th token.
• ← − DP(i, l) represents the summation of scores of all possible labeling sequences of the subsentence from the i-th toke to the last token while a span starts with the i-th token with a label l.
• ← − DP(i, F) represents the summation of scores of all possible labeling sequences of the subsentence from the i-th toke to the last token while there is no arc coming from the i-th token.
We can compute the values of these functions for all values of i and l using dynamic programming. The base cases are: The recursive formulation of these functions are: where Score(y i,j = l) is the score assigned by the teacher model to the dependency arc from i to j with label l. After dynamic programming, we can compute the substructure marginals of the teacher P t (y i |x) as follows: where • DP(X, i) represents the summation of scores of all possible labeling sequences in which the i-th token is labeled as X. X can be one of 'B l , I l , E l , O, S l '.
• Z(x) represents the summation of scores of all possible labeling sequences given the input sentence x. y i,j = l represents that there is a dependency arc of label 'l' from the i-th word to the j-th word. We can calculate Z(x) by − → DP(n, l) The edge cases are: P t (y n = B l |x) = 0 P t (y 1 = I l |x) = P t (y n = I l |x) = 0 P t (y 1 = E l |x) = 0  An important goal of KD is to produce faster and smaller models. In Table 7, we show a comparison on the running speed and model size between the teacher and student models on the CoNLL English test set from Case 2a. It can be seen that the student model is about 24 times faster and 25 times smaller than the teacher model.

C Detailed Experimental Results
In this section, we present detailed experimental results. , which is a high quality comparison between deep neural networks. We evaluate with a significance level of 0.05. For the significance test over averaged scores, we averaged over the same random seed of each language as a sample of averaged score. In tables, we use † to represent our approaches are significantly stronger than the models training without KD or with Top-1 KD. We use ‡ to represent that our approaches are significantly stronger than other KD approaches.
C.1 Results of NER task Table 8, 9 and 10 represent the KD results of experiments with labeled and unlabeled datasets. Our approaches outperform the baselines significantly in most of the cases. Note that in some cases, our approaches perform slightly inferior to other approaches (for example, de dataset in Case 1a in Table 9 with 30k unlabeled sentences) while our approaches are still stronger than these approaches according to the ASD test. The possible reason is that the variances of our approaches are much larger than the other approaches and ASD indicates our approaches is possibly better than the other approaches.

C.2 Results of Parsing task
Tabel 11 and 12 represent the results of experiments of Parsing. Our structural KD approaches significantly outperform the other approaches in all cases. UAS and LAS in these tables were dependency parsing metrics, and they refer to unlabeled attachment score and labeled attachment score respectively.      Table 12: The accuracy of Parsing task with unlabeled dataset (in thousand). Note that all our approaches are significantly stronger than the baseline.