Structured Refinement for Sequential Labeling

Filtering target-irrelevant information through hierarchically reﬁning hidden states has been demonstrated to be effective for obtaining informative representations. However, previous work simply relies on locally normalized attention without considering possible labels at other time steps, the capacity for modeling long-term dependency relations is thus limited. In this paper, we propose to extend previous work with globally normalized attention, e.g., structured attention, to leverage structural information for more effective representation re-ﬁnement. We also propose two implementation tricks to accelerate CRF computation and an initialization trick for Chinese character embeddings to further improve performance. We provide extensive experimental results on various datasets to show the effectiveness and efﬁ-ciency of our proposed method.


Introduction
Sequential labeling tasks, e.g., named entity recognition (NER) and part-of-speech (POS) tagging, play an important role in natural language processing. Figure 1 shows two examples of sequential labeling tasks. Early studies focused on introducing rich features to improve performance. For example, to handle out-of-vocabulary words by introducing morphological features, Lample et al. (2016) and Ma and Hovy (2016) leveraged character-level features, whereas Heinzerling and Strube (2019) exploited subword-level features. Moreover, introducing long-term dependency features is also found to be beneficial for sequential labeling. Jie and Lu (2019) attempted to explicitly exploit dependency relations with additional annotations, while  and Chen et al. (2019) endeavored to learn these relations implicitly with more complex encoders. * This work was done when the first author was at NAIST.  (bottom). For NER, "the Senate Finance Committee" is a named entity of type ORG (organization). The prefixes S-, I-, or E-indicate this word is located at the beginning, intermediate, or ending of the current named entity, while O signifies this word is outside any named entity. In the case of POS tagging, each tag is a part-of-speech category. For instance, NN represents a singular noun and VBN is the past participle of a verb.
However, as Tishby and Zaslavsky (2015) pointed out, features are not created equal, only the target-relevant features are profitable for improving model performance. Recently, Cui and Zhang (2019) proposed a hierarchically-refined label attention network (LAN), which explicitly leverages label embeddings and captures long-term label dependency relations through multiple refinements layers.
Individually picking up the most likely label at each time step is undoubtedly critical, however, considering the entire historical progress is also indispensable. We find that the locally normalized attention, which Cui and Zhang (2019) used to leverage information from label embeddings, can eventually hurt performance. Since it only considers the current time step but ignores labels at other time steps, thus we presume its ability to capture long-term dependency relations is limited.
On the other hand, Kim et al. (2017) incorporated neural networks with probabilistic graphical models to obtain structural distributions as an alternative to conventional attention mechanisms. Their method relies on attending to cliques of linearchain conditional random fields (CRF). These in- ferred inner structures, i.e., represented as the marginal probabilities, are not the targets of their tasks but only serve as the latent variables, thus they do not impose direct supervision on these attention weights. In contrast, since we consider to repeatedly refine these inferred structures to obtain the final outputs, we compute structural attention over these target labels instead, without introducing unobserved variables. In this paper, we propose a novel structured refinement mechanism by combining representation refinement and structured attention. Following and extending Cui and Zhang (2019), we hierarchically refine hidden representations with global normalized structured attention, i.e., the marginal probability of CRF. Besides, to impose direct supervision on the target structures, we share the label embeddings and the transition matrix of CRF across layers. Our method can be considered as iteratively re-constructing hidden representations with only label embeddings, and thus it is capable of filtering target-irrelevant information out.
Besides, we propose a character embedding initialization trick to enhance performance on Chinese datasets and two CRF implementation tricks to accelerate computation.
Our contributions are considered as four-folds, (a) we propose a novel structured refinement network by combing representation refinement and structured attention for sequential labeling tasks, (b) we propose an initialization trick for Chinese character embeddings, (c) we propose two implementation tricks to accelerate CRF training and decoding, (d) and we prove the effectiveness and efficiency of our model through extensive experiments for NER and POS tagging on various datasets.

Baseline
Formally speaking, given a token sequence {x t } n t=1 , the aim of sequential labeling tasks is to find the most probable label sequence {y t } n t=1 .

Label Attention Network
Label attention network (Cui and Zhang, 2019) consists of an embedding layer followed by several encoding and refinement layers alternatively. The decoding layer is a bidirectional LSTM followed by a refinement layer.
Embedding Layer Cui and Zhang (2019) employed the concatenation of word and characterbased word representations as the token representations x t = [w t , c t ], they convert words to word embeddings w t ∈ R dw and use a character-level bidirectional LSTM to build character-based word embeddings c t ∈ R dc .
Encoding Layer They utilized an independent bidirectional LSTM for each layer l as follows, is the refined representation from the last refinement layer. Specially, hidden vectorh (0) t is the token representation x t . After this, they employ a refinement layer, which is called "labelattention inference sublayer" in the original paper, to refine hidden states. They make use of attention mechanism (Vaswani et al., 2017) to produce the attention matrix α (l) t,j as in Equation 2, and further calculate the label-aware hidden statesĥ (l) t ∈ R d h , which jointly encode information from the token representation subspace and the label representation subspace.
Where Q (l) , K (l) , V (l) ∈ R d h ×d h are all parameters, and v y j ∈ R d h is the embedding of label y j ∈ Y. In practice, they use multiple heads to capture representations from multiple aspects in parallel. After that, they concatenate the hidden state h (l) t and the label-aware hidden stateĥ (l) t as the refined representationh (l) t ∈ R 2d h , and feed it into the next encoding layer.
Decoding Layer Similar to the encoding layer, the decoding layer contains a bidirectional LSTM and a refinement layer, but at this layer, the attention matrix α (L+1) t,j only servers as the label probability distribution to predict the most probable label sequence.
3 Proposed Method

Structured Refinement
A notable highlight of the model of Cui and Zhang (2019) is that it is not equipped with the commonly used CRFs (Lample et al., 2016;Ma and Hovy, 2016), however, it still can achieve remarkable performance. And just because of abandoning the computationally expensive CRFs, their model obtains a significant acceleration on both training and decoding stages. However, we find that the time step independent attention, i.e., the softmax operation in Equation 2, only considers these labels at the current time step and ignores all the possible label combinations at other time steps, thus the performance is eventually degraded since the ability of capturing long-term dependency relation is local and limited. We thus bring CRF back and use the marginal probability to construct refined representations. We claim replacing the attention matrix α (l) t,j with the globally normalized marginal probability can capture long-term dependency relations more effectively.
The potential function of CRF is defined as, where A ∈ R |Y|×|Y| is the transition matrix, A y t−1 ,yt denotes the transition score from label y t−1 to label y t , and v yt is the embedding of label y t . The conditional probability of a specified label sequence y can be described as where Z (h (l) ) is the global normalization term, commonly known as the partition function. Furthermore, the marginal probability is defined as follow.
Marginal probability stands for the sum of the probabilities of all possible label sequences that emit label y j at time step t. Calculating marginal probability requires enumerating all possible structures, and it thus can be called globally normalized probability or structured attention.
We replace the locally normalized attention α (l) t,j in Equation 3 with our globally normalized one, i.e., µ t (y j , h (l) ). Furthermore, we employ residual connection (He et al., 2016) and layer normalization (Ba et al., 2016), instead of concatenation, to construct the refined representationh t is then fed into the next layer.

Computing Marginal probability
Conventional method to compute the marginal probability µ t (y j , h (l) ) requires running the forward-backward algorithm. Fortunately, as Eisner (2016) indicates, merely computing the logpartition function, log Z (h (l) ), and differentiating it with an automatic differentiation library yields equivalent marginal probability efficiently. Thus, we use the torch.autograd.grad function of PyTorch to compute the marginal probability as follow.

Training and Decoding
We train our model by maximizing the loglikelihood with the back-propagation algorithm.
The objective function is defined as follow, We apply the Viterbi algorithm (Forney, 1973) to efficiently search for the most probable label sequences on the decoding stage.

Complexity and Implementation Tricks
One concern regarding our proposed method is its computational complexity, as it requires to compute not only the partition function but also the marginal probability. Calculating the partition function, as in Equation 8, is the well-known bottleneck of CRF computation. And this is commonly achieved through reducing potential matrices by applying matrix multiplications. Similar to Rush (2020), we make use of the associative property of matrix multiplication to accelerate computation. The product of multiplying matrices A, B, C, and D is equivalent to the product of AB and CD.
Leveraging the power of GPU to compute AB and CD in parallel, and recursively applying this trick, we can reduce the time complexity of obtaining the partition function from O ( where |x| i is the length of i-th sentence in batch B. Moreover, instead of padding the sequence length |x i | out to the nearest power of two as Rush (2020) does, we pre-compile argument indices of the matrix multiplication to handle the variant sentence length issue in a batch. Our method can effectively avoid out-of-memory error since we don't waste memory for paddings. This pre-compiling trick can further reduce the time complexity to O (max i log |x| i ). We release our CRF implementation with these two tricks as an independent library 1 for future study and use.

Character Embeddings Initialization
We describe a trick for Chinese character embeddings initialization. The most striking difference between Chinese and English is that the minimal semantic units, i.e., sememes, of Chinese are characters instead of words or subwords. The character vocabulary size of Chinese, e.g., around 2,000 on the OntoNote 5.0 dataset, is markedly larger than English, e.g., around 100 on the OntoNotes 5.0 English dataset. Existing models (Zhang and Yang, 2018;Li et al., 2020a) generally focused on introducing additional pre-trained character embeddings on the top of lexicon embeddings, and attempted to selectively leverage information from both of them according to the different word segmentation schemes. However, we notice that most of these characters already exist in the word vocabulary as single-character words, thus we employ a randomly initialized orthogonal matrix 2 to project the pre-trained word embeddings into the same dimension as the character embeddings, and use these projected embeddings for initialization.

Datasets
We conduct experiments on the CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) and the OntoNotes 5.0 (Weischedel et al., 2013) datasets for English NER, and on the OntoNotes 5.0 and the OntoNotes 4.0 datasets for Chinese NER experiments. We also conduct experiments on the Wall Street Journal (WSJ) dataset (Marcus et al., 1993) and the Universal Dependencies (UD) v2.2 English dataset for POS tagging experiments.
The only data pre-processing that we have performed is replacing digital tokens with a special token. And we convert labels to the IOBES labeling scheme (Ramshaw and Marcus, 1995;Ratinov and Roth, 2009) on NER datasets. The dataset statistics are provided in Table 1.

Hyper-parameter Settings
Following Cui and Zhang (2019) and Jie and Lu (2019), 100-dimensional Glove (Pennington et al., 2014) word embeddings are utilized for all the English experiments, and 300-dimensional FastText (Mikolov et al., 2018) word embeddings are employed for Chinese experiments. The dimension of character embeddings is 30, and the hidden states dimension d c of the character bidirectional LSTM is 100, i.e., 50 in each direction. We apply dropout (Srivastava et al., 2014) on token representations with a rate of 0.5.
For encoding and refinement layers, the dimension of the hidden state d h of bidirectional LSTMs is 600, i.e., 300 in each direction. We apply dropout on hidden states h (l) t with a rate of 0.5 before feeding into refinement layers. The number of refinement layers L is just 1. We optimize our model by applying stochastic gradient descent (SGD) with decaying learning rate η τ = η 0 /(1 + 0.075 · τ ), where τ is the index of the current epoch, and the initial learning rate η 0 for Chinese experiments without contextual word representations is 0.05, and for all the other experiments we use 0.1. The weight decay rate is 10 −8 , the momentum is 0.15, the batch size is 10, the number of epochs is 100, and gradients exceed 5 will be clipped.
In addition, since the pre-trained contextualized word embeddings technique is widely accepted as a new fundamental utility of natural language processing, we also conduct experiments with ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019). In these settings, tokens are represented where e t is the contextual word representation.
ELMo vectors are obtained by averaging output vectors over all layers of ELMo. For English experiments, we use the original checkpoint, and use the checkpoints provided by Che et al. (2018) for Chinese experiments.
BERT representations are the averages all BERT subword embeddings in the last four layers. Following Li et al. (2020b) and Li et al. (2020a), we utilize bert-large-cased and hfl/chinese-bert-wwm checkpoints for English and Chinese experiments respectively.

Evaluation
NER experiments are evaluated by using F 1 scores, and POS tagging experiments are evaluated with accuracy scores. All of our experiments were run 4 times with different random seeds, and the averaged scores are reported in the following tables. Our models 3 are implemented with deep learning framework PyTorch (Paszke et al., 2019) and we ran experiments on GeForce GTX 1080Ti with 11 GB memory. Table 2 compares the performance of our proposed method and baselines on the OntoNotes 5.0 English dataset. Our model significantly outperforms Cui and Zhang (2019) and Jie and Lu (2019) by 0.49 and 0.13 F 1 scores respectively. These results demonstrate that our model can filter irrelevant information more effectively than Cui and Zhang (2019). Notably, the model of Jie and Lu (2019) relies on external dependency annotations, whereas our model requires no external knowledge 4 . In the case of employing ELMo, our model outperforms Jie and Lu (2019) by 0.11 F 1 score.

Named Entity Recognition
On the CoNLL 2003 English dataset, our model performs worse than these baseline models, but, with ELMo, it outperforms Jie and Lu (2019) and

Discussion
Influence of Weight Tying The major difference between our method and Kim et al. (2017) is that we use only observed labels, while they employ unobserved labels as latent variables. In the actual implementation, this difference is reflected in whether to share label embeddings and the transition matrix of CRF across layers. Intuitively, completely relying on unobserved variables would implicitly performing clustering on latent representation space, and it might introduce noise. Besides, the state transitions in a different layer may obey different dynamics. Thus sharing the transition matrix across layers might have an impact on performance.
We conducted experiments on the OntoNotes 5.0 English dataset to compare the performance of all the above-mentioned settings, as reported in Table 7. Notably, our model, with both the label embeddings {v y j } m j=1 and the transition matrix A shared, surpasses all separated models. These results support our claim that tying the weights of embeddings and the label transition matrix can  indeed leverage annotation information and thus is better than completely relying on unobserved variables. Besides, we did not notice significant performance changes when varying the number of labels |Y|. Furthermore, the number of parameters of our shared model is, in fact, the smallest one, even compared to LAN (about 10.0 million parameters).  Table 8 shows the comparison on the OntoNotes 5.0 English dataset, measuring the influence of these two connection mechanisms. We find that the residual connection works better than the concatenation connection, that might because the residual connection can make training more smoothly by preventing the chaotic loss surface (Li et al., 2018).

Connection F 1
Concatenation 88.54 Residual 88.65 Influence of Parameter Size As in Table 9, we did not observe performance increase along with the increasing of the number of refinement layers. Therefore, we claim that one refinement layer is enough for our model, while Cui and Zhang (2019) needs two refinement layers. Our hypothesis is that the long-term dependency modeling capacity of the  Table 9: Influence of parameter size, where L is the number of refinement layers, d h is the hidden state dimension of the bidirectional LSTM, and |θ| represents the number of parameters.
first-order CRF is relatively limited, and we remain the use of the higher-order CRF as future work.
Training and Decoding Speeds We report the training and decoding speeds on the OntoNotes 5.0 English dataset. We demonstrate the efficiency of our CRF implementation tricks by comparing it with a widely used library, pytorch-crf 5 . According to Table 10, our CRF implementation tricks remarkably accelerate both training and decoding. In particular, with our CRF implementation, our computation extensive model even achieves a greater training speed than BiLSTM-CRF with pytorch-crf. Therefore, we claim that the efficiency of our model is acceptable.

Related Work
Early-stage research of NER and POS tagging focused on introducing rich features, for example, Yang et al. (2016) conducted experiments on the influence of discrete manual features, Chiu and Nichols (2016); Ma and Hovy (2016) introduced morphological features by employing a convolution network to encode character-level features, while Lample et al. (2016) chose bidirectional LSTM.
Some other research aimed at leveraging syntactic information, Li et al. (2017) and Jie and Lu (2019) proposed to run external parsers first and directly encode this syntactic information. Other work attempted to infer dependency relations among words implicitly, such that, Strubell et al. (2017) introduced iterative dilated convolution networks as an alternative to BiLSTM, and Zhang et al. (2018) and Liu et al. (2019b) designed encoders which maintain and update global representations along with local token representations.
Recently, Li et al. (2020b) unified flat and nested NER by formulating them as a machine reading comprehension task. Yu et al. (2020) proposed to enumerate all possible spans and to utilize a biaffine classifier to assign category labels to them.
Besides, the widespread use of contextual word representations, e.g., ELMo (Peters et al., 2018), Flair (Akbik et al., 2018), and BERT (Devlin et al., 2019), greatly improves the performance of NER models and they are accepted as new fundamental techniques of natural language processing.
Intuitively speaking, the refinement mechanism provides the models with additional chances to revise previous decisions. In existing work, this method was successfully applied to various tasks, e.g., text classification , sequential labeling (Cui and Zhang, 2019;Lyu et al., 2019), machine translation , and question answering (Nema et al., 2019). Our work is not the first attempt of introducing refinement mechanism to sequential labeling tasks. Cui and Zhang (2019) relied on locally normalized attention to softly refine hidden representations layer by layer, while Liu et al. (2019a) chose to discretely filter out target-irrelevant semantic aspects and thus could be considered as a hard refinement mechanism.

Conclusion
Motivated by the structured attention, we enhanced the previous refinement mechanism by replacing the locally normalized attention with our globally normalized attention. Experimental results on various tasks and datasets demonstrate that structured refinement is capable of filtering out targetirrelevant information through capturing long-term dependency relations. Besides, we remarkably accelerated training and decoding through two implementation tricks for CRF, and obtained further model performance improvement with an initialization trick for Chinese character embeddings. We remain to employ the higher-order CRF as future work.