Subsequence Based Deep Active Learning for Named Entity Recognition

Active Learning (AL) has been successfully applied to Deep Learning in order to drastically reduce the amount of data required to achieve high performance. Previous works have shown that lightweight architectures for Named Entity Recognition (NER) can achieve optimal performance with only 25% of the original training data. However, these methods do not exploit the sequential nature of language and the heterogeneity of uncertainty within each instance, requiring the labelling of whole sentences. Additionally, this standard method requires that the annotator has access to the full sentence when labelling. In this work, we overcome these limitations by allowing the AL algorithm to query subsequences within sentences, and propagate their labels to other sentences. We achieve highly efficient results on OntoNotes 5.0, only requiring 13% of the original training data, and CoNLL 2003, requiring only 27%. This is an improvement of 39% and 37% compared to querying full sentences.


Introduction
The availability of large datasets has been key to the success of deep learning in Natural Language Processing (NLP). This has galvanized the creation of larger datasets in order to train larger deep learning models. However, creating high quality datasets is expensive due to the sparsity of natural language, our inability to label it efficiently compared to other forms of data, and the amount of prior knowledge required to solve certain annotation tasks. Such a problem has motivated the development of new Active Learning (AL) strategies which aim to efficiently train models, by automatically identifying the best training examples from large amounts of Code is made available on: https://github.com/ puria-radmard/RFL-SBDALNER unlabeled data (Wei et al., 2015;Wang et al., 2017;Tong and Koller, 2002). This tremendously reduces human annotation effort as much fewer instances need to be labeled manually.
To minimise the amount of data needed to train a model, AL algorithms iterate between training a model, and querying information rich instances to human annotators from a pool of unlabelled data (Huang et al., 2014). This has been shown to work well when the queries are 'atomic'-a single annotation requires a unit labour, and describes entirely the instance to be annotated. Conversely, each instance of structured data, such as sequences, require multiple annotations. Hence, such query selection methods can result in a waste of annotation budget (Settles, 2011).
For example, in Named Entity Recognition (NER), each sentence is usually considered an instance. However, because each token has a separate label, annotation budgeting is typically done on a token basis (Shen et al., 2017). Budget wasting may therefore arise from the heterogeneity of uncertainty across each sentence; a sentence can contain multiple subsequences (of tokens) of which the model is certain on some and uncertain on others. By making the selection at a sentence level, although some budget is spent on annotating uncertain subsequences, the remaining budget may be wasted on annotating subsequences for which an annotation is not needed.
It can therefore be desirable for annotators to label subsequences rather than the full sentences. This gives a greater flexibility to AL strategies to locate information rich parts of the input with improved efficiency -and reduces the cognitive demands required of annotators. Annotators may in fact perform better if they are asked to annotate shorter sequences, because longer sentences can cause boredom, fatigue, and inaccuracies (Rzeszotarski et al., 2013).
In this work, we aim to improve upon the efficiency of AL for NER by querying for subsequences within each sentence, and propagating labels to unseen, identical subsequences in the dataset. This strategy simulates a setup in which annotators are presented with these subsequences, and do not have access to the full context, ensuring that their focus is centred on the tokens of interest.
We show that AL algorithms for NER tasks that use subsequences, allowing training on partially labelled sentences, are more efficient in terms of budget than those that only query full sentences. This improvement is furthered by generalising existing acquisition functions ( § 4.1) for use with sequential data. We test our approaches on two NER datasets, OntoNotes 5.0 and CoNLL 2003. On OntoNotes 5.0, Shen et al. (2017) achieve stateof-the-art performance with 25% of the original dataset querying full sentences, while we require only 13% of the dataset querying subsequences. On CoNLL 2003, we show that the AL strategy of Shen et al. (2017) requires 50% of the dataset to achieve the same results as training on the full dataset, while ours requires only 27%.
Contributions of this paper are: 1. Improving the efficiency of AL for NER by allowing querying of subsequences over full sentences; 2. An entity based analysis demonstrating that subsequence querying AL strategies tend to query more relevant tokens (i.e., tokens belonging to entities); 3. An uncertainty analysis of the queries made by both full sentence and subsequence querying methods, demonstrating that querying full sentences leads to selecting more tokens to which the model is already certain.

Related Work
AL algorithms aim to query information rich data points to annotators in order to improve the performance of the model in a data efficient way. Traditionally these algorithms choose data points which lie close to decision boundaries (Pinsler et al., 2019), where uncertainty is high, in order for the model to learn more useful information. This measure of uncertainty, measured through acquisition functions, are therefore vital to AL. Key functions include predictive entropy (MaxEnt) (Gal et al., 2017), mutual information between model posterior and predictions (BALD) (Houlsby et al., 2011;Gal et al., 2017), or the certainty of the model when making label predictions (here called LC) (Mingkun Li and Sethi, 2006). These techniques ensure all instances used for training, painstakingly labelled by experts, have maximum impact on model performance. There has been exploration of uncertainty and deep learning based AL for NER (Chen et al., 2015;Shen et al., 2017;Settles and Craven, 2008;Fang et al., 2017). These approaches however, treat each sentence as a single query instead of a collection of individually labelled tokens. In these methods, the acquisition functions that score sentences aggregate token-wise scores (through summation or averaging).
Other works forgo this aggregation, querying single tokens at a time (Tomanek and Hahn, 2009;Wanvarie et al., 2011;Marcheggiani and Artières, 2014). These works show that AL for NER can be improved by taking the single token as a unit query, and use semi-supervision (Reddy et al., 2018;Iscen et al., 2019) for training on partially labelled sentences (Muslea et al., 2002). However, querying single-tokens is inapplicable in practise because, either a) annotators have access to the full sentence when queried but can only label one token, which would lead to frustration as they are asked to read the full sentence but only annotate a single token, or b) annotators only have access to the token of interest, which means that they would not have enough information to label tokens differently based on their context, leading to annotators labeling any unique token with the same label. Moreover, if the latter approach was somehow possible, we would be able to reduce the annotation effort to the annotation of only the unique tokens forming the dataset, its dictionary. Furthermore, all of these past works use Conditional Random Fields (CRFs) (Lafferty et al., 2001), which have since been surpassed as the state-of-the-art for NER (and most NLP tasks) by deep learning models (Devlin et al., 2019).
In this work we follow the approach where annotators only have access to subsequences of multiple tokens. However, instead of making use of single tokens, we will query more than one token, providing enough context to the annotators. This allows the propagation of these annotations to identical subsequences in the dataset, further reducing the total annotation effort. 4312 3 Background

Active Learning Algorithms
Most AL strategies are based on a repeating score, query and fine-tune cycle. After initially training an NER model with a small pool of labelled examples, the following is repeated: (1) score all unlabelled instances, (2) query the highest scoring instances and add them to training set, and, (3) fine-tune the model using the updated training set (Huang et al., 2014).
To describe this further, notation and proposed training process is introduced, with details in following sections. First, the sequence tagging dataset, denoted by D = {(x (n) , y (n) )} N n=1 , consists of a collection of sentence and ground truth labels. The i-th token of the n-th sentence (y (n) i ) has a label y (n) i = c with c belonging to C = {c 1 , ..., c K }. We also differentiate between the labelled and unlabelled datasets, D L and D U , which initially are empty and equal to D. Finally, we fix A as the total number of tokens queried in each iteration.

Acquisition Functions
Instances in the unlabelled pool are queried using an acquisition function. This function aims to quantify the uncertainty of the model when generating predictive probabilities over possible labels for each instance. Instances with the highest predictive uncertainty are deemed as the most informative for model training. Previously used acquisition functions such as Least Confidence (LC) and Maximum Normalized Log-Probability (MNLP) (Shen et al., 2017;Chen et al., 2015) are generalised for variable length sequences. Lettingŷ (n) <i be the history of predictions prior to the i-th input, the next output probability will be p <i , x (n) ). Then, we define the token-wise LC score as: The LC acquisition function for sequences is then defined as: and, for MNLP as: Note that this is similar to LC except for the normalization factor 1/ . The formulation above can be applied to other types of commonly used acquisition functions such as Maximum Entropy (MaxEnt) (Gal et al., 2017) by simply defining: as the token score. Given the task of quantifying uncertainty amongst the unlabelled pool of data, both of these metrics -LC and MaxEnt -provide intuitive interpretations. eq. (1) scores highly tokens for which the predicted label has lowest confidence, while eq. (4) scores highly tokens for which the whole probability mass function has higher entropy. Both of these therefore score more highly uniform predictive distributions, which indicates underlying uncertainty. Finally, given the similarity of performance between MNLP and Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011) in NER tasks (Shen et al., 2017), and the computational complexity required to calculate BALD with respect to the other activation functions, we will not compare against BALD.

Subsequence Acquisition
In this section we describe how we build on past works, and the core contribution of this paper. Our work forms a more flexible AL algorithm that operates on subsequences, as opposed to full sentences (Shen et al., 2017). This is achieved by generalising acquisition functions for subsequences ( § 4.1) scoring and querying subsequences within sentences ( § 4.2), and performing label propagation on unseen sentences to avoid the multiple annotations of repeated subsequences ( § 4.3).

Subsequence Acquisition Functions
Since this work focuses on the querying of subsequences, from the previously defined LC and MNLP we generalize them to define a family of acquisition functions applicable for both full sentences and subsequences: Special cases are when α = 0 and α = 1 which return the original definitions of LC in eq. (2) and MNLP in eq. (3). As noted by Shen et al. (2017), LC for sequences biases acquisition towards longer sentences. The tuneable normalisation factor in eq. (5) over the sequence of scores mediates the balance of shorter and longer subsequences selected. This generalisation can be applied to other types of commonly used acquisition functions such as MaxEnt and BALD by modifying the token-wise score.

Subsequence Selection
Each sentence x (n) can be broken into a set of sub- where all elements s ∈ S (n) can be efficiently scored by first computing the token scores, then aggregating as required. Once this has been done for all sentences in D U , a query set S Q ⊂ ∪ n S (n) of non-overlapping (mutually disjoint) subsequences is found. The requirement of non-overlapping subsequences avoids the problem of relabelling tokens, but disallows simply choosing the highest scoring subsequences (since these can overlap). Instead at each round of querying, we perform a greedy selection, repeatedly choosing the highest scoring subsequence that does not overlap with previously selected subsequences. Adjustments can be made to reflect practical needs, such as restricting the length of the viable subsequences to [ min , max ]. This is because longer subsequences are easier to label, while shorter subsequences are more efficient in querying uncertain tokens, and so the selection is only allowed to operate within these bounds.
Additionally, it is easy to imagine a scenario in which a greedy selection method does not select the maximum total score that can be generated from a sentence. This scenario is illustrated in Table 1 where lengths are restricted to min = max = 3 for simplicity. Note that tokens can become unselectable in future rounds because they are not inside a span of unlabelled tokens of at least size min . When the algorithm has queried all subsequences of this size range, it starts to query shorter subsequences by relaxing the length constraint. However in practise, model performance on the validation set converges before all subsequences of valid range have been exhausted. Nonetheless, when choosing subsequences of size [ min , max ] = [4, 7] these will be exhausted when roughly 90% and 80% of tokens have been labelled for the OntoNotes 5.0 and CoNLL 2003 datasets.

Subsequence Label Propagation
Since a subsequence querying algorithm can result in partially labelled sentences, it raises the question of how unlabelled tokens should be handled. In previous work based on the use of CRFs (Tomanek and Hahn, 2009;Wanvarie et al., 2011;Marcheggiani and Artières, 2014) this was solved by using semisupervision on tokens for which the model showed low uncertainty. However, for neural networks, the use of model generated labels could lead to the model becoming over-confident, harming performance and biasing (Arazo et al., 2020) uncertainty scores. Hence, we ensure that backpropagation only occurs from labelled tokens.
Our final contribution to the AL algorithm is the use of another semi-supervision strategy where we propagate uniquely labelled subsequences in order to minimise the number of annotations needed. When queried for a subsequence, the annotator (in this case an oracle) is not given the contextual tokens in the remainder of the sentence. For this reason, given an identical subsequence, a consistent annotator will provide the same labels. Therefore, the proposed algorithm maintains a dictionary that maps previously queried subsequences to their provided labels. Once a queried subsequence and its label are added to the dictionary, all other matching subsequences in the unlabelled pool are given the same, but temporary, labels.
The tokens retain these temporary labels until they are queried themselves. After scoring and ranking members of S, the algorithm will disregard sequences that match exactly members of this dictionary, which is updated during the querying round. However, if tokens belonging to these previously seen subsequences are encountered in a different context, meaning as part of a different subsequence, they may also be queried. For example, in Table 1, if the subsequence "shop to buy" had been previously queried elsewhere in the dataset, the red subsequence will not be considered for querying, as it retains its temporary labels. Instead, the green subsequence could be queried, in which case the temporary labels of tokens 6 and 7 will be overwritten by new, permanent labels.
Therefore, the value of min becomes a trade-off between the improved resolution of the acquisition function, and the erroneous propagation of shorter, more frequent label subsequences to identical ones in different contexts.  Table 1: This shows the subsequences from a sentence using min = max = 3, α = 1. Besides the token index j, the top three rows show the tokens, labels, and the token-wise scores. If y (n) j = X, then the corresponding token is unlabelled, hence the score is considered when selecting the next query. After this, the subsequences constituting S (n) are displayed with their LC 1 scores. In this case "shop to buy" will be chosen since it maximises LC 1 , but 'traps' its surrounding tokens until min is lowered to 2 and "shoes ." may be considered.

Subsequence Active Learning Algorithm
Finally, we summarise the AL algorithm proposed. Given a set of unlabelled data D U , we initially randomly select a proportion of sentences from D U , label them, and add these to D L . A dictionary B is also initialised. Using these labelled sentences we train a model. Then, the following proposed training cycle is repeated until D U is empty (or an early stopping condition is reached): 1. Find all consecutive unlabelled subsequences in D U , and score them using a pre-defined acquisition function.
2. Select the top scoring non-overlapping subsequences S Q that do not appear in B, such that the number of tokens in S Q is A, and query them to the annotators. Update D L and D U . As each sequence is selected, add it to B, mapping it to its true labels.
3. Provide all occurrences of the keys of B in D U with their corresponding temporary labels. These will not be included in D L as these are temporary.
4. Finetune the model on sentences with any label, temporary and permanent.
Repeat this process until convergence.
5 Experimental Setup

Datasets
As in previous works (Shen et al., 2017), we use the two following NER datasets: OntoNotes 5.0. This is a dataset used to compare results with the full sentence querying baseline (Weischedel, Ralph et al., 2013), and comprising of text coming from: news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, and talk shows. This is a BIO formatted dataset with a total of K = 37 classes and 99,333 training sentences, with an average sentence length of 17.8 tokens in its training set.
CoNLL 2003. This is a dataset, also in BIO format, with only 4 entity types (LOC, MISC, PER, ORG) resulting in K = 9 labels (Tjong Kim Sang and De Meulder, 2003). This dataset is made from a collection of news wire articles from the Reuters Corpus (Lewis et al., 2004). The average sentence length is 12.6 tokens in its training set. A full list of class types and entity lengths and frequencies for both datasets can be found in the Appendix.

NER Model
Following the work of Shen et al. (2017), a CNN-CNN-LSTM model for combined letter-and tokenlevel embeddings was used; see Appendix for an overview of the model and hyperparameters setting and validation. Furthermore, the AL algorithm used in (Shen et al., 2017) will serve as one of the baselines following the same procedure. This represents an equivalent algorithm to that proposed, but which can only query full sentences, and does not use label propagation.

Model Training and Evaluation
As the evaluation measure we use the F 1 score. After the first round of random subsequence selection, the model is trained. After subsequent selections the model is finetuned -training is resumed from the previous round's parameters. In all cases, the model training was stopped either after 30 epochs were completed, or if the F 1 score for the valida- tion set had monotonically decreased for 2 epochs. This validation set is made up of a randomly selected 1% of sentences of the original training set.
After finetuning, the model reloads its parameters from the round-optimal epoch, and its performance is evaluated on the test set. Furthermore, the AL algorithms were also stopped after all hyperparameter variations using that dataset and acquisition function family had converged to the same best F 1 , which we denote with F * 1 . For the OntoNotes 5.0 dataset, F * 1 value was achieved after 30% of the training set was labelled, and for the CoNLL 2003 dataset after 40%.

Active Learning Setup & Evaluation
We choose min = 4 to give a realistic context to the annotator, and to avoid a significant propagation of common subsequences. The upper bound of max = 7 was chosen to ensure subsequences were properly utilised, since the average sentence length of both datasets is roughly twice this size. For the OntoNotes 5.0 dataset, every round A = 10, 000 tokens are queried, whereas for the CoNLL 2003 dataset A = 2, 000 tokens. These represent roughly 0.5% and 1% of the available training set.
We evaluate the efficacy and efficiency of the tested AL strategies in three ways. First, model performance over the course of the algorithm was evaluated using end of round F 1 score on the test set. We compare the proportion of the dataset's tokens labelled when the model achieves 99% of the F * 1 score (F 1 * = 0.99 × F * 1 ). We also quantify the rate of improvement of model performance during training using the normalised Area Under the Curve (AUC) score of each F 1 test curve. The normalisation ensures that the resulting AUC score is in the range [0, 1], and it is achieved by dividing the AUC score by the size of the dataset. This implies that methods that converge faster to their best performance will have a higher normalized AUC. Second, we consider how quickly the algorithms can locate and query relevant tokens (named entities). Third, we finally evaluate their ability to extract the most uncertain tokens from the unlabelled pool.
6 Results & Discussion 6.1 Active Learning Performance Figure 1 shows the LC α performance curves for α = 0, α = 1 and the best performing value for each acquisition class (based on the normalised AUC score, Table 3) for full sentence querying (FS), and only the best performing α values for subsequence querying (SUB). The figure also shows the performance of training on the complete training set (No AL), and when the both sentences and subsequences are random selected by the acquisition function. The equivalent figures for MaxEnt α are available in Appendix, and follow similar trends. Then, the performance of each curve, quantified in terms of the normalised AUC is summarised in Table 3. Table 2 shows further analysis of the best results in Figure 1, with best referring to acquisition function and optimal α. These results first show that subsequence querying methods are more efficient than querying full sentences, achieving their final F 1 with substantially less annotated data, and with higher normalised AUC scores. For OntoNotes 5.0, querying subsequences reduces final proportion required by 38.8%. For CoNLL 2003, this reduction is 36.6%. Altogether, subsequence querying holds improved efficiency over the full sentence querying baseline.    Figure 1, when the models are trained using 100% of the training set and active learning (AL), with the best hyperparameter setting of the acquisition function with for full sentence and subsequence, based on normalised AUC score.
As a point of interest, full sentence querying can be easily improved by optimising α alone. For the OntoNotes 5.0 dataset, using LC 1 , 24.2% of tokens are required to achieve F * 1 . This however, can be improved by 9.33% to only requiring 22.0% by choosing α = 0.7. For CoNLL 2003, using LC 1 for full sentences, 50.0% of the dataset was required, but when using LC 0.7 , it was 40.7% of the tokens.

Entity Recall
This section and the next aim to understand some of the underlying mechanisms that allow the subsequence querying methods to achieve results substantially better than a full sentence baseline. Namely, the ability of the different methods to extract the tokens for which the model is the most uncertain about. Given that the majority of tokens in both datasets have the same label -"O", signifying no entity -it is likely that tokens belonging to entities, particularly rarer classes, trigger higher model uncertainty. Querying full sentences at a time, the AL algorithm will spend much of its token budget for that round labelling non-entity tokens while attempting to locate the more informative entities. Subsequence querying methods, not faced with this wasteful behaviour, allow the AL algorithm to query entity tokens quicker, locating and labelling the majority of entity tokens faster over the course of training.
The proportion of tokens belonging to entities that the AL algorithm has queried against the round number is plotted in Figure 2 for OntoNotes 5.0. For both datasets, the random querying methods contain a distribution of token classes that reflect the dataset at large, producing roughly linear curves for this figure. Curves for all methods that employ an uncertainty based acquisition function are concave, and the AUC reflects the ranking of model performance for each querying method. This relation suggests that shortly after initialisation, better performing algorithm variations query entity tokens faster. In later stages of finetuning this rate is reduced, likely because after labelling a large proportion of them, the remaining entity tokens cause little uncertainty for the model. In a practical setting where querying may have to be stopped before model performance has converged (i.e. due to accumulated cost of annotations), it is greatly beneficial to ensure that the model is exposed to a high number of relevant tokens, because this increases the likelihood of locating entity tokens belonging to underrepresented classes at an early stage.

Uncertainty Score Analysis
Finally, this section compares the scores of tokens in the queried set S Q for each querying method. Comparing the distribution and development of these scores provides a direct insight to the core assumptions of why full sentence querying is outperformed. Figure 3 shows the difference in score distributions for sentence versus subsequence querying, against querying round number, for rounds preceding model performance convergence. First, it is seen that decreasing the individual query size (full sentence to subsequence) increases the median uncertainty extracted at the earlier rounds. Second, Figure 3 provides evidence for the mechanism suggested earlier: aggregating the token scores across full sentences means querying both the highly uncertain tokens, and the tokens that provide little uncertainty. Querying high scoring sentences like this can cause a distribution with two peaks as seen in  Table 3: Normalised AUC scores for model performance (F1 score on test set) for α = 0, 1, and its optimal value in each case. Each pair of differences between the optimized acquisition function for full sentences and subsequences (indicated by a †) are significantly different (two-sided unpaired t-test, with p-value < 0.05). .0 dataset, made on the 1st, 5th, 10th, and 15th scoring rounds. This corresponds to scores after training on 1%, 3.2%, 6.1%, and 9.0% of the utilised training set. the figure. As the model becomes increasingly certain about its predictions, high scores are localised within smaller subsequences, and the coarse sensitivity of full sentence querying means it forfeits all the higher scoring tokens. These differences were also observed when comparing subsequence querying methods with sub-optimal α. This figure only analyses behaviour of up to 9% of the training set's tokens have been queried. Instead, Figure 4 show how the mean of token-wise scores evolve for different querying methods for the OntoNotes 5.0 dataset until convergence. This clearly shows that subsequence querying methods converge faster over the full course of the algorithm compared to full sentence querying. This is consistent with Figure 1 in terms of initial rate and final time of model performance convergence, namely that model performance plateaus alongside the uncertainty score.
Keeping track of query scores like this is also a reasonable idea in industrial applications. When training on a very semantically specific corpus, there may not be enough fully labelled sentences to build a test set. In that case, observing the rate progress of score convergence can be used as an early stopping method for the AL algorithm (Zhu et al., 2010).

Conclusion & Future Work
In this study we have employed subsequence querying methods for improving the efficiency of AL for NER tasks. We have seen that these methods outperform full sentence querying in terms of annotations required for optimal model performance, requiring 38.8% and 36.6% fewer tokens for the OntoNotes 5.0 and CoNLL 2003 datasets. Optimal results for subsequence querying (and full sentence querying) were achieved by generalising previously used AL acquisition functions, defining a larger family of acquisition functions for sequential data.
The analysis of § 6.3 suggests that a full sentence querying causes noisy acquisition functions due to the tokens in the queried sentences that were not highly scored. This added noise reduces the budget efficiency, and a subsequence querying method eliminates a large part of this effect. This efficiency also translated into a faster recall of named entities in the dataset to be queried ( § 6.2).
Limitations and future work: Limitations of this study are largely centred on the use of an oracle to provide tokens with their labels. With human annotators, the cropped context of subsequence queries may make them produce more inaccuracies than when annotating full sentences. such studies will help reveal how context affects label accuracy, how this, in turn, affects optimal hyperparameters in the subsequence selection process (such as optimal query length), further accommodations that must be made to effectively optimise worker efficiency, and how to deal with unreliable labels. We leave to future work the evaluation of these querying methods with human annotators. There are also ways to incorporate model generated labelling methods for more robust semi-supervision into our framework that we leave to future work. Finally, there are examples of other tasks for structured data, such as audio, video, and image segmentation, where the part of an instance may be queried. A generalisation of the strategy demonstrated for the NER case may allow for more efficient active learning querying methods for these other types of data.

A Model Architecture
The model architecture is built of three sections. The character-level convolutional neural network (CNN) (LeCun and Bengio, 1998) character-level encoder extracts character level features, w word j for each token x (i) j in a sentence. Then, a latent token embedding w emb j corresponding to that token is generated. The full representation of the token is the concatentation of the two vectors: w full j := (w char j , w emb j ). The token-label embeddings, w emb , are initialised using word2vec (Ling et al., 2015), and updated during training and finetuning, as per the baseline paper. A second, token-level CNN encoder is used to generate {h token j } i j=1 , given the tokenlevel representations {w full j } i j=1 . The final tokenlevel encoding is defined by another concatentation: h Enc j := (h token j , w full j ). Finally, a tag decoder is used to generate the token-level pmfs over the C possible token classes:

C Dataset Analysis
Here, we cluster similar labels in the BIO format, reducing the total K classes to the K (r) = (K + 1)/2 class groups c (r) 1 , ..., c (r) K (r) . Therefore, c (r) 1 corresponds exactly to c 1 , the empty label, while c (r) k , k > 1 groups the raw labels c 2k−2 and c 2k−1 . Figures 5 and 7 show the distribution of these class groups for the OntoNotes 5.0 and CoNLL 2003 datasets respectively. For the former, counts range from 199 tokens for the 'LANGUAGE' to 46698 tokens for the 'ORG' class. The full available training set totals 1766955 tokens in 99333 sentences; this is partitioned into a train and validation set during experimentation. A further test set comprises of 146253 tokens in 8057 sentences. The latter's training set contains 172210 tokens in 13689 sentences, and its test set has 42141 tokens in 3091 sentences sentences.   : F1 score on test set achieved each round (top) and against time (bottom in each case) using roundoptimal model parameters. All subsequence experiments here use min = 3, max = 6.