Larger-Context Tagging: When and Why Does It Work?

The development of neural networks and pretraining techniques has spawned many sentence-level tagging systems that achieved superior performance on typical benchmarks. However, a relatively less discussed topic is what if more context information is introduced into current top-scoring tagging systems. Although several existing works have attempted to shift tagging systems from sentence-level to document-level, there is still no consensus conclusion about when and why it works, which limits the applicability of the larger-context approach in tagging tasks. In this paper, instead of pursuing a state-of-the-art tagging system by architectural exploration, we focus on investigating when and why the larger-context training, as a general strategy, can work. To this end, we conduct a thorough comparative study on four proposed aggregators for context information collecting and present an attribute-aided evaluation method to interpret the improvement brought by larger-context training. Experimentally, we set up a testbed based on four tagging tasks and thirteen datasets. Hopefully, our preliminary observations can deepen the understanding of larger-context training and enlighten more follow-up works on the use of contextual information.


Introduction
The rapid development of deep neural models has shown impressive performances on sequence tagging tasks that aim to assign labels to each token of an input sequence (Sang and De Meulder, 2003;Lample et al., 2016;Ma and Hovy, 2016).More recently, the use of unsupervised pre-trained models (Akbik et al., 2018(Akbik et al., , 2019;;Peters et al., 2018;Devlin et al., 2018) (especially contextualized version) has driven state-of-the-art performance to a new level.Among these works, researchers frequently choose the boundary with the granularity of sentences for tagging tasks (i.e., sentence-level tagging) (Huang et al., 2015;Chiu and Nichols, 2015;Ma and Hovy, 2016;Lample et al., 2016).Undoubtedly, as a transient, sentence-level setting enables us to develop numerous successful tagging systems, nevertheless the task itself should have not be defined as sentence-level but for simplifying the learning process for machine learning models.Naturally, it would be interesting to see what if larger-context information (e.g., taking information of neighbor sentences into account) is introduced to modern top-scoring systems, which have shown superior performance under the sentencelevel setting.A small number of works have made seminal exploration in this direction, in which part of works show significant improvement of largercontext (Luo et al., 2020;Xu et al., 2019) while others don't (Hu et al., 2020(Hu et al., , 2019;;Luo et al., 2018).Therefore, it's still unclear when and why largercontext training is beneficial for tagging tasks.In this paper, we try to figure it out by asking the following three research questions: Q1: How do different integration ways of largercontext information influence the system's performance?The rapid development of neural networks provides us with diverse flavors of neural components to aggregate larger-context information, which, for example, can be structured as a sequential topology by recurrent neural networks (Ma and Hovy, 2016;Lample et al., 2016) (RNNs) or graph topology by graph neural networks (Kipf and Welling, 2016;Schlichtkrull et al., 2018).
Understanding the discrepancies of these aggregators can help us reach a more generalized conclusion about the effectiveness of larger-context training.To this end, we study larger-context aggregators with three different structural priors (defined in Sec.3.2) and comprehensively evaluate their efficacy.arXiv:2104.04434v1[cs.CL] 9 Apr 2021 Q2: Can the larger-context training easily play to its strengths with the help of recently arising contextualized pre-trained models (Akbik et al., 2018(Akbik et al., , 2019;;Peters et al., 2018;Devlin et al., 2018) (e.g.BERT)?The contextual modeling power of these pre-trained methods makes it worth looking at its effect on larger-context training.In this work, we take BERT as a case study and assess its effectiveness quantitatively and qualitatively.
Q3: If improvements could be observed, where does the gain come and how do different characteristics of datasets affect the amount of gain?Instead of simply figuring out whether larger-context training could work, we also try to interpret its gains.Specifically, we propose to use fine-grained evaluation to explain where the improvement comes from and why different datasets exhibit discrepant gains.
Overall, the first two questions aim to explore when larger-context training can work while the third question addresses why.Experimentally, we try to answer these questions by conducting a comprehensive analysis, which involves four tagging tasks and thirteen datasets.Our main observations are summarized in Sec. 8. 1 Furthermore, we show, with the help of these observations, it's easier to adapt larger-context training to modern top-performing tagging systems with significant gains.We brief our contributions below: 1) We try to bridge the gap by asking three research questions, between the increasing topperforming sentence-level tagging systems and insufficient understanding of larger-context training, encouraging future research to explore more largercontext tagging systems.2) We systematically investigate four aggregators for larger-context and present an attribute-aided evaluation methodology to interpret the relative advantages of them, and why they can work (Sec.3.2).3) Based on some of our observations, we adapt larger-context training to five modern top-scoring systems in the NER task and observe that all larger-context enhanced models can achieve significant improvement (Sec.6).Encouragingly , with the help of larger-context training, the performance of Akbik et al. (2018) on the WB (OntoNotes5.0-WB)dataset can be improved by a 10.78 F 1 score. 1 Putting the conclusion at the end can help the reader understand it better since more contextual information about experiments has been introduced.

Task, Dataset, and Model
We first explicate the definition of tagging task and then describe several popular datasets as well as typical methods of this task.

Task Definition
Sequence tagging aims to assign one of the predefined labels to each token in a sequence.In this paper, we consider four types of concrete tasks: Named Entity Recognition (NER), Chinese Word Segmentation (CWS), Part-of-Speech (POS) tagging, and Chunking.

Datasets
The datasets used in our paper are naturally ordered without random shuffling according to the paper that constructed these datasets, except for WNUT-2016 dataset.Named Entity Recognition (NER) We consider two well-established benchmarks: CoNLL-2003 (CN03) and OntoNotes 5.0.OntoNotes 5.0 is collected from six different genres: broadcast conversation (BC), broadcast news (BN), magazine (MZ), newswire (NW), telephone conversation (TC), and web data (WB).Since each domain of OntoNotes 5.0 has its nature, we follow previous works (Durrett and Klein, 2014;Chiu and Nichols, 2016;Ghaddar and Langlais, 2018) that utilize different domains of this dataset, which also paves the way for our fine-grained analysis.Chinese Word Segmentation (CWS) We use four mainstream datasets from SIGHAN2005 and SIGHAN2008, in which CITYU is traditional Chinese, while PKU, NCC, and SXU are simplified ones.Chunking (Chunk) CoNLL-2000 (CN00) is a benchmark dataset for text chunking.Part-of-Speech (POS) We use the Penn Treebank (PTB) III dataset for POS tagging.2

Neural Tagging Models
Despite the emergence of a bunch of architectural explorations (Ma and Hovy, 2016;Lample et al., 2016;Yang et al., 2018;Peters et al., 2018;Akbik et al., 2018;Devlin et al., 2018) for sequence tagging, two general frameworks can be summarized: (i) cEnc-wEnc-CRF consists of the wordlevel encoder, sentence-level encoder, and CRF layer (Lafferty et al., 2001); (ii) ContPre-MLP is composed of a contextualized pre-trained layer, followed by an MLP or CRF layer.In this paper, we take both frameworks as study objects for our three research questions first,3 and instantiate them as two specific models: CNN-LSTM-CRF (Ma and Hovy, 2016) and BERT-MLP (Devlin et al., 2018).
3 Larger-Context Tagging 3.1 Sentence-level Tagging a sequence of sentences, where sentence s i contains n i words: Sentence-level tagging models predict the label for each word w i,t sentence-wisely (within a given sentence s i ).CNN-LSTM-CRF, for example, first converts each word w i,t ∈ s i into a vector by different word-level encoders wEnc(•): where ⊕ denotes the concatenation operation, Lookup(w i,t ) can be pre-trained by context-free (e.g., GloVe) or context-dependent (e.g., BERT) word representations.
And then the concatenated representation of them will be fed into sentence encoder sEnc(•) (e.g., LSTM layer) to derive a contextualized representation for each word.
where the lower case "s" of LSTM (s) represents a sentence-level LSTM.Finally, a CRF layer will be used to predict the label for each word.

Contextual Information Aggregators
Instead of predicting entity tags sentence-wisely, more contextual information of neighbor sentences can be introduced in diverse ways.Following, we elaborate on how to extend sentence-level tagging to a larger-context setting.The high-level idea is to introduce more contextual information into wordor sentence-level encoder defined in Eq. 1 and Eq. 2.
Here, we propose four larger-context aggregators, whose architectures are illustrated in Fig. 1.
Bag-of-Word Aggregator (bow) calculates a fused representation r for a sequence of sentences.
where BOW(•) is a function that computes the average of all word representations of input sentences.Afterward, r, as additional information, will be injected into the word encoder.
More precisely, the word-level encoder and sentence-level encoder can be re-written below: where the upper case "S" of LSTM (S) denotes the larger-context encoder that utilizes an LSTM deal with a sequence of sentences (S = s 1 , • • • , s k ) (instead of solely one sentence).
Sequential Aggregator (seq) first concatenates all sentences s i ∈ S and then encode it with a larger-context encoder LSTM (S) .Formally, seq aggregator can be represented as: where w seq i,t is defined as Eq. 1, and the Lookup(w i,t ) is GloVe.Then, a CRF decoder is utilized to predict the tags for each word.
Graph Aggregator (graph) incorporates nonlocal bias into tagging models.Each word w i is conceptualized as a node.For edge connections, we define the following types of edges between pairs of nodes (i.e.w i and w j ) to encode various structural information in the context graph: i) if |i−j| = 1; ii) if w i = w j .In practice, the graph aggregator first collects contextual information over a sequence of sentences, and generate the word representation: where  p 7 j a h f 5 h 7 j Y n V u C H 2 L 9 0 s 8 7 8 6 U 4 x 5 W a l 4 + 6 i D 3 s 4 5 D m e Y I a 6 m i g S d 5 D x 5 W a l 4 + 6 i D 3 s 4 5 D m e Y I a 6 m i g S d 5 D x 5 W a l 4 + 6 i D 3 s 4 5 D m e Y I a 6 m i g S d 5 D p 7 j a h f 5 h 7 j Y n V u C H 2 L 9 0 s 8 7 8 6 U 4  Afterwords, the contextual vector g will be introduced into larger-context encoder , i.e., LSTM (S) : Contextualized Sequential Aggregator (cPreseq) is an extension of seq aggregator by using contextualized pre-trained models, such as BERT (Devlin et al., 2018), Flair (Akbik et al., 2018), and ELMo (Peters et al., 2018), as a word encoder.Here, cPre-seq is instantiated as BERT to get the word representation, then followed by a larger-context encoder LSTM (S) .We make the length of larger-context for the cPre-seq aggregator within 512.cPre-seq can be formalized as: = LSTM (S) (BERT(w i,t ), h cP re i,t−1 , θ). (9) 4 Experiment: When Does It Work?
The experiment in this section is designed to answer the first two research questions: Q1 and Q2 (Sec.1).Specifically, we investigate whether largercontext training can achieve improvement and how different structures of aggregator, contextualized pre-trained models influence it.

Settings and Hyper-parameters
We adopt CNN-LSTM-CRF as a prototype and augment it with larger-context information by four categories of aggregators: bow, seq, graph, and cPre-seq.We use Word2Vec (Mikolov et al., 2013) (trained on simplified Chinese Wikipedia dump) as noncontextualized embeddings for CWS task, and GloVe (Pennington et al., 2014) for NER, Chunk, and POS tasks.
The window size (the number of sentence) k of larger-context aggregators will be explored with a range of k = {1, 2, 3, 4, 5, 6, 10} for seq, bow, and cPre-seq.We chose the best performance that the larger-context aggregator achieved with window size k = 1 as the final performance of a largercontext aggregator. 4We use the result from the model with the best validation set performance, terminating training when the performance on development is not improved in 20 epochs.
For the POS task, we adopt dataset-level accuracy as evaluated metric while for other tasks, we use a corpus-level F 1-score (Sang and De Meulder, 2003) to evaluate.

Exp-I: Effect of Structured Typologies
Tab. 1 illustrates the relative improvement results of four larger-context training (k > 1) relative to the sentence-level tagging (k = 1).To examine whether the larger-context aggregation method has a significant improvement over the sentence-level tagging, we used significant test with Wilcoxon Signed-RankTest (Wilcoxon et al., 1970) at p = 0.05 level.Results are shown in Tab. 1 (the last column).We find that improvements brought by four larger-context aggregators are statistically significant (p < 0.05), suggesting that the introduction of larger-context can significantly improve the performance of sentence-level models.

Results
We detail main observations in Tab.1: 1) For most of the datasets, introducing largercontext information will bring gains regardless of the ways how to introduce it (e.g.bow or graph), indicating the efficacy of larger contextual information.Impressively, the performance on dataset WB is significantly improved by 7.26 F1 score with the cPre-seq aggregator (p = 5.1 × 10 −3 < 0.05).
2) Overall, comparing with bow and graph aggregators, seq aggregator has achieved larger improvement by average, which can be further enhanced by introducing contextualized pre-trained models (e.g.BERT).
3) Incorporating larger-context information with some aggregators also can lead to performance drop on some datasets (e.g, using graph aggrega-  1: The relative improvement (the performance difference between a model with larger-context aggregator (e.g.bow) and the one without it) on tasks CWS, NER, Chunk, and POS."norm" denotes the normal setting (K = 1).The values in red are the performance of larger-context tagging (k > 1) lower than sentence-level tagging (k = 1)."Signi."denotes p-value of "significant test"."Emb.","Non-Con.","Con.", and "Agg." are the abbreviations of "Embeddings", "Non-Contextualized", "Contextualized", and "Aggregator" respectively.The values in pink indicate that the value is less than zero.
tor on dataset MZ lead to 0.16 performance drop), which suggests the importance of a better match between datasets and aggregators.

Exp-II: Effect of BERT
To answer the research question Q2 ( Can the larger-context approach easily play to its strengths with the help of recently arising contextualized pretrained models?), we elaborate on how cPre-seq and seq aggregators influence the performance.
Results Fig. 2 illustrates the relative improvement achieved by two larger-context methods: seq (blue bar) and cPre-seq (red bar) on four different tagging tasks.We observe that: 1) In general, aggregators equipped with BERT can not guarantee a better improvement, which is dataset-dependent.2) Task-wisely, cPre-seq can improve performance on all datasets on NER, Chunk, and POS tasks.By contrast, seq is beneficial to all datasets on CWS task.It could be attributed to the difference in language and characteristics of the task.Specifically, for most non-CWS task datasets, cPre-seq (7 out of 9 datasets) performs better than seq (p < 0.05).

Experiment: Why Does It Work?
Experiments in this section are designed for the research questions Q3, interpreting where the gains of a larger-context approach come and why different datasets exhibit diverse improvements.To achieve this goal, we use the concept of interpretable evaluation (Fu et al., 2020a) that allows us perform fine-grained evaluation of one or multiple systems.

Attribute Definition
The first step of interpretable evaluation is attribute definition.The high-level idea is, given one attribute, the test set of each tagging task will be partitioned into several interpretable buckets based on it.And F 1 score (accuracy for POS) will be calculated bucket-wisely.Next, we will explicate the general attributes we defined in this paper.
We first detail some notations to facilitate definitions of our attributes.We define x as a token and a bold form x as a span, which occurs in a test sentence X = sent(x).We additionally define two functions oov(•) that counts the number out of training set words, and ent(•) that tallies the number of entity words.Based on these notations, we introduce some feature functions that can compute different attributes for each span or token.Following, we will give the attribute definition of the NER.Training set-independent Attributes • φ eLen (x) = |x|: entity span length Training set-dependent Attributes • φ eFre (x) = Fre(x): entity frequency • φ eCon (x) = Con(x): label consistency of entity where Fre(x) calculates the frequency of input x in the training set.Con(x) quantify how consistently a given span is labeled with a particular label, and Con(x) can be formulated as:  (Efron and Tibshirani, 1986).
where E tr denotes entities in the training set, lab(•) denotes the label of input span while str(•) represents the surface string of input span.Similarly, we can extend the above two attributes to token-level, therefore obtaining φ tFre (x) and φ tCon (x).
Attributes for CWS task can be defined in a similar way.Specifically, the entity (or token) in NER task corresponds to the word (or character) in CWS task.Note that we omit word density for CWS task since it equals to one for any sentence.

Attribute Buckets
We breakdown all test examples into different attribute buckets according to the given attribute.Take entity length (eLen) attribute of NER task as an example, first, we calculate each test sample's entity length attribute value.Then, divide the test entities into N attribute buckets (N = 4 by default) where the numbers of the test samples in all attribute intervals (buckets) are equal, and calculate the performance for those entities falling into the same bucket.

Exp-I: Breakdown over Attributes
To investigate where the gains of the larger-context training come, we conduct a fine-grained evaluation with the evaluation attributes defined in Sec.5.1.We use the cPre-seq larger-context aggregation method as the base model.Fig. 3 shows the relative improvement of the cPre-seq larger-context aggregation method in NER (7 datasets) and CWS tasks (4 datasets).The relative improvement is the performance of cPre-seq larger-context tagging minus sentence-level tagging.

Results
Our findings from Fig. 3 are: 1) Test spans with lower label consistency can benefit much more from the larger-context training.As shown in Fig. 3 (a,b,i,j), test spans with lower label consistency (NER:eCon,tCon=S/XS, CWS: wCon,cCon=S/XS) can achieve higher relative improvement using the larger-context training, which holds for both NER and CWS tasks.
2) NER task has achieved more gains on lower and higher-frequency test spans, while CWS task obtains more gains on lower-frequency test spans.As shown in Fig. 2 (c,d,k,l), in NER task, test spans with higher or lower frequency (NER:eFre=XS/XL;tFre=XS/XL) will achieve larger improvements with the help of more contextual sentences; while for the CWS task, only the test spans with lower frequency will achieve more gains.
3) Test spans of NER task with lower entity density have obtained larger improvement with the help of a larger-context training.In terms of entity density shown in Fig. 3 (e), an evaluation attribute specific to the NER task, the larger-context training is not good at dealing with the test spans with high entity density (NER:eDen=XL/L), while doing well in test spans with low entity density (NER:eDen=XS/S).4) Larger-context training can achieve more gains on short entities in NER task while long words in CWS task.As shown in Fig. 3 (f,m), the dark blue boxes can be seen in the short entities (eLen=XS/S) of NER task, and long words (wLen=XL/L) of CWS task.5) Both NER and CWS tasks will achieve more gains on spans with higher OOV density.For the OOV density shown in Fig. 2  (h,o), the test spans with higher OOV density (NER,CWS:dOov=L/XL) will achieve more gains The darker blue implies more significant improvement while the darker red suggests larger-context leads to worse performance.For the attribute name, "e", "t", "w", and "c" refers to "entity", "token", "word", and "character", respectively.
from the larger-context training, which holds for both NER and CWS tasks.

Exp-II: Quantifying and Understanding Dataset Bias
Different datasets (e.g.CN03) may match different information aggregators (e.g.cPre-seq).Figuring out how different datasets influence the choices of aggregators is a challenging task.We try to approach this goal by (i) designing diverse measures that can characterize a given dataset from different perspectives, (ii) analyzing the correlation between different dataset properties and improvements brought by different aggregators.
Dataset-level Measure Given a dataset E and an attribute p as defined in Sec.5.1, the data-level measure can be defined as: where E te ∈ E is a test set that contains entities/tokens in the NER task or word/character in the CWS task.φ p (•) is a function (as defined in Sec.5.1) that computes the attribute value for a given span.For example, ζ sLen (CN03) represents the average sentence length of CN03's test set.
Correlation Measure Statistically, we define a variable of ρ to quantify the correlation between a dataset-level attribute and the relative improvement of an aggregator: ρ = Spearman(ζ p , f y ), where Spearman denotes the Spearman's rank correlation coefficient (Mukaka, 2012).ζ p represents dataset-level attribute values on all datasets with respect to attribute p (e.g., eLen) while f y denotes the relative improvements of larger-context training on corresponding datasets with respect to a given aggregator y (e.g., cPre-seq).
Results Tab. 2 displays (using spider charts) measure ζ p5 of seven datasets with respect to diverse attributes, and correlation measure ρ in the NER task. 6 Takeaways: We can conduct a similar analysis for bow and graph aggregators.Due to limited pages, we detail them in our appendix and highlight the suitable NER datasets for each aggregator as follows.
6 Adapting to Top-Scoring Systems Beyond the above quantitative and qualitative analysis of our instantiated typical tagging models (Sec.2.3), we are also curious about how well modern top-scoring tagging systems perform when equipped with larger-context training.
To this end, we choose the NER task as a case study and first re-implement existing topperforming models for different NER datasets separately, and then adapt larger-context approach to them based on the seq or cPre-seq aggregator, 7 which has shown superior performance in our above analysis.
Settings We collect five top-scoring tagging systems (Luo et al., 2020;Lin et al., 2019;Chen et al., 2019;Yan et al., 2019;Akbik et al., 2018) that 7 Training all four aggregators for all tagging tasks is much more costly and here we choose these two since they can obtain better performance at a relatively lower cost.are most recently proposed8 .Among these five models, regarding Akbik et al. (2018), we use cPreseq aggregator for the larger-context training, since this model originally relies on a contextualized pre-trained layer.Besides, from above analysis in Sec.5.4 we know the suitable datasets for cPre-seq aggregator: CN03, WB, TC, BC, and NW.Regarding the other four models, we use the seq aggregator for the larger-context training and the matched datasets are: WB, TC, and BC.
Results Tab. 3 shows the relative improvement of larger-context training on five modern top-scoring models in the NER task.We observe that the largercontext training has achieved consistent gains on all chosen datasets, which holds for both seq and cPre-seq aggregators.Notably, the larger-cotext training achieves sharp improvement on WB, which holds for all the five top-scoring models.For example, with the help of larger-context training, the performance can be improved significantly using Akbik et al. (2018) and 7.18 F 1 score using Luo et al. (2020).This suggests that modern top-scoring NER systems can also benefit from larger-context training.

Related Work
Our work touches the following research topics for tagging tasks.Sentence-level Tagging Existing works have achieved impressive performance at sentence-level tagging by extensive structural explorations with different types of neural components.Regarding sentence encoders, recurrent neural nets (Huang et al., 2015;Chiu and Nichols, 2015;Ma and   2016; Lample et al., 2016;Li et al., 2019;Lin et al., 2020) and convolutional neural nets (Strubell et al., 2017;Yang et al., 2018;Chen et al., 2019;Fu et al., 2020a) were widely used while transformer were also studied to get sentential representations (Yan et al., 2019;Yu et al., 2020).Some recent works consider the NER as a span classification (Li et al., 2019;Jiang et al., 2019;Mengge et al., 2020;Ouchi et al., 2020) task, unlike most works that view it as a sequence labeling task.To capture morphological information, some previous works introduced a character or subword-aware encoders with unsupervised pre-trained knowledge (Peters et al., 2018;Akbik et al., 2018;Devlin et al., 2018;Akbik et al., 2019;Yang et al., 2019;Lan et al., 2019).
Document-level Tagging Document-level tagging introduced more contextual features to improve the performance of tagging.Some early works introduced non-local information (Finkel et al., 2005;Krishnan and Manning, 2006) (Hu et al., 2020(Hu et al., , 2019)), chemical NER (Luo et al., 2018), disease NER (Xu et al., 2019), and Chinese patent (Li andXue, 2014, 2016).Compared with these works, instead of proposing a novel model, we focus on investigating when and why the larger-context training, as a general strategy, can work.
Interpretability and Robustness of Sequence Labeling Systems Recently, there is a popular trend that aims to (i) perform a glass-box analysis of sequence labeling systems (Fu et al., 2020b;Agarwal et al., 2020), understanding their generalization ability and quantify robustness (Fu et al., 2020c), (ii) interpretable evaluation of them (Fu et al., 2020a), making it possible to know what a system is good/bad at and where a system outperforms another, (iii) reliable analysis (Ye et al., 2021) for test set with fewer samples.Our work is based on the technique of interpretable evaluation, which provides a convenient way for us to diagnose different systems.

Discussion
We summarize the main observations from our experiments and try to provide preliminary answers to our proposed research questions: (i) How do different integration ways of largercontext information influence the system's performance?Overall, introducing larger-context information will bring gains regardless of the ways how to introduce it (e.g., seq, graph

A Aggregator Setting
Tab. 4 illustrates the window size k(k = 1) when the larger-context aggregator achieves the best performance.The window size k when seq achieves the best performance will be chosen to set the document-length of the graph aggregator.

B Quantifying and Understanding Dataset Bias
In this section, we will supplement some analyses related to Sec. 5.3.

B.1 Data-level Measure
Tab. 5 gives the data-level measure ζ p in seven (four) datasets with respect to eight (seven) attributes in NER (CWS) task.The data-level measure ζ p will be used to compute the correlation measure in Sec.5.3.

B.2 Results
Tab. 6 displays (using spider charts) measure ζ p of seven datasets with respect to diverse attributes, and correlation measure ρ in NER task.We have given a detail analysis on seq and cPre-seq on the main text, here, we will provide the suggestion for choosing the datasets for bow and graph aggregator.
( Tab. 7 illustrates the measures ζ p in four CWS datasets with respect to seven attributes (e.g., wCon) and correlation measure ρ.We can conduct similar analysis like NER for CWS.We highlight the suitable CWS datasets for each aggregator as follows: • bow: NCC, and SXU.

Figure 2 :
Figure2: Illustration of the relative improvement (%) achieved by two larger-context methods (i.e., seq and cPre-seq) on four different tagging tasks.The red and blue bars represent the improvements from seq and cPre-seq, respectively.The error bars represent 95% confidence intervals of the relative improvement that are computed based on Bootstrap method(Efron and Tibshirani, 1986).

Figure 3 :
Figure 3: The relative increase (∈ [0, 1]) of the cPre-seq larger-context training on NER (a−h) and CWS (i−o) tasks based on their evaluation attributes."co" denotes the CoNLL-2003 dataset.In order to facilitate observation, we divide the attribute value range into four categories: extra-small (XS), small (S), large (L), and extra-large (XL).The darker blue implies more significant improvement while the darker red suggests larger-context leads to worse performance.For the attribute name, "e", "t", "w", and "c" refers to "entity", "token", "word", and "character", respectively.

Table 2 :
Based on these correlations, which passed significantly test (p < 0.05), between dataset-level measure (w.r.t a certain attribute, e.g.eCon) and gains from larger-context training (w.r.t an aggregator, e.g.seq), we can obtain that:(1) Regarding the cPre-seq aggregator, it negatively correlated with ζ eCon , ζ tCon , ζ eFre , and ζ eDen with larger correlation values.Therefore, the cPre-seq aggregator is more appropriate to deal with WB, TC, BC and NW datasets, since these four datasets have a lower value of ζ p with respect to the attribute eCon (TC,WB), tCon (TC, WB), eFre (NW, TC), and eDen (BC, WB, TC).Additionally, since the cPre-seq aggregator obtains the highest positive correlation with ζ dOov , and ζ dOov (CN03), as well as ζ dOov (BC), achieve the highest value, cPre-seq aggregator is suitable for CN03 and BC.Illustration of measures ζp in seven datasets (CN03, TC, NW, WB, MZ, BN, BC) with respect to eight attributes (e.g., eCon) and correlation measure ρ in NER task.A higher absolute value ρ (e.g |-0.714|) represents the improvement of the corresponding aggregator (e.g., seq) heavily correlates with corresponding attribute (e.g.eCon).The number with the highest absolute value of each column is colored by green."cPre"represents "cPre-seq" and the values in grey denote correlation values do not pass a significance test (p = 0.05)."Attr."denotesattributes.(2)Regarding the seq aggregator, it negatively correlated with ζ eCon , ζ tCon , and ζ eDen .Therefore, the seq aggregator is better at dealing with datasets WB, TC, and BC, since these datasets are with lower ζ p value on one of the attributes (eCon, tCon, and eDen). Hovy,

Table 3 :
The relative improvement of larger-context training on top-scoring models in the NER task."cPre" represents "cPre-seq"."norm" denotes the normal setting (K = 1).The testing datasets are chosen based on the analysis in Sec.5.4.
1) regarding bow aggregator, it negatively correlated with ζ tCon and ζ eDen with larger correlation values.Therefore, bow aggregator is more appropriate to deal with datasets WB, TC, BC, since these four datasets are with lower value of ζ p with respect to the attribute tCon (TC, WB) and eDen (BC, WB, TC).Additionally, bow aggregator obtained the highest positive correlation with ζ tFre , ζ eLen , and ζ sLen .Besides, ζ tFre (MZ), ζ eLen (NW), and ζ sLen (NW), also achieved the highest value, suggesting that bow aggregator is suitable for MZ and NW.(2) regarding graph aggregator, it negatively correlated with ζ eCon , ζ eFre , ζ tFre , and ζ eLen , with larger correlation values.Therefore, graph aggregator is more appropriate to deal with datasets WB, TC, NW, and CN03, since these four datasets are with lower value of ζ p with respect to the attribute eCon (TC,WB), eFre (NW, TC), tFre (CN03, WB), and eLen (CN03),.

Table 5 :
The data-level measure ζp in seven (four) datasets with respect to eight (seven) attributes in NER (CWS) task.The value of wFre and cFre on CWS task needs to multiply by 10 −7 .

Table 6 :
Illustration of measures ζp in seven datasets (CN03, TC, NW, WB, MZ, BN, BC) with respect to eight attributes (e.g., eCon) and correlation measure ρ in NER task.A higher absolute value ρ (e.g |-0.714|) represents the improvement of corresponding aggregator (e.g., seq) heavily correlate with corresponding attribute (e.g.eCon).The number with the highest absolute value of each column is colored by green."cPre" represents "cPre-seq" and the value in grey denotes correlation value does not pass a significance test (p = 0.05).

Table 7 :
Illustration of measures ζp in four datasets ( CITYU, NCC, SXU, PKU) with respect to seven attributes (e.g., wCon) and correlation measure ρ in CWS task.A higher absolute value ρ (e.g |-0.657|) represents the improvement of corresponding aggregator (e.g., bow) heavily correlate with corresponding attribute (e.g.wCon).The number with the highest absolute value of each column is colored by green."cPre" represents "cPre-seq" and the value in grey denotes correlation value does not pass a significance test (p = 0.05).