Foreseeing the Benefits of Incidental Supervision

Real-world applications often require improved models by leveraging *a range of cheap incidental supervision signals*. These could include partial labels, noisy labels, knowledge-based constraints, and cross-domain or cross-task annotations – all having statistical associations with gold annotations but not exactly the same. However, we currently lack a principled way to measure the benefits of these signals to a given target task, and the common practice of evaluating these benefits is through exhaustive experiments with various models and hyperparameters. This paper studies whether we can, *in a single framework, quantify the benefits of various types of incidental signals for a given target task without going through combinatorial experiments*. We propose a unified PAC-Bayesian motivated informativeness measure, PABI, that characterizes the uncertainty reduction provided by incidental supervision signals. We demonstrate PABI’s effectiveness by quantifying the value added by various types of incidental signals to sequence tagging tasks. Experiments on named entity recognition (NER) and question answering (QA) show that PABI’s predictions correlate well with learning performance, providing a promising way to determine, ahead of learning, which supervision signals would be beneficial.


Introduction
The supervised learning paradigm, where direct supervision signals are available in high-quality and large amounts, has been struggling to fulfill needs in many real-world AI applications. As a result, researchers and practitioners often resort to datasets that are not collected directly for the target task but capture some phenomena useful for it * Part of this work was done while the author was at Allen Institute for AI. 1 Our code is publicly available at https://github. com/CogComp/PABI. : An example of NER with various incidental supervision signals: partial labels (some missing labels in structured outputs), noisy labels (some incorrect labels), auxiliary labels (labels of another task, e.g. named entity detection in the figure), and constraints in structured learning (e.g. the BIO constraint where I-X must follow B-X or I-X (Ramshaw and Marcus, 1999) in the figure). (Pan and Yang, 2009;Vapnik and Vashist, 2009;Roth, 2017;Kolesnikov et al., 2019). However, it remains unclear how to predict the benefits of these incidental signals on our target task beforehand. For example, given two incidental signals that are relevant to the target task, it is still difficult for us to predict which one is more beneficial. Therefore, the common practice is often trial-and-error: perform experiments with different combinations of datasets and learning protocols, often exhaustively, to measure the impact on a target task (Liu et al., 2019;Khashabi et al., 2020). Not only is this very costly, but this trial-and-error approach can also be hard to interpret: if we don't see improvements, is it because the incidental signals themselves are not useful for our target task, or is it because the learning protocols we have tried are inappropriate?
The difficulties of foreseeing the benefits of various incidental supervision signals, including partial labels, noisy labels, constraints 2 , and cross-domain signals, are two-fold. First, it is hard to provide a unified measure because of the intrinsic differences among different signals (e.g., how do we  Table 1: Comparison between PABI and prior works: (1) Cross-type: PABI is a unified measure that can measure the benefit of different types of incidental signals (e.g., comparing a noisy dataset and a partially annotated dataset).
(2) Mixed-type: PABI can measure the benefit of mixed incidental signals (e.g., a dataset that is both noisy and partially annotated). (3) PABI is derived from PAC-Bayesian theory but also easy to compute in practice; PABI is shown to have similar or better predicting capability of signals' benefit (see Figs. 2 and 3 and Sec. 4.1). Papers are denoted as: CST'11 (Cour et al., 2011), HH'12 (Hovy andHovy, 2012), LD'14 (Liu and Dietterich, 2014), AL'88 (Angluin and Laird, 1988), NDRT'13 (Natarajan et al., 2013), RVBS'17 (Rolnick et al., 2017), VW'17 (Van Rooyen and Williamson, 2017), WNR'20 (Wang et al., 2020), NHFR'19 (Ning et al., 2019), B'17 (Bjerva, 2017, GMSLBDS'20 (Gururangan et al., 2020). predict and compare the benefit of learning from partial data and the benefit of knowing some constraints on the target task?). Second, it is hard to provide a practical measure supported by theory. Previous attempts are either not practical (Baxter, 1998;Ben-David et al., 2010) or too heuristic (Gururangan et al., 2020), or apply to only one type of signals, e.g., noisy labels (Natarajan et al., 2013;Zhang et al., 2021). In this paper, we propose a unified PAC-Bayesian motivated informativeness measure (PABI) to quantify the value of incidental signals for sequence tagging tasks. We suggest that the informativeness of various incidental signals can be uniformly characterized by the uncertainty reduction they provide. Specifically, in the PAC-Bayesian framework, the informativeness is based on the Kullback-Leibler (KL) divergence between the prior and the posterior, where incidental signals are used to estimate a better prior (one that is closer to the gold posterior) to achieve better generalization performance. Furthermore, we provide a more practical entropy-based approximation of PABI. In practice, PABI first computes the entropy of the prior estimated from incidental signals, and then computes the relative decrease from the entropy of the prior without any information (i.e. the uniform prior) as the informativeness of incidental signals.
We have been in need of a unified informativeness measure like PABI. For instance, it might be obvious that we can expect better learning performance if the training data is less noisy and more completely annotated, but what if we want to compare the benefits of a noisy dataset to those of a dataset that is only partially labeled? PABI enables this kind of comparisons to be done analytically, without the need for experiments on a wide range of incidental signals such as partial labels, noisy labels, constraints, auxiliary labels (labels of another task), and cross-domain signals, for sequence tagging tasks in NLP. A specific example of NER is shown in Fig. 1, and the advantages of PABI can be found in Table 1.
Finally, our experiments on two NLP tasks, NER and QA, show that there is a strong positive correlation between PABI and the relative improvement for various incidental signals. This strong positive correlation indicates that the proposed unified, theory-motivated measure PABI can serve as a good indicator of the final learning performance, providing a promising way to know which signals are helpful for a target task beforehand.

Related Work
There are lots of practical measures proposed to quantify the benefits of specific types of signals. For example, a widely used measure for partial signals in structured learning is the partial rate (Cid-Sueiro, 2012;Cour et al., 2011;Hovy and Hovy, 2012;Liu and Dietterich, 2014;Van Rooyen and Williamson, 2017;Ning et al., 2019); a widely used measure for noisy signals is the noise ratio (Angluin and Laird, 1988;Natarajan et al., 2013;Rolnick et al., 2017;Van Rooyen and Williamson, 2017);Ning et al. (2019) propose to use the concaveness of the mutual information with different percentage of annotations to quantify the strength of constraints in the structured learning; others, in NLP, have quantified the contribution of constraints experimentally (Chang et al., 2008(Chang et al., , 2012. Bjerva (2017) proposes to use conditional entropy or mutual information to quantify the value for auxiliary signals. As for domain adaptation, domain similarity can be measured by the performance gap between domains (Wang et al., 2019) or measures based on the language model in NLP, such as the vocabulary overlap (Gururangan et al., 2020). Among them, the most relevant work is Bjerva (2017). However, their conditional entropy or mutual information is based on token-level label distribution, which cannot be used for incidental signals involving multiple tokens or inputs, such as constraints and cross-domain signals. At the same time, for the cases where both PABI and mutual information can handle, PABI works similar to the mutual information as shown in Fig. 2, and PABI can further be shown to be a strictly increasing function of the mutual information. The key advantage of our proposed measure PABI is that PABI is a unified measure motivated by the PAC-Bayesian theory for a broader range of incidental signals compared to these practical measures for specific types of incidental signals.
There also has been a line of theoretical work that attempts to exploit incidental supervision signals. Among them, the most relevant part is task relatedness. Ben-David and Borbely (2008) define task relatedness based on the richness of transformations between inputs for different tasks, but their analysis is limited to cases where data is from the same classification problem but the inputs are in different subspace. Juba (2006) proposes to use the joint Kolmogorov complexity (Wallace and Dowe, 1999) to characterize relatedness, but it is still unclear how to compute the joint Kolmogorov complexity in real-world applications. Mahmud and Ray (2008) further propose to use conditional Kolmogorov complexity to measure the task relatedness and provide empirical analysis for decision trees, but it is unclear how to use their relatedness for other models, such as deep neural networks. Thrun and O'Sullivan (1998) propose to cluster tasks based on the similarity between the task-optimal distance metric of k-nearest neighbors (KNN), but their analysis is based on KNN and it is unclear how to use their relatedness for other models. A lot of other works provide quite good qualitative analysis to show various incidental signals are helpful but they did not provide quantitative analysis to quantify to what extent these types of incidental signals can help (Abu-Mostafa, 1993;Baxter, 1998;Ben-David et al., 2010;Balcan and Blum, 2010;Natarajan et al., 2013;London et al., 2016;Van Rooyen and Williamson, 2017;Ciliberto et al., 2019;Wang et al., 2020). Compared to these theoretical analyses, PABI can be easily used in practice to quantify the benefits of a broader range of incidental signals.

PABI: A Unified PAC-Bayesian Informativeness Measure
We start with notations and preliminaries. Let X be the input space, Y be the label space, andŶ be the prediction space. Let D denote the underlying distribution on X × Y. Let : Y ×Ŷ → R + be the loss function that we use to evaluate learning algorithms. A set of training samples In the common supervised learning setting, we usually assume the concept that generates data comes from the concept class C. In this paper, we assume C is finite, and its size is | C |, which is common in sequence tagging tasks because of finite vocabulary and label set. We want to choose a predictor c : X →Ŷ from C such that it generalizes well to unseen data with respect to , measured by the generalization error . More generally, instead of predicting a concept, we can specify a distribution over the concept class. Let P denote the space of probability distributions over C. General Bayesian learning algorithms (Zhang, 2006) in the PAC-Bayesian framework (McAllester, 1999a,b;Seeger, 2002;McAllester, 2003b,a;Maurer, 2004;Guedj and Shawe-Taylor, 2019) aim to choose a posterior π λ ∈ P over the concept class C based on a prior π 0 ∈ P and training data S, where λ is a hyper parameter that controls the tradeoff between the prior and the data likelihood. In this setting, the training error and the generalization error need to be generalized, to take the distribution into account, as L S (π λ ) = E c∼π λ [R S (c)] and L D (π λ ) = E c∼π λ [R D (c)] respectively. When the posterior is one-hot (exactly one entry of the distribution is 1), we have the original definitions of training error and generalization error, as in the PAC framework (Valiant, 1984).
We are now ready to derive the proposed informativeness measure PABI motivated by PAC-Bayes. The generalization error bound in the PAC-Bayesian framework (Catoni, 2007;Guedj and Shawe-Taylor, 2019) says that with probability 1 − δ over S, the following bound holds where π λ * is the posterior distribution with the op- , π * ∈ P is the gold posterior that generates the data, D KL (π * ||π 0 ) denotes the KL divergence from π 0 to π * , B and C are two constants. This is based on the Theorem 2 in Guedj and Shawe-Taylor (2019).
As shown in the generalization bound, the generalization error is bounded by the KL divergence D KL (π * ||π 0 ) from the prior distribution to the gold posterior distribution. Therefore, we propose to utilize incidental signals to improve the prior distribution from π 0 toπ 0 so that it is closer to the gold posterior distribution π * . Correspondingly, we can define PABI, the informativeness measure for incidental supervision signals, by measuring the improvement with regard to the gold posterior, in the KL divergence sense.
Definition 3.1 (PABI). Suppose we use incidental signals to improve the prior distribution from π 0 toπ 0 . The informativeness measure for incidental signals, PABI, is defined as Remark. Note that S(π 0 ,π 0 ) = 0 ifπ 0 = π 0 , while ifπ 0 = π * , then S(π 0 ,π 0 ) = 1. This result is consistent with our intuition that the closerπ 0 is to π * , the more benefits we can gain from incidental signals. The square root function is used in PABI for two reasons: first, the generalization bounds in both PAC-Bayesian and PAC (see Appx. A.1) frameworks have the square root function; second, in our later experiments, we find that square root function can significantly improve the Pearson correlation between the relative performance improvement and PABI. It is worthwhile to note that the square root is not crucial for our framework, because our goal is to compare the benefits among different incidental supervision signals, where the relative values are expressive enough. In this sense, any strictly increasing function in [0, 1] over the current formulation would be acceptable.
We focus on the setting that the gold posterior π * is one-hot, which means π * concentrates on the true concept c * ∈ C, though the definition of PABI can handle general gold posterior. However, π * is unknown in practice, which makes Eq. (1) hard to be computed in reality. In the following, we provide an approximationŜ of PABI.
Definition 3.2 (Approximation of PABI). Assume that the original prior π 0 is uniform, and the gold posterior π * is one-hot concentrated on the true concept c * in C, as we have assumed that C is finite. Let H(·) be the entropy function. The approxima-tionŜ of PABI is defined aŝ The uniform prior π 0 is usually used when we do not have information about the prior on which concept in the class that generates data. The intuition behindŜ is that, it measures how much entropy incidental signals reduce, compared with noninformative prior π 0 .Ŝ can be computed through data and thus is practical. To see how this approximation works, first note D KL (π * ||π 0 ) = ln | C | because π * is one-hot and π 0 is uniform over the finite concept class C. Let π c be the one-hot distribution concentrated on concept c for each c ∈ C. The approximation is that we estimate π * by π c , where c followsπ 0 : D KL (π * ||π 0 ) ≈ E c∼π 0 D KL (π c ||π 0 ), and E c∼π 0 D KL (π c ||π 0 ) = H(π 0 ). Therefore, As shown in Appx. A.1, the approximation of PABI and PABI are equivalent in the nonprobabilistic cases with the finite concept class, indicating the quality of this approximation. Some analysis on the extensions and general limitations of PABI can be found in Appx. A.2 and A.3.

Examples for PABI
In this section, we show some examples of sequence tagging tasks 3 in NLP for PABI. We

Notations
Descriptions D target domain with gold signals D source domain with incidental signals c(x) gold system on gold signals c(x) perfect system on incidental signalŝ c(x) silver system trained on incidental signals η difference between the perfect system and the gold system in the target domain η 1 difference between the silver system and the perfect system in the source domain η 1 difference between the silver system and the perfect system in the target domain η 2 difference between the silver system and the gold system in the target domain Table 2: Summary of core notations in the estimation process for transductive signals.
consider two types of incidental signals, signals with a different conditional probability distribution (P (y | x)) from gold signals (e.g., noisy signals), and signals with the same task as gold signals but a different marginal distribution of x (P (x)) from gold signals (e.g., cross-domain signals). Similar to the categorization of transfer learning (Pan and Yang, 2009), they are denoted as inductive signals and transductive signals respectively. In our following analysis, we study the tasks with finite concept class which is common in sequence tagging tasks. For simplicity, we focus on simple cases where the number of incidental signals is large enough. How different factors (including base model performance, size of incidental signals, data distributions, algorithms, and cost-sensitive losses) affect PABI are discussed in Appx. A.4. We derive the PABI for partial labels in detail and the derivations for others are similar. More examples and details can be found in Appx. A.5.

Examples with Inductive signals
Partial labels. The labels for each example in sequence tagging tasks is a sequence and some of them are unknown in this case. Assuming that the ratio of the unknown labels in data is η p ∈ [0, 1] , the size of the reduced concept class will be |C| = | L | n| V | n ηp . Therefore,Ŝ(π 0 ,π 0 ) = S(π 0 ,π 0 ) = It is consistent with the widely used partial rate because it is a monotonically decreasing function of the partial rate (Cid-Sueiro, 2012).
Noisy labels. For each token, P (y|ỹ) is determined by the noisy rate η n ∈ [0, 1], i.e. P (y =ỹ) = 1 − η n and the probability of other labels are all ηn | L |−1 . We can get the corresponding probability distribution of lay ∈Ŷ = Y = L n , where V is the vocabulary of input words, L is the label set for the task, and n is the length of the input sentence. bels over the tokens in all inputs (π 0 over the concept class). In this way,Ŝ(π 0 ,π 0 ) = . It is consistent with the widely used noise rate (Natarajan et al., 2013) because it is a monotonically decreasing function of the noisy rate. In practice, the noisy rate can be easily estimated with some aligned data 4 , and the noise with more complex patterns (e.g. input dependent) is postponed as our future work.

Examples with Transductive Signals
For transductive signals, such as cross-domain signals, we can first extend the concept class C to the extended concept class C e with the corresponding extended input space X e . After that, we can use incidental signals to estimate a better prior distributionπ e 0 over the extended concept class C e , and then get the correspondingπ 0 over the original concept class by restricting the concept from C e to C. In this way, the informativeness of transductive signals can still be measured by S(π 0 ,π 0 ) or S(π 0 ,π 0 ). The restriction step is similar to Roth and Zelenko (2000).
However, how to compute H(π 0 ) is still unclear. We now provide a way to estimate it. To better illustrate the estimation process for transductive signals, we provide the summary of core notations in Table  2. For simplicity, we use c(x) to denote the gold system on the gold signals,c(x) to denote the perfect system on the incidental signals, andĉ(x) to denote the silver system trained on the incidental signals. Source domain (target domain) is the domain of incidental signals (gold signals). We use D to denote the target domain andD to denote the source domain. P D (x) is the marginal distribution of x under D, and similar definition for PD(x). In Algorithm 1: Confidence-Weighted Bootstrapping with Prior Probability. The algorithm utilizes incidental signals to improve the inference stage in semi-supervised learning.
Input: A small dataset with gold signals D = (X1, Y1), and a large dataset with inductive signalsD = (X2,Ỹ2) where X1 ∩ X2 = φ 1 Initialize claissifierĉ = LEARN(D) (initialize the classifier with gold signals) 2 P (Y 2 |X 2 ,Ỹ 2 ) = PRIOR(D,D) (estimate the probability of gold labels for inputs inD) 3 while convergence criteria not satisfied do 4Ŷ = INFERENCE(X 2 ;ĉ; P (Y 2 |X 2 ,Ỹ 2 )) (get predicted labels of inputs inD) 5ρ = CONFIDENCE(X 2 ;ĉ, P (Y 2 |X 2 ,Ỹ 2 )) (get confidence for predicted labels) 6D = (X 2 ,Ŷ ,ρ) (get confidence-weighted incidental dataset with predicted labels) 7ĉ = LEARN(D +D) (learn a classifier with both gold dataset and incidental dataset) 8 returnĉ our analysis, we assumec(x) is a noisy version of c(x) with the noisy rate η, andĉ(x) is a noisy version ofc(x) with the noisy rate η 1 (η 1 ) in the source (target) domain : In practice, η is unknown but it can be estimated by η 1 in the source domain and η 2 = E x∼P D (x) 1(ĉ(x) = c(x)) (the noisy rate of the silver system compared to the gold system on the target domain) as follows: Here we add an assumption: η 1 in the target domain is equal to η 1 in the source domain. 5 In Appx. A.6, we can see that Eq. (3) serves as an unbiased estimator for η under some assumptions, but the concentration rate will depend on the size of the source data, which requires finergrained analysis on the estimator in Eq. (3) and we postpone it as our future work. Similar to noisy labels, the corresponding informativeness of transductive signals can be then computed aŝ Note that we treat cross-domain signals as a special type of noisy data, when η is estimated.
To justify the use of η in the informativeness measure for transductive signals, we show in Theorem A.2 (see Appx. A.7) that (informally speaking) the generalization error of a learner that is jointly trained on data from both source and target domains can be upper bounded by η (plus a function of the size of the concept class and the number of samples).
Finally, we note that although the computation cost of PABI for transductive signals is higher than that for inductive signals where PABI does not need any training, it is still much cheaper than building combined models with joint training. For example, given T source domains and T target domains, the goal is to select the best source domain for each target domain. If we use the joint training, we need to train T × T = T 2 models. However, with PABI, we only need to train T + T = 2T models. Furthermore, for each model, joint training on the combination of two domains requires more time than the training on a single domain used in PABI. In this situation, we can see that PABI is much cheaper than building combined models with joint training.

Examples with Mixed Incidental Signals
The mix of partial and noisy labels. The corresponding informativeness for the mix of partial and noisy labels isŜ , where η p ∈ [0, 1] denotes the ratio of unlabeled tokens, and η n ∈ [0, 1] denotes the noise ratio.
The mix of partial labels and constraints. For BIO constraints with partial labels, we can use dynamic programming with sampling as Ning et al.

Experiments
In this section, we verify the effectiveness of PABI for various inductive signals and transductive sig-  : Correlations between informativeness (ranging from 0 to 1) and relative performance improvement for NER with various inductive signals (signals with a different conditional probability distribution (P (y | x)) from gold signals). On one hand, as shown in (a)-(c), PABI has a similar foreseeing ability with measures for specific signals. On the other hand, as shown in (d)-(f), PABI can measure the benefits of mixed inductive signals and compare different types of inductive signals, which cannot be handled by existing frameworks. For individual inductive signals, the baselines (gray points) are, i.e. one minus partial rate for partial labels (Ning et al., 2019), one minus noisy rate for noisy labels (Natarajan et al., 2013), and entropy normalized mutual information for auxiliary labels (Bjerva, 2017). For NER with various inductive signals (f) (with all PABI points from (a)-(e)), Pearson's correlation and Spearman's rank correlation are 0.92 and 0.93. Note that the relative improvement for NER (with informativeness 0.90 but relative improvement 0.70) in auxiliary labels (c) is smaller than expected mainly due to the imbalanced label distribution (88% O among all BIO labels). More discussions about the imbalanced distribution can be found in Appx. A.4. nals on NER and QA. More details about experimental settings are in Appx. A.8.
Learning with various inductive signals. In this part, we analyze the informativeness of inductive signals for NER. We use Ontonotes NER (18 types of named entities) (Hovy et al., 2006) as the main task. We randomly sample 10% sentences (30716 words) of the development set as the small gold signals, 90% sentences (273985 words) of the development set as the large incidental signals. We use two-layer NNs with 5-gram features as our basic model. The lower bound for our experiments is the result of the model with small gold Ontonotes NER annotations and bootstrapped on the unlabeled texts of the large gold Ontonotes NER, which is 38% F1, and the upper bound is the result of the model with both small gold Ontonotes NER annotations and the large gold Ontonotes NER annotations, which is 61% F1. To utilize inductive signals, we propose a new bootstrapping-based algorithm CWBPP, as shown in Algorithm 1, where inductive signals are used to improve the inference stage by approximating a better prior. It is an extension of CoDL (Chang et al., 2007) by covering various inductive signals.
We experiment on NER with various inductive signals, including three types of individual signals, partial labels, noisy labels, auxiliary labels, and two types of mixed signals: signals with both partial and noisy labels, and signals with both partial labels and constraints. As shown in Fig. 2, we find that there is a strong correlation between the relative improvement and PABI for various inductive signals. For individual signals in Fig. 2(a)-2(c), we find that PABI has a similar foreseeing ability comparing to the measures for specific signals, i.e., 1 − η p for partial labels, 1 − η n for noisy la-  Figure 3: Correlation between informativeness measures (baselines or the PABI) and relative performance improvement (via joint training or pre-training) for cross-domain NER and cross-domain QA. We can see that the correlation between the relative improvement and PABI is stronger than other baselines. Red results with the PABI which is based on η in Eq. (3); Gray points indicate the results with the naive informativeness measure η 2 ; Black points indicate the results with the vocabulary overlap baseline (Gururangan et al., 2020). The specific correlations can be found in Table 3. bels, and entropy normalized mutual information 6 ( I(Y ;Ỹ ) H(Y ) ) for auxiliary labels. For mixed signals in Fig. 2(d)-2(e), the strong correlation is quite promising because the benefits of mixed signals cannot be quantified by existing frameworks. Finally, the strong positive correlation for different types of signals in Fig. 2(f) indicates that it is feasible to compare the benefits of different incidental signals with PABI, which cannot be addressed by existing frameworks.
Learning with cross-domain signals. In this part, we consider the benefits of cross-domain signals for NER and QA. For NER, we consider four NER datasets, Ontonotes, CoNLL, Twitter (Strauss et al., 2016), andGMB (Bos et al., 2017). We aim to detect the person names here because the only shared type of the four datasets is the person 7 . In our experiments, the Twitter NER serves as the main dataset and the other three datasets are cross-domain datasets. There are 85 sentences in the small gold training set, 756 sentences (9 times of the gold signals) in the large incidental training set, and 851 sentences in the test set. SQuAD dataset serves as the main dataset and other datasets are cross-domain datasets. We randomly sample 700 QA pairs as the small gold signals, about 6.2K QA pairs as the large incidental signals (9 times of the small gold signals), and 21K QA pairs as the test data. We tried larger datasets for both gold and incidental signals (keeping the ratio between two sizes as 9), and the results are similar as long as the size of gold signals is not too large.
We use BERT as our basic model and consider two strategies to make use of incidental signals: joint training and pre-training. For NER, the lower bound is the result with only small gold twitter annotations, which is 61.51% F1, and the upper bound is the result with both small gold twitter annotations and large gold twitter annotations, which is 78.31% F1. For QA, the lower bound is the result with only small gold SQuAD annotations, which is a 26.45% exact match. The upper bound for the joint training is the result with both small gold SQuAD annotations and large SQuAD annotations, which is a 50.72% exact match. Similarly, the upper bound for the pre-training is a 49.24% exact match.
The relation between the relative improvement (pre-training or joint training) and informativeness measures (baselines or the PABI) is shown in Fig. 3 and Table 3. We can see that there is a strong positive correlation between the relative improvement and PABI for cross-domain signals. Comparing to the naive baseline η 2 , we can see that the adjustment from η 1 is crucial (Eq. (3)), indicating that directly using η 2 is not a good choice. We also show the vocabulary overlap baseline as in

Conclusion and Future Work
Motivated by PAC-Bayesian theory, this paper proposes a unified framework, PABI, to characterize the benefits of incidental supervision signals by how much uncertainty they can reduce in the hypothesis space. We demonstrate the effectiveness of PABI in foreseeing the benefits of various signals, i.e., partial labels, noisy labels, auxiliary labels, constraints, cross-domain signals and combinations of them, for solving NER and QA. To our best knowledge, PABI is the first informativeness measure that can handle various incidental signals and combinations of them; PABI is motivated by the PAC-Bayesian framework and can be easily computed in real-world tasks. Because the recent success of natural language modeling has given rise to many explorations in knowledge transfer across tasks and corpora (Bjerva, 2017;Phang et al., 2018;Zhu et al., 2019;Liu et al., 2019;He et al., 2020;Khashabi et al., 2020) , PABI is a concrete step towards explaining some of these observations. We conclude our work by pointing out several interesting directions for our future work.
First, PABI can also provide guidance in designing learning protocols. For instance, in a B/I/O sequence chunking task, 8 missing labels make it a partial annotation problem while treating missing labels as O introduces noise. Since the informativeness of partial signals is larger than that of noisy signals with the same partial/noisy rate (see details in Sec (2019) prove to us via their experiments. We plan to explore more in this direction to apply PABI in designing better learning protocols.
Second, we need to acknowledge that our current exploration for auxiliary labels is still limited. The results for auxiliary labels with a different label set ( Fig. 2(c)) is blocked by the imbalanced label distribution (Appx. A.4). For more complex cases, such as part-of-speech tagging (PoS) for NER, we can only treat them as cross-sentence constraints now and the results are also limited (Appx. A.5).
In the future, we will work more in this direction to better quantify the value of auxiliary signals.
Another interesting direction is to link PABI with the generalization bound. It might be too hard to directly link PABI with the generalization bound for all types of incidental signals, but it is possible to link it to the generalization bound for some specific types. For example, for partial and noisy labels, PABI can directly be expressed in the generalization bound as in Cour et al. Finally, we plan to evaluate PABI in more applications, such as textual entailment and image classification, and more types of signals, such as cross-lingual and cross-modal signals.
. We propose to reduce the concept class from C tõ C by using incidental signals. Then PABI in the PAC framework can be written as It turns out that Eq. (4) is a special case of Eq. (1) when π * is one-hot over C, π 0 is uniform over C andπ 0 is uniform overC. Specifically, suppose π * is one-hot over C, π 0 is uniform over C andπ 0 is uniform overC. We have D KL (π * ||π 0 ) = ln | C | and D KL (π * ||π 0 ) = ln |C|. It follows that At the same time, we havê As shown in the above derivation, we can see that the three informativeness measures, PABI in Eq. (1), the approximation of PABI in Eq. (2), and PABI in the PAC framework in Eq. (4), are equivalent, i.e. S(π 0 ,π 0 ) =Ŝ(π 0 ,π 0 ) = S(C,C), in the non-probabilistic cases with the finite concept class. The equivalence among three measures further indicates that both PABI and the approximation of PABI are reasonable.
It is worthwhile to notice that the size of concept class also plays an important role in the lower bound on the generalization error as shown in the following theorem, indicating that PABI based on the reduction of the concept class is a reasonable measure. Theorem A.1. Let C be a concept class with VC dimension d > 1. Then, for any m ≥ 1 and any learning algorithm A, there exists a distribution D over X and a target concept c ∈ C such that where c S is a consistent concept with S returned by A. This is the Theorem 3.20 in Chapter 3.4 of Mohri et al. (2018).
However, PABI restricted to the PAC framework cannot handle the probabilistic cases. For example, incidental signals can reduce the probability of some concepts, though the concept class is not reduced. In this example, S(C,C) is zero, but we actually benefit from incidental signals.

A.2 PABI in the Parametric Concept Class
In practice, algorithms are often based on parametric concept class. The two informativeness measures in the PAC-Bayesian framework, S(π 0 ,π 0 ) andŜ(π 0 ,π 0 ), can be easily adapted to handle the cases in parametric concept class. Given parametric space C w , we can easily change the probability distribution π(C w ) over the parametric concept class to the probability distribution π(C) over the finite concept class C = {c : V n → L n } by clustering concepts in the parametric space according to their outputs on all inputs. The concepts in each cluster have the same outputs on all inputs as outputs of one concept in the finite concept class C. We then merge the probabilities of concepts in the same cluster to get the probability distribution π(C) over the finite concept class C. This merging approach can be applied to any concept class which is not equal to the finite concept class C, including nonparametric and semi-parametric concept class. In practice, we can use sampling algorithms, such as Markov chain Monte Carlo (MCMC) methods, to simulate this clustering strategy.

A.3 Limitations of PABI
Different informativeness measures are based on different assumptions, so we analyze their limitations in detail to understand their limitations in applications.
For the informativeness measure S(C,C), it cannot handle probabilistic signals or infinite concept classes. There are various probabilistic incidental signals, such as soft constraints and probabilistic co-occurrences between an auxiliary task and the main task. An example of probabilistic cooccurrences between part-of-speech (PoS) tagging and NER is that the adjectives have a 95% probability to have the label O in NER. As for the infinite concept class, most classifiers are based on infinite parametric spaces. Thus, S(C,C) cannot be applied to these classifiers.
The informativeness measure S(π 0 ,π 0 ) is hard to be computed for some complex cases. In practice, we can use the estimated posterior distribution over the gold data, which is asymptotically unbiased, to estimate it. Another approximation is to use the informativeness measureŜ = 1 − H(π 0 ) H(π 0 ) . However, it is not directly linked to the generalization bound, so more work is needed to guarantee its reliability for some complex probabilistic cases. We postpone to provide the theoretical guarantees H(π 0 ) on more complex cases as our future work.

A.4 Discussion of Factors in PABI
In this subsection, we consider the impact of the following factors in PABI: base model performance, the size of incidental signals, data distribution, algorithm and cost-sensitive loss.
Base model performance. In the generalization bound in both PAC and PAC-Bayesian, we can see that the relative improvement in the generalization bound from reducing C is small if m is large. In practice, the relative improvement is the real improvement with some noise. Therefore, we can see that the real improvement is dominant if m is small and the noise is dominant if m is large. Therefore, PABI may not work well when m is large and when the performance on the target task is already good enough.
The size of incidental signals. Our previous analysis is based on a strong assumption that incidental signals are large enough (ideallym → ∞). A more realistic PABI is based onC withm examples as S(C,C) = ln | Cm |−ln |Cm| ln | C | = ln | Cm |−ln |Cm| where Cm denotes the restricted concept class of C on them examples, and so doesCm. Note that the ratio of the intrinsic information in incidental signals is independent of the sizem, so ln |Cm| ln | Cm | = ln |C| ln | C | holds for our signals. For example, ln |Cm| ln | Cm | = η p for partial data with unknown ratio η p , doesn't depend on the sizem. (1) Whenm is large enough, S(C,C) = 1 − ln |C| ln | C | .
(2) When the sizes of different incidental signals are allm, the relative improvement is independent ofm because ln | Cm | ln | C | is the same constant for different incidental signals. Our experiments are based on this case and does not really rely on the assumption that incidental signals are large enough. (3) The incidental signals we are comparing are not large enough and have different sizes, we need to use S(C,C) = (1 − ln |C| ln | C | ) ×m | V | n to incorporate that difference. We can replace |V| n with some reasonable M , e.g. the largest size of incidental signals, to make PABI in a larger range of values in [0,1]. In future, we need to explore more in this direction.
Data distribution. As for the distribution of examples, both PAC and PAC-Bayesian are distribution-free (see more in Chapter 2.1 of Mohri et al. (2018)). However, if we consider the joint distribution between examples and labels, such as imbalanced label distribution, the situation will be different. Specific types of joint data distribution refer to a restricted concept class C . Therefore, PABI is expected to work well if the reduction from C is similar to the reduction from C with incidental signals, i.e. S(C ,C ) = 1 − ln |C | ln | C | ≈ 1 − ln |C| ln | C | . Algorithm. Different algorithms make different assumptions on the concept class. For example, SVM aims to find the maximum-margin hyperplane (see more in Chapter 5.4 of Mohri et al. (2018)). Therefore, a specific algorithm actually is based on a restricted concept class C (e.g. concepts with margin in SVM case). Similarly, PABI is expected to work well if the reduction from C is similar to the reduction from C with incidental signals. We also cannot compare the benefits from various incidental signals with different algorithms. If the algorithm is not expressive enough to take advantage of incidental signals, we may also not be able to use PABI there.
Cost-sensitive Loss. For different loss functions other than 0-1 loss, there are still some similar gen-eralization bounds in PAC and PAC-Bayesian (using complexity of concept class and sample size) (Bartlett et al., 2006;Ciliberto et al., 2016). Therefore, PABI can also be used (possibly with some minor modifications) for cost-sensitive loss functions.

A.5 More Examples with Incidental Signals
In this subsection, we show more examples with incidental signals, including within-sentence constraints, cross-sentence constraints, auxiliary labels, cross-lingual signals, cross-modal signals, and the mix of cross-domian signals and constraints.
Within-Sentence Constraints. As for withinsentence constraints, we show three types of common constraints in NLP, which are BIO constraints, assignment constraints, and ranking constraints.
• BIO constraints are widely used in sequence tagging tasks, such as NER. For BIO constraints, I-X must follow B-X or I-X, where "X" is finer types such as PER (person) and LOC (location). We consider a simple case here: there are only B, I, O three labels. We have ln |C| = | V | n (ln | L | n + ln[ Therefore, S(π 0 ,π 0 ) = S(π 0 ,π 0 ) = S(C,C) = 1 − ln | L | n +ln[ • Assignment constraints can be used in various types of semantic parsing tasks, such as semantic role labeling (SRL). Assume we need to assign d agents with d tasks such that the agent nodes and the task nodes form a bipartite graph (without loss of generality, assume d ≤ d ). Each agent is represented by a feature vector in V f . We haveŜ(π 0 ,π 0 ) = This informativeness doesn't rely on the choice of V f where that V f denotes discrete feature space for arguments.
• Ranking constraints can be used in ranking problems, such as temporal relation extraction. For a ranking problem with t items, there are d = t(t − 1)/2 pairwise comparisons in total. Its structure is a chain following the transitivity constraints, i.e., if A < B and B < C, then A < C. In this way, we haveŜ(π 0 ,π 0 ) = S(π 0 ,π 0 ) = (t−1) ln 2 . This informativeness doesn't rely on the choice of V f where V f denotes discrete feature space for events.

Cross-sentence Constraints.
For crosssentence constraints, we consider a common example, global statistics based on 2-tuple of tokens, i.e. pairs of tokens in different sentences must have the same labels. We can group words into K groups with probability p. In this way, we haveŜ(π 0 ,π 0 ) = The approximation holds as long as | L |, V, and n are not all too small. For example, as shown in Table 4, the percentage of 5-gram words with unique NER labels is 99.37%, so ideally the corresponding PABI will be √ 0.9937 = 0.9968. It is worthwhile to note that the k-gram words with unique labels can also be caused by the low frequency of the appearance of the k-grams. In our experiments, we only consider the k-grams with unique labels that appear at least twice in the data. We experiment on NER with three types of cross-sentence constraints: uni-gram words with unique NER labels, bi-gram words with unique NER labels, and 5-gram part-ofspeech (PoS) tags with unique NER labels 9 . The  results are shown in Fig. 4. Auxiliary labels. For auxiliary labels, we show two examples as follows: • For a multi-class sequence tagging task, we use the corresponding detection task as auxiliary signals. Given a multi-class sequence tagging task with C labels in the BIO format (Ramshaw and Marcus, 1999), we will have 3 labels for the detection and 2C + 1 labels for the classification. Thus,Ŝ(π 0 ,π 0 ) = S(π 0 ,π 0 ) = S(C,C) =  et al., 2017). In the BIO setting, we havê S(π 0 ,π 0 ) = S(π 0 ,π 0 ) = S(C,C) = Note that PABI is consistent with the entropy normalized mutual information (see more in footnote 6) becauseŜ(π 0 ,π 0 ) = I(Y ;Ỹ ) H(Y ) for auxiliary labels.
Cross-lingual signals. For cross-lingual signals, we can use multilingual BERT to getĉ in the extended input space (V ∪ V ) n . After that, η 1 and η 2 can be computed accordingly.
Cross-modal signals. For cross-modal signals, we only consider the case where labels of gold and incidental signals are same and inputs of gold and incidental are aligned. A common situation is that a video has visual, acoustic, and textual information. In this case, the images and speech related to the texts can be used as cross-modal information. We can use cross-modal mapping between speech/images and texts (e.g. Chung et al. (2018)) to estimate the η 1 and η 2 for cross-modal signals.
The mix of cross-domain signals and constraints. Letc denote the perfect system on crossdomain signals and satisfying constrains on inputs of gold signals, andĉ denote the model trained on cross-domain signals and satisfying constraints on inputs of gold signals. In this way, we can estimate η 1 and η 2 by forcing constraints in their inference stage.

A.6 Derivation of Equation (3)
For simplicity, we use Y to denote c(x),Ỹ to denotec(x), andŶ to denoteŶ (x). We then re-write the definitions of η, η 1 and η 2 as η = E x∼P D (x) 1(c(x) =c(x)) = P (Y =Ỹ ), η 1 = E x∼P D (x) 1(ĉ(x) =c(x)) = P (Ŷ =Ỹ ) and Note that L is the label set for the task. Considering all three systems in the target domain, we have

A.7 PABI for Transductive Signals
Assumption I:c(x) is a noisy version of c(x) with a noise ratio η in both target and source domain: Theorem A.2. Let C be a concept class of VC dimension d for binary classification. Let S + be a labeled sample of size m generated by drawing βm points (S) from D according to c and (1 − β)m points (S) fromD (the distribution of incidental signals) according toc. Ifĉ = arg min c∈C R S + , 1 2 (c) = arg min c∈C 1 2 R S (c) + 1 2 RS(c) is the empirical joint error minimizer, and c * T = arg min c∈C R D (c) is the target error minimizer, c * = arg min c∈C RD(c)+R D (c) is the joint error minimizer, under assumption I, and assume that C is expressive enough so that both the target error minimizer and the joint error minimizer can achieve zero errors, then for any δ ∈ (0, 1), with probability at least 1 − δ, A concept is a function c: X → {0, 1}. The probability according to the distribution D that a concept c disagrees with a labeling function f (which can also be a concept) is defined as Note that here (y, c(x)) = |y − c(x)| is the loss function and R D (c) = E x∼D [ (y, c(x))] where y is the gold label for x. We denote R α (c) (α ∈ [0, 1]) the corresponding weighted combination of true source and target errors, measured with respect toD and D as follows: Proof.
Lemma A.4. For a fixed concept c from C with VC dimension d, if a random labeled sample (S + ) of size m is generated by drawing βm points (S) from D and (1 − β)m points (S) fromD, and labeling them according to f D and fD respectively, then for any δ ∈ (0, 1) with probability at least 1 − δ (over the choice of the samples), where R S + ,α = αR S (c) + (1 − α)RS(c) and e is the natural number.

A.8 Details of Experimental Settings
In this subsection, we briefly highlight some important settings in our experiments and more details can be found in our released code.
NER with individual inductive signals. For partial labels, we experiment on NER with four different partial rates: 0.2, 0.4, 0.6, and 0.8. For noisy labels, we experiment on NER with seven different noisy rates: 0.1 − 0.7. For auxiliary labels, we experiment on two auxiliary tasks: named entity detection and coarse NER (CoNLL annotations with 4 types of named entities (Sang and De Meulder, 2003)).
NER with mixed inductive signals. A more complex case is the comparison between the mixed inductive signals. For the first type of mixed signals, we experiment on the combination between three unknown partial rates (0.2, 0.4, and 0.6) and four noisy rates (0.1, 0.2, 0.3, and 0.4). As for the second type of mixed signals, we experiment on the combination between the BIO constraint and five unknown partial rates (0.2, 0.4, 0.6, 0.8, and 1.0).
NER with various inductive signals. After we put the three types of individual inductive signals and the two types of mixed inductive signals together, we still see a correlation between PABI and the relative performance improvement in experiments in Fig. 2

(f).
NER with cross-domain signals Because we only focus on the person names, a lot of sentences in the original dataset will not include any entities. We random sample sentences to keep that 50% sentences without entities and 50% sentences with at least one entity. η 1 and η 2 is computed by using sentence-level accuracy.
QA with cross-domain signals. For consistency, we only keep one answer for each question in all datasets. Another thing worthwhile to notice is that the most informative QA dataset is not always the same for different main QA datasets. For example, for NewsQA, the most informative QA dataset is SQuAD, while the most informative QA dataset for SQuAD is QAMR.
Experimental settings for learning with various inductive signals. The 2-layer NNs we use in CWBPP (algorithm 1) has a hidden size of 4096, ReLU non-linear activation and cross-entropy loss. As for the embeddings, we use 300 dimensional GloVe embeddings (Pennington et al., 2014). The size of the training batch is 10000 and the optimizer is Adam (Kingma and Ba, 2015) with learning rate 3e −4 . When we initialize the classifier with gold signals (line 1), the number of training epochs is 20. After that, we conduct the bootstrapping 5 iterations (line 3-7). The confidence for predicted labels is exactly the predicted probability of the classifier (line 5). In each iteration of bootstrapping, we further train the classifier on the joint data 1 epoch (line 7). It usually costs several minutes to run the experiment for each setting on one GeForce RTX 2080 GPU.
Experimental settings for learning with crossdomain signals. As for BERT, we use the pretrained BERT-base pytorch implementation (Wolf et al., 2020). We manually use the common parameter settings for our experiments. Specifically, for NER, the pre-trained BERT-base is case-sensitive, the max length is 256, batch size is 8, the epoch number is 4, and the learning rate is 5e −5 . As for QA, the pre-trained BERT-base is case-insensitive, the max length is 384, batch size is 16, the epoch number is 4, and the learning rate is 5e −5 . It usually costs less than half an hour to run the experiment for each setting on one GeForce RTX 2080 GPU.