Automatic Rule Induction for Efficient Semi-Supervised Learning

,


Introduction
Large-scale pretrained neural networks can struggle to generalize from small amounts of labeled data (Devlin et al., 2019), motivating approaches that leverage both labeled and unlabeled data. This is partially due to the black-box and correlational nature of neural networks, which confers the additional difficulties of uninterpretability (Bolukbasi et al., 2021) and unreliability (Sagawa et al., 2020).
A growing body of research seeks to ameliorate these issues by augmenting neural networks with symbolic components: heuristics, logical formulas, program traces, network templating, blacklists, etc (Arabshahi et al., 2018;Galassi et al., 2020;Wang et al., 2021). In this paper, we refer to these components as rules. Symbolic reasoning has attractive properties. Rules need little or no data to systematically generalize, and rules are inherently interpretable with respect to their constituent operations.
In this paper we propose a general-purpose framework for the automatic discovery and integration of symbolic rules into pretrained models. The framework contrasts with prior neuro-symbolic NLP research in two ways. First, we present a fully automatic rule generation procedure, whereas prior work has largely focused on manually crafted rules (Mekala and Shang, 2020;Awasthi et al., 2020;Li et al., 2021) or semi-manual rule generation procedures (Boecking et al., 2020;Galhotra et al., 2021;Zhang et al., 2022). With these existing techniques, practioners must formulate and implement their rules by hand, creating a second-order "rule annotation" burden on top of the data labeling process.
Second, the proposed framework is general purpose and can be applied to any classification dataset. This contrasts with prior research that proposes task-and domain-specific symbolic logic, through weak supervision signals (Ratner et al., 2017;Awasthi et al., 2020;Safranchik et al., 2020), special loss functions (Xu et al., 2018), model architectures (Seo et al., 2021), and prompt templates (Schick and Schütze, 2020a).
Our framework consists of two steps. First, we generate symbolic rules from data. This involves training low-capacity machine learning models on a reduced feature space, extracting artifacts from these models which are predictive of the class labels, then converting these artifacts into rules. After the rule induction step, we use the induced rules to amplify training signal in the unlabeled data. In particular, we adopt a rule-augmented self-training procedure, using an attention mechanism to aggregate the predictions of a backbone classifier (e.g. BERT) and the rules.
We evaluate the ARI framework across nine text classification and relation extraction tasks. The results suggest that the proposed algorithm can exceed state-of-the-art semi-supervised baselines, and that these gains may be because the model learns to rely more heavily on rules for difficult-topredict examples. We also show that the proposed rule induction strategy can rival human crafted rules in terms of their quality. Last, we demonstrate the interpretabiltiy of the overall system. In summary, the contributions of this paper are: 1 • Methods for automatically inducing and filtering symbolic rules from data.
• A self-training algorithm and attention mechanism for incorporating these rules into pretrained NLP models.
• Evidence suggesting the proposed framework can be layered beneath a number existing algorithms to boost performance and interpretability.

The ARI Framework
The proposed rule induction framework seeks to automatically induce symbolic rules from labeled data. Next, the rules can be used to amplify training signal on the unlabeled data. These steps are depicted in Fig. 1. More formally, assume we are given a target classification task consisting of labeled classification data L = {(x i , y i )} M i=1 and unlabeled data where each x i is a text string and y i ∈ {1, ..., K}. Our proposed method uses the labeled data L to generate a set of symbolic prediction functions ("rules") R = {r j } R j=1 that take the text and output a label or abstain: r j (x) ∈ {−1} ∪ {1, ..., K}. We then train a joint system which models P (y|x; L, U, R), i.e., an estimator which utilizes the labeled data, unlabeled data, and rules to make reliable and interpretable predictions.

Rule Induction
We begin by explaining our rule induction technique. Concretely, the goal is to generate a set of prediction functions which use the text to output a label or abstain. We operationalize this as a threestage pipeline. First, we featurize the text. Second, we use these features to construct rule-based predictor functions. Last, we filter the rules in order 1 An open-source implementation of the framework is available at: https://github.com/microsoft/ automatic-rule-induction. Figure 1: Overview of the proposed Automatic Rule Induction (ARI) framework. First, rules are induced from labeled data (top, shown with real example rules). Second, the rules are integrated into pre-trained NLP models via an attention mechanism and a self-training procedure (bottom).
to block them from firing on risky examples (to maximise precision).
Text Featurization. In the first step, the input text x j is converted into a binary or continuous feature space ϕ(x j ) ∈ R d that is more amenable to symbolic reasoning than the raw text.
1. Ngram (ϕ N ). We adopt a bag-of-words model of the text, converting each string into a binary vector reflecting the presence or absence of words in a vocabulary of size V .
2. PCA (ϕ P ). Intuitively, if we only have a small amount of labeled data, then common ngrams may be spuriously correlated with the labels. To tackle this issue, we follow Arora et al.
; Yang et al. (2021) by subtracting off a vector of shared information from each feature matrix. Specifically, we compute the first principal component v of an ngram feature matrix P ∈ R (M +N )×d constructed from both labeled and unlabeled texts in a dataset, i.e., the jth row P j, Then it follows that singular value decomposition (SVD) of the ngram feature matrix is P = U ΣV T . The first principal component v is the most "common" part of all samples (e.g., common words), and is defined as the first column of V ∈ R d×d . We then remove the projection of all features vectors {ϕ N (x)} onto v: We hypothesize that this can help remove common information that is shared across many texts, in order to isolate the most unique and salient lexical phenomena.
Rule Discovery. Armed with a featurization of the texts in L, we proceed by generating symbolic rules from the features which are capable of predicting the labels with high precision. In practice, these rules are artifacts of low-capacity machine learning models. We experiment with two rule generation algorithms.
The first rule generation algorithm uses a linear model and can be applied to ngram-based (binary) feature spaces. It involves training a simple linear model m(x j ) = σ(Wϕ(x j )) containing one matrix of parameters W ∈ R K×V that predicts class labels from the input features. It is trained by using a cross-entropy loss function and l 2 regularization term (Tibshirani, 1996). Note that in this case σ represents an element-wise sigmoid function (Mao et al., 2014). Next, we select the R largest weights in W and create one rule from each weight. If a selected weight w i,k corresponds to feature f i and label k, then we create a rule r that predicts label k if the i th dimension of ϕ(x j ) is 1, otherwise abstaining: The second rule generation algorithm uses decision trees and can be applied to ngram-or PCAbased (binary or continuous) feature spaces. Intuitively, we want to find regions inside the range of each feature (or combination of features) that are predictive of the labels. We accomplish this by training a random forest classifier containing R decision trees at a depth of D (we use D = 3 in the experiments). To make a rule from each decision tree, we apply a confidence threshold τ to the predicted label distribution in order to control the boundary between prediction and abstainment. In other words, if a decision tree t i outputs a probability distributionp over the labels, i.e. t i (ϕ(x j )) =p i,j then we construct a rule r i such that: Note that due to the bagged construction of the random forest, we hypothesize that these decision trees will yield rules which can be aggregated for robust supervision signal.
Rule Filtering. Since rules are allowed to abstain from making predictions, we can introduce dynamic filtering mechanisms that block rules from firing on examples where the rule is likely to make errors. This helps increase the precision of our rules and increase the fidelity of our downstream rule integration activities.
• Training accuracy. The rules are not perfect predictors and can make errors on the training set. We randomly sample a proportion of these errors (50% in the experiments) and replace the incorrectly predicted value with abstainment (-1).
• Semantic coverage. We design a filter to ensure that the "covered" subset of examples (examples where at least one rule fires) resembles the training set. In detail, after a rule r i fires on input text x j , predicting label r i (x j ) = l, we use the Sentence BERT framework (Reimers and Gurevych, 2019) and a pretrained mpnet model (Song et al., 2020) to obtain embeddings for the input sentence x j and all training samples that have the same label as the rule's prediction: {x i ∈ L : y i = l}. We then compute the cosine similarity between the input's embedding and the training set embeddings. If the maximum of these similarities is below some threshold (0.8 in the experiments) then we block the rule r i from firing and replace its prediction l with abstainment (-1). 2

Rule Integration
After we have induced weak symbolic rules {r i } R i=1 from the labeled data L, we can leverage the rules and unlabeled data U for extra training signal.
Our method is inspired by recent work in weak supervision and semi-supervised learning (Karamanolakis et al., 2021;Du et al., 2020). It consists of a backbone classification model (e.g. BERT) and a proposed rule aggregation layer. The aggregation layer uses an attention mechanism to combine the outputs of the backbone model and rules. The parameters of the backbone and aggregator are jointly trained via a self-training procedure over the labeled and unlabeled data.
In more detail, the backbone model b(·) is a standard BERT-based classifier with a prediction head attached to the [CLS] embedding. This classifier outputs a probability distribution over the possible labels.
The aggregation layer a(·) is trained to optimally combine the predictions of the backbone model and rules. It does so via the following attention mechanism. The layer first initializes trainable embeddings e j for each rule r j , and embedding e s for the backbone. Next, it computes dot-product attention scores between these embeddings and an embedded version of the input text (h i ). The final model prediction is a weighted sum of the backbone and rule predictions, where the weights are determined by the attention scores.
Specifically, if the set of rules activated on input x i is R i = {r j ∈ R : r j (x i ) ̸ = −1}, and the function g(·) ∈ R K returns a one-hot encoding of its input, then the rule aggregation layer computes a probability distribution over the labels: where the attention scores are calculated as, Note that p is a multi-layer perceptron that projects the input representation h i into a shared embedding space, Q is a normalizing factor to ensure a(x i ) is a probability distribution, σ(·) is the sigmoid function. Following Karamanolakis et al. (2021), the quantity u is a uniform smoothing term.
In order to train the overall system, we first pretrain the backbone on the labeled data L. Next we iteratively co-train the backbone and aggregation layer. We train the aggregator (freezing the parameters of the backbone), then train the backbone (freezing the aggregator). The process is as follows: 1. Train the backbone s using labeled data L and a cross-entropy loss function, where b(x i ) y i denotes the logit for the groundtruth class y i : Repeat until convergence: (a) Train the aggregator t on labeled data using a cross-entropy loss function : Train the aggregator on unlabeled data U with a minimum entropy objective (Grandvalet and Bengio, 2004). This encourages the aggregator to learn attention scores that favor rule agreement, because the aggregator will be encouraged to output more focused probability distributions, thereby placing less importance on spurious rules that disagree: where log a(x i ) ∈ R K denotes the element-wise logarithm of the probability distribution a(x i ). (c) Train the backbone on labeled data using ℓ sup stu : Train the backbone on unlabeled data by distilling from the aggregator, i.e. train the backbone to mimic the aggregator's output: Once trained, one can use the outputs of either the backbone or aggregator for inference. If one uses the aggregator, they receive the benefit of improved interpretability: one could inspect the attention scores s j i to understand what proportion of the system's decision was due to each rule. 3

Experiments
We perform experiments across 9 datasets and tasks, finding that the ARI rule induction framework can improve the performance of state-of-theart semi-supervised text classification algorithms. Note that concrete examples of the human-readable rules succeeding (and failing) are given in the Appendix.

Experimental Setup
We evaluate our framework on nine benchmark NLP classification datasets that are popular in the few-shot learning and weak supervision literature (Ratner et al., 2017;Awasthi et al., 2020;Zhang et al., 2021a;Cohan et al., 2019). These tasks are as follows: AGNews: using news headlines to predict article topic, CDR: using scientific paper excerpts to predict whether drugs induce diseases, ChemProt: using paper experts to predict the functional relationship between chemicals and proteins, IMDB: movie review sentiment, SciCite: classifying citation intent in Computer Science papers, SemEval: relation classification from web text, SMS: text message spam detection, TREC: conversational question intent classification, Youtube: internet comment spam detection. Table 1 shows dataset statistics. Our benchmarks cover a range of discourse domains and classification types. Unless otherwise stated we consider a 5% / 95% split between labeled data and unlabeled data. We construct this split by randomly partitioning the total training data and removing labels from the 95% split. Following Gao et al. (2020); Zhang et al. (2022) we subsample each validation set so that it roughly matches the size of the training set in order to better simulate label scarcity.
All reported results are the average of ten experimental trials, each with different random splits, seeds, and initializations. For each trial, we continuously train our models for 12,500 steps using a batch size of 32, and we stop the training process early based on validation set performance. For each method (baseline and proposed), we conducted a minimal hyperparameter search (details in the Appendix) to establish the best validation performance before running inference over the test set. We ran all experiments on Microsoft Azure cloud compute using NVIDIA V100 GPUs (32G VRAM). All algorithms were implemented using the Pytorch and Wrench frameworks (Paszke et al., 2017;Zhang et al., 2021a). We report binary F1 score for binary classification tasks and macro-weighted F1 for multiclass classification tasks.

Baselines
We experiment with our ngram and pca-style featurization schemes, as well as our linear model (linear) and decision tree (tree)-based rule generation methods. We compare against the following baselines: BERT: directly fine-tuning a BERT model on the available supervised data (Devlin et al., 2019). Weak Ensemble: It is possible that traditional ML models like regressions and decision trees achieve good performance in these low-resource settings, and the proposed ARI framework just takes advantage of these models. We accordingly train several weak models (BERT, regression, and random forest using the same hyperparameters as was used to obtain rules) and ensemble their predictions for comparison. LMFT: training a BERT model on the unlabeled data with its original language modeling objective before fine-tuning on the supervised data (Howard and Ruder, 2018;Gururangan et al., 2020). Self-Train: iteratively self-training towards the predictions of a frozen model on the unlabeled data (Nigam and Ghani, 2000;Lee et al., 2013). Snuba: We use the Snuba algorithm (Varma and Ré, 2018) to automatically generate weak labels over the unlabeled data, then a generative label  Table 2: Semi-supervised learning performance on nine classification datasets. Following Tatiana and Valentin (2021), we report the geometric mean in the "Avg." column. We denote the highest and second-highest performance (excluding the expert rules model) in bold and italic respectively. Note that T-Few was pretrained on AGNews, TREC, and IMDB. See Table 11 for accuracies and comparison against the PR-BOOST algorithm (Zhang et al., 2022 We also compare against two oracles. The first called ASTRA and is a state-of-the-art weak supervision algorithm that uses manually designed rules and an iterative self-training procedure (Karamanolakis et al., 2021). For this oracle we use previously published heuristic labeling functions from the weak supervision literature (Zhang et al., 2021a). The rules were manually constructed using domain expertise and, being expertly crafted, suggest an upper bound on performance. The second oracle is called T-Few (Liu et al., 2022) and represents a state-of-the-art prompting approach using a large 3 billion parameter model (30 times larger than the rest of the models considered in this paper).

Experiment Results
Overall results. Table 2 presents our main results. The proposed ARI framework achieves the best performance on 5 out of 9 datasets, and the ARI variations beat the baselines in terms of average performance. Our results suggest that LMFT does not always improve the performance over standard BERT finetuning, and can hurt the performance sometimes (CDR). This is in line with previous research findings (Vu et al., 2021;Du et al., 2020). Self-Train achieves an overall better performance than BERT, but underperformed on ChemProt and overperformed on SemEval. PET achieves strong results on AGNews and Youtube, but fails on many other datasets. This might be due to its sensitivity to prompts and label words for the scientific domains, which is typical for promptbased models (Gao et al., 2020). Additionally, due to implementation differences in this prior work, we tested PET after a fixed number of training steps instead of the early-stopping validation technique employed by the other algorithms (Section 3.1).
For ARI, decision-tree based methods give the best results overall, while there is no clear winner between PCA and Ngram-based models. Considering that we also removed stop words in the Ngram features, using PCA to remove common components might not make a big difference to the rules. The performance of ARI is close to ASTRA which uses manually crafted expert rules, showing the potential of automatic rules. Surprisingly, ARI is better than ASTRA on SciCite and SMS by a nontrivial margin. This suggests that automatic rules have the potential to rival human-generated rules. See the Appendix for further results and analysis.
Our results suggest that for prompting methods like PET and T-Few to outperform ARI, one needs bigger models with more language capacity like the 3B parameter Tfew (30x larger than e.g. BERT). ARI and PET leverage smaller models which are faster with reduced memory but also reduced capacity and therefore less effective prompts as prior research as noted (Liu et al., 2021). In the Appendix, we observe that ARI continues to outperform PET when more powerful backbone encoders like De-BERTaV3 (He et al., 2021) are used (Table 10). Fig. 2. We vary the fraction of labeled data between 2% to 40% on the ChemProt and Youtube datasets. The results suggest that ARI can reliably outperform the baselines across this range, especially when labeled data is scarce. Standard supervised BERT fine-tuning become increasingly competitive as the fraction of labeled data exceeds 40%.

Robustness We further test our method's robustness to the number of labeled examples in
Filter Ablations. We provide ablation results on rule filtering methods in Table 3. We pick the best performers between the three rule-generation methods in Table 2 and then vary the filters. All the three methods show performance gains when applied individually, and combining the filters appears to further improve performance in some cases.     dure. To validate this hypothesis, we enabled mean subtraction for ARI on 3 datasets. Table 5 gives the results and we can observe a slight decrease in performance, -0.32% on average, when mean subtraction is enabled.
Transferability of rules One possible limitation of ARI is that the rules overfit to their data and are extremely limited to a specific setting. . We conclude that while the learned rules are adapted to their training data, they are indeed transferable to other domains to some degree, although we note that this work assumes access to in-domain training sets, and that out-of-domain or zero-shot generalization is outside the paper's scope.

Interpretability
As discussed in Section 2.2, the behavior of the aggregation layer a(·) can be traced to individual rules, which are themselves human readable and interpretable. This is because the output of a(·) is a linear combination of attention scores and rule predictions (Equation 1). In other words, if the attention score for rule r j on example x i is s j i , then the strength of rule r j 's contribution to the model's final prediction is exactly s j i /Q. See the Appendix for case studies showing the impact of individual rules on model behavior.
To further demonstrate the system's interpretability, we grouped examples according to their difficulty 4 and measured the cumulative effect of rules 4 Following (Swayamdipta et al., 2020), we used the en-on model behavior (i.e., j s j i /Q) for each category. The results are given in Table 6

Related Work
Our research draws on a number of related areas of research, including Neuro-Symbolic computation, semi-supervised learning, and weak supervision.
Neuro-symbolic approaches seek to unite Symbolic AI, which from the 1950's until the mid 1990's was the dominant paradigm of AI research (Crevier, 1993;Russell and Norvig, 2002), with statistical machine learning and neural networks. For example, there is work that uses discrete parses to template neural network components (Arabshahi et al., 2018;Mao et al., 2019;Yi et al., 2018). There is also work that seeks to embed symbolic knowledge into network parameters via special loss functions (Xu et al., 2018;Seo et al., 2021) or carefully curated datasets (Lample and Charton, 2019;Clark et al., 2020;Saeed et al., 2021) and architectures (Trask et al., 2018). Other related work seeks to incorporate logical constraints into text generation models (Wang et al., 2021;Lu et al., 2020).
Our framework is further inspired by semisupervised learning research that leverages labeled and unsupervised data. Our baseline PET model comes from a family of algorithms that leverage prompting and model ensembling for greater data efficiency (Schick and Schütze, 2020a,b). There is also research on pulling in demonstration examples from the training set (Gao et al., 2020), automatic prompt generation (Zhang et al., 2021b; Li and tropy of BERTs predicted label distribution as a measure of example difficulty. We ranked examples according to this measure, then split them into hard (above the 75th percentile), medium (25-75th percentile) and easy (below 25th percentile). Liang, 2021), and leveraging extra datasets and tasks for data augmentation when data is scarce (Du et al., 2020;Vu et al., 2021).
Our self-training approach is similar to the knowledge distillation literature (Hinton et al., 2015;Gou et al., 2021) where a "student" model is trained to imitate the predictions of a "teacher" model. In our case, the teacher is not a separate model but a frozen student plus rule aggregation layer.
Another close body of research taps into weak sources of supervision like regular expressions, keywords, and knowledge base alignment (Mintz et al., 2009;Augenstein et al., 2016;Ratner et al., 2017). Researchers have incorporated these weak supervision signals into self-training procedures like ours (Karamanolakis et al., 2021), as well as constructing procedural generators for boosting weak supervision signals (Zhang et al., 2021a) and interactive pipelines for machine-assisted rule construction (Zhang et al., 2022;Galhotra et al., 2021;Maheshwari et al., 2020). There is also research on automatically generating weak labeling functions (Varma and Ré, 2018;Maheshwari et al., 2021) which shares our bag-of-words featurization and regression scoring mechanism.

Conclusion
In this paper, we proposed Automatic Rule Induction (ARI), a simple and general-purpose framework for the automatic discovery and integration of symbolic rules into pretrained NLP models. Our results span nine sequence classification and relation extraction tasks and suggest that ARI can improve state-of-the-art algorithms with no manual effort and minimal computational overhead.
Future work could investigate layering ARI beneath other few-shot and semi-supervised algorithms, and improving the underlying rule generation strategies, particularly with causal mechanisms (Feder et al., 2021).

Limitations
ARI is not without limitations. We observe that hyperparameter selection is key for quality rule generation (Feurer and Hutter, 2019). Second, as other research has noted (Dodge et al., 2019;, few-shot evaluation protocols remain immature as they rely on small, high variance training sets and static test sets. Last, our procedure works by extrapolating correlations in small train-ing sets, which may result in overfitting and undermine robustness to distribution shift (Sagawa et al., 2020). While the results in Table 2 suggest the end-to-end ARI system is not any more susceptible to spurious correlations than other ML/DL-based methods, i.e. susceptibility is not a big enough issue to prevent SOTA or near-SOTA performance. We hypothesize this may be due to two reasons. First, the rules are heavily regularized: our rule selection model has a strong l2 penalty, our decision trees are generated as part of a stochastic random forest, and the PCA subtraction may also have a regularizing effect. Second, ARI is a hybrid system (neural + rule) which can learn to favor the pre-trained student model when spurious rules fire. Our min-entropy loss function on unlabeled data is designed to encourage such behavior. Concrete examples of spurious rules being ignored can be found in Appendix E.

Ethical Considerations
Adding symbolic components to neural systems is a promising way to improve AI trust. Symbolic mechanisms are inherently more interpretable and controllable than black-box function approximators. These components can be reviewed by independent panels and modified to fit the considerations and sensitivities of particular applications.
Microsoft has been 100% carbon neutral since 2012, is committed to being carbon negative by 2030 and removing all of its historical emissions by 2050. This extends to the Microsoft Azure cloud compute engine used for our experiments, which runs on majority renewable energy (clo, 2020).

Acknowledgements
We thank Pengcheng He, Giannis Karamanolakis, Hannes Schulz, Yu Shi, Robert Gmyr, Yuwei Fang, Shuohang Wang and many others for their advice. Timo Schick and Hinrich Schütze. 2020b. It's not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118. Performance

References
As described in Section 3.3, one can use either the teacher or student model for ARI inference. Table 8 has the results and suggests that their performance is similar, and that there is no clear winner.  Table 8: Relative performance of the teacher and student model, using the same filter settings as in Table 2. Table 9 gives the performance of the rules by themselves, using the best combination of filters for downstream performance (described in Section 3.3). Interestingly, we find that the rules do not always outperform BERT, even on the small number of examples they fire on. We hypothesize that the contextualized nature of the teacher's embedding mechanism may be helping it further determine when rules should be applied.