Adversarial Scrubbing of Demographic Information for Text Classification

Contextual representations learned by language models can often encode undesirable attributes, like demographic associations of the users, while being trained for an unrelated target task. We aim to scrub such undesirable attributes and learn fair representations while maintaining performance on the target task. In this paper, we present an adversarial learning framework “Adversarial Scrubber” (AdS), to debias contextual representations. We perform theoretical analysis to show that our framework converges without leaking demographic information under certain conditions. We extend previous evaluation techniques by evaluating debiasing performance using Minimum Description Length (MDL) probing. Experimental evaluations on 8 datasets show that AdS generates representations with minimal information about demographic attributes while being maximally informative about the target task.


Introduction
Automated systems are increasingly being used for real-world applications like filtering college applications (Basu et al., 2019), determining credit eligibility (Ghailan et al., 2016), making hiring decisions (Chalfin et al., 2016), etc. For such tasks, predictive models are trained on data coming from human decisions, which are often biased against certain demographic groups (Mehrabi et al., 2019;Blodgett et al., 2020;Shah et al., 2020). Biased decisions based on demographic attributes can have lasting economic, social and cultural consequences.
Natural language text is highly indicative of demographic attributes of the author (Koppel et al., 2002;Burger et al., 2011;Nguyen et al., 2013;Verhoeven and Daelemans, 2014;Weren et al., 2014;Rangel et al., 2016;Blodgett et al., 2016). Language models can often encode such demographic associations even without having direct access to them. Prior works have shown that intermediate representations in a deep learning model encode demographic associations of the author or person being spoken about (Blodgett et al., 2016;Elazar and Goldberg, 2018;Elazar et al., 2021). Therefore, it is important to ensure that decision functions do not make predictions based on such representations.
In this work, we focus on removing demographic attributes encoded in data representations during training text classification systems. To this end, we present "Adversarial Scrubber" (ADS) to remove information pertaining to protected attributes (like gender or race) from intermediate representations during training for a target task (like hate speech detection). Removal of such features ensures that any prediction model built on top of those representations will be agnostic to demographic information during decision-making.
ADS can be used as a plug-and-play module during training any text classification model to learn fair intermediate representations. The framework consists of 4 modules: Encoder, Scrubber, Bias discriminator and Target classifier. The Encoder generates contextual representation of an input text. Taking these encoded contextual representations as input, the Scrubber tries to produce fair representations for the target task. The Bias discriminator and Target classifier predict the protected attribute and target label respectively from the Scrubber's output. The framework is trained end-to-end in an adversarial manner (Goodfellow et al., 2014).
We provide theoretical analysis to show that under certain conditions Encoder and Scrubber converge without leaking information about the protected attribute. We evaluate our framework on 5 dialogue datasets, 2 Twitter-based datasets and a Biographies dataset with different target task and protected attribute settings. We extend previous evaluation methodology for debiasing by measuring Minimum Description Length (MDL) (Voita and Titov, 2020) of labels given representations, instead of probing accuracy. MDL provides a finergrained evaluation benchmark for measuring debiasing performance. We compute MDL using offthe-shelf classifiers 1 making it easier to reproduce. Upon training using ADS framework, we observe a significant gain in MDL for protected attribute prediction as compared to fine-tuning for the target task. Our contributions are: • We present Adversarial Scrubber (ADS), an adversarial framework to learn fair representations for text classification.
• We provide theoretical guarantees to show that Scrubber and Encoder converge without leaking demographic information.
• We extend previous evaluation methodology for adversarial debiasing by framing performance in terms of MDL.
• Experimental evaluations on 8 datasets show that models trained using ADS generate representations where probing networks achieve near random performance on protected attribute inference while performing similar to the baselines on target task.
• We show that ADS is scalable and can be used to remove multiple protected attributes simultaneously.

Related Work
Contextual representations learned during training for a target task can be indicative of features unrelated to the task. Such representations can often encode undesirable demographic attributes, as observed in unsupervised word embeddings (Bolukbasi et al., 2016) and sentence embeddings (May et al., 2019). Prior work has analysed bias in different NLP systems like machine translation (Park et al., 2018;Stanovsky et al., 2019;Font and Costa-Jussa, 2019;Saunders and Byrne, 2020), NLI (Rudinger et al., 2017), text classification (Dixon et al., 2018;Kiritchenko and Mohammad, 2018;Sap et al., 2019;Liu et al., 2021), language generation (Sheng et al., 2019) among others. Debiasing sensitive attributes for fair classification was introduced as an optimization problem by Zemel et al. (2013). Since then, adversarial training (Goodfellow et al., 2014) frameworks have been explored for protecting sensitive attributes for NLP tasks (Zhang et al., 2018;Elazar and Goldberg, 2018;Liu et al., 2020). 1 We use MLPClassifier modules from scikit-learn. Scrubber uses e to produce u. Bias discriminator d and Target classifier c infer protected attribute z and target task label y from u.
Our work is most similar to Elazar and Goldberg (2018), which achieves fairness by blindness by learning intermediate representations which are oblivious to a protected attribute. We compare the performance of ADS with Elazar and Goldberg (2018) in our experiments.

Adversarial Scrubber
ADS takes text documents {x 1 , x 2 , . . . , x n } as input from a dataset D with corresponding target labels {y 1 , y 2 , . . . , y n }. Every input x i is also associated with a protected attribute z i ∈ {1, 2, ...K}. Our goal is to construct a model f (x) such that it doesn't rely on z i while making the prediction y i = f (x i ). The framework consists of 4 modules: (i) Encoder h(·) with weights θ h , (ii) Scrubber s(·) with weights θ s , (iii) Bias discriminator d(·) with weights θ d and (iv) Target classifier c(·) with weights θ c as shown in Figure 1. The Encoder receives a text input x i , and produces an embedding e i = h(x i ), which is forwarded to the Scrubber. The goal of the Scrubber is to produce representation u i = s(h(x i )), such that y i can be easily inferred from u i by the Target classifier, c, but u i does not have the information required to predict the protected attribute z i by the Bias discriminator d. Our setup also includes a Probing network q, which helps in evaluating the fairness of the learned representations.
Bias discriminator d is updated using the gradients: Update the Encoder h, Scrubber s, and Task Classifier c using the gradients: In the rest of this section, we describe ADS assuming a single Bias discriminator. However, ADS can easily be extended to incorporate multiple discriminators for removing several protected attributes (discussed in Section 6.1). Scrubber: The Scrubber receives the input representation h(x i ) from Encoder and generates representation u i = s(h(x i )). The goal of the Scrubber is to produce representations such that the Bias discriminator finds it difficult to predict the protected attribute z i . To this end, we consider two loss functions: Entropy loss: In the Entropy loss, the Encoder and Scrubber parameters are jointly optimized to increase the entropy of the prediction probability distribution, H(d(u i )).
δ-loss: The δ-loss function penalizes the model if the discriminator assigns a high probability to the correct protected-attribute class. For every input instance, we form an output mask m i ∈ R 1×K where K is the number of protected attribute classes. m (k) i = 1 if z i = k and 0 otherwise. The Encoder and Scrubber minimizes the δ-loss defined as: where softmax gumble (·) is the gumble softmax function (Jang et al., 2017). In our experiments, we use a combination of the entropy and δ losses.
Target classifier: The Target classifier predicts the target label y i from u i by optimizing the cross entropy loss: L c (c(u i ), y i ).
The Scrubber, Target classifier, and Encoder parameters are updated simultaneously to minimize the following loss: where λ 1 and λ 2 are positive hyperparameters.
Bias discriminator: The Bias discriminator, which predicts the protected attribute z i , is trained to reduce the cross-entropy loss for predicting where K is the number of protected attribute classes.
Training: The Bias discriminator and Scrubber (along with Target classifier and Encoder) are trained in an iterative manner as shown in Algorithm 1. First, the Bias discriminator is updated using gradients from the loss in Equation 1. Then, the Encoder, Scrubber and Target classifier are updated simultaneously using the gradients shown in Equation 2.
Probing Network: Elazar and Goldberg (2018) showed that in an adversarial setup even when the discriminator achieves random performance for predicting z, it is still possible to retrieve z using a separately trained classifier. Therefore, to evaluate the amount of information related to y and z present in representations u, we use a probing network q. After ADS is trained, we train q on representations h(x) and s(h(x)), to predict y and z (q is trained to predict y and z separately). We consider an information leak from a representation, if z can be predicted from it with above random performance. If the prediction performance of q for z is significantly above the random baseline, it means that there is information leakage of the protected attribute and it is not successfully guarded.

Theoretical Analysis
Proposition 1. Minimizing L s is equivalent to increasing Bias discriminator loss L d .
Proof: Entropy and δ-loss components of L s tries to increase the bias discriminator loss. The discriminator cross-entropy loss L d can be written as: where The same holds true for the δ-loss component. δ(o i ) reduces the probability assigned to the true output class which increases the cross entropy loss L d (detailed proof provided in Appendix A.2 due to space constraint). Minimizing the entropy and δloss components of the Scrubber loss L s increases L d for a fixed Bias discriminator. Therefore, assuming our framework converges to (θ * s , θ * h , θ * d ) using gradient updates from L s we have: where (θ s , θ h ) can be any Scrubber and Encoder parameter setting.
Proposition 2. Let the discriminator loss L d be convex in θ d , and continuous differentiable for all θ d . Let us assume the following: s are Encoder and Scrubber parameters when the Scrubber output representation s(h(x)) does not have any information about z (one trivial case would be when ) does not have any information about z (this is achieved when d(·) always predicts the majority baseline for z). ∀(θ s , θ h ), the following holds true: (c) the adversarial framework converges with parameters θ * s , θ * h and θ * d .
d ) which implies that the Bias discriminator loss does not benefit from updates of θ s and θ h .
Proof: As the Bias discriminator converges to θ * d , we have: θ h and θ s are updated using gradients from L s (Equation 4). Since the Encoder and the Scrubber parameters converge to θ * h and θ * s respectively, from Proposition 1 (Equation 6) we have: We can show that: . Proposition 3. Let us assume that the Bias discriminator d(·) is strong enough to achieve optimal accuracy of predicting z from s(h(x)) and assumptions in Proposition 2 hold true. Then, Encoder and Scrubber converge to (θ * h , θ * s ) without leaking information about the protected attribute z.
Proof: An optimal Bias discriminator d(·) minimizes the prediction entropy, thereby increasing the entropy and δ-loss. Given (θ Then, for any other discriminator θ * d we have: Following assumption 2b,do where θ (0) d is the optimal Bias discriminator we can show that:  ( Equation 11). Then, from Proposition 2 we have As L d does not decrease, and d(·) is optimal it shows that no additional information about z is revealed which the Bias discriminator can leverage to reduce L d . This shows that starting from (θ

Experiments
In this section, we describe our experimental setup and evaluate ADS on several benchmark datasets.

Dataset
We evaluate ADS on 5 dialogue datasets, 2 Twitterbased datasets and a Biographies dataset. (a) Multi-dimensional bias in dialogue systems: We evaluate ADS on 5 dialogue datasets: Funpedia, ConvAI2, Wizard, LIGHT and OpenSub, introduced by Dinan et al. (2020). These datasets are annotated with multi-dimensional gender labels: the gender of the person being spoken about, the gender of the person being spoken to, and gender of the speaker. We consider the gender of the person being spoken about as our protected attribute. The target task in our setup is sentiment classification. For obtaining the target label, we label all instances using the rule-based sentiment classifier VADER (Hutto and Gilbert, 2014), into three classes: positive, negative and neutral. The dialogue datasets: Funpedia, Wizard, ConvAI2, LIGHT and Open-Sub were downloaded from "md_gender" dataset in huggingface library. 2 We use the same data split provided in huggingface for these dataset.  (b) Tweet classification: We experiment on two Twitter datasets. First, we consider the DIAL dataset (Blodgett et al., 2016), where each tweet is annotated with "race" information of the author, which is our protected attribute and the target task is sentiment classification. We consider two race categories: non-Hispanic blacks and whites. Second, we consider the PAN16 (Rangel et al., 2016) dataset where each tweet is annotated with the author's age and gender information both of which are protected attributes. The target task is mention detection. We use the implementation 3 of Elazar and Goldberg (2018) Table 1.

Implementation details
We use a 2-layer feed-forward neural network with ReLU non-linearity as our Scrubber network s. We use BERT-base (Devlin et al., 2019) as our Encoder h. Bias discriminator d and Target classifier c take the pooled output of BERT [CLS] representation followed by a single-layer neural network. All the models were using AdamW optimizer with a learning rate of 2 × 10 −5 . Hyperparameter details for different datasets are mentioned in Table 2. z and y sections in the table report the protected attribute and the target task for each dataset. For each task we also report the number of output classes in paranthesis (e.g. Sentiment (3)). The implementation of this project is publicly available here: https://github.com/brcsomnath/AdS.

Evaluation Framework
In our experiments, we compare representations obtained from 4 different settings as shown in In Figure 2(a), we retrieve h(x) from pre-trained BERT model. In Figure 2(b), we retrieve h(x) from BERT fine-tuned on the target task. In Figure 2(c), Encoder output h(x) from ADS is evaluated. In Figure 2(d), Scrubber output, s(h(x)) is evaluated. This represents our final setup ADS -s(h(x)).

Metrics
We report the F1-score (F1) of the probing network for each evaluation. However, previous work has shown that probing accuracy is not a reliable metric to evaluate the degree of information related to an attribute encoded in representations (Hewitt and Liang, 2019). Therefore, we also report Minimum Description Length (MDL) (Voita and Titov, 2020) of labels given representations. MDL captures the amount of effort required by a probing network to achieve a certain accuracy. Therefore, it provides a finer-grained evaluation benchmark which can even differentiate between probing models with comparable accuracies. We compute the online code (Rissanen, 1984) for MDL. In the online setting, blocks of labels are encoded by a probabilistic model iteratively trained on incremental blocks of data (further details about MDL is provided in Appendix A.1). We compute MDL using sklearn's MLPClassifier 5 at timesteps corresponding to 0.1%, 0.2%, 0.4%, 0.8%, 1.6%, 3.2%, 6.25%, 12.5%, 25%, 50% and 100% of each dataset as suggested by Voita and Titov (2020). A higher MDL signifies that more effort is required to achieve the probing performance. Hence, we expect the debiased representations to have higher MDL for predicting z and a lower MDL for predicting y.

Results
The evaluation results for all datasets are reported in Table 3. For all datasets, we report performances in 4 settings described in Section 5.3. Dialogue and Biographies dataset: First, we focus on the results on the dialogue and biographies datasets reported in Table 3 (first two rows). We observe the following: (i) for pre-trained h(x), MDL of predicting z is lower than y for these datasets. This means that information regarding z is better encoded in the pre-trained h(x), than the target label y. (ii) In "w/o adversary h(x)" setup, the Encoder is fine-tuned on the target task (without debiasing), upon which MDL for y reduces significantly (lowest MDL achieved in this setting for all datasets) accompanied by a rise in MDL for z. However, it is still possible to predict z with a     Table 4: Comparing ADS with existing baseline. The best and second best performances are in bold and underlined respectively. ADS -s(h(x)) achieve the best performance on both settings in the PAN16 dataset and is able to reduce ∆ z better than baseline on DIAL.
F1-score significantly above the random baseline, (iii) "ADS -h(x)" setup achieves similar F1 score for predicting y, but still has a F1-score for z significantly above the random baseline. (iv) "ADS s(h(x))" performs the best in terms of guarding the protected attribute z (lowest prediction F1-score and highest MDL) by achieving near random F1score across all datasets. It is also able to maintain performance on the target task, as we observe only a slight drop compared to the fine-tuning performance ("w/o adversary h(x)" for predicting y).
DIAL & PAN16: Next, we focus on the Twitterbased datasets DIAL & PAN16, where the target task is sentiment classification/mention detection and the protected attribute is one of the demographic associations (race/gender/age) of the author. The evaluation results are reported in Table 3 (third row). For these datasets, we observe that (i) "w/o adversary h(x)" representations have higher F1 and lower MDL for predicting z, compared to "Pre-trained h(x)". This shows that fine-tuning on the target task y encodes information about the protected attribute z. (ii) "ADS -h(x)" performs similar to "w/o adversary h(x)" representations on the target task but still leaks significant information about z, unlike the previous datasets. (iii) "ADSs(h(x))" achieves the best performance in terms of guarding the protected variable z (achieves almost random performance in PAN16 dataset), without much performance drop in the target task.
Comparison with Prior Work: We report two metrics following Elazar and Goldberg (2018): (i) ∆ z -which denotes the performance above the random baseline for z (50% for both PAN16 and DIAL) (ii) Acc y -is the probing accuracy on the 12.9 ADS s(h(x)) -(both) 53.8 231.5 54.4 230.9 88.6 5.5 Table 5: Evaluation results of protecting multiple attributes using ADS. Statistically significant best performances are in bold. Expected trends for a metric are shown in ↑higher scores and ↓lower scores. "ADS s(h(x)) -(both)" achieves the best performance. 7 target task. Our framework cannot be directly compared with Elazar and Goldberg (2018) as they have used LSTM Encoder. Therefore, we report the baseline Encoder performances as well. In Table 4, we observe that it possible to retrieve z and y from "w/o adversary BERT" with a higher performance compared to "w/o adversary LSTM". This indicates that BERT encodes more information pertaining to both y and z compared to LSTM. In the DIAL dataset, ADS is able to reduce ∆ z by an absolute margin of 25% compared to 9.7% by Elazar and Goldberg (2018), while the absolute drop in Acc y is 3.5% compared to 3.6% by Elazar and Goldberg (2018). In PAN16 dataset, ADS achieves the best ∆ z and Acc y performance for both setups with protected attributes: age and gender respectively. ADS -s(h(x)) also achieves performance comparable to the "w/o adversary BERT" setup, which is fine-tuned on the target task. Therefore, ADS is successful in scrubbing information about z from the representations of a stronger encoder compared to Elazar and Goldberg (2018).

Scrubbing multiple protected attributes
In this experiment, we show that using ADS it is possible to guard information about multiple protected attributes. L s in this setup is defined as: where N is the number of protected attributes and d n (·) is the Bias discriminator corresponding to the n th protected attribute z n . We evaluate on PAN16 dataset considering two protected attributes z 1 (age) and z 2 (gender). The target task is mention prediction. We consider the  Table 6: Ablation experiments on Funpedia using F1score (F1), Precision (P) and Recall (R). Expected trends for a metric are shown in ↑higher scores and ↓lower scores. ADS with both loss components performs the best in guarding z.
subset of PAN16 that contains samples with both gender and age labels. This subset has 120K training instances and 30K test instances. Evaluation results are reported in Table 5. Similar to previous experiments, we observe that "w/o adversary h(x)" (fine-tuned BERT) leaks information about both protected attributes age and gender. We evaluate the information leak when "ADS s(h(x))" is retrieved from a setup with single Bias discriminator (age/gender). We observe a significant gain in MDL for the corresponding z n in both cases, indicating that the respective z n is being protected. Finally, we train ADS using two Bias discriminators and "ADS -s(h(x)) (both)" representations achieve the best performance in guarding z 1 & z 2 , while performing well on the target task. This shows that ADS framework is scalable and can be leveraged to guard multiple protected attributes simultaneously.

Efficacy of different losses
We experiment with different configurations of the Scrubber loss L s to figure out the efficacy of individual components. We show the experimental results on the Funpedia dataset in Table 6 (with λ 1 = λ 2 = 1). We observe that most leakage in z (increase in prediction F1-score) occur when the entropy loss is removed. Removing δ-loss also results in a slight increase in leakage accompanied by a gain in performance for predicting y. This shows that both losses are important for guarding z. Empirically, we found that δ-loss is not suitable for binary protected attributes. This is because during training when the Scrubber is encouraged to learn representations that do not have information about z, it learns to encode representations in a manner such that the Bias discriminator predicts the opposite z class. Hence, the information about z is still present and is retrievable using a probing network q. For this reason, we use δ-loss for only Funpedia (λ 2 values in Table 2) where we considered 3 gender label classes.

Visualization
We visualize the UMAP (McInnes et al., 2018) projection of Encoder output representations, h(x), in Figure 3. Blue and red labels indicate female and male biographies respectively. Figure 3a and Figure 3b show representations before and after ADS training. In Figure 3a, male and female labeled instances are clearly separated in space. This shows that text representations encode information relating to gender attributes. In Figure 3b, we observe that after training in our adversarial framework both male and female labeled instances are difficult to segregate. This indicates that post training in ADS, it is difficult to identify biography representations on the basis of gender.

Conclusion
In this work, we proposed Adversarial Scrubber (ADS) to remove demographic information from contextual representations. Theoretical analysis showed that under certain conditions, our framework converges without leaking information about protected attributes. We extend previous evaluation metrics to evaluate fairness of representations by using MDL. Experimental evaluations on 8 datasets show that ADS is better at protecting demographic attributes than baselines. We show that our approach is scalable and can be used to remove multiple protected attributes simultaneously. Future work can explore leveraging ADS towards learning fair representations in other NLP tasks.

Acknowledgement
This work was supported in part by grants NIH 1R01AA02687901A1 and NSF IIS2133595.

Ethical considerations
We propose ADS, an adversarial framework to prevent text classification modules from taking biased decisions. ADS is intended to be used in scenarios, where the user is already aware of the input attributes they want to protect. ADS can only be trained on data where protected attributes are annotated. It is possible that representations retrieved from ADS, contain sensitive information which were not defined as the protected variables. Even in such a scenario, ADS won't reveal information more than its already available in the dataset. One potential way of misusing ADS would to define relevant features for a task (e.g. experience for a job application) as a protected attribute, then the classification system may be forced to rely on sensitive demographic information for predictions. In such cases, it is possible to flag systems by evaluating the difference in True Positive Rate (TPR) when the protected attribute is changed (GAP TPR z,y metric (De-Arteaga et al., 2019)). All experiments were performed on publicly available data, where the identity of author was anonymous. We did not perform any additional data annotation.

A.1 Minimum Description Length
Minimum Description Length (MDL) measures the description length of labels given a set of representations. MDL captures the amount of effort required to achieve a certain probing accuracy, characterizing either complexity of probing model, or amount of data required. Estimating MDL involves a dataset {(x 1 , y 1 ), . . . , (x n , y n )}, where x i 's are data representations from a model and y i 's are task labels. Now, a sender Alice wants to transmit labels {y 1 , . . . , y n } to a receiver Bob, when both of them have access to the data representations x i 's. In order to transmit the labels efficiently, Alice needs to encode y i 's in an optimal manner using a probabilistic model p(y|x). The minimum codelength (Shannon-Huffman code), required to transmit the labels losslessly is: There are two ways of evaluating MDL for transmitting the labels y 1:n (a) variational code -transmit p(y|x) explicitly and then use it to encode the labels (b) online code -encodes the model and labels without explicitly transmitting the model. In our experiments, we evaluate the online code for estimating MDL. In the online setting, the labels are transmitted in blocks in n timesteps {t 0 , . . . , t n }. Alice encodes the first block of labels y 1:t 1 using a uniform code. Bob learns a model p θ 1 (y|x) using the data {(x i , y i )} t 1 i=1 , Alice then transmits the next block of labels y t 1 +1:t 2 using p θ 1 (y|x). In the next iteration, the receiver trains a new model using a larger chunk of data {(x i , y i )} t 2 i=1 , which encodes y t 2 +1:t 3 . This continues till the whole set of labels y 1:n is transmitted. The total codelength required for transmission using this setting is given as: L online (y 1:n |x 1:n ) = t 1 log 2 C− n−1 i=1 log 2 p θ i (y t i +1:t i+1 |x t i +1:t i+1 ) where y i ∈ {1, 2, . . . , C}. The online codelength L online (y 1:n |x 1:n ) is shorter if the probing model is  able to perform well using fewer training instances, therefore capturing the effort needed to achieve a prediction performance.

A.2 Theoretical Analysis
Proposition. Minimizing δ-loss is equivalent to increasing the Bias discriminator loss L d .
Proof: The δ-loss function can be written as: where o j i is the raw logit assigned to the j th output class, the true output class is k = z i and g j , g k are i.i.d samples from Gumble(0,1) distribution. The cross entropy loss of the bias discriminator L d can be written as: The Therefore, minimizing δ(o i ) increases L d .

A.3 Implementation Details
All experiments are conducted in PyTorch framework using Nvidia GeForce RTX2080 GPU with  12GB memory. We use an off-the-shelf MLPClassifer from sklearn 8 as our probing network q. ADS has a total of 110M parameters (all 4 modules combined). The average runtime per epoch for each dataset is reported in Table 7.

A.4 Measuring Fairness in Representations
MDL scales linearly with the dataset size (Equation 12), therefore making it hard to compare across different datasets. In order to make it comparable, we measure a normalized description length measure for transmitting 1000 labels: |D| is the dataset size. Performance using this measure are reported in Table 8 for all datasets. In all experiments we report the MDL required for transmitting the labels in the training set.