Conceptor-Aided Debiasing of Large Language Models

Pre-trained large language models (LLMs) reflect the inherent social biases of their training corpus. Many methods have been proposed to mitigate this issue, but they often fail to debias or they sacrifice model accuracy. We use conceptors--a soft projection method--to identify and remove the bias subspace in LLMs such as BERT and GPT. We propose two methods of applying conceptors (1) bias subspace projection by post-processing by the conceptor NOT operation; and (2) a new architecture, conceptor-intervened BERT (CI-BERT), which explicitly incorporates the conceptor projection into all layers during training. We find that conceptor post-processing achieves state-of-the-art (SoTA) debiasing results while maintaining LLMs' performance on the GLUE benchmark. Further, it is robust in various scenarios and can mitigate intersectional bias efficiently by its AND operation on the existing bias subspaces. Although CI-BERT's training takes all layers' bias into account and can beat its post-processing counterpart in bias mitigation, CI-BERT reduces the language model accuracy. We also show the importance of carefully constructing the bias subspace. The best results are obtained by removing outliers from the list of biased words, combining them (via the OR operation), and computing their embeddings using the sentences from a cleaner corpus.


Introduction
LLMs such as BERT (Devlin et al., 2019) and GPT (Radford et al., 2019;Brown et al., 2020) are extremely successful in most natural language processing (NLP) tasks.However, since they are trained on texts written by humans, the social bias is inherited and represented in the parameters of LLMs (Bolukbasi et al., 2016;Caliskan et al., 2022).For example, gender bias has been found in contextualized embeddings (May et al., 2019;Zhao Figure 1: The pipeline of the conceptor-aided debiasing paradigm.We first use different settings (wordlists with outlier filter and corpora) to generate the best bias subspace (conceptor matrix), then apply them to two conceptor-aided debiasing methods and measure the debiasing performance by two evaluation metrics.The experiment is conducted on two LLMs: BERT and GPT.et al., 2019).Therefore, many researchers have developed debiasing techniques to improve the social fairness of NLP.However, such debiasing often fails to debias effectively and reduces language model performance in downstream tasks (Meade et al., 2022).Furthermore, most debiasing methods neither follow Bommasani et al. (2020)'s suggestion to reduce bias in all layers nor tackle intersectional bias in an efficient way (Lalor et al., 2022).
In this paper, we challenge Karve et al. ( 2019)'s conclusion that conceptor negation fails to debias BERT stably.Instead, we are the first ones to empirically find that as a soft shrinkage of the principal components of the subspace defined by the list of biased words (Liu et al., 2018), conceptors is a powerful tool to debias LLMs such as BERT and GPT using either post-processing or continued-training.In this process, we demonstrate the effect on debiasing performance of choosing different corpora, subspace removal methods, and criteria for selecting the list of bias attribute words that are used to construct the bias subspace.Further, we unprecedentedly show that the conceptor can tackle varied types of biases (e.g.gender, race) intersectionally and efficiently by its unique logical operation.
Specifically, the attribute wordlists at the core of our method, and the methods we build on, are sets of attribute words related to bias.These typically come in opposing pairs (e.g.'man'/'woman', 'prince'/'princess'). Bolukbasi et al. (2016), Liang et al. (2020) and others use the first principal component (PC) to define the bias subspace-which can be later subtracted entirely to debias.We similarly construct such subspaces, but use conceptors as a 'soft' way to remove them-downscale the PC adjusted by a regularized identity map.When generating such wordlists, it may be more representative of bias by removing outliers in the embedding space.Considering the embeddings are contextualized, we select the contextualized token-level word embeddings using sentences from a specific corpus.Then we stack them to generate a bias subspace in a form of a conceptor matrix for the debiasing in the next step.The pipeline is shown in Figure 1.
This work contributes the following: • Employs conceptor negation post-processing to debias LLMs such as BERT and GPT, beating most SoTA while retaining useful semantics and robustness in multiple scenarios • Explores conceptor-intervened BERT (CI-BERT)-a novel model architecture that continues training BERT after incorporating conceptors within all of BERT's layers • Illustrates how different corpora, bias attribute wordlists, and outlier removal criteria impact debiasing performance • Demonstrates conceptor-aided methods can be generalized to different layers of LLMs and various types of biases and can mitigate them intersectionally by its unique logical operation 2 Related Work

Bias Manifestation
Multiple demographic biases are common in society.Among them, gender bias is the most wellstudied in academia, given its omnipresence and bi-polarity (Bolukbasi et al., 2016;May et al., 2019;Kurita et al., 2019).Other social biases (e.g.racial, religious) are also widespread in LLMs and attracting increasing attention (Nangia et al., 2020;Nadeem et al., 2021;Meade et al., 2022).Such social bias manifests itself in all layers of the contextualized representations of LLMs like BERT and GPT (Bommasani et al., 2020); and Kaneko and Bollegala (2021) show that debiasing all layers is more effective.Moreover, Lalor et al. (2022) indicates the importance of addressing varied biases in different dimensions.Thus, a new challenge is raised on how to adapt current methods or develop novel paradigms to mitigate the bias in each layer and across multiple social dimensions.

Debiasing Techniques and Challenges
We collect the mainstream SoTA debiasing methods (Overview: Meade et al. ( 2022); Xie and Lukasiewicz (2023)), each with typical examples: (1) Bias Subspace Projection (BSP): the classic method of bias subspace subtraction is to first capture the bias subspace determined by attribute words in the corpora and then project the bias direction out from the language embeddings.This can be done by post-processing as either hard projection (Bolukbasi et al., 2016;SENTENCEDEBIAS, Liang et al., 2020) or soft projection (Karve et al., 2019).Some variants attain a similar goal by training a linear classifier (INLP, Ravfogel et al., 2020) or fine-tuning LLMs (Kaneko and Bollegala, 2021).
(3) Dropout Regularization (DROPOUT): in combination with an additional pre-training, increasing the dropout components inside the transformer-based language models can lead to lower bias (Webster et al., 2020).
(4) SELF-DEBIAS: by using specific templates to encourage LLMs to generate toxic output and then modifying the original output distribution of the model by a decoding algorithm, Schick et al. (2021) makes use of the internal knowledge of language model to debias in a post-hoc manner.
Further, it is common to combine multiple such methods.For instance, Zhao et al. (2019) and Liang et al. (2020) combine the techniques of data augmentation and hard debiasing.However, per the discussion in Meade et al. (2022), the methods often neither debias as well as they claim (e.g.CDA, DROPOUT, SENTENCEDEBIAS), nor do they maintain the model's capability for downstream tasks (e.g.CDA, DROPOUT, INLP).Worse, some techniques like CDA and DROPOUT increase the bias measured on SEAT-a test of language bias which we will describe in Section 5.This dilemma challenges us to develop new methods to further reduce bias while retaining meaningful semantics.Last, the majority of debiasing methods ground the bias by word list; different lists can lead to different debias performance (Antoniak and Mimno, 2021).

Conceptors in NLP
Conceptors-a soft projection method supporting conceptual abstraction and logical operations (Jaeger, 2014)-has been adapted to NLP domains such as debiasing (Liu et al., 2018;Sedoc and Ungar, 2019;Karve et al., 2019), continual learning (Liu et al., 2019a), and semantic information enrichment (Liu et al., 2019b).Conceptor negation is a soft shrinkage of the PCs of a subspace such as stop words or, in our case, of the target words defining the bias directions (Liu et al., 2018).Therefore it has the potential to debias better than hard projection (e.g., Bolukbasi et al., 2016) while retaining enough semantics.Mathematically, it can capture, conjoin, and negate the bias concepts by logical operation, and thus can deal with intersectional bias efficiently.
Although Karve et al. (2019) showed that debiasing conceptors can successfully debias both static embeddings such as Glove, Word2vec, and Fasttext, and contextual embeddings such as ELMo (Peters et al., 2018), they state that the performance in BERT is far less consistent and effective than other word representations.We discover that this is the result of their having selected the wrong set of attribute words, which leads to a poor bias subspace. 2 Another difference is that the BERT tokens of attribute words should be averaged if they contain multiple subwords after tokenization (Liang et al., 2020;Kaneko and Bollegala, 2021).

The Mechanism of Conceptors
Let us take a closer look at the mathematics of conceptors: considering a set of vectors {x 2 We fixed the coding issues.
where ∥ • ∥ F is the Frobenius norm and α −2 is a scalar hyper-parameter called aperture3 .It can be shown that C has a closed-form solution: is a data collection matrix whose i-th column is x i .Intuitively, C is a soft projection matrix on the linear subspace where the typical components of x i samples lie so that it can capture the components that all representations roughly share.Therefore, different from PCA projection which removes the first several principal components (PCs) completely, the conceptors method softly downscales the PCs adjusted by a regularized identity map (Figure 2).Conceptors support Boolean operations such as NOT (¬), AND (∧) and OR (∨).For two arbitrary conceptors C 1 and C 2 , we have These logical operations are feasible if C 1 and C 2 are created by the sets of equal sizes (Jaeger, 2014), as shown in Figure 3.This reveals the potential for debiasing by combining different conceptors learned from different bias subspaces.This is helpful both in combining different wordlists for the same bias (e.g.gender) or different wordlists for different protected classes (e.g.gender and race).

Bias Subspace Setting
We explore the impact of different choices of attribute wordlists, the corpora used to find their embeddings, and how the wordlists are combined and filtered to remove outliers, on the quality of bias Corpora We compare three corpora: (1) the Brown Corpus (Francis and Kucera, 1979), a collection of text samples of mixed genres; (2) the Stanford Sentiment Treebank (SST; Socher et al., 2013), a polarized dataset of 10,662 movie reviews; and (3) a Reddit Corpus (Liang et al., 2020), a dataset collected from discussion forums about relationships, electronics, and politics.The reason is to see how the language formality and topic breadth of texts impact the debiasing, the Brown corpus is formal and contains 15 genres, the Reddit corpus is informal with 3 domains and the SST corpus is informal, has only one domain.They are used to provide embeddings for the attribute words.

Combining and Filtering Wordlists
We compare five ways of using three different wordlists to create conceptor bias subspaces.
The three wordlists are gender words originating from different sources: the pronouns wordlist is a set of common terms that are specific to particular genders, such as 'daughter' or 'son'; the extended wordlist, an extension of the former, contains less frequent words such as 'cowgirls' or 'fiancees'; and propernouns wordlist is comprised of proper nouns like 'Tawsha', 'Emylee', and so on.
There are five methods of using these three wordlists to generate a bias subspace.We can use each of them individually (their subspaces are named the same as themselves: pronouns, extended, and propernouns, respectively).We can also combine them in two ways: either by concatenating them as a single list generating a corresponding subspace (named all); or by running the conceptor OR operation-a Boolean operation of conceptors described in subsection 2.3-on the three corresponding conceptor matrices to generate what can be viewed as a union of the three bias subspaces (named or).
Unlike Karve et al. (2019), to study the effects of removing outliers from the wordlists, we first project the LLM's embeddings of the words in the wordlist to a 2-dimensional UMAP clustering (McInnes et al., 2018) space, shown in Figure 4, and then filter the outliers by percentile on their (x, y)-coordinate.The outliers are defined as the points that fall outside of 1.5 times the interrange (IR), which is the difference between p-th and (1−p)-th percentile.We iterate p from 0.1 to 1.0 with step size 0.1 to generate different wordlists and then test how well each debiases.Our goals are to detect the negative effect of outliers on debiasing performance and to explore which percentile here is optimal for debiasing.

Debiasing Methods
We propose and explore two kinds of conceptoraided debiasing: conceptor post-processing, and conceptor-intervened continued training.They are abbreviated as P.P. and C.T. respectively in tables.

Conceptor Bias Subspace Construction
We construct the conceptor negation matrix ¬C as demonstrated in Algorithm 1, where matrix X is a stack of the within-sentence contextualized embeddings of the words.The words are determined by attribute wordlists and the sentences are from the specified corpus as mentioned in Section 4.1.Note that we do not need the "difference space" of bipolar bias as the conceptor projection matrix is applied to the original space-in this way the conceptor method is different from the so-called harddebiasing (Bolukbasi et al., 2016).To ensure contextualization we remove the less-than-four-word sentences.Also, following Kaneko and Bollegala (2021)'s idea, if a word crosses multiple sub-tokens, then its contextualized embedding is computed by averaging the contextualized embeddings of its constituent sub-tokens, which is different than the previous conceptor works.
Conceptor Negation and Post-Processing Next, we post-process the sentence embeddings t which contain attribute words and target words, by taking the matrix product of ¬C to subtract the bias subspace, rendering debiased embeddings t * , as demonstrated in the last part of Algorithm 1.Each BERT layer manifests different levels of bias (Bommasani et al., 2020).To maximize the effectiveness of ¬C, we want ¬C to be generated from the corresponding layer.Therefore, we are the first ones to test the debiasing performance by using different conceptor matrices generated by different layers of the language model and to explore whether conceptor post-processing generalizes well on each layer of LLMs and on different LLMs (BERT and GPT).
Intersectioanl Debiasing Importantly, not only can conceptors mitigate different types of biases such as gender and race respectively, but it can also conjoin and negate these biased concepts together due to its magical logical operations.It is natural that societal biases co-exist in multi-dimensions: such as "African Male" rather than "African" and "Male".Therefore, it is efficient that conceptors can tackle them intersectionally by utilizing the previously constructed bias subspaces via its OR operation to construct the new mixed conceptors.

Conceptor Intervention and Continued Training
The varying levels of bias across BERT layers suggest the possible utility of an alternate approach to mitigate the bias.Accordingly, we construct a new architecture, Conceptor-Intervened BERT (CI-BERT), by placing the corresponding conceptor matrix after each layer of BERT (Figure 5).We then continue training the entire model to incorporate the model weights with the bias negation captured by the conceptors in each layer.Thus we can take the biases in all layers into account so that we can mitigate the layerwise bias simultaneously.CI-BERT architecture can be used in three ways.We can load the original pre-trained weights to CI-BERT and directly render the language embeddings (Type I; CI-BERT × original weights).Alternatively, we can continue training the model using CI-BERT to get newly trained weights; then we can load these weights back to either the original off-the-shelf BERT architecture (Type II; BERT × trained weights) or to the new architecture CI-BERT (Type III; CI-BERT × trained weights).

Sentence Encoder Association Test
The Sentence Encoder Association Test (SEAT) (May et al., 2019) is an extension of the Word Encoder Association Test (WEAT) (Caliskan et al., 2017).It can measure the bias at the sentence level in different kinds of bias (Meade et al., 2022).
SEAT uses two types of words: attribute words W a (e.g.he/she) and target words W t (e.g.occupations), which we expect to be gender-neutral.That is, the associations between w a /w ′ a ∈ W a and w t ∈ W t should be no difference in the sentencetemplate representations of LLMs.
Denote the sentence sets of attribute words as A and A ′ , and of target words as T and T ′ , we have: where for each sentence s, we have c(s, A, A ′ ), the difference of the mean cosine similarity of s concerning sentences from between A and A ′ ; as The amount of bias is given by the effect size where µ and σ denote the mean and standard deviation, respectively.The smaller the absolute value of d is, the less bias has been detected.The one-sided p-value measures the likelihood that a random resampling of the sentence set that contains attribute words would generate the observed test statistic.

Gender Co-Reference Resolution
As described by Gonen and Goldberg (2019), SEAT can detect only the presence but not the absence of bias.To further understand how the conceptoraided methods work on debiasing, we adopt an end-task: gender co-reference resolution.
WinoBias (Zhao et al., 2018) provides genderbalanced co-reference tests to evaluate LLMs' neutrality towards pronouns referring to occupations.The tests include pro-stereotypical (PRO) scenarios, where gender pronouns match gender-conforming occupations (e.g., her/nurse), and anti-stereotypical (ANTI) scenarios, where gender pronouns apply to disfavored occupations.The bias is measured by the average and absolute difference in F1 scores between the PRO and ANTI subsets.
Require: large language model M θ (with parameters θ), bias attribute wordlist W, and corpus S.
1: X ← [ ] 2: for each word w ∈ W do 3: for each sentence s ∈ S do 4: if w inside s then 5: wc ← the embedding of w inside M θ (s) // get contextualized word embedding 6: X ← X + wc // stack as a matrix 7: end if 8: end for 9: end for 10: C ← XX ⊤ (XX ⊤ + I) −1 // construct conceptor bias subspace // note that different Xi yields different Ci for arbitrary i 11: // cross bias subspaces by AND operator (if intersectional debias) 12: // unite bias subspaces by OR operator (for robust debias) 13: ¬C ← I − C // make negation conceptor matrix by NOT operator 14: for each new sentence t do 15: // debias sentence by projection 16: end for Based on this, de Vassimon Manela et al. (2021) develop two intuitive metrics, skew and stereotype, to better probe model fairness.In formula, where superscripts M and F denote male and female respectively and F1 stands for the F1-score.
It is shown that there is an approximate trade-off between these two biases.The authors argue that the T2 test set of WinoBias is better than the T1 test set at revealing bias, as the latter is less ambiguous to LLMs.Therefore, we only report T2 here.

Debiasing Results
This section aims to answer these questions: • What is the best setting for bias subspace generation within conceptor-aided debiasing?• Given the best setting, can the conceptor post-  2022) is included in the tables.

Models
To investigate the generalization of conceptor debiasing, we explored different scales of typical LLM families, which are: BERT-T (bert-tiny), BERT (bert-base-uncased), BERT-L (bert-largeuncased), GPT2 (gpt2), GPT-L (gpt2-large), and GPT-J (gpt-j).We did not test on GPT3 and Chat-GPT since their embedding models (e.g.textembedding-ada-002) do not support the contextualized embedding on token level.However, due to the similar modeling, once we have such embedding, conceptor techniques can be transferred.

Bias Subspace Construction with Robustness Boosted via OR Operator
We construct the conceptor bias subspaces upon the different combinations of corpora, wordlist selections, and outlier removal.
To evaluate corpora, by testing on the last layer of the BERT, we compare the debiasing result of three different corpora: Brown, SST, and Reddit on SEAT.Table 8 shows that Brown stably provides the best debiasing result even if using different wordlist subspaces.The SST corpus is a close second, while Reddit is by far the worst.The style of the Reddit corpus is most likely least similar to that of the SEAT evaluations.
To evaluate alternate methods of constructing the bias wordlist subspace, we use the five subspaces described in Section 4.1.Among them, the or subspace is the most robust; see Table 9, 10 and 11.Combining the pronouns, extended and propernouns subspaces with or represents the distinct yet union concepts (and hence subspaces) of each of the wordlists, thus both outperforming individual wordlists and outperforming the all subspace, which simply concatenates all the wordlists, giving a less precise subspace definition.
To evaluate wordlist outlier removal, we define the outliers by the UMAP filter as discussed in section 4.1 and generate different percentages of the words that are used to capture bias.For example, the all subspace has 2071 words within 0.5−1.0percentile, 2061 in the 0.4 percentile, 1601 in the 0.3 percentile, 430 in the 0.2 percentile, and 82 in the 0.1 percentile (Table 6).We observe that including fewer words often leads to higher debiasing performance, presumably due to the removal of outliers.However, an extremely small percentile, say 0.1, would harm the effectiveness of debiasing because of the inadequate loss being left (Table 9, 10 and 10).Similar results are obtained if using T-SNE (Van der Maaten and Hinton, 2008).
In conclusion, the optimal setting for BERT-T is "sst-0.5-or"(SST; percentile 0.5; or subspace); similarly, for BERT is "brown-0.4-or"(Brown; percentile 0.4; or subspace).For other models, if not mentioned, it is default as "brown-1.0-or".Henceforth, these settings are held for the conceptor debiasing on the models respectively.

Post-Processing Debias via NOT Operator
For general debiasing via conceptor negation postprocessing, the performance is excellent.The SEAT score of BERT decreases from 0.620 to around 0.350−0.400 in Brown Corpus (Table 9), and can be as low as 0.311 if using the setting "brown-0.4or",outperforming the debiasing result of CDA, DROPOUT and SENTENCEDEBIASE (Table 1).The success of debiasing is further verified by Wino-Bias (Table 2), where the skew bias drops from 38.3 to 22.3 without any additional fine-tuning.Although the stereotype bias increases, it is not only expected since these two biases are trade-offs but also acceptable, as they now reach a good balance (de Vassimon Manela et al., 2021).
The debiasing conceptors are robust and generalizable, as shown in Table 1, the debias performance is consistent in different scales of BERT and GPT models.Note that the settings of BERT-L, GPT2-L and GPT-J are not tuned (i.e.default setting), which means that they can likely reach much lower SEAT scores.Moreover, conceptors can mitigate the bias in almost all scenarios, no matter using which corpus, bias subspace, or wordlist threshold (Table 9,  11 and 10); no matter which LLMs (Table 1, 15,  17, 16 and ,19) ; no matter in which layer (Table 12,  13 and 18); and no matter which type of biases (Table 3, Table 21 and 22)

Intersectional Debias via AND Operator
Table 3 empirically shows that conceptors not only can mitigate the different type of biases, but also can intersect the existing bias subspaces (e.g.gender and race) to create a mixed conceptor matrix in an efficient way and to debias gender and race respectively.Furthermore, for assessing the intersectional debiasing, we employ the I1-I5 intersectional bias test introduced by Tan and Celis (2019).They adapt the SEAT to examine the privilege associated with the combination of being African/European American and being male or female.The results demonstrate that such intersected conceptor formed via the AND operator can effectively reduce multidimensional bias, lowering the SEAT score from 0.673 to 0.434, while its conceptor counterparts focused solely on single-dimensional bias can only reduce the score to 0.613 and 0.635 respectively.

Conceptor-Intervention Debias
We use CI-BERT architecture to continue to train the models to get the new weights.Then we demonstrate the combinations of architectures and weights as an ablation study (Type I, II, and III).Among them, Type III can outperform conceptor post-processing (Table 1), and Type I and II (Table 4).Compared to the SEAT score after postprocessing, Type I can outperform it at each layer of BERT-T but underperform it at most layers of BERT (Table 13 and 18).In short, using the CI-BERT with the newly trained weights could receive the lowest bias in the model and is promising to beat post-processing.For example, when using the setting "brown-0.4-or", the lowest SEAT score is 0.280, beating the post-processing result of 0.311 and more than half of the SoTA methods.This is verified again by gender co-reference resolution in To further study the feasibility and robustness of CI-BERT continued training concerning the model property.We experiment on both BERT-T and BERT and plot the average SEAT curve along with training steps (Figure 6).Both can beat their post-processing counterparts in some steps during the early training stage, and then the bias fluctuates and increases again, perhaps due to the model relearning the bias during continued training, or oversaturating the conceptor bias projections into its weights.
In comparison, the continual-trained CI-BERT can more stably lower the bias in smaller Bert model.We suspect this is related to the model complexity.The debiasing projection of the last layer's conceptor matrix is upon the last hidden state and thus generated transitively from all the prior layers.Currently, we are embedding all layers' conceptor matrices, which may lead to overlapping and redundant debiasing projection from the prior layers.

Maintaining Meaningful Semantics
To understand how conceptor debiasing impacts the downstream natural language understanding (NLU) tasks, the GLUE benchmark (Wang et al., 2018)-comprised of nine different tasks-is used to evaluate the model after debiasing (Table 5).While there seems to be no consensus about the quanti- tative threshold of the trade-off between language modeling capability and debiasing performance, a small decrease may be acceptable depending on the downstream tasks.We believe that, in an ideal scenario, the performance on the GLUE benchmark should not significantly decline after debiasing.
The conceptor post-processing of BERT can retain and even improve the useful semantics (increase the average GLUE score by 1.77) for downstream tasks without damage to the model's ability, outperforming any other listed SoTA debiasing methods.Even if scaling to BERT-L, the GLUE is still slightly higher.In comparison, the average GLUE score of conceptor continued-training BERT is relatively low, although it is not the worst among all the methods.This indicates that the continuedtraining method, while still capable of outperforming its post-processing counterpart under the same setting, may sacrifice NLU abilities.
Since GPT is an autoregressive model, we adopt the SequenceClassification counterpart on the GLUE benchmark, following the method of Meade et al. (2022).The score of GPT2 and GPT-J are decreased slightly by 0.11-0.14, which is an affordable cost, while GPT2-L increases slightly by 0.05.Notice that even when trained on the original BERT architecture, the average GLUE score still drops about 0.3 point.Thus, the lower GLUE score here is not completely caused by CI-BERT, though the actual reason is hard to determine due to training randomness (McCoy et al., 2019).

Conclusion and Future Work
We have shown that conceptor-aided debiasing can successfully mitigate bias from LLMs (e.g., BERT, GPT) by its NOT operation.Specifically, conceptor post-processing outperforms many stateof-the-art debiasing methods in both debiasing effectiveness and semantic retention.We also tested a new architecture, conceptor-intervened BERT (CI-BERT), which in combination with continued training, takes all layers' biases into account and shows the promise to outperform its post-processing counterpart.However, it might be at the cost of increased instability and worse semantic retention.In all cases, the best conceptor matrices are generally obtained when the bias subspace is constructed using (1) a cleaner corpus, (2) the union of different related wordlists (e.g.pronouns, roles, and names) by the conceptor OR operation, and (3) removal of outliers from the wordlists.We further show that cocneptor-aided debiasing is robust in different LLMs, various layers of models, and varied types of biases.Moreover, conceptors can utilize the current conceptor matrices to construct a new conceptor matrix to mitigate the intersectional bias in an efficient manner by AND operation.
In future research we plan to make CI-BERT and intersectional conceptors more robust and effective.

Limitations
We list several limitations of our work below.
1) We only test the binary bias.We only test the bias in pairs via SEAT and WinoBias, for example, 'male'/'female' or 'young'/'old'.However, it is widely recognized that terms in gender, race, etc. can be multi-polar.
2) Our result is limited to English, and both corpora and wordlist tend towards North American social biases.The whole of our experiment is conducted in English.In addition, Brown and SST Corpora are collected entirely in the North American environment.So are the wordlists.Therefore, it is expected that they skew towards North American social biases.When such models are debiased under the North American environment, it is necessary to understand how effective they are when transferred to other cultures.
3) The generalization of conceptor-aided debiasing techniques can be tested more exhaustively.This work has tested it on gender and race, but it can also be tested on other types of bias such as religious bias and hate speech.

Ethical Considerations
The definition and recognition of bias are subtle.For example, we have used simple traditional binary definitions of male and female to examine gender bias.This, of course, ignores a much wider variety of gender identities, thus introducing an implicit bias to the analysis.Similarly, studies on racial bias rely on possibly problematic definitions of race.Core to our debiasing method is the selection of the wordlists.Each wordlist carries its own implicit definitions of gender, race, and other important dimensions.Care should be used to ensure that they represent the desired categories.To this end, it is often useful to involve people from the communities whose language is being debiased to better represent their values and belief systems.
One should also be careful in the use of debiasing.Removing signals about race or gender is often beneficial in reducing discrimination or in producing better translations.It may also remove key features of models needed for analyses.For example, removing gender or race 'signal' from the model may severely hamper the use of that model in gender studies or work on critical race theory."White-washing" models are not always a benefit; sometimes one wants to see the bias inherent in a corpus.

B Model Checkpoints
We use the Hugging Face Transformers package (Wolf et al., 2019) in our experiments.The models and checkpoint names are given in Table 7.

D GLUE Details
Before being evaluated on GLUE, each model is trained for three epochs with the following settings: batch_size 32, maximum_sequence_length 128, and learning_rate 2e−5; the same as Meade et al. ( 2022).

E Full Bert-Base-Uncased Model Results
• Table 8 shows the gender debiasing result by different types of the corpus, using the last layer of "bert-base-uncased" as a benchmark.
• Table 9, and 10, 11 show the post-processing gender debiasing result of different percentiles of wordlist on three different corpora: Brown, SST, and Reddit, respectively.
• Table 12 and 13 show the post-processing and conceptor-intervened gender debiasing result of each layer on two different corpora: Brown and SST, respectively.
• Table 14 contains GLUE results for the gender debiased model.

F Full Bert-Tiny Model Results
• Table 15, 16, and 17 show the post-processing gender debiasing result of different percentiles of wordlist on three different corpora: Brown, SST, and Reddit, respectively.
• Table 18 shows the post-processing and conceptor-intervened gender debiasing result of each layer on the SST corpus.

G Full GPT2 Model Debiasing Results
• Table 19 shows the post-processing gender debiasing result of different percentiles of wordlist on Brown corpus.

H Full Other LLMs' GLUE Results
• Table 20 contains GLUE results for gender debiased model.

I Full Intersectional Debiasing Results
• Table 21

Figure 3 :
Figure 3: Visualizing the boolean operations on two conceptor matrices.The OR (AND) operator leads to a conceptor matrix (in pink color) with the smallest (largest) ellipsoid (He and Jaeger, 2018).In our case, it is then negated by the NOT operator to debias.

Figure 5 :
Figure 5: Conceptor-Intervened BERT (CI-BERT).Each model's layer takes the matrix product (blue circle) of the conceptor-X generated from the corresponding layer X.It can be used directly or continually trained.
processing mitigate bias and beat SoTA? • Does embedding conceptors into LLMs via continued training beat post-processing?• What roles can conceptors operators-NOT, OR, AND-play in the debiasing pipeline?To help comparison, the SoTA debiasing results from Meade et al. (

Figure 6 :
Figure 6: SEAT score curve of CI-BERT continued training.We compare the results with the original embeddings and post-processed embeddings.We test on the last layer of BERT-T (top) and BERT (bottom).

Table 1 :
SEAT effect size of gender debiased BERT and GPT model.Effect sizes closer to 0 indicate less biased sentence representations (bolded value).Statistically significant effect sizes at p < 0.01 are denoted by *.The final column is the average absolute SEAT score of the first six columns.Default means using the default setting: brown corpus, no wordlist filtering, and OR subspace; while tuned means using the optimal combination of corpus, wordlist percentile, and conceptor bias subspace.P.P. stands for post-processing, while C.T. stands for continued training.The full version is in Appendixes E and G.

Table 3 :
SEAT effect size of race, gender, and intersectionally debiased BERT model, where the absolute average SEAT score of gender, race, and intersect are across 6, 7, 5 tests, respectively.The full version is in Appendix I. § It indicates the conceptor matrix generated by its negated AND operation of gender conceptor matrix and race conceptor matrix

Table 4 :
The ablation study of architecture and weights of CI-BERT evaluated by SEAT (the same as Table1).

Table 5 :
GLUE validation set results for gender debiased BERT and GPT model.The full version is in Appendixes E and H.

Table 7 :
The package's model and checkpoint name in our experiment.

Table 8 :
SEAT effect size of gender debising.The impact of different corpora on bert-base-uncased models.Effect sizes closer to 0 are indicative of less biased sentence representations (bolded value).Statistically significant effect sizes at p < 0.01 are denoted by *.Note that the "conceptor-X (subspace)" indicates the conceptor negation matrix is generated by the X-layer of the language model in combinations with the subspace of the specific attribute wordlist.The top-3 best performance is colored in orange.

Table 9 :
SEAT effect size of gender debising.The impact of different percentiles of wordlist (using UMAP clustering) on Brown Corpus, bert-base-uncased models.The top-3 best performance is colored in orange.

Table 10 :
SEAT effect size of gender debising.The impact of different percentiles of wordlist (using UMAP clustering) on SST Corpus, bert-base-uncased models.The top-3 best performance is colored in orange.

Table 11 :
SEAT effect size of gender debising.The impact of different percentiles of wordlist (using UMAP clustering) on Reddit Corpus, bert-base-uncased models.The top-3 best performance is colored in orange.

Table 12 :
SEAT fffect size of gender debising from CI-BERT, Type I.The conceptor-intervened performance of different layer's conceptors on SST Corpus, bert-base-uncased models.The setting is "brown-0.4-or".The layer(s) of CI-BERT that outperform the conceptor post-processing of the same layer(s) are colored in orange.

Table 13 :
SEAT effect size of gender debising from CI-BERT, Type I.The conceptor-intervened performance of different layer's conceptors on SST Corpus, bert-base-uncased models.The setting is "sst-0.9-extended".The layer(s) of CI-BERT that outperform the conceptor post-processing of the same layer(s) are colored in orange.

Table 14 :
GLUE validation set results for gender debiased BERT model.We use the F1 score for MRPC, the Spearman correlation for STS-B, and Matthew's correlation for CoLA.For all other tasks, we report accuracy.All scores are averaged among three runs.The model is "bert-base-uncased".The top-3 best performance is colored in orange.

Table 15 :
SEAT effect size of gender debising.The impact of different percentiles of wordlist (using UMAP clustering) on Brown Corpus, bert-tiny models.The top-3 best performance is colored in orange.

Table 16 :
SEAT effect size of gender debising.The impact of different percentiles of wordlist (using UMAP clustering) on SST Corpus, bert-tiny models.The top-3 best performance is colored in orange.

Table 17 :
SEAT effect size of gender debising.The impact of different percentiles of wordlist (using UMAP clustering) on Reddit Corpus, bert-tiny models.The top-3 best performance is colored in orange.

Table 18 :
SEAT effect size of gender debising from CI-BERT, Type I.The conceptor-intervened performance of different layer's conceptor matrix on SST Corpus, bert-tiny models.The layer(s) of CI-BERT that outperform the conceptor post-processing of the same layer(s) are colored in orange.

Table 19 :
SEAT effect size of gender debising.The impact of different percentiles of wordlist (using UMAP clustering) on Brown Corpus, gpt-2 models.The top-3 best performance is colored in orange.

Table 20 :
GLUE validation set results for other LLMs.We use the F1 score for MRPC, the Spearman correlation for STS-B,and Matthew's correlation for CoLA.For all other tasks,we report the accuracy.

Table 21 :
BERT intersectional gender debiasing, where intersected conceptor indicates the conceptor matrix generated by its negated AND operation of gender conceptor matrix and race conceptor matrix

Table 22 :
BERT intersectional race debiasing, where intersected conceptor indicates the conceptor matrix generated by its negated AND operation of gender conceptor matrix and race conceptor matrix