You Are My Type! Type Embeddings for Pre-trained Language Models

,


Introduction
Pre-trained language models (PLMs) based on transformers (Vaswani et al., 2017) have achieved state-of-the-art results in several downstream NLP tasks (Devlin et al., 2019;Liu et al., 2020).Being trained in a self-supervised fashion, such models convey, to a certain extent, linguistic (Puccetti et al., 2021;Lin et al., 2019) and factual knowledge (Rogers et al., 2020;Meng et al., 2022).Being able to faithfully extract the desired knowledge is a crucial aspect that has sparked lots of interest (Petroni et al., 2019;Bouraoui et al., 2020).
However, querying the PLM for information is not always reliable and requires more than a manually-written prompt as an input (Petroni et al., 2020).This is opposed to a standard knowledge graph (KG), where users formulate a structured SPARQL query specifying exactly what to expect at the output.For example, the query "SELECT ?xWHERE wd:Q76 wdt:P26 ?x" returns the spouse of Barack Obama, "Michelle Obama".In the PLM setting, the SPARQL query could be replaced by a natural-language prompt, such as "The spouse of Barack Obama is [MASK]".While the predictions of the prompt are reasonable (left-hand side of Figure 1), they do not reflect the requirement of getting instances of a specific type (names of people) in the output.In fact, in BERT's top-1 prediction on prompts where the desired output type is a MUSICAL INSTRUMENT (e.g., "Philip Glass plays [MASK]"), more than half of the predictions follow different types such as SPORT ("plays football") and CHARACTER ("plays Hamlet"), instead of the expected "plays piano".Indeed, differently from the KG with typed entities, the type information is dismissed from the input prompt, thus bringing no guarantee about the expected type.
While several works try to remedy this by engineering prompts to satisfy a desired type (Jiang et al., 2020;Shin et al., 2020;Zhong et al., 2021), or relying on external sources to enrich the prompt (Petroni et al., 2020), these approaches do not fully exploit the latent concepts encoded in the PLM (Dalvi et al., 2022).To fill up this gap, we introduce the notion of Type Embeddings (TEs).Similar to how positional embeddings in a PLM encode information about the position of a token in an input (Wang and Chen, 2020), TEs encode the expected type information of the output.The definition of a TE requires neither supervised training nor external resources as it simply uses the existing PLM token embeddings, e.g., people names, to obtain type information, e.g., for PERSON.TEs can be then naturally injected into the input embedding layer of a PLM to embody the expected type in the output (right-hand side of Figure 1).Driving the model towards the expected type can help in applications exploiting PLMs, such as data integration (Cappuzzo et al., 2020), data cleaning (Narayan et al., 2022), rule induction (Cui and Chen, 2021), and fact-checking (Lee et al., 2020).
Our contributions can be summarized as follows: Figure 1: Top-5 predictions of BERT (with log probabilities) for a given prompt (left) and the changes when adding type information (right).Tokens following the desired type are colored.Correct answer is underlined.
• We introduce TYPE EMBEDDINGS (TEs), which, similar to positional embeddings, can be added to the input of PLMs and effectively encode type information.We show how to compute these embeddings using only labeled tokens that adhere to the specific type; the main idea is to remove the first singular vector of the token embedding matrix (Section 3).
• We propose methods to analyze type embeddings and evaluate their effectiveness by (i) measuring their semantic similarity to instances of the type, (ii) assessing the sensitivity of tokens to a given type, and (iii) analyzing layer-wise type classification (Section 4).• We inject type embeddings into PLMs and show increase in performance for a factual probing dataset (LAMA) and alleviation of "type bias" for a prompt by steering the output type with TEs (Section 5).We conclude the paper by discussing future directions, including the extension of our approach from types to more generic concepts (Section 6).Data and code for the paper are available at https: //github.com/MhmdSaiid/TypeEmbedding.

Related Work
PLMs have been largely studied in the last years, with most analysis focusing on the attention mechanism (Voita et al., 2019;Vig and Belinkov, 2019) and on the role of embeddings (Rogers et al., 2020;Li et al., 2021;Clark et al., 2019).
However, none of those efforts study the notion of types that we introduce.One exception is the recent studies of how concepts are encoded in PLMs.One work analyzes BERT by clustering contextual representations across layers, followed by a manual annotation to label clusters with meaningful concepts (Dalvi et al., 2022).Another work starts from treating the feedforward network of a transformer as a key-value memory and studies how certain vectors encode concepts in the vocabulary space (Geva et al., 2022).Our effort is different in two ways.First, we do not require the labeling of artifacts from the PLM, but rather we rely on user-specified tokens to model their common type.Second, we focus on type, which is one semantic concept, leaving the others, such as syntactic, morphological, and lexical to future work (Section 6).
Our approach is related to the interpretation of a neural net's internal state in terms of a concept defined through a vector (Kim et al., 2018;Schrouff et al., 2021).The Concept Activation Vector (CAV) is derived from example images by finding the normal to the hyperplane separating examples without a concept and examples with a concept in the model's activation.CAVs separate examples with and without the target concept in a model's activation at a certain layer.By Testing with a CAV (TCAV), one can identify the importance of the color 'red' in fire-engine images for a neural network.We use CAVs on textual input, rather than on images, to measure how sensitive the model is to a type after adding its TE (Section 4.3).However, while CAV is a sensitivity measurement tool, TEs steer the target type in the model's output.A work sharing the same spirit as ours uses a vector to steer output in a PLM for style transfer between sentences (Subramani et al., 2022).However, our method requires only 10 tokens per type as opposed to 100 labeled sentences for style transfer, and it works also with GPT.
Our work introduces a new kind of type embeddings to enrich the input to the PLM, in analogy to positional embeddings (Wang and Chen, 2020;Wang et al., 2021a).We show the benefit of our solution on the LAMA benchmark (Petroni et al., 2019), which contains cloze statements to query the PLM for a masked token.To enhance a PLMs' performance for such task, previous work improve prompts by mining or paraphrasing new prompts (Jiang et al., 2020), by adding trigger tokens (Shin et al., 2020), by finding vectors for prompts in the embedding space without restriction to the PLM's vocabulary (Zhong et al., 2021), or by combining multiple prompts (Qin and Eisner, 2021).As we simply add the type embedding to the input, our work is also different from approaches that pre-train an adapter to enhance PLMs' factual knowledge (Wang et al., 2021b) or rely on information retrieval to provide additional context for the prompt (Petroni et al., 2020).Finally, we steer the output while not changing the underlying model by triggering the neurons responsible for a prediction (Dai et al., 2022) or by producing an alternative model with edited facts (De Cao et al., 2021).

Type Embedding
In this section, we propose how to compute TEs from PLM token embeddings (Section 3.1), and how to use them (Section 3.2).Following the work on latent concepts in BERT (Dalvi et al., 2022), we focus on such model and report results on other PLMs in Table A2 in the Appendix.

Obtaining the TE
Given a type t, let the matrix P t ∈ R n×d hold the token embeddings for n different tokens, where d is the dimension of the token embeddings.The n tokens are instances of a specific type t.We call these tokens positively typed tokens.
For our analysis of P t , we apply Singular Value Decomposition (SVD).The SVD of an m × n matrix M factorizes it into M = U ΣV T , where U is an m × m unitary matrix, Σ is an m × n diagonal matrix, and V is an n × n unitary matrix.We call the column vectors of U and V singular vectors.The diagonal values in Σ are called singular values.
Assuming that M is a matrix where each row contains features of a data point, then the first singular vector of V , corresponding to the highest singular value, corresponds to the direction with maximum variance for the covariance matrix.In other words, it is the vector that contains the "common-part" of all data points.
The SVD of the matrix is P t = U ΣV T .The first column of the matrix V , v (1) , is the first singular vector, which encodes information common between all n tokens.We hypothesize that this vector, unlike other singular vectors, contains nontype related information and needs to be removed from the input to promote type information encoded in the other singular vectors (more details in Section 4.1).A similar observation has been made for multilingual representations (Roy et al., 2020), where removing r singular vectors leaves semantic-related information in the input representations (Yang et al., 2021).Thus, the embedding to be added to promote type t is E t = −λv (1) , where λ is a multiplier that is tuned on a hold-out dataset.
In practice, a type embedding is derived from a small set of tokens that are instances of the same type.Those can be provided by users, or obtained from existing typed resources such as KGs.In the rest of the chapter, the TEs are computed based on weighted sampling from KG entities.We query DBpedia (Auer et al., 2007) for tokens adhering to a specific type, keeping only those in the PLM's vocabulary, and use their node degree as the weight.

Using the TE
Assuming that a user has obtained the TE for the expected output type, the TE is simply added to the [MASK] input embedding, in analogy to token and positional embeddings.Figure 2 shows an example for a prediction where we enforce a YEAR type.
Depending on the task at hand, the TE can be added to one or more tokens.We found it more effective to add it only to the [MASK] token for city language occupation MLM tasks, while for text generation it is more effective to add the TE to all tokens in the prompt.
While we focus on MLM, we report preliminary results for text generation in Section 6.

Analysis of TEs
Having obtained a TE, we propose a series of analysis methods to assess its validity.We use the TE as a simple type retriever (Section 4.1), study the distribution of singular vectors (Section 4.2), analyze the effect of the TE w.r.t. the output and quantify the model's sensitivity w.r.t.typed tokens (Section 4.3), perform layer-wise classification to identify the desired type (Section 4.4), and measure TCAV of a model equipped with a TE (Section 4.5).

Similarity
As the TE is computed from token embeddings, the vector for E t lives in the subspace formed by these embeddings.Therefore, we can use the TE to sort by distance token embeddings (through cosine similarity) as a qualitative confirmation that it reflects the desired type.Table 1 shows examples of TEs for three types (cities, years, and occupations) and the most similar token embeddings of BERT.This suggests that TEs could act as a standalone type retriever, to sort tokens according to type and analyze any biases in the tokens from which the TE is computed.Applying the method on the first singular vector v (1) (i.e., −E t ), we observe that the top retrieved tokens ('.', 'and', 'the', . ..) relate to syntax, suggesting that the first singular vector encodes syntactic aspects, in agreement with other work in multilingual representations (Roy et al., 2020), showing that such vectors encode non-semanticrelated information (Yang et al., 2021).

Distribution of Singular Vectors
To understand the bias imposed by the first singular vector, we analyze the distributions of singular vectors, as it has been shown that distributions of singular vectors deviating from a Gaussian distribution contain bias (Shin et al., 2018).
From Figure 3, we see that the distribution of the singular vector v (1) , corresponding to the largest singular value, clearly deviates from a Gaussian distribution, while others do not.This is indicated by the high kurtosis values for the first singular vectors.This suggests that this singular vector could represent a common bias that affects tokens (Shin et al., 2018).Note that since each singular vector is of dimension d, and to plot the histogram, we report the mean of the singular vector.

Effect of TE
We introduce two metrics for measuring TE's effectiveness.
Adversarial Accuracy.We expect that adding a TE to BERT causes the PLM to be more "type aware" in the associated task, i.e., adding the TE conveys type-related tokens in the output.For example, in an MLM task, adding the TE should rank higher the tokens following the associated type.In an NLG task, adding a TE should convey more type-related tokens in the generated text.We focus on the former and leave the latter for future work.
To validate this hypothesis, we check if the score of a positively typed token in an MLM task for a model with the associated TE is greater than that of a standard BERT model.Formally, given a model M t , with an MLM head that has been equipped with a TE E t promoting a specific type t, we denote by P

(x)
Mt the normalized output score of the token x with model M t and prompt pr.To assess the effectiveness of the TE, we compute this normalized probability to that of an adversary, a BERT model without any equipped TE.We define the metric adversarial accuracy (AA) as: where M ∅ is a model without any TE, and X t + is a set of tokens adhering to the type t.A higher value indicates that the TE is able to promote PLM tokens following type t.
Adversarial Sensitivity.We also expect that adding the TE should make tokens following the type more sensitive to the input TE.In other words, adding the TE in an MLM setting should cause these tokens to be more salient w.r.t. the input.To validate this hypothesis, we compare the sensitivity of a token w.r.t. the input in two models with and without a TE.If the former is greater than the latter, then the model is more sensitive to the typed token.
More formally, given a model M, the output score of a token x is P (x) (X [M ASK] ) 1 .With a first-order Taylor series expansion, we obtain S , where 0 is the zero vector.S M is reminiscent of metrics used in the neural network pruning literature (LeCun et al., 1989;Molchanov et al., 2017).However, the metric is applied w.r.t. a vector rather than to the usual case of scalar, and we do not take the absolute value of the metric as we focus on comparing sensitivities of models and not measuring an absolute effect.
Finally, to test a TE, we compare the sensitivity to that of a standard BERT model.Similarly, we define adversarial sensitivity as the number of positive typed tokens whose sensitivity increased after adding TE to the number of positive typed tokens in a set X t + .More formally: For both measures, we report results over a sample of 100 tokens, making sure that every one is an instance of type t and none of them has been used to derive the TE.We then compute the accuracy 10 times to get mean and standard deviation.To make sure that any change in the scores is due only to the TE, we set pr = [M ASK].This simple prompt neglects any contextual information that might affect PLM tokens, thus ensuring that any change is due to the TE.
Results for mean and standard deviation are reported in   a certain type.We observe a lower score for type CITY, which is likely due to (a) the large cardinality of the CITY type making it more difficult to model all required aspects of cities, and (b) coincidence of some city tokens with people names such as Morris, Salem, and Riley.
For AS, the TE has a small error margin.As we cannot expect token embeddings to capture all intricacies of a certain type, there are examples where the model fails the sensitivity test.Examples of failing tokens that did not show improvement in type sensitivity for CITY are Salvador and Blair, for LANGUAGE are Cherokee and Romani, and OC-CUPATION are general and vicar.

Layer-wise Classification
As TEs are added at the input of the model, we postulate that adding TEs should help BERT identify types of input prompts more efficiently.For this, we train a layer-wise linear classifier on embed-dings of input prompts, where positive instances are prompts belonging to a certain type t and negative instances are prompts of other types (examples in Table 3).For each type, we sample 100 positive and negative instances from other LAMA datasets (negative instances are sampled randomly from the remaining types), and train a layer-wise linear classifier.We repeat each experiment 10 times and report mean accuracy on a test set of the same size.Prompts appearing in the train set do not appear again in the test set.Results in Figure 4 show that adding TE gives most layer classifiers an increase in F1-score.The highest increase is usually at a layer in the middle, in agreement with other work (Dalvi et al., 2022), possibly because this is where a type is formed (Geva et al., 2021;Jawahar et al., 2019).The highest increase is for LAN-GUAGE, likely due to the smaller cardinality of the type compared to CITY and ORGANIZATION.We obtain from these the classifiers the CAVs needed for TCAV in the following section.

TCAV Sensitivity
A Concept Activation Vector (CAV) is a vector in the direction of the values of a concept's set of examples (Kim et al., 2018).For example, given images showing the concept of the red color (positive samples) and images without it (negative samples), a linear classifier is trained on the activation at each layer to separate positive and negative samples.The normal to the hyperplane separating the samples is the CAV.By using CAVs (with directional derivatives), one can measure the sensitivity of an input w.r.t. a concept by gauging the sensitivity of model predictions to changes in inputs towards the direction of a concept.Thus, given a set of datapoints representing a certain concept, Testing with CAVs (TCAVs) provides means to compute the model's conceptual sensitivity across the input (Kim et al., 2018).As a final analysis measure, we posit that a model equipped with a TE should have higher TCAV values across layers.For this, we compute layer-wise TCAV using the CAVs in Section 4.4.Figure 5 shows the TCAV values for types CITY and LANGUAGE, comparing a vanilla BERT model (k=0) and one equipped with TE (k>0) for the last 4 layers.As TCAV computes the model's conceptual sensitivity across a set of inputs, we observe that with the right TE, the importance of the type becomes more salient, i.e., the sensitivity of model predictions w.r.t.types, such as CITY at a certain layer, increases for a prompt and a TE associated with that type.

Experiments
The LAMA benchmark (Petroni et al., 2019) contains cloze statements to test PLMs' factual knowledge.First, we apply TEs to BERT and show increase in precision for most datasets (Section 5.1).
We then enforce a change in the output with TEs (Section 5.2).Finally, we show the impact of the tokens that encode the TE (Section 5.3).

LAMA
We focus on the GRE and TREx datasets (ElSahar et al., 2018)   Intrinsic Evaluation.We compare BERT with TE (BTE) against standard BERT (B).As we assume that the user knows the desired output type, we also report for a baseline BERT + Token Type (BTo), which adds the expected type label (e.g., "the year") before the [MASK] token.We also report on a baseline PostTE which uses the TE at the output for re-ranking.The initial output score is added to the cosine similarity between the token embedding and the type embedding, controlled by a hyperparameter to adjust the importance of the similarity score.
We choose the range of the hyperparameter to vary from 0 to 30 as in a similar work for natural language generation (Pascual et al., 2021).We also tested another baseline where we add the tokens used to derive the TE before the [MASK] token, as a signal of the desired types (Shin et al., 2020), but the results are lower than BTo.Aggregated (macro) precision@k (P@k) results over all datasets are reported in Table 4 (full results in Table A3 in Appendix).On average, our proposal clearly improves the results.We see improvements across most of the types using TEs.However, we do observe reduction of precision in a few types, where the main reason being the greedy selection of a non-optimal value of λ.For type MANUFAC-TURER, setting λ = 1 (rather than λ = 2) improves P@1 P@10 P@50 P@100

LPAQA
. the results.For type SPECIALIZATION, while desired outputs such as mathematics and physics do exist in the KG samples, other nodes in the KG, such as teenager, Greek, and Sir have greater node degree and thus got selected in the sample for obtaining the TE.For the GROUP data, the value of λ for the TE was 0, meaning that adding the TE would hurt performance.Analyzing the predictions, we believe this is due to the bias in the TE imposed by the KG as most samples are related to sport groups (such as FIFA, UEFA, and CONCACAF) thus producing a TE biased towards sports group which negatively impacts the predictions.We discuss other sampling methods in Section 5.3.Finally, the YEAR dataset shows lower performance.
We believe this is due to BERT's inability to precisely capture numeracy (Wallace et al., 2019).For PostTE, our method, of using the TE at the input, produces better results, as using the TE at the output does not allow for the fusion between factual and type knowledge in the model.PostTE does push typed tokens to higher rankings (indicating also the effectiveness of TEs in modeling type), but adding TEs to the input is better in terms of performance.Plus adding TEs to the input is more universal:, as the output is usually controlled by the experiment type (binary classification, MLM, NLG,...), which might not always make it clear how to insert the TE, whereas the input is always fixed.One thing to note is that, with PostTE, out of the 38 different datasets used, 22/38 had an optimal value of λ to be zero.Meaning that for most datasets, it did not improve results, as opposed to our method which had only 5/38 datasets with optimal λ = 0. Extrinsic Evaluation.We evaluate our model against two supervised baselines.The first one, LPAQA (Jiang et al., 2020), uses mining-based methods to identify possible prompts for a given relation.The second baseline, OptiPrompt (Zhong et al., 2021), searches real-valued input vectors that Prompt TE P@1 P@10 P@50 P@100 maximize the likelihood of the gold label on the training set using a gradient-based searching algorithm.Results in Table 5 show that our approach does better with fewer prompts, as LPAQA requires at least 10X more prompts per example.For OptiPrompt, the supervised approach produces better results than our unsupervised method.However, the approach requires training data, which is not always available.In fact, the authors of the paper use only TREx relations as they can query the KG for more data, which is not the case for Google-RE datasets.Also, as the method uses 1000 data points for training, the authors had to rely on another KG to gather more samples.Our approach requires only 10 tokens per type.Finally, while training enhances performance, it also encodes certain regularities that models could exploit, such as being prone to over-predicting the majority class label, as reported for OptiPrompt, unlike our approach which keeps model parameters intact.

Switching Types in Prompts
LAMA authors provide manually written prompts that adhere to the desired type.For example, to get the PLACE OF BIRTH (PoB) of a person, they use the prompt "[X] was born in [Y].", while for the DATE OF BIRTH (DoB) of a person they use the prompt "[X] (born [Y])".These prompts follow from how sentences about date and place of birth are written in Wikipedia pages.In this experiment, we ponder whether TEs can enforce a different type given one of these two prompt structure.We use DOB prompts with the expected outcomes of POB, where the goal is to steer the type of the output to a different type.For example, given "Barack (born [MASK])" (prompt for DoB), we set as expected output "Honolulu" (PoB answer).We remove examples for which the expected output is not in BERT vocabulary and are left with 1139 prompts.
We then add the TE for CITY during inference.The results are shown in Table 6.As expected, without any TE, the precision score is zero as the output P@1 P@10 P@50 P@100 type is heavily influenced by the prompt.Adding E city to the input steers the model to change type and it outputs cities.However, the scores are still less than those of POB prompts.Since the prompt is biased towards a certain type, better results can be obtained by removing the projection of the year information onto the city TE.Our optimized TE is then ||E city || 2 ||Eyear|| 2 E year , which indeed shows improved results in Table 6.

Token Sampling
We study the impact of how tokens for TEs are sampled by (i) changing the sampling method, and (ii) varying the number of tokens used.
Sampling Methods.We evaluate forms of obtaining tokens alternative to weighted sampling: (i) weighted sampling with node degrees as weights (BTE), (ii) using the Top-10 tokens w.r.t.node degree (Top10), (iii) using the Bottom-10 tokens (Bot10), and (iv) sampling uniformly without relying on node degree (Unif ).We repeat the experiment in Section 5.1 with every sampling strategy and show results in Table 7.More detailed results are in Table A4 in the Appendix.
We observe that Top10 and weighted sampling obtain comparable performance.While Top10 gets better results for COUNTRY, ORGANIZATION, and GENRE, other types such as YEAR, SPECIALIZA-TION and MANUFACTURER show lower precision because of the bias coming from the most popular KG samples.For example, Top10 samples only years in the 21 st century, specializations related to titles (duke, Sultan, and Sir rather than mathematics and physics), and it is biased towards car manufacturers (Fiat and Honda).Weighted sampling reduces such bias.For FOOTBALL POSI-TION, Unif does better as it has more variety in the sample with more tokens related to American football positions (quarter back and guard) rather than soccer positions only (goalkeeper and midfielder).
In some cases, the bias in the KG reflects the bias in the test data.For OCCUPATION, the TE using Top10 does encode some bias as most tokens are related to artistic positions (musician, actor), but n P@1 P@10 P@50 P@100 0 . this improves results as the same bias occurs also among the expected outputs.
Varying Size of Samples.To study the effect of the number of tokens used in deriving the TE, we repeat the experiment in Section 5.1, while varying the number of tokens n. Results are reported in Table 8.We observe that results peak between 10 and 20 samples, but even a small number of samples significantly improves the results compared to the original BERT without TE (n=0).

Conclusion and Future Work
We have introduced TEs as additional input for PLMs to better encode type information, proposed methods to analyze TEs, and tested them on the LAMA dataset.While initial results are promising, we identify two directions of research.
More Precise Type Embeddings.Further analysis on the examples can lead to better TEs.One direction is to use also negative samples to compute the TE.This implies learning a vector that separates between samples as CAVs do.However, adding negative samples can bring more bias in the TE.This could be alleviated by performing some statistical hypothesis testing, as with CAVs (Kim et al., 2018).Another way to improve the effectiveness of our proposal is to combine vectors.Assuming a taxonomy of the types, different TEs can be combined, for example by subtracting for the one at hand, say PERSON, the ones that are not super or sub types, such as CITY and YEAR, as discussed for DoB in Table 6.
From Types to Concepts.While we focus on types and TEs, our approach can be extended to include more generic concepts, as long as their tokens are in the PLM's vocabulary.This could help alleviate the stereotypical and toxic content found in PLMs (Ousidhoum et al., 2021).To test our idea, we report an example for the task of natural language generation, where we "de-toxify" text generated by an autoregressive language model.We use a distilled GPT-2 model (Radford et al., 2019) and the RealToxicityPrompts dataset that contains 100K sentence-level prompts derived from a corpus of English text (Gehman et al., 2020).We feed 10K samples to the model, thus producing the generated texts.We then measure the toxicity of such texts with the Perspective API2 .We consider a text toxic if the toxicity probability returned by the API is >0.5 and obtain 460 toxic prompts.We then compute a "toxicity concept embedding" using 6 manually picked tokens that convey toxicity.To de-toxify the generated text, we set the multiplier λ to negative values.Instead of adding the embedding to the [MASK] token only, we found better results when adding it to all tokens in the prompt.We believe that adding the TE to all tokens helps to 'preserve' type information along the lengthy generation procedure, as opposed to MLM which decodes one token.We also test a sample of nontoxic prompts (same size as toxic prompts) to show the effect of our embedding.In addition to toxicity, we measure fluency (perplexity of generated continuations according to a larger PLM) and diversity (mean number of distinct uni-/bi-/trigrams, normalized by the length of text for each prompt), as in other works for text generation (Liu et al., 2021).Results in Table 9 show a huge reduction in the toxicity probability with λ = −1, higher more diversity but slightly less fluency for the toxic prompt.Setting λ = −2 decreases further the toxicity probability, but at the expense of less fluency.For the non-toxic prompts, the toxicity results are nearly the same, with minor differences for fluency and diversity.Considering that a "concept vector" steers the generation of the PLM without any form of finetuning, it is promising to study the use of "plugand-play" concept vectors.Examples are reported in Table B1 in the Appendix.

Limitations
Encoding types requires a set of tokens and their embeddings.As we turn to PLMs, we are restricted by the tokens in its vocabulary, which limits the number of possible types for TEs.In addition, while we use TEs for a factual dataset, the TE encodes only type information and no factual information.While results improve for LAMA with TE, the interaction of type information and factual knowledge of the PLM is not understood.Finally, one cannot decide on a clear sampling method to use for computing the TEs (assuming the existence of a knowledge source such as a KG).The best sampling is heavily dependent on the distribution of the gold labels in the test dataset.

A LAMA
Dataset statistics are reported in Table A1.Detailed results on the datasets are reported in Table A3.A full inference run on all LAMA datasets takes on average approximately 5 minutes on Google Colab with a Tesla P100 with a batch size of 32.We vary λ from 0 to 5. We repeat the experiment in Section 5.1 with every sampling strategy and report results in Table A4

Figure 2 :
Figure 2: Input representation for a PLM.The YEAR type embedding (green box) is added to the [MASK] token.

Figure 3 :
Figure 3: Distribution of the mean of the singular vectors across different types.We report the singular vectors with top-4 singular values.The distribution of the v (1) has the highest kurtosis.

Figure 4 :
Figure 4: F1 scores of three classifiers trained and tested on layer-wise embeddings of CITY (Ci), LANGUAGE (L), and ORGANIZATION (Org) datasets.
Table 2 for both AA and AS.For AA, TEs perform well in promoting tokens respecting 1 Other input tokens are omitted for brevity.

Table 2 :
Mean and standard deviation (in parentheses) of AA and AS for different types (k = 1).

Table 3 :
Examples of LAMA datasets grouped by output types COUNTRY (top) and CITY (bottom).
ties (10, by default).To tune the λ value of a TE, we use a hold-out dataset of 5% for each dataset, and choose the value that maximizes precision.We report results on a BERT BASE CASED model.Further experiments with other PLMs show similar trends (results in TableA2in Appendix).

Table 5 :
Mean precision over all LAMA datasets compared to extrinsic baselines.Unsupervised BTE outperforms LPAQA, which uses supervised learning.Supervised OptiPrompt obtains higher precision as it searches for prompts in the embedding space.

Table 6 :
Precision in predicting PoB (place of birth) for DoB (date of birth) prompts by adding CITY TE (k = 5).Results with TE are comparable to the PoB prompt.

Table 7 :
Mean over all datasets for every method.

Table 8 :
Average of precision of the datasets while varying the number of samples n to compute the TE.

Table 9 :
Results of detoxifying texts generated from a distilled GPT-2 model.λ indicates the value of the multiplier of the TE (λ = 0 for original PLM).
. For LANGUAGE, all sampling The Pirate Bay was written in [MASK] .P103 954 The native language of Jan Davidsz.de Heem is [MASK] .P1412 921 Leone Caetani used to communicate in [MASK] .P37 707 The official language of Iitti is [MASK] .P364 780 The original language of Do Phool is [MASK] .

Table A1 :
LAMA datasets grouped by type.Each dataset belongs to the TREx dataset, unless otherwise stated by (GRE).