Putting Words in BERT’s Mouth: Navigating Contextualized Vector Spaces with Pseudowords

We present a method for exploring regions around individual points in a contextualized vector space (particularly, BERT space), as a way to investigate how these regions correspond to word senses. By inducing a contextualized “pseudoword” vector as a stand-in for a static embedding in the input layer, and then performing masked prediction of a word in the sentence, we are able to investigate the geometry of the BERT-space in a controlled manner around individual instances. Using our method on a set of carefully constructed sentences targeting highly ambiguous English words, we find substantial regularity in the contextualized space, with regions that correspond to distinct word senses; but between these regions there are occasionally “sense voids”—regions that do not correspond to any intelligible sense.


Introduction
Vector spaces defined over static word vectors are somewhat interpretable, as the points are limited to the vocabulary. Contextualized representations (CRs), by contrast, are mysterious because of the unbounded number of distinct contextualized embeddings, and no obvious way to discover the word and context that would correspond to an arbitrary point in the space. Attempts have been made to characterize the information captured in contextualized representations (Rogers et al., 2020;Liu et al., 2019), but some of the techniques used (e.g., probing classifiers) have been subject to criticism for their indirectness.
We propose a new technique called Masked Pseudoword Probing (MaPP) that allows controlled exploration of the space of a contextualized masked LMs (specifically, English BERT; Devlin et al., 2019). MaPP takes advantage of the static embedding at the first layer of BERT and "hallu-z event < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 j t P w y D C x Z J M B c U g Z X 3 a h b S M O N A = " > A A A D Q H i c f Z L P a 9 R A F M e n U W t d f 3 S r F 8 H L Y C m I y J L E V n d v B T 1 4 E V v o t o V m W S a z L 8 n Q m U y Y m e x u D P F / 8 O J f I H j S P 8 K j / 4 H 4 D 3 g T r 5 6 c 7 L o s 2 6 g P A l + + n 2 / C e y 8 v z D j T x n W / r j m X L l 9 Z v 7 p x r X X 9 x s 1 b m + 2 t 2 8 d a 5 o p C n 0 o u 1 W l I N H C W Q t 8 w w + E 0 U 0 B E y O E k P H 9 W 8 5 M x K M 1 k e m S K D A a C x C m L G C X G W s P 2 3 U A Q k 4 R R + b o a l o G B q S m P E q g q P G x v u 5 1 e z 9 3 d 3 c N u Z 8 / 1 f b 9 r h f v Y 7 / Y 8 7 H X c W W 3 v r x 9 + + / L u 7 Y e D 4 Z b T D k a S 5 g J S Q z n R + s x z M z M o i T K M c q h a Q a 4 h I / S c x F D O G q / w j r V G O J L K P q n B M 3 c l V z e X G T G 1 U W y Y A I 1 r Z y V C h N a F C G 2 i R v o i q 8 2 / s b P c R N 1 B y d I s N 5 D S e S 9 R z r G R u F 4 U H j E F 1 P D C C k I V s 0 N g m h B F q L H r b A X P w Q 6 p 4 K X 9 7 q s M F r V z e u d a 7 f u H n r d n f z z q G W p S J 0 Q C S X 6 j j G m n K W 0 4 F h h t P j Q l E s Y k 6 P 4 t P n D T + a U K W Z z N + Y q q B D g b O c p Y x g 4 6 R R 9 x 4 S 2 I z j t H 5 r R z U y d G Z q l l s L R 9 2 t w O / 3 g 5 2 d X R j 4 u 0 E U R T 3 X B E + i X j + E o R / M a 2 t v / e D 7 1 / f v P u 6 P N r 0 u S i Q p B c 0 N 4 V j r k z A o z L D G y j D C q e 2 g U t M C k 1 O c 0 X o + t 4 X b T k p g K p V 7 c g P n 6 o q v m a 0 w Y u a s 0 D B B N W y U F Q s W W l c i d o 4 G 6 Y u s E f / G T k q T 9 o Y u a 1 E a m p P F L G n J o Z G w 2 R N M m K L E 8 M o 1 m C j m Q k A y x g o T 4 7 b Z Q S + o C 6 n o K / f d 1 w V V 2 E j 1 q E Z Y Z Q L P r A u d o c d N 9 z + j 2 / O Z 0 X W d b Z T T K Z F C 4 D y p 0 d T W 5 z 9 m a i / C y W R J J y 2 a L m H a g t k S Z i 0 4 X s J x C 7 p U K J Y 8 a Z Y t e T 1 r G a p V Q 2 W t u 6 H z Q 4 H / b g 4 j P 3 z q R w f u m C K w q A 1 w H z w A D 0 E I n o E 9 8 B L s g w E g w I I P 4 B P 4 7 H 3 x f n g / v V 8 L q 7 d 2 9 s 5 d s F L e 7 z / C O h q G < / l a t e x i t > z in < l a t e x i t s h a 1 _ b a s e 6 4 = " A 3 R E + X 5 O i v Y q N 7 2 L 3 D b z n s i h a M Q = " > A A A D N H i c f Z L N b t N A F I W n b o E S f p q W D R K b E S k S q l A 0 t i A k u 0 q w Y I M o E m k r 1 S E a T 6 6 d U W c 8 1 s w 4 i b H M 0 7 C F J e + C x A 6 x 5 R k Y t 6 m i m J 8 r W T o 6 3 7 F 1 7 / W N M s G N J e T b h r e 5 d e 3 6 j e 2 b r V u 3 7 9 z d a e / u H R u V a w Z D p o T S p x E 1 I H g K Q 8 u t g N N M A 5 W R g J P o / E X N T 2 a g D V f p O 1 t k M J I 0 S X n M G b X O G r f v 7 4 e S 2 m k U l x + q 9 w f j 0 M L C l j y t 9 s f t D u k O + j 3 y N M C k S 0 g / I D 0 n n h F / 4 A + w 7 5 y 6 O m h Z R + N d b y u c K J Z L S C 0 T 1 J g z n 2 R 2 V F J t O R N Q t c L c Q E b Z O U 2 g v G i 8 w o + c N c G x 0 u 5 J L b 5 w 1 3 J 1 b 5 m V C x f F l k s w a 5 R K Y w o Z O V j n / m C 1 + T d 2 l t u 4 P 3 J T Z r m F l F 2 2 E e c C W 4 X r H e E J 1 8 C s K J y g T H P X P 2 Z T q i m z b p O t 8 C W 4 + T S 8 d t 9 9 k 4 G m V u m D M q Q 6 k X R R u X m T 8 E m t / h d 0 G 1 4 G n W q F K c y Z k p K m k z K c V + X V L 5 l X D T a b r e C s C e M V i 5 s s W b G k y a Y r N m 0 y N 0 8 Y K T G p 1 6 x E u W j y Y p 0 X V e X u 5 u o 4 8 L / F c d D 1 e 9 3 g b d i P X Z V 4 g Z P y 6 j K S Q F D V t J k Q n w g B H W T r B u B G u f s q n z D C O b o 6 d 6 A 2 4 / g y 8 c + + + z 8 E w 1 O Z l F T G T K r a o X b 9 p 9 K p R / z O 6 C S + N T n W i D O Z c K 8 W y S R X N 6 + r m Q + Z 1 i 8 1 m K z h r w 2 T F k j Z L V y x t s + m K T d v M 9 R P F W k 6 a M W v p N q T F y 3 V e 1 r X b m 5 v l o P 8 W J 2 E / 2 O + H H 8 L e U b j c o G 3 y l D w j L 0 h A D s g R e U u O y Z B w c k k + k c / k i / f V + + 7 9 8 H 5 e W 7 2 N 5 Z 3 H Z C 2 8 X 7 8 B P S A S e g = = < / l a t e x i t > x in Figure 1: Illustration of the MaPP method as used in the specialization experiments ( §5.2). In other experiments, the pseudoword is perturbed prior to step 3.
cinates" new embeddings into this space to correspond to tokens' contextualized representations. By extending BERT's vocabulary with these pseudowords, we can use them as inputs for masked prediction of words in the sentence. The words predicted in the masked slot serve as evidence of the meaning of the pseudoword in question-for example, it may encapsulate a specific sense consistent with the original context. We can also transplant a pseudoword into new contexts to see if it generalizes as per our intuitions about word meanings.
We focus on the contextualized meanings of ambiguous verbs and prepositions, which serve as ideal test cases, as they may often be disambiguated by their objects. For example, if a pseudoword interpreted as "in" induces a distribution over its argument slot that gives most of the probability mass to locations such as "London" or "Paris", we may conclude that the pseudoword has a locative sense (table 1). We first ask: How well do pseudowords inferred from contextualized representa-tions in BERT accord with linguistic expectations about word senses? We investigate this by deploying MaPP for sentences whose masked slot reveals the sense of the ambiguous word. Second, we ask: How semantically smooth is the BERT-space such that arbitrary points near a pseudoword will behave in semantically similar ways? We study this by navigating the space around a pseudoword (or between two pseudowords), and examining the vector's behavior via MaPP. This method allows for investigating the geometry of a contextual representation by traveling the BERT-space in a continuous way and exploring different regions, which we show (see §5) to correspond to distinct concepts.
Our experiments indicate a substantial regularity in the BERT-space. We see regions in the space that correspond to distinct senses. These regions can be recovered using our technique; for example, by sampling points around a pseudoword and looking at the points in the BERT-space which decode to it. Moreover, we see that between sense-regions there are often "voids" in the space that do not correspond to any intelligible sense.

Analyzing Contextual Representations
Probing representations. Deciphering the information encoded in contextualized representations like BERT is widely investigated in recent NLP research. Probing methods use CRs as inputs to probing classifiers to see how well the CRs may serve as features in predicting specific properties. The intuition is that if the CR can be used to predict a specific property, then knowledge about it is encoded in the representation. Recent classifier-based probes have focused on various linguistic properties such as morphology, parts of speech, sentence length, and syntactic and semantic relations (Liu et al., 2019;Conneau et al., 2018;Belinkov et al., 2017;Adi et al., 2016, inter alia). Closely related to ours is work that studied the extent to which the lexical semantic classes of nouns are disambiguated by CRs (Zhao et al., 2020), showing that BERT fares well in this respect.
Beyond classifier-based probes, other approaches have also been explored, such as information theoretic probing (Pimentel et al., 2020;Voita and Titov, 2020), and structural probing (Hewitt and Manning, 2019), which evaluates whether syntax trees are embedded in a linear transformation of a CR's word representation space.
An alternative approach to probing learned rep-resentations is directly analyzing the attention weights and activation patterns (Brunner et al., 2020;Abnar and Zuidema, 2020). Criticism against some instances of this approach is found in Jain and Wallace (2019), who claimed that attention weights are less transparent than is often stipulated.
The shortcomings of probing. Some recent work has taken a more critical view regarding probing techniques (Belinkov, 2021). Elazar et al. (2021) argue that while probing methods might show that certain linguistic properties exist in a representation, they do not reveal how and if this information is being used by the probing model. This could be due to the disconnect between the representation itself and the probing model. Relying on classifiers to interpret representations might be problematic; they add additional confounds to the interpretability of the results, and different representations may need different classifiers (Wu et al., 2020;Zhou and Srikumar, 2021).
Another critique concerns the difference between correlation and causation (Feder et al., 2021): classifier-based probes may rely on shallow correlations in the training set, thus reflecting data artifacts that are irrelevant to the studied distinction.
Word Sense Disambiguation. Word Sense Disambiguation (WSD) aims at making explicit the semantics of a word in context, typically by identifying the most suitable meaning from a predefined sense inventory (Bevilacqua et al., 2021). Disambiguation can also be defined indirectly, through minimal pairs that contrast two senses of a word (Trott and Bergen, 2021) or through another word in the text that determines the semantic class of the word in question (Jiang and Riloff, 2021). Our work bears on this line of work as well: we are using MaPP to test whether the masked prediction indicates that the pseudoword encodes the expected sense. However, we are using carefully controlled sentences so it remains to be seen whether pseudowords can be induced to capture word senses "in the wild".
The geometry of BERT. Understanding the geometry of the BERT-space is not easy. Some attempts in this direction have been made (Coenen et al., 2019;Ethayarajh, 2019;Michael et al., 2020;Mickus et al., 2020;Xypolopoulos et al., 2021;Garí Soler and Apidianaki, 2020), but a more thorough investigation is lacking. As opposed to predictive methods such as probing, descriptive methods that rely on geometric features of the space analyze the information in CRs directly. This paper takes a different approach that views BERT as a function that is defined over a continuous space. Our proposed methodology thus allows for a more direct inspection of "gaps" between embedded tokens, that does not require an auxiliary probe and probing dataset, and instead investigates the model's behavior on arbitrary points in the input space. The paper focuses then on the interpretation of individual points, as opposed to other related work on the geometry of BERT (Coenen et al., 2019;Ethayarajh, 2019;Cai et al., 2021;Hernandez and Andreas, 2021), which mostly considers higher-level properties of the BERT space.
A naïve geometric approach to investigate the information in BERT could be to look at neighborhoods of contextualized embeddings. A vector in this space represents some word within a sentence; it lies in R d , with d = 768. However, it is unclear how such neighborhoods should be defined.
Of course, it is possible to define a discrete neighborhood comprising of contextualized embeddings close to the vector; these may represent the same sense of the same word. Still, in terms of the geometry of the space, how should we interpret a continuous neighborhood in the output space? While we could force a non-discrete outlook by generating vectors artificially-e.g., by generating points that are epsilon away from a given point-these artificial contextualized vectors are disembodied, with no obvious linguistic basis. It is therefore unclear how to interpret these artificial vectors (or the linguistic properties of tokens they might encode).

Traversing CR Spaces: A New Probing Methodology
Our motivating question is: Are word senses encoded in BERT's representations-and if so, how? As a test case, we look at highly ambiguous words, as they potentially offer complex geometric configurations of senses in the BERT space.

Masked Pseudoword Probing
We propose a novel probing technique, MaPP (Masked Pseudoword Probing), which "hallucinates" vectors to reconstruct a token's contextualized representation. MaPP allows us to "navigate" the BERT-space by looking at neighborhoods of certain word vectors in what we term the input space, an extension of the discrete space of BERT's static (decontextualized) word embeddings. By inducing and manipulating new vectors in the input space (henceforth, pseudowords), we can observe the effects on BERT's behavior via masked prediction. Mathematically, pseudowords are inverse images under the BERT function, continued from the finite space of word embeddings that BERT generally receives, which in the standard implementation contains 30k points in R d , one per entry in BERT's vocabulary. 2 We are interested in the contextualized representation of an ambiguous word, which we call the focus token t in a sentence s.
Let z t be the static embedding of t (i.e., the input embedding BERT receives for t), and let x t be its contextualized representation in s. Let d be the dimension of the input embeddings (without the positional embedding).
To apply our method, we stipulate that there is a specific token called the cue token, in the j th position in s, that disambiguates t. Under this assumption, MaPP can discover the sense of a vector x in the vicinity of x t , by masking the cue token, and decoding the distribution of its fillers. These fillers serve as a proxy for x's sense.
MaPP operates in two steps, described next. 1) Pseudoword Representation. First, we find an embedding z * that best reconstructs x t when input to BERT in place of t. Formally: where BERT (z) is a forward pass of the model with the vector z replacing z t . The solution z * is a pseudoword. The original embedding z t of t is a solution to eq. (1). But BERT is not invertible, and there is no reason for z * and z t to be close. Indeed, our experiments show that z * is different from t's input embedding, 3 and suggest that z * is a "disambiguated" counterpart of the focus token.
We can approximate z * using standard optimization techniques. When solving for z * , we hold BERT's parameters fixed, and seek to identify the input embedding z. In standard BERT training, by contrast, the input is known, and we solve for BERT's parameters. 2 The term "pseudoword" is also used in psycholinguistics, but refers to a different concept (Gale et al., 1992;Schütze, 1998;Shoemark et al., 2019). 3 We looked at the distribution of the distances (Euclidean and cosine) between all of the pseudowords and their corresponding static embeddings in the input space.  2) Pseudoword-Guided Prediction. After computing z * , we define a new sentence s ′ , identical to s except for the j th position (the disambiguating position), where we place a mask. For example: The focus token t is the ambiguous "in". The cue tokens "London" and "September" would indicate locative and temporal senses of "in" respectively. Next, we replace the input embedding of t with z * or another input space vector z in its vicinity, and predict the distribution of the slot fillers in the masked position. 4

Pseudowords as Input Vectors.
Let us denote the standard input space for the token at t's position (i.e., input embeddings of BERT's vocabulary) with I static ⊂ R d (where I static = 30k). By extending BERT's inputs to pseudowords, we are performing what is known in mathematical analysis as a continuation of BERT's function from the discrete input space I static ⊂ R d to continuous regions in R d . This approach allows us to gain insight as to the semantics encoded by different regions of the BERT space. Construing BERT as a continuous function also allows us to invert it, and obtain a point in the inverse image z * of BERT by solving an optimization problem. We note that viewing the BERT space as a continuous space, e.g., for purposes of mapping between it and other continuous spaces, is an increasingly common practice (Schuster et al., 2019;Gauthier and Levy, 2019); see further discussion in appendix A.4.
In our experiments ( §5), the pseudowords will help us explore the geometry of the BERT-space, by traveling across it in a "continuous" waysomething that is not possible to do with the BERT vectors as discussed in §2. For example, we can study how perturbations of z * (the pseudowords) affect the prediction of the cue word.

Experimental Research Questions & The MaPP Dataset
The main hypothesis we study in this paper is: There are regular "nicely defined" regions in the BERT-space around words that correspond to distinct senses.
Such regions may be variously interpreted: Around a point with a particular sense, there is a ball which contains mostly points corresponding to that sense. Or, for example, points that correspond to the same sense will lie on a high-dimensional manifold. Cases that we consider as not "nicely defined", are, for example, points corresponding to different senses that are scattered in the space in an inseparable way (or at least inseparable by simple functions). We would like to be able to map the semantic concept of sense to the geometric properties of the BERT-space. However, since little is known about the space, fully characterizing its geometry is out of the scope for this work.
We present MaPP to study this hypothesis, and in doing so introduce the concept of pseudowords. This concept opens additional research questions. Specialization. Let z * ∈ R d be a pseudoword obtained by solving eq. (1) for a sentence s with a focus token t and cue token at position j, holding a sense η. Does z * yield a sense distribution (determined by its slot fillers in the j th position) that concentrates on η? That is, does a pseudoword decode to a specific sense of the focus token? Generalization. Is it possible to transplant a pseudoword into a sentence where the context around the focus token is different, and still obtain coherent results? For example, in the sentences: (a) "The pan is for cooking." and (b) "The fork is for eating?", the focus token "for" is the same, and in both cases has a PURPOSE meaning. The context, however, is different. If we transplant the pseudoword for "for" induced with sentence (a), to the position of "for" in a masked version of sentence (b) (i.e, "The pan is for [MASK]?"), will we get a coherent prediction with the same sense? Or is the pseudoword obtained from one sentence limited to a specific context?
If pseudowords obtained for one sentence do not generalize to another, we propose to induce a "generalized pseudoword" trained over multiple examples with the same sense.

The MaPP Dataset
To answer the questions listed above, we manually compiled the MaPP Dataset, a controlled dataset with short sentences, designed to avoid confounds that may introduce difficulties in interpreting the results.
Each sentence contains an ambiguous word that is fully disambiguated by a specific slot in the sentence. E.g., in the sentence "The book is for reading", the ambiguous word "for" has a PURPOSE sense, strongly signaled by "reading". All sentences were reviewed by a linguist to maximize naturalness and minimize ambiguity.
The dataset consists of 3 portions, each used in different experiments. We describe each portion adjacent to the relevant experiment.
Relational words as a test case. We chose to focus our analysis on the ambiguity of relational words in English, specifically prepositions and verbs. Relational words present an interesting test case: many are highly ambiguous and encode basic semantic distinctions, such as space, time and manner (Schneider et al., 2018). We do not attempt to cover all possible senses of the selected words; instead, we have constructed our dataset to illustrate just a few clear contrasts (see further discussion in appendix A.4).

Experiments
We use MaPP to empirically evaluate the hypotheses listed above.
We conduct four types of experiments. First, we test specialization, the extent to which the induced pseudoword z * can be viewed as a sensedisambiguated version of the focus token's input embedding. Second, we venture into the immediate regions around z * by perturbing it, and ex-

Query Top 5 predictions
The dinner is on Monday. z fire offer sale Friday hold z * Sunday Saturday Thursday Tuesday Friday The clip is about a queen. z minute year second day week z * woman girl man child boy amining how this affects the resulting senses. We thereby gain insight as to the regularity of the region around z * with respect to the focus token's sense. Third, we examine the sense regularity of the BERT CRs by examining the senses encountered when traversing the line between two pseudowords corresponding to different senses (e.g., the locative and temporal senses of "in"). Finally, we test generalization, namely, the extent to which z * can serve as a sense-disambiguated embedding of the focus word in different contexts.
To solve for z * (eq. (1)), we add a new token to the vocabulary (#TOKEN#), which corresponds to the focus token. When backpropagating the gradients, we ensure that the gradients of all parameters of BERT are zero, except the token embedding of #TOKEN#. In this way, we preserve the original BERT model while enabling us to solve for z * . We use 5 random initializations and select the z * with the lowest loss. We use standard gradient-based optimization for this process. We are solving for the input to BERT rather than model parameters, so we backpropagate through the network, holding BERT's parameters fixed, and take gradients with respect to z.

Specialization Experiments
We test whether we can control the sense of BERT's predicted tokens using a pseudoword z * . If this is indeed the case, it supports our view of z * as a disambiguated version of its corresponding static token embedding. In this experiment, we designate highly ambiguous words-specifically, verbs like "have" and prepositions like "in"-as the focus tokens and apply the process described in §3.1 on these ambiguous words.   Evaluation. For each sentence we perform masked prediction and manually categorize which of the top 5 predicted words are consistent with the original sense. From these judgments we compute the accuracy-the proportion of sense-congruent predictions. 5 In total, we evaluate 470 predictions. Results. Table 3 shows the performance of MaPP versus a Vanilla BERT baseline (using the static embeddings rather than pseudowords for masked prediction). We see that in most cases, by applying MaPP, we shift the prediction of the model to the desired sense, which establishes the validity of our technique. Further, table 4 shows that typically, after applying MaPP, the model's top prediction is not the word that was masked in the original sentencei.e., the pseudoword is not simply memorizing a specific cue word. Table 2 illustrates two examples exhibiting a clear shift to the desired sense. Not every pseudoword behaves as expected, though, as is discussed in subsequent experiments.

ε-Perturbation
Our goal in this experiment is to "travel" in the BERT-space. Since it is not clear how to interpret a direct perturbation of the contextualized vectors, 5 We also conducted a random baseline experiment, with randomly sampled vectors from R d instead of the pseudoword. The accuracy was negligible.  we do this via the input space. We compute a pseudoword and perturb it to obtain new points in an ε-ball around it, as schematized in figure 2. Given a pseudoword z * for a particular token in context, normalize it to a unit vector. Choose 10 random directions w by sampling uniformly from the unit sphere. For each direction w and perturbation distance ε, find a vector w ′ that is ε away (in cosine distance) from z * in the direction w (i.e., w ′ is on the intersection between the plane spanned by z * and w, and the unit sphere). We do this for several values of ε (see below). Each perturbed z * is fed back into the model and used for masked prediction.
Data. In this experiment we use the Basic Portion of the MaPP Dataset, as described in §5.2.
Evaluation. For ε ∈ {0,0.2,...,1.8} and each direction, we examine at the model's top 5 predictions for the masked token to determine which are consistent with the original sense. We measure the average accuracy over the 10 directions for each ε.
Results. The fraction of predicted words consistent with the original sense decreases gradually as the amount of perturbation ε increases (figure 3 and numeric results in appendix A.4). This matches our hypothesis that there is regularity in the BERT space, and that it is carved into regions which correspond to distinct senses. Outside of these regions (where ε is large), we occasionally encounter sense voids-regions where there is no intelligible sense

Mask
Vanilla BERT Query 1 Interpolated MaPP Query 2 The event is in [MASK].
The event is in August.
The book is for [MASK].
The book is for him.
The book is for viewing.
The clip is about a [MASK].
The clip is about a minute.
The clip is about a queen.
The dinner is on the [MASK].   compatible with the context. For example, with the query "The event is in Canada.", we see small values of ε producing names of countries, but ε ≥ 0.6 producing adjectives ("annual", "amateur", "contested", "open", "free") which are ungrammatical and nonsensical in context.

Interpolation
Next, we take two pseudowords representing distinct senses of an ambiguous word in minimal pair sentences, and traverse the space between them to determine what the boundary between sense regions looks like. Given two pseudowords z 1 and z 2 , we simply interpolate their vectors: where 0 ≤ α ≤ 1 controls how much weight to put on one pseudoword or the other. This is depicted in figure 4. Data. In this experiment we use the Minimal Pairs Portion of our dataset. These 40 pairs of sentences differ only in the cue to give contrasting senses of the focus word. 6 7 different ambiguous words and 16 distinct senses appear in this portion of the dataset, with 5 sentences for each distinct sense. Several examples appear in table 1. 6 In some cases, the two elements of a minimal pair differ syntactically as well. Viz.: auxiliary vs. main verb ("I had gone/cake"); verb-particle construction vs. verb+PP ("run over the cat/bridge"); and PP vs. approximation modifier ("about a horse/minute"). Evaluation. For each sentence in a minimal pair, we infer z. Then for α ∈ {0,0.1,0.15,0.2,...,1}, we compute z * α , use it for masked prediction, and judge whether each of the top 5 predictions corresponds to the sense in the first sentence, the second sentence, or neither.
Results. Figure 5 shows the overall proportion of predictions for each sense as α progresses from 0 to 1. We see a gradual trend from one sense to the other. This matches our hypothesis that there is regularity in the BERT-space: traveling on a line between two senses in the input spaces decodes to two distinct regions in the BERT-space.
For some individual examples, there is a sharp boundary at some α; for others there is an intermediate region where the predictions mix the two senses or are unrelated to both (see appendix A.4).
The behavior with static embeddings ("Vanilla BERT") can serve as a control for interpreting the effect of the interpolated pseudoword, as shown for several examples in table 5. In many cases Vanilla BERT prefers one of the two senses by default, but with α sufficiently close to the other sense's pseudoword, the behavior changes. In general, the transition from one sense to the other is readily apparent from the predictions. Exceptions where one of the expected senses is inadequately represented include "I had cake." (no foods are predicted in the top 5) and "The dinner is on the plate." (not as many food-oriented locations were predicted as expected).

Generalization Experiments
In this experiment we examine whether the pseudoword is specialized for a particular sense of the focus word only in a context-specific fashion, or whether the pseudoword is a valid representation of the sense in new contexts where the focus word may appear. To this end we take a pseudoword from one context and "transplant" it into a new context. We are particularly interested in transplantations where the ambiguous word has a similar meaning but is expected to yield a new distribution of masked predictions, due to the influence of the new context. For example: (2) a. s: The book is for reading.
b. s ′ : The cup is for [MASK].
Both (2a) and (2b) exemplify the 'purpose of item' sense of for, for different kinds of items that have different kinds of ordinary purposes. If the pseudoword inferred from the original sentence appropriately generalizes the meaning of for, then transplanting it into s ′ should yield a word like "drinking". However, in most cases the prediction either does not change (here, for example, we still get generalization type @1 @5 N (total # of predictions) 54 270 Vanilla BERT baseline (s ′ only) 31.5 31.9 MaPP: post hoc average 11.1 14.8 MaPP: aggregate loss (eq. (2)) 57.4 53.7 Table 6: Generalization experiment. Comparison of @1 and @5 accuracy, over two generalization types; simple average of the pseudowords versus averaging in the loss function.
[MASK] = "reading"), or we get an incoherent prediction for the masked token. We hypothesize that this is because the pseudoword overfit to the original context-that is, it is incapable of representing the desired sense in new contexts (especially if the meaning of the new context crucially affects what should be predicted in the masked slot).
We hypothesize that it is necessary to take multiple contexts into account in order to produce a flexible-context sense-like vector. One possible strategy is to compute a sense-vector as a simple average of the individually-learned pseudoword. We refer to this as post hoc averaging.
Another possible strategy is to train each pseudoword on multiple examples with distinct contexts: we replace eq. (1) with an aggregate loss that averages over n sentences containing the same focus token with the same sense η: Note that both approaches inject supervision into the process of training the sense vector by specifying which examples correspond to the same sense. Data. In this experiment we use the Generalizaton Portion of the MaPP Dataset, which contains 138 sentences with 5 ambiguous words and 6 distinct senses. For each sense, there are 23 sentences (with 14 sentences used as the training set to compute the averages, and 9 as the test set). Evaluation. For each sentence we compute two z * pseudowords with the two kinds of averaging (post hoc and aggregate loss), and compare their effectiveness at adapting the expected sense for the new context.
In total we evaluate 270 predictions. Results. Table 6 shows generalization accuracies for the two techniques as well as Vanilla BERT. We see that the aggregate loss technique produces a correct prediction a majority of the time, while the static embedding is less accurate and post hoc averaging of pseudowords performs very poorly. These results support our intuition regarding the possibility to generate a representation for pseudowords that generalizes over different usages of the word. However, an ideal representation of a pseudoword would be one that could serve as a sense-disambiguated embedding of the focus word. This might not be completely achievable, but the representation might be improved in this direction by learning it over a larger more diverse dataset.

Discussion
What is a pseudoword? The optimization problem defined in eq. (1) results in a pseudoword z * ∈ R d . We use the pseudowords as input vectors to the model, although they are not constrained to the 30k vectors in BERT's vocabulary, but may be arbitrary vectors in R d . We discuss the validity of such an operation in appendix A.4. In practice, we find that many pseudowords behave as sensedisambiguated input vectors. While our goal is not to explore the pseudoword-space for its own sake-pseudowords are a tool to shed light on the geometry and behavior of the BERT-space-our experiments with pseudowords and artificially perturbed pseudowords reveal that the pseudowordspace contains regions that are semantically coherent as inputs to BERT. Prospects for the MaPP technique. Our dataset is manually curated to control for specific linguistic phenomena. We expect that pseudoword may be less semantically targeted if learned with larger contexts that create more opportunities for confounds. We note also that senses are not necessarily discrete (Erk and McCarthy, 2009), and it would be worthwhile to explore how graded semantic distinctions are represented, as well as underspecified meanings. We are also interested in exploring how BERT represents tokens in sentences that permit multiple plausible interpretations. The MaPP technique can be applied to investigate the properties of other CR models as well, as it requires only that the model be a differentiable function from input token embeddings to contextualized embeddings.

Conclusion
We have presented a novel methodology and a dataset for investigating the geometry of the BERTspace, using a traversal technique which allows for a continuation of the input space. We showed that there is substantial regularity in the BERT-space, with regions that correspond to distinct senses. Moreover, we found evidence for "voids" in the space-regions that do not correspond to any intelligible sense. Our technique gives rise to various types of analysis, creating avenues for future work. Immediate directions that we plan to pursue are (a) examining sense representation in longer, naturally occurring sentences, and (b) extending our analysis to a multilingual setting.

A.1 The MaPP Dataset
We present here the ambiguous words together with their senses, as used in the MaPP Dataset (table 7).  CRs (e.g., Liu et al., 2019), we know that these differences (in the sense and the form of relational words) are indeed encoded in them. Indeed, in some settings, it is possible to separate groups of them via simple classifiers. However, this is a weak notion of the knowledge that is encoded in the representation. Other work that focused on probing for function words comprehension  explored whether qualitatively different objectives lead to demonstrably different sentence representations.
To understand to what extent (and how) the form of a word versus its sense is encoded in its contextual representation, we have conducted the experiments that we describe in Section 5.

A.3 Our method vs. Other Probing Methods
Our work addresses two basic shortcomings of most probing methods. First, they strongly rely on the probing dataset used to train and evaluate a classifier. Changing the distribution of examples can shift the results of the probing experiment (e.g., Slobodkin et al., 2021). Second, probing methods give an aggregated picture at the population level, and cannot provide insight at the level of individual examples. Our method does not train a classifier, and can provide information at the instance level; it therefore does not rely on aggregation to yield a meaningful conclusion. Rather, it is designed to allow for an interpretable navigation of the BERT space. While our method does allow reporting trends at the population level by aggregation, results can be traced back to the instance level.

A.4 Further Discussion
Transfer learning using BERT. Although BERT is built as a masked language model, it is often being used as a tool for transfer learning; its produced representations are treated as vectors in a continuous space and are being used with great success for various tasks such as POS tagging, NLI, multilingual alignments, prediction of brain activity patterns, and more (Schuster et al., 2019;Gauthier and Levy, 2019;Rogers et al., 2020). Our method uses pseudowords as input vectors to the model. However, the vectors that are given as an input to BERT are always one of 30k vectors in BERT's base vocabulary, where any other vector is considered "out of vocabulary". BERT was never meant to receive any vector in R d as it is defined over a discrete set, yet we are breaking the "discrete-contract" and asking what is the behaviour of the model given the pseudowords and perturbations of them. To the best of our knowledge this is a novel approach to the exploration of BERT. If BERT was treated only as a masked language model then one could claim that there is a certain set of rules that is needed to be followed in order to infer meaningful conclusions from its outputs. In our approach however, we choose adopt a different view -we think about BERT as a sentence encoder; a function from a sequence of strings to sequence of vectors. We claim that in adopting this approach there is no need to constrain the model to a discrete space. Moreover, this "contract" has in fact already been violated, as contextualized representations are often being used for other tasks other than masked language modeling, and therefore the use of pseudowords as inputs is nothing short of a natural continuation of this idea.    Table 9: Examples top-5 prediction with ε-Perturbubed MaPP versus vanilla BERT. 3 indicated that the prediction has been coded as consistent with Query sense, 7 for wrong prediction. The expectation is that values of ε closer to 0 will be more reflective of the Query sense, while as ε increases will be more incoherent or will not fit the Query sense