Interpreting Embedding Spaces by Conceptualization

One of the main methods for semantic interpretation of text is mapping it into a vector in some embedding space. Such vectors can then be used for a variety of text processing tasks. Recently, most embedding spaces are a product of training large language models. One major drawback of this type of representation is its incomprehensibility to humans. Understanding the embedding space is crucial for several important needs, including the need to explain the decision of a system that uses the embedding, the need to debug the embedding method and compare it to alternatives, and the need to detect biases hidden in the model. In this paper, we present a novel method of transforming any embedding space into a comprehensible conceptual space. We first present an algorithm for deriving a conceptual space with dynamic on-demand granularity. We then show a method for transferring any vector in the original incomprehensible space to an understandable vector in the conceptual space. We combine human tests with cross-model tests to show that the conceptualized vectors indeed represent the semantics of the original vectors. We also show how the conceptualized vectors can be used for various tasks including identifying weaknesses in the semantics underlying the original spaces and differences in the semantics of alternative models.


Introduction
Natural language processing (NLP) involves a major step of mapping input text into some semantic representation. During the last two decades, researchers have developed a class of algorithms that transform (embed) text into vectors in some semantic space. While these vectors do not capture the full meaning of the original texts, they allow us to perform a wide spectrum of textual processing tasks.
Recently, we have seen a major progress in NLP thanks to the development of Large Language Models (LLMs) that are based on deep neural networks and are trained on vast amount of text [8,21,13]. These models can then be used for generating embeddings for natural language sentences [23,18].
While these powerful embedding methods show excellent performance on a variety of tasks, they suffer from a major drawback. The dimensions of the vector space used for the embedding are internal structures in a neural network and are not comprehensible to humans. This does not present a problem when the embedding algorithm is used as a black box. Understanding the embedding space is, however, crucial for several important needs, including the need to explain the decision of a system that uses the embedding, the need to debug the embedding method and compare it to alternatives, and the need to detect biases hidden in the model [4,25,15].
The importance of interpretability was recognized by many researchers. Several works present methods for explaining a decision of a system that uses the embedding (mainly classifiers) such as [24,14]. These methods are mainly tailored for understanding the decisions rather than the model itself. Some works [30,11] try to solve the problem of model understanding by training or retraining for generating a new model that is interpretable, thus detouring the problem of understanding the original model. Another line of work tries to find orthogonal transformations. It can help the interpretability but it provides a limited level of interpretabilty and usually requires an embedding matrix [9,20]. Many methods try to assign meaning to each dimension of the embedding space. For example, some probing methods use classification to find meaning to the individual space dimensions [5,7].
In this work we present a novel methodology for conceptualization of embedding spaces that allow humans to gain deep understanding into the original embedding space. We have developed a method for generating an ontology -a hierarchical conceptual space, based on Wikipedia categories (or similar knowledge structures). This ontology is then used to generate an embedding space whose dimensions are human-understandable concepts. Next we show a method for mapping any vector in the original space into the conceptual space. The original black-box embedding algorithm (usually LLM-based) is still used for the decision, thus we maintain its high performance. The conceptual representation is used to understand the model decision, and the model itself by feeding it with various texts and understanding the way it represents them.
Our method has the following features: 1. It does not assume that each dimension in the latent space corresponds to an explicit and human-understandable concept.
2. It is model agnostic -it can work with any model without additional training. Our only requirement is a black box that receives a text fragment and outputs a vector.
3. Our conceptual embedding space can be generated for any given desired size.
4. Our conceptual embedding space can be selectively deepened to specialize in specific subjects.
We evaluate our new method via a sequence of qualitative and quantitative methods. In particular, we present a novel method for evaluating the correspondence between the latent representation and its understandable counterpart using both human raters and LLMs.

Conceptualization of Embedding Spaces
In this section, we present a novel algorithm for mapping a vector that represents text in some latent embedding space to a comprehensible vector in a conceptual space. We assume that each concept has an associated textual representation and that the similarity of this representation to the input text, measured in the original embedding space, determines the strength of relatedness of the input to the particular concept.
Let T be a space of textual objects (sentences, for example). Let L = l 1 , . . . , l k be a latent embedding space. Let f : T → L be a function that maps a text object to a vector in the latent space. Typically, f will be a LLM or LLM-based.
Our method requires two components: A space of concepts C = c 1 , . . . , c n associated with the domain of interest, and a mapping function τ : C → T that returns a textual representation for each concepts in C.
Given a vector l ∈ L (that typically represents some input text), we perform the following steps. First, we map each concept c ∈ C to a vector in L by applying f on τ (c), the textual representation of c. We thus define n vectors in L, C = c 1 , . . . , c n such that c i ≡ f (τ (c i )).
Our next step is to measure the similarity of each vector c i to the input vector l using any given similarity measure sim. Finally, the algorithm outputs a vector in the conceptual space, using the similarities as the dimensions.

Conceptual Space
< l a t e x i t s h a 1 _ b a s e 6 4 = " Z K T e D P Q O O A r N + c 8 M J C V t + c J B M X c = " > A A A B 6 n i c d V D L S g M x F M 3 4 r P V V L b h x E y y C u B i S l q l 1 V 9 S F y 4 r 2 A e 1 Q M m n a h m Y e J B m h D P 0 E N y 4 U c e v G P / A z x J 1 / Y 6 Z V U N E D F w 7 n 3 M s 9 9 3 q R 4 E o j 9 G 7 N z S 8 s L i 1 n V r K r a + s b m 7 m t 7 Y Y K Y 0 l Z n Y Y i l C 2 P K C Z 4 w O q a a 8 F a k W T E 9 w R r e q P T 1 G 9 e M 6 l 4 G F z p c c R c n w w C 3 u e U a C N d 0 i 7 u 5 g r 0 X E n T / h 2 p 1 / 4 7 R V U N E D F w 7 n 3 M u 9 9 7 g R o 0 K a 5 r u 2 s r q 2 v r G Z 2 k p v Z 3 Z 2 9 / T 9 g 5 Y I Y 4 5 J E 4 c s 5 B 0 X C c J o Q J q S S k Y 6 E S f I d x l p u + P z u d + e E C 5 o G F z J a U T 6 P h o G 1 K M Y S S U 5 u s 6 c 4 q m X 7 0 k U 5 7 F T L B Q c P W c a Z t k u V Y v Q N M q m Z Z d O F L H t W q l S h p Z h L p C r Z 2 + u X z P i u e H o b 7 1 B i G O f B B I z J E T X M i P Z T x C X F D M y S / d i Q S K E x 2 h I u o o G y C e i n y w u n 8 F j p Q y g F 3 J V g Y Q L 9 f t E g n w h p r 6 r O n 0 k R + K 3 N x f / 8 r q x 9 G r 9 h A Z R L E m A l 4 u 8 m E E Z w n k M c E A 5 w Z J N F U G Y U 3 U r x C P E E Z Y q r L Q K 4 e t T + D 9 p F Q 2 r Y l i X V q 5 + B p Z I g S O Q B X l g g S q o g w v Q A E 2 A w Q T c g n v w o C X a n f a o P S 1 b V 7 T P m U P w A 9 r L B z s 2 l c Y = < / l a t e x i t > l 2 = f (⌧ (c 2 )) < l a t e x i t s h a 1 _ b a s e 6 4 = " L e f j I g H m G D Q 9 S s D W b C n g S 5 O I 7 G o = " > A A A B 6 H i c d V D L S g M x F M 3 U V 6 2 v q k t B g k V w N W R q X 7 O y 6 M Z l C / Y B d S i Z N N P G Z h 4 k G a E M X b p y 4 0 I R t 3 6 F v 6 E 7 v 0 E / w r R V U N E D F w 7 n 3 M s 9 9 7 o R Z 1 I h 9 G q k 5 u Y X F p f S y 5 m V 1 b X 1 j e z m V l O G s S C 0 Q U I e i r a L J e U s o A 3 F F K f t S F D s u 5 We have thus defined a meta algorithm CES (Conceptualizing Embedding Spaces) that for any given embedding method f , conceptual space C and a mapping function τ from concepts to text, takes a vector in the latent space L and returns a vector in the conceptual space C: A graphical representation of the process is depicted in Figure 1.
If we use cosine similarity as sim, and use a normalised f function, we can implement CES as matrix multiplication, which can accelerate our computation. First, observe that under these restrictions, cosine similarity is equivalent to the dot product between vectors.
Let U = u 1 , . . . , u k be the standard basis in k dimensions as a base of L. We can look at the projection of U in the C space, by using function φ such We can now create a n × k matrix M = φ(u 1 ), . . . , φ(u k ) . Using this matrix, we define CES f,C,τ (l) = M ·l.

Generating Conceptual Spaces
Our method requires a conceptual space C consisting of a set of concepts. The level of abstraction of these concepts needed for a task depends on the problem domain. In this section we present several algorithms that, given a hierarchy of concepts, generate a concept space. To allow the algorithms return spaces with various abstraction levels, the nodes in the hierarchy should be connected by a subset relation (is-a).  We chose Wikipedia category graph as our source as it provides a wide and deep coverage of our knowledge. Wikipedia category graph is indeed a semantic network where the nodes are the categories (concepts) and the edges are the links between a category and its parent. There is, however, a major obstacle to using Wikipedia category graph. The links are unlabeled and do not always represent an is-a relation. For example, one of the subcategories of "Smartphones" is "Firefox OS", and one of the subcategories of "Nutrition" is "Obesity".
In the first subsection, we present a heuristic method for estimating whether a link in the graph represents an is-a relation. We then present algorithms that use this estimation to generate conceptual spaces out of the knowledge graph.

Detecting Is-a Links
Let G = (V, E) be a knowledge graph, where V is a set of concepts and E ⊆ V × V is a set of links between concept. Let Obj(c) be the set of objects belonging to concept c. We say that c 1 is-a c 2 if Obj(c 1 ) ⊆ Obj(c 2 ). We define parents(c) = {c ∈ V |(c , c) ∈ E} and children(c) = {c ∈ V |(c, c ) ∈ E}. Given a node c and a parent node p, we define siblings(c, p) = children(p) − {c}.
The main idea behind our method of detecting is-a links is that a set of siblings connected to a specific parent through is-a links should be similar. We estimate similarity between a node and its sibling by the similarity between their set of parents. Instead of using a binary decision, we chose to assign a continuous value in the [0, 1] that will be used by our algorithms for generating conceptual spaces.
We can now define the is-a score of an edge (p, c) as: We remove from each node λ% (35% in our experiments) of its parent links with lowest is-a score.

Generating a Fixed-Depth Conceptual Space
A major strength of the hierarchical representation of concepts is its multiple levels of abstractions. For our purpose, that means that we can request a concept space with a given level of granularity. Given a concept graph G, after removing edges with low is-a score, we can define d(c), the depth of each concept (node) as the length of the shortest path from the root. We designate by C i = {c ∈ C|d(c) = i} as the set of all concepts with depth of exactly i. Table 1 shows the concepts of C 1 .

A Conceptual Space with On-demand Granularity
The fixed-depth algorithm seems quite flexible as it allows us generating conceptual spaces of any granularity. There are, however, several difficulties in using this approach. One problem is the large growth of the number of nodes with the increase in the depth. For example, in our implementation, |C 1 | = 37, |C 2 | = 706 and |C 3 | = 3467. Another problem arises in domainspecific tasks, where high-granularity concepts are needed in specific subjects but not in others. Lastly, it is difficult to know ahead of time what is the required granularity for the given task. We have therefore developed an algorithm that, given a guiding set T of input texts and a desired concept-space size, generates a concept space of that size with granularity tailored to T . The main idea is to deepen categories that are strongly associated to T , thus enlarging the distances between the textual objects, allowing for more refined reasoning. We use the symbol C * to indicate a concept space that is created this way.
The algorithm starts with C 1 as its initial concept space. It then iterates until the desired size is achieved. At each iteration, the set of text examples T is embedded into the current conceptual space using CES. The concept with the largest weight is then selected for expansion. The algorithm selects its best p% children for some p, judged by their is-a score, and adds them to the current conceptual space. The algorithm is shown in Algorithm 1.

Algorithm 1 Selective deepening
ĉ ← concept in C with max weight in emb best ← p% of children(ĉ) with highest is-a score C ← C ∪ best end while return C If the embedding is used for classification tasks, we can use the labels of the training examples in addition to their text. In such a case, we evaluate concepts using a linear combination of the weight according to the embedding of the training texts and the entropy according to the labels of the examples. As before, the node with the maximal value is chosen for expansion. The entropy of a concept is determined by the set of labeled examples whose text is embedded into a vector with the given concept assigned the highest weight. The intuition is that concepts representing texts in different classes need a refinement to allow better separation.

Empirical Evaluation
It is not easy to evaluate an algorithm whose task is to create an understandable representation that matches the original incomprehensible embedding. We performed a series of experiments, including a human study, that show that our method indeed achieved its desired goal. For all the experiments, we have used RoBERTa sentence embedding model as our f [23,13] unless otherwise specified, which for simplicity we will now refer to as SRoBERTa 1 . Whenever the concept space C * was used, we set size = 768 to match the size used by SRoBERTa, but we observed that using much smaller values yielded almost as good results. For τ , the function that maps concepts to text, we have just used the text of the concept name (with length of 4.25 words on average in G

Qualitative Evaluation
We first show several examples of actual conceptual representation created by our algorithm to get some insight into the way that our method works. We have selected 3 sentences from 3 different recent CNN articles of various topics and applied SRoBERTa embedding to the three sentences to get 3 latent vectors.
To generate the concept space C * , we need a set of example texts. We have used the first 10 sentences of each article. We could now run CES and get conceptual embeddings for the 3 latent vectors. Table 2 show the resulting embeddings. We show only the 3 top concepts (C1, C2, C3) due to the lack of space. We can see that the concepts of the new embedding are understandable and intuitively capture the semantics of the input texts. In brackets, we show the depth of each concept in the concept graph.
It is important to note that the representations shown are not based on some new embedding method, but reflect what our model thinks about the way that SRoBERTa understands the text.

Evaluating Performance on Classification Tasks
To show that our representation matches the original one generated by the LLM, we first show that learning using the original embedding dimensions as features and learning using the conceptual features yield similar classifiers. Most works try to show such similarity by comparing accuracy results. This method, however is prone to errors. Two classifiers might give us accuracy of 80%, while agreeing only on 60% of the cases. Instead, we use a method that is used for rater agreement, reporting two numbers: the raw agreement and the Cohen's kappa coefficient [6].
We use the following data sets: AG News, Ohsumed, R8, Yahoo, BBC News, DBpedia 14 and 20Newsgroup. Note that we use only topical classification tasks, as the concept space we use does not include the necessary concepts needed for tasks like sentiment analysis. If the data set has more than 10,000 examples, we randomly sample 10,000. The results are averaged over 10 folds. We use random forest learning algorithm with 100 trees and max depth of 5. The conceptual space used by CES is C * , using the training set as the guiding text T .
AG News-has 4 classes, world, sports, business and science and technology. We use only its training set (120k examples) 3 . Ohsumed-has 23 classes. It contains medical abstracts from the MeSH categories of the year 1991. (In the version used, examples with two or more categories were removed, leaving us with more than 7k samples). R8-subset of Reuters 21578 data set. It has 8 classes and approximately 7600 examples. 4 Yahoo-Answers Comprehensive Questions and Answers from Yahoo! [36]. It has 10 classes. We use here only the test set as it is sufficiently large with over 60K examples. BBC News-Consists of around 2000 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005 [12]. It contain five classes (business, entertainment, politics, sport and technology). DBpedia 14-14 non-overlapping classes from DBpedia 2014 [36]. We use only the training set that has 40K samples. Each sample is a title and content. 20 Newsgroup-approximately 20K newsgroup documents, of 20 different classes. We used it from sklearn data sets python library. Table 3 shows the agreement between a classifier trained on the LLM embedding and a classifier trained on their conceptual embedding generated by CES. We report raw agreement and kappa coefficient with standard de-   Table 4: Raw agreement between human raters and the model.
viation. The second column reports raw agreement with a random classifier for reference. Note that the larger the Kappa coefficient is, the less likely it is a random chance, and the closer it gets to one, the more the two models agree. All the values are relatively high indicating high agreement between the LLM embedding and CES's embedding. While the results reported here look promising, they are not sufficient to indicate that our goal is achieved. Consider the following hypothetical algorithm. Let D be the size of the LLM embedding space. The algorithm selects D random words from the English vocabulary and assign each to an arbitrary dimension. This hypothetical algorithm satisfies two requirements: Using it will always be in 100% agreement with the original, and its generated representation will be understandable by humans. However, it is clear that it does not convey to humans any knowledge regrading the LLM representation.
In the next subsections, we describe experiments with humans and with other models that support our claim that CES generates understandable representations that indeed reflect the semantics of the LLM embedding.

Human Evaluation
We have designed a human experiment that tests the hypothesis that the conceptual representation generated by CES reflects the meaning of the LLM representation in a human-understandable way. We use a learning algorithm to generate a classifier based on the LLM-based embedding using the vector dimensions as features. We apply the resulting classifier on a test set. We then ask the human testers to classify the same test set, but instead of showing them the input text, we show them only the CES-generated concepts. We claim that if there is high agreement between the LLM-based classifier and the human raters who only have access to the conceptual representation, then the conceptual representation indeed reflects the meaning of the LLM embedding.
To allow classification by the human raters, out of the 7 data sets described in the previous subsection, we chose the 4 that have meaningful names for the classes. To make the classification task less complex to the raters, we randomly sampled two classes from each data set, thus creating a binary classification problem. For each binary data set, we set aside 20% of the examples for training a classifier based on the LLM-embedding, using the same method and parameters as in the previous subsection. The resulting classifier was then applied to the remaining 80% of the data set.
Out of this test set, we sample 10 examples on which the LLM-based classifier was right and 10 on which it was wrong (except for the Ohsumed data set where only 7 wrong answers were found). This is the test set that will be presented to the human raters. Each test case is represented by the 3 top concepts of the CES embedding, after applying feature selection on the full embedding to choose the top 20% concepts. As before, the conceptual space is C * with size = 768 and with the training set used as T . The rater is presented with an example represented by 3 concepts and two alternative class names, and is instructed to choose one. The final human classification of a test example was computed by majority voting of 3 raters. Table 4 show the raw agreement between the LLM-based classification and human classification. Kappa coefficient was not computed as the test set is too small. The results are encouraging as they show quite high agreement. To get additional results we repeat the experiment in the next subsection but replace the human raters by LLM-based raters.

Evaluation by Other Models
We repeated the experiments of the last subsection, with the same test sets, but instead of using human raters, we used a LLM rater. The LLM rater receives the top 3 concepts, just like the human rater, and makes a decision by computing cosine similarity between its embedding of the textual representation of each class to its embedding of the textual representation of the 3 concepts. The 3 LLMs used for rating are SBERT [23] 5 , ST5 [18] 6 and SRoBERTa. Note that the two uses of SRoBERTa are quite different. The one used for the original classification is based on a training set and a smart learning algorithm, while the model used for rating just computes similarity between the class and the 3 concepts. One major difference between our method and alternatives is that they try to assign meaning to each dimension of the latent space while we map each latent vector to a conceptual space. We denote the alternative methods by Dimension Meaning Assignment (DMA). We have designed two competitors that represent the DMA approach.
The first one, which we call DM A words , is based on a vocabulary of 10,000 frequent words 7 . We represent each word by our LLM, yielding 10,000 vectors of size 768. We now map each dimension to the word with the highest weight for it. We make sure that the mapping is unique. The second one, which we call DM A concepts , is built in the same way, using, instead of words, the concepts in C 3 .

Using CES for Understanding Models
One major feature of our methodology is that it allows us to gain understanding of the semantics of trained models, differences between their views of the world, and their potential knowledge gaps. We demonstrate this by comparing the views of three LLMs, SBERT, ST5 and SRoBERTa, mapped by CES to concept representations, on 4 example texts.
We have used phrases rather than full sentences to prevent the models from using the contexts for disambiguation. The conceptual space used is C 3 since we did not have accompanied text for selective deepening. Table 6 shows the top 3 concepts of representation generated for the text "FC Barcelona" by CES for the 3 LLMs. We can see that, while SRoBERTa and ST5 recognize the sport aspect of the input text, SBERT is not aware of it and only identifies the city. To validate this observation, we applied the 3 models to 2 additional texts, "Miami Dolphin", which is strongly related to sport, and "Politics in Spain", which is related to the city aspect. We measured the cosine similarity between the input text and the two others in each of the 3 embedding spaces of the 3 models. The results support our observation. SBERT embedding is more similar to the city aspect embedding while the two others are more similar to the sport text embedding. Note that the difference in the variance of values is due to the LLMs different embedding spaces. In Table 7, for the input "Manhattan Project", we can see that ST5 recognizes the military project while SBERT knows only about the connection of Manhattan to New York. SRoBERTa recognizes both aspects.
In Table 8, for the input "Amy Winehouse", SRoBERTa indeed recognizes her as a celebrity and a musician while ST5 main concepts are related to her drug problems. In Table 9, for the input "Donald Trump", all models are aware of its connection to presidency, while only ST5 associate it highly with racism.
The examples shown in this subsection lead us to believe that our conceptualization algorithm can be very useful for deciding which LLM to use for a given task and what additional training is needed for a given model.

Related Work
The problem of interpretability has received significant attention in the last years. In this section, we discuss some of the works that are most relevant for our research.
One relatively early approach try to find orthogonal or close to orthogonal transformations [9,20,28,29] to improve interpretability. These models' advantage is that they do not lose information due to the orthogonality. Other works [2,17,32] transform the original embedding to a sparse embedding to improve interpretability. Unlike our method, the above models provide a limited explainability and they need embedding matrix which limit them especially in unbounded embedding space (like sentence embedding that we use here).
A large body of research [24,14,35,22,26,10,27] is devoted to generate an explanation for the decision, mostly classification, of a black-box model. These methods use neighboring examples or counterfactuals to give the user an insight into the reasoning behind the decision.
Several works set a goal, like ours, of understanding the model itself, rather than its decisions. Some works (e.g [31]) try to assign a specific concept to each dimension. Other methods [34,33,3,7,5] also try to infer the meaning of specific dimensions or layers and also to understand the semantic of internal elements of the network. A recent work [19] tries to understand sentence similarity by creating AMR graphs of the input sentences, mapping sub-embeddings into semantic aspects of the AMRs and use AMR metrics to compute similarity. Some works [16,1,11,30] try to detour the problem of model interpretation by generating an understandable model, based on the original model. Some of these methods [16,1] do so by mapping to a more understandable model while others [11,30] perform training or retraining for generating the new models. All of these works were applied only to static models.

Conclusion
Previous approaches that attempted to understand latent embedding spaces, in particular those generated by LLMs, assumed that the dimensions of these spaces correlate to some semantic concepts recognizable by humans. This assumption is not necessarily true as it is quite possible that each latent dimension represent some complex combination of human recognizable concepts.
In this work we introduce an alternative approach that maps the latent embedding space into a space of concepts that are well-understood by humans and provides a good coverage of the human knowledge. We also present a method that generates such a conceptual space with on-demand level of granularity.
We showed that the results of using the conceptual embedding correlates with those achieved using the original embedding. We illuminate a problematic aspect of using such evaluation methods -that any arbitrary 1-1 mapping to concepts can yield perfect correlation while not reflecting the meaning of the original embedding. We then introduce a novel method for evaluating the correspondense of the conceptual embedding to the meaning of the original embedding both by humans and by other models.
We believe that the novel method presented in this paper provides a good way of understanding latent embedding spaces generated by strong LLMs. Such understanding can be very helpful for explanation of model decisions, for choosing the right model, and for debugging models by identifying knowledge gaps and biases.

Ethical statement
This work can have a broad impact on interpretability of embedding in NLP LLMs. It can be added as an additional check for the faithfulness of the model at hand. It can help measuring embedding models on the relevant subject, and can help guiding us if the model needs additional training on a specific subject.