On the Transformation of Latent Space in Fine-Tuned NLP Models

We study the evolution of latent space in fine-tuned NLP models. Different from the commonly used probing-framework, we opt for an unsupervised method to analyze representations. More specifically, we discover latent concepts in the representational space using hierarchical clustering. We then use an alignment function to gauge the similarity between the latent space of a pre-trained model and its fine-tuned version. We use traditional linguistic concepts to facilitate our understanding and also study how the model space transforms towards task-specific information. We perform a thorough analysis, comparing pre-trained and fine-tuned models across three models and three downstream tasks. The notable findings of our work are: i) the latent space of the higher layers evolve towards task-specific concepts, ii) whereas the lower layers retain generic concepts acquired in the pre-trained model, iii) we discovered that some concepts in the higher layers acquire polarity towards the output class, and iv) that these concepts can be used for generating adversarial triggers.


Introduction
The revolution of deep learning models in NLP can be attributed to transfer learning from pre-trained language models.Contextualized representations learned within these models capture rich linguistic knowledge that can be leveraged towards novel tasks e.g.classification of COVID-19 tweets (Alam et al., 2021;Valdes et al., 2021), disease prediction (Rasmy et al., 2020) or natural language understanding tasks such as SQUAD (Rajpurkar et al., 2016) and GLUE (Wang et al., 2018).
Despite their success, the opaqueness of deep neural networks remain a cause of concern and has spurred a new area of research to analyze these models.A large body of work analyzed the knowledge learned within representations of pre-trained models (Belinkov et al., 2017;Conneau et al., 2018;Liu et al., 2019;Tenney et al., 2019;Durrani et al., 2019;Rogers et al., 2020) and showed the presence of core-linguistic knowledge in various parts of the network.Although transfer learning using pre-trained models has become ubiquitous, very few papers (Merchant et al., 2020;Mosbach et al., 2020;Durrani et al., 2021) have analyzed the representations of the fine-tuned models.Given their massive usability, interpreting fine-tuned models and highlighting task-specific peculiarities is critical for their deployment in real-word scenarios, where it is important to ensure fairness and trust when applying AI solutions.
In this paper, we focus on analyzing fine-tuned models and investigate: how does the latent space evolve in a fine-tuned model?Different from the commonly used probing-framework of training a post-hoc classifier (Belinkov et al., 2017;Dalvi et al., 2019a), we opt for an unsupervised method to analyze the latent space of pre-trained models.More specifically, we cluster contextualized representations in high dimensional space using hierarchical clustering and term these clusters as the Encoded Concepts (Dalvi et al., 2022).We then analyze how these encoded concepts evolve as the models are fine-tuned towards a downstream task.Specifically, we target the following questions: i) how do the latent spaces compare between base1 and the fine-tuned models?ii) how does the presence of core-linguistic concepts change during transfer learning?and iii) how is the knowledge of downstream tasks structured in a fine-tuned model?
We use an alignment function (Sajjad et al., 2022) to compare the concepts encoded in the finetuned models with: i) the concepts encoded in their pre-trained base models, ii) the human-defined concepts (e.g.parts-of-speech tags or semantic properties), and iii) the labels of the downstream task towards which the model is fine-tuned.We carried out our study using three pre-trained transformer language models; BERT (Devlin et al., 2019), XLM-RoBERTa (Conneau et al., 2020) and ALBERT (Lan et al., 2019), analyzing how their representation space evolves as they are finetuned towards the task of Sentiment Analysis (SST-2, Socher et al., 2013), Natural Language Inference (MNLI, Williams et al., 2018) and Hate Speech Detection (HSD, Mathew et al., 2020).Our analysis yields interesting insights such as: • The latent space of the models substantially evolve from their base versions after finetuning.
• The latent space representing core-linguistic concepts is limited to the lower layers in the fine-tuned models, contrary to the base models where it is distributed across the network.
• We found task-specific polarity concepts in the higher layers of the Sentiment Analysis and Hate Speech Detection tasks.
• These polarized concepts can be used as triggers to generate adversarial examples.
• Compared to BERT and XLM, the representational space in ALBERT changes significantly during fine-tuning.

Methodology
Our work builds on the Latent Concept Analysis method (Dalvi et al., 2022) for interpreting representational spaces of neural network models.We cluster contextualized embeddings to discover Encoded Concepts in the model and study the evolution of the latent space in the fine-tuned model by aligning the encoded concepts of the fine-tuned model to: i) their pre-trained version, ii) the humandefined concepts and iii) the task-specific concepts (for the task the pre-trained model is fine-tuned on). Figure 1 presents an overview of our approach.In the following, we define the scope of Concept and discuss each step of our approach in detail.

Concept
We define concept as a group of words that are clustered together based on any linguistic relation such as lexical, semantic, syntactic, morphological etc. Formally, consider C t (n) as a concept consisting of a unique set of words {w 1 , w 2 , . . ., w J } where J is the number of words in C t , n is a concept identifier, and t is the concept type which can be an encoded concept (ec), a human-defined concept (pos ∶ verbs, sem ∶ loc, . . . ) and a class-based concept (sst ∶ +ive, hsd ∶ toxic, . . .).Task-specific Concepts: Another kind of concept that we use in this work is the task-specific concepts where the concept represents affinity of its members with respect to the task labels.Consider a sentiment classification task with two labels "positive" and "negative".We define C sst (+ve) as a concept containing words when they only appear in sentences that are labeled positive.Similarly, we define C hsd (toxic) as a concept that contain words that only appear in the sentences that were marked as toxic.

Latent Concept Discovery
A vector representation in the neural network model is composed of feature attributes of the input words.We group the encoded vector representations using a clustering approach discussed below.The underlying clusters, that we term as the en-coded concepts, are then matched with the humandefined concepts using an alignment function.
Formally, consider a pre-trained model M with L layers: {l 1 , l 2 , . . ., l L }.Given a dataset W = {w 1 , w 2 , ..., w N }, we generate feature vectors, a sequence of latent representations: 2 by doing a forward-pass on the data for any given layer l.Our goal is to cluster representations z l , from task-specific training data to obtain encoded concepts.
We use agglomerative hierarchical clustering (Gowda and Krishna, 1978), which assigns each word to its individual cluster and iteratively combines the clusters based on Ward's minimum variance criterion, using intra-cluster variance.Distance between two vector representations is calculated with the squared Euclidean distance.The algorithm terminates when the required K clusters (i.e.encoded concepts) are formed, where K is a hyper-parameter.Each encoded concept represents a latent relationship between the words present in the cluster.

Alignment
Once we have obtained a set of encoded concepts in the base (pre-trained) and fine-tuned models, we want to align them to study how the latent space has evolved during transfer learning.Sajjad et al. (2022) calibrated representational space in transformer models with different linguistic concepts to generate their explanations.We extend their alignment function to align latent spaces within a model and its fine-tuned version.Given a concept C 1 (n) with J number of words, we consider it to be θaligned (Λ θ ) with concept C 2 (m), if they satisfy the following constraint: (1) where Kronecker function δ(w, w ′ ) is defined as it positive if all words (≥ θ) in the encoded concept appeared in positively labeled sentences.Note that here a word represents an instance based on its contextualized embedding.We similarly align C ec with C sst (−ve) to discover negative polarity concepts.
To carryout the analysis, we fine-tuned the base models for the tasks of sentiment analysis using the Stanford sentiment treebank dataset (SST-2, Socher et al., 2013), natural language inference (MNLI, Williams et al., 2018) and the Hate Speech Detection task (HSD, Mathew et al., 2020).

Clustering
We used the task-specific training data for clustering using both the base (pre-trained) and fine-tuned models.This enables to accurately compare the representational space generated by the same data.We do a forward-pass over both base and fine-tuned models to generate contextualized feature vectors 3 of words in the data and run agglomerative hierarchical clustering over these vectors.We do this for every layer independently, obtaining K clusters (a.k.a encoded concepts) for both base and fine-tuned models.We used K = 600 for our experiments. 4We carried out preliminary experiments (all the BERT-base-cased experiments) using K = 200, 400, . . ., 1000 and all our experiments using K = 600 and K = 1000.We found that our results are not sensitive to these parameters and the patterns are consistent with different cluster settings (please see Appendix B).

Human-defined Concepts
We experimented with traditional tasks that are defined to capture core-linguistic concepts such as word morphology: part-of-speech tagging using the Penn TreeBank data (Marcus et al., 1993), syntax: chunking tagging using CoNLL 2000 shared task dataset (Tjong Kim Sang and Buchholz, 2000), CCG super tagging using CCG Tree-bank (Hockenmaier, 2006) and semantic tagging using the Parallel Meaning Bank data (Abzianidze et al., 2017).We trained BERT-based sequence taggers for each of the above tasks and annotate the task-specific training data.Each core-linguistic task serves as a human-defined concept that is aligned with encoded concepts to measure the representation of linguistic knowledge in the latent space.Appendix A presents the details on human defined concepts, data stats and tagger accuracy.

Alignment Threshold
We consider an encoded concept to be aligned with another concept, if it has at least 95%5 match in the number of words.We only consider concepts that have more than 5 word-types.Note that the encoded concepts are based on contextualized embedding where a word has different embeddings depending on the context.

Analysis
Language model pre-training has been shown to capture rich linguistic features (Tenney et al., 2019;Belinkov et al., 2020) that are redundantly distributed across the network (Dalvi et al., 2020).We analyze how the representational space transforms when tuning towards a downstream task: i) how much knowledge is carried forward and ii) how it is redistributed, using our alignment framework.

Comparing Base and Fine-tuned Models
How do the latent spaces compare between base and fine-tuned models?We measure the overlap between the concepts encoded in the different layers of the base and fine-tuned models to guage the extent of transformation.Figures 3 compares the concepts in the base BERT, XLM-RoBERTa and ALBERT models versus their fine-tuned variants on the SST-2 task. 6We observe a high overlap in concepts in the lower layers of the model that starts decreasing as we go deeper in the network, completely diminishing towards the end.We conjecture that the lower layers of the model retain generic language concepts learned in the base model, where as the higher layers are now learning task-specific concepts.
7 Note, however, that the lower layers also do not completely align between the models, which shows that all the layers go through substantial changes during transfer learning.
at lower and higher thresholds. 6Please see all results in Appendix C.1. 7Our next results comparing the latent space with humandefined language concepts (Section 4.2) and the task specific concepts (Section 4.3) reinforces this hypothesis.
Comparing Architectures: The spread of the shaded area along the x-axis, particularly in XLM-R, reflects that some higher layer latent concepts in the base model have shifted towards the lower layers of the fine-tuned model.The latent space in the higher layers now reflect task-specific knowledge which was not present in the base model.ALBERT shows a strikingly different pattern with only the first 2-3 layers exhibiting an overlap with base concepts.This could be attributed to the fact that AL-BERT shares parameters across layers while the other models have separate parameters for every layer.ALBERT has less of a luxury to preserve previous knowledge and therefore its space transforms significantly towards the downstream task.Notice that the overlap is comparatively smaller (38% vs. 52% and 46% compared to BERT and XLM-R respectively), even in the embedding layer, where the words are primarily grouped based on lexical similarity.

Presence of Linguistic Concepts in the Latent Space
How does the presence of core-linguistic concepts change during transfer learning?To validate our hypothesis that generic language concepts are now predominantly retained in the lower half, we analyze how the linguistic concepts spread across the layers in the pre-trained and fine-tuned models by aligning the latent space to the human-defined concepts.Figure 4 shows that the latent space of the models capture POS concepts (e.g., determiners, past-tense verbs, superlative adjectives etc.) The information is present across the layers in the pre-trained models, however, as the model is finetuned towards downstream task, it is retained only at the lower and middle layers.We can draw two conclusions from this result: i) POS information is important for a foundational task such as language modeling (predicting the masked word), but not critically important for a sentence classification task like sentiment analysis.To strengthen our argument and confirm this further, we fine-tuned a BERT model towards the task of POS tagging itself.Figure 5 shows the extent of the alignment between POS concept with BERT-base and BERT tuned models towards the POS.Notice that more than 80% encoded concepts in the final layers of the BERT-POS model are now aligned with the POS concept as opposed to the BERT-SST model where POS concept (as can be seen in Figure 4) decreased to less than 5%.
Comparing Tasks and Architectures We found these observations to be consistently true for other tasks (e.g., MNLI and HSD) and human-defined concepts (e.g., SEM, Chunking and CCG tags) across the three architectures (i.e., BERT, XLM-R and ALBERT) that we study in this paper.
8 Table 1 compares an overall presence of core-linguistic concepts across the base and fine-tuned models.We observe a consistent deteriorating pattern across all human-defined concepts.In terms of architectural difference we again found ALBERT to show a substantial difference in the representation of POS post fine-tuning.The number of concepts not only regressed to the lower layers, but also decreased significantly as opposed to the base model.

Task-specific Latent Spaces
How is the knowledge of downstream tasks structured in a fine-tuned models?Now that we have established that the latent space of higher layers are substantially different from base models and from linguistic concepts, we probe: what kind of knowledge is learned in the latent space of higher layers?Previous research (Kovaleva et al., 2019;Merchant et al., 2020;Durrani et al., 2021) found  that the higher layers are optimized for the task.We also noticed how the concepts learned in the top 6 layers of the BERT-POS model completely evolve towards the (POS) task labels (See Figure 5).We now extend this experiment towards the sentencelevel tasks and investigate the extent of alignment between latent concepts of the fine-tuned models with its task labels.The SST-2 task predicts the sentiment (positive or negative) of a sentence.Using the class label, we form positive and negative polarity concepts, and align the polarity concepts with the encoded concepts. 9If an encoded concept is not aligned with any polarity concept, we mark the concept as "Neutral".Figure 6 shows that the concepts in the final layers acquire polarity towards the task of output classes compared to the base model where we only see neutral concepts throughout the network.Figure 7 shows an example of positive (top left) and negative polarity (top right) concepts 9 Positive polarity concept is made up of words that only appeared in the positively labeled sentences.We say an encoded concept (C ec ) is aligned to positive polarity concept (C + ) if 95% words in C ec ∈ C + .Note that the opposite is not necessarily true.Comparing architectures Interestingly, the presence of polarity clusters is not always equal.The last two layers of BERT-SST are dominated by negative polarity clusters, while ALBERT showed an opposite trend where the positive polarity concepts were more frequent.We hypothesized that the imbalance in the presence of polarity clusters may reflect prediction bias towards/against a certain class.However, we did not find a clear evidence for this in a pilot experiment.We collected predictions for all three models over a random corpus of 37K sentences.The models predicted negative sentiment by 69.5% (BERT), 67.4% (XLM) and 64.4% (ALBERT).While the numbers weakly correlate with the number of negative polarity concepts in these models, a thorough investigation is required to obtain accurate insights.We leave a detailed exploration of this for future.
ALBERT showed the evolution of polarity clusters much earlier in the network (Layer 3 onwards).This is inline with our previous results on aligning encoded concepts of base and fine-tuned models (Figure 3).We found that the latent space in AL-BERT evolved the most, overlapping with its base model only in the first 2-3 layers.The POS-based concepts were also reduced just to the first two layers (Figure 4).Here we can see that the concepts learned within the remaining layers acquire affinity towards the task specific labels.We found these results to be consistent with the hate speech task (See Appendix C.3) but not in the MNLI task, where we did not find the latent concepts to acquire affinity towards the task labels.This could be attributed to the complexity and nature of the MNLI task.Unlike the SST-2 and HSD tasks, where lexical triggers serve as an important indicators for the model, MNLI requires intricate modeling of semantic relationships between premise and hypothesis to predict entailment.Perhaps an alignment function that models the interaction between the concepts of premise and hypothesis is required.We leave this exploration for the future.

Adversarial Triggers
The discovery of polarized concepts in the SST-2 and HSD tasks, motivated us to question: whether the fine-tuned model is learning the actual task or relying on lexical triggers to solve the problem.Adversarial examples have been used in the literature to highlight model's vulnerability (Kuleshov et al., 2018;Wallace et al., 2019).We show that our polarity concepts can be used to generate such examples using the following formulation: Let C ec (+ve) = {C  M } be the sentences in a dev-set that are predicted as negative by the model.We compute the flipping accuracy of each concept C + x using the following function: where γ(w i , s j ) = 1, if prepending w i to the sentence s j flips the model's prediction from negative to positive.Here N a is the total number of adversarial examples that were generated, and equates to |C We compute the flipping accuracy of each polarized concept on a small hold-out set.Table 2 shows the average flipping accuracy of the top-5 polarized concepts for each class (positive/negative in SST-2 and toxic/non-toxic in the HSD task) across final three layers on the test-set.We observed that by just prepending the words in highly polarized concepts, we are able to effectively flip the model's prediction by up to 91.5%.This shows that these models are fragile and heavily rely on lexical triggers to make predictions.In the case of Hate Speech Detection task, we observed that while it is easy to make a non-toxic sentence toxic, it is hard to reverse the affect.
Comparing Architectures We found ALBERT to be an outlier once again with a high flipping accuracy, which shows that ALBERT relies on these cues more than the other models and is therefore more prone to adversarial attacks.

Related Work
A plethora of papers have been written in the past five years on interpreting deep NLP models.The work done in this direction can be broadly classified into: i) post-hoc representation analysis that encode the contextualized embedding for the knowledge learned (Dalvi et al., 2017;Belinkov et al., 2020;Rogers et al., 2020;Lepori and McCoy, 2020) and ii) causation analysis that connect input features with model behavior as a whole and at a level of individual predictions (Linzen et al., 2016;Gulordava et al., 2018;Marvin and Linzen, 2018). 10Our work mainly falls in the former category although we demonstrated a causal link between the encoded knowledge and model predictions by analyzing the concepts in the final layers and demonstrating how they can be used to generate adversarial examples with lexical triggers.Recent work (Feder et al., 2021;Elazar et al., 2021) formally attempts to bridge the gap by connecting the two lines of work.
Relatively less work has been done on interpreting fine-tuned models.Zhao and Bethard (2020) analyzed the heads encoding negation scope in finetuned BERT and RoBERTa models.Merchant et al. (2020); Mosbach et al. (2020) analyzed linguistic knowledge in pre-trained models and showed that while fine-tuning changes the upper layers of the model, but does not lead to "catastrophic forgetting of linguistic phenomena".Our results resonate with their findings, in that the higher layers learn task-specific concepts.
11 However, similar to Durrani et al. (2021) we found depreciation of linguistic knowledge in the final layers.Mehrafarin et al. (2022) showed that the size of the datasets used for fine-tuning should be taken into account to draw reliable conclusions when using probing classifiers.A pitfall to the probing classifiers is the difficulty to disentangle probe's capacity to learn from the actual knowledge learned within the representations (Hewitt and Liang, 2019).Our work is different from all the previous work done on interpreting fine-tuned models.We do away from the limitations of probing classifiers by using an unsupervised approach.
Our work is inspired by the recent work on discovering latent spaces for analyzing pre-trained models (Michael et al., 2020;Dalvi et al., 2022;Fu and Lapata, 2022;Sajjad et al., 2022).Like Dalvi et al. (2022); Sajjad et al. (2022) we discover encoded concepts in pre-trained models and align them with pre-defined concepts.Different from them, we study the evolution of latent spaces of fine-tuned models.

Conclusion
We studied the evolution of latent space of pretrained models when fine-tuned towards a downstream task.Our approach uses hierarchical clustering to find encoded concepts in the representations.We analyzed them by comparing with the encoded concepts of base model, human-defined concepts, and task-specific concepts.We showed that the latent space of fine-tuned model is substantially different from their base counterparts.The humandefined linguistic knowledge largely vanishes from the higher layers.The higher layers encode taskspecific concepts relevant to solve the task.Moreover, we showed that these task-specific concepts can be used in generating adversarial examples that flips the predictions of the model up to 91% of the time in the case of ALBERT Hate Speech model.The discovery of word-level task-specific concepts suggest that the models rely on lexical triggers and are vulnerable to adversarial attacks.

Limitations
The hierarchical clustering is memory intensive.For instance, the clustering of 250k representation vectors, each of size 768 consumes 400GB of CPU memory.This limits the applicability of our approach to small to medium data sizes.Moreover, our approach is limited to word-level concepts.The models may also learn phrasal concepts to solve a task.We speculate that the low matches of affinity concepts in the MNLI task is due to the limitation of our approach in analyzing phrasal units.

C.3 Task-specific Latent Spaces
In Section 4.3 we studied how the concepts in SST models acquire polarity towards the task.We did not show the base models due to space limitations.
Here we show the base models as well to demonstrate that all concepts had no polarity in the base

D Selection of task-specific Latent clusters
Figure 22 shows some task-specific latent clusters from various models and layers.

Figure 1 :
Figure 1: Comparing encoded concepts of a model across different layers with: i) the concepts encoded its base model (dashed lines), ii) human-defined concepts (e.g.POS tags or semantic properties), and iii) task specific concepts (e.g.positive or negative sentiment class).
Figure 2: Examples of encoded concepts.The size of a specific word is based on its frequency in the cluster, defined by the number of times different contextual representations of a word were grouped in the same cluster.

Figure 3 :
Figure 3: Comparing encoded concepts of base models with their SST fine-tuned versions.X-axis = base model, Y-axis = fine-tuned model.Each cell in the matrix represents a percentage (aligned concepts/total concepts in a layer) between the base and fine-tuned models.Darker color means higher percentage.Detailed plots with actual overlap values are provided in the Appendix.

Figure 4 :
Figure4: Alignment of the encoded concepts with POS concepts (e.g., determiners, past-tense verbs, superlative adjectives) in the base and fine-tuned SST models.The maximum possible concepts per layer are 600 (total of clusters).Note that the POS information depreciates significantly in the final layers in the SST-tuned models.

Figure 6 :
Figure 6: Aligning encoded concepts with the task specific concepts in Base and their corresponding SST tuned models.

Figure 7 :
Figure 7: Polarity Concepts in XLM-R models: Positive (top left) and Negative (top right) in the SST task, Toxic Concept (bottom) in the HSD task.

N
} be a set of latent concepts that are identified to have a strong affinity towards predicting positive sentiment in

)
of the concepts that acquire affinity towards the negative class.The concepts with high flipping accuracy can be used to generate adversarial examples.
Figure 8: Comparing encoded concepts when using 600 or 1000 clusters

Figure 10 :
Figure 10: Comparing Latent Concepts of Base models with their SST and MNLI fine-tuned versions.X-axis = base model, Y-axis = fine-tuned model

Table 1 :
Overall presence (percentage of aligned concepts) of human-defined concepts in base (B) versus SST fine-tuned models.

Table 2 :
Flipping accuracy (%age) of top-5 polarized concepts: +ve → −ve = flipping a positive sentence to negative using negative polarity concept, nt → tx = converting a non-toxic sentence toxic using toxic concept.
models.In Figure21, we show the same for the Hate-Speech task.We do not show the MNLI task, because we could not find polarity concepts in that task.