Can LLMs facilitate interpretation of pre-trained language models?

Work done to uncover the knowledge encoded within pre-trained language models rely on annotated corpora or human-in-the-loop methods. However, these approaches are limited in terms of scalability and the scope of interpretation. We propose using a large language model, ChatGPT, as an annotator to enable fine-grained interpretation analysis of pre-trained language models. We discover latent concepts within pre-trained language models by applying agglomerative hierarchical clustering over contextualized representations and then annotate these concepts using ChatGPT. Our findings demonstrate that ChatGPT produces accurate and semantically richer annotations compared to human-annotated concepts. Additionally, we showcase how GPT-based annotations empower interpretation analysis methodologies of which we demonstrate two: probing frameworks and neuron interpretation. To facilitate further exploration and experimentation in the field, we make available a substantial ConceptNet dataset (TCN) comprising 39,000 annotated concepts.


Introduction
A large body of work done on interpreting pretrained language models answers the question: What knowledge is learned within these models?Researchers have investigated the concepts encoded in pre-trained language models by probing them against various linguistic properties, such as morphological (Vylomova et al., 2017;Belinkov et al., 2017a), syntactic (Linzen et al., 2016;Conneau et al., 2018;Durrani et al., 2019), and semantic (Qian et al., 2016;Belinkov et al., 2017b) tasks, among others.Much of the methodology used in these analyses heavily rely on either having access to an annotated corpus that pertains to the linguistic concept of interest (Tenney et al., 2019;Liu et al., 2019a;Belinkov et al., 2020), or involve human-inthe-loop (Karpathy et al., 2015;Kádár et al., 2017;Geva et al., 2021;Dalvi et al., 2022) to facilitate such an analysis.The use of pre-defined linguistic concepts restricts the scope of interpretation to only very general linguistic concepts, while human-inthe-loop methods are not scalable.We circumvent this bottleneck by using a large language model, ChatGPT, as an annotator to enable fine-grained interpretation analysis.
Generative Pre-trained Transformers (GPT) have been trained on an unprecedented amount of textual data, enabling them to develop a substantial understanding of natural language.As their capabilities continue to improve, researchers are finding creative ways to leverage their assistance for various applications, such as question-answering in financial and medical domains (Guo et al., 2023), simplifying medical reports (Jeblick et al., 2022), and detecting stance (Zhang et al., 2023).We carry out an investigation of whether GPT models, specifically ChatGPT, can aid in the interpretation of pre-trained language models (pLMs).
A fascinating characteristic of neural language models is that words sharing any linguistic relationship cluster together in high-dimensional spaces (Mikolov et al., 2013).Recent research (Michael et al., 2020;Fu and Lapata, 2022;Dalvi et al., 2022) has built upon this idea by exploring representation analysis through latent spaces in pre-trained models.Building on the work of Dalvi et al. (2022) we aim to identify encoded concepts within pre-trained models using agglomerative hierarchical clustering (Gowda and Krishna, 1978) on contextualized representations.The underlying hypothesis is that these clusters represent latent concepts, capturing the language knowledge acquired by the model.Unlike previous approaches that rely on predefined concepts (Michael et al., 2020;Durrani et al., 2022) or human annotation (Alam et al., 2023) to label these concepts, we leverage the ChatGPT model.
Figure 1: ChatGPT as an annotator: Human annotation or taggers trained on pre-defined concepts, cover only a fraction of a model's concept space.ChatGPT enables scaling up annotation to include nearly all concepts, including the concepts that may not have been manually annotated before.
Our findings indicate that the annotations produced by ChatGPT are semantically richer and accurate compared to the human-annotated concepts (for instance BERT Concept NET).Notably, Chat-GPT correctly labeled the majority of concepts deemed uninterpretable by human annotators.Using an LLM like ChatGPT improves scalability and accuracy.For instance, the work in Dalvi et al. (2022) was limited to 269 concepts in the final layer of the BERT-base-cased (Devlin et al., 2019) model, while human annotations in Geva et al. (2021) were confined to 100 keys per layer.Using ChatGPT, the exploration can be scaled to the entire latent space of the models and many more architectures.We used GPT to annotate 39K concepts across 5 pre-trained language models.Building upon this finding, we further demonstrate that GPT-based annotations empowers methodologies in interpretation analysis of which we show two: i) probing framework (Belinkov et al., 2017a), ii) neuron analysis (Antverg and Belinkov, 2022).
Probing Framework We train probes from GPTannotated concept representations to explore concepts that go beyond conventional linguistic categories.For instance, instead of probing for named entities (e.g.NE:PER), we can investigate whether a model distinguishes between male and female names or probing for "Cities in the southeastern United States" instead of NE:LOC.

Neuron Analysis
Another line of work that we illustrate to benefit from GPT-annotated latent concepts is the neuron analysis i.e. discovering neurons that capture a linguistic phenomenon.In contrast to the holistic view offered by representation analysis, neuron analysis highlights the role of individual neurons (or groups of them) within a neural network ( (Sajjad et al., 2022).We obtain neuron rankings for GPT-annotated latent concepts using a neuron ranking method called Probeless (Antverg and Belinkov, 2022).Such fine-grained interpertation analyses of latent spaces enable us to see how neurons distribute in hierarchical ontologies.For instance, instead of simply identifying neurons associated with the POS:Adverbs, we can now uncover how neurons are distributed across sub-concepts such as adverbs of time (e.g., "tomorrow") and adverbs of frequency (e.g., "daily").Or instead of discovering neurons for named entities (e.g.NE:PER), we can discover neurons that capture "Muslim Names" versus "Hindu Names".To summarize, we make the following contributions in this work: • Our demonstration reveals that ChatGPT offers comprehensive and precise labels for latent concepts acquired within pLMs.
• We showcased the GPT-based annotations of latent concepts empower methods in interpretation analysis by showing two applications: Probing Classifiers and Neuron Analysis.
• We release Transformers Concept-Net, an extensive dataset containing 39K annotated concepts to facilitate the interpretation of pLMs.

Methodology
We discover latent concepts by applying clustering on feature vectors ( §2.1).They are then labeled using ChatGPT ( §2.2) and used for fine-grained interpretation analysis ( §2.3 and 2.4).A visual representation of this process is shown in Figure 1.

Concept Discovery
Contextualized word representations learned in pretrained language models, can identify meaningful groupings based on various linguistic phenomenon.These groups represent concepts encoded within pLMs.Our investigation expands upon the work done in discovering latent ontologies in contextualized representations (Michael et al., 2020;Dalvi et al., 2022).At a high level, feature vectors (contextualized representations) are first generated by performing a forward pass on the model.

Concept Annotation
Encoded concepts capture latent relationships among words within a cluster, encompassing various forms of similarity such as lexical, syntactic, semantic, or specific patterns relevant to the task or data.Figure 2 provides illustrative examples of concepts encoded in the BERT-base-cased model.This work leverages the recent advancements in prompt-based approaches, which are enabled by large language models such as GPT-3 (Brown et al., 2020).Specifically, we utilize a zero-shot learning strategy, where the model is solely provided with a natural language instruction that describes the task of labeling the concept.We used ChatGPT with zero-shot prompt to annotate the latent concepts with the following settings:3 Assistant is a large language model trained by OpenAI Instructions: Give a short and concise label that best describes the following list of words: ["word 1", "word 2", ..., "word N"]

Concept Probing
Our large scale annotations of the concepts in pLMs enable training probes towards fine-grained concepts that lack pre-defined annotations.For example we can use probing to assess whether a model has learned concepts that involve biases related to gender, race, or religion.By tracing the input sentences that correspond to an encoded concept C in a pre-trained model, we create annotations for a particular concept.We perform finegrained concept probing by extracting feature vectors from annotated data through a forward pass on the model of interest.Then, we train a binary classifier to predict the concept and use the probe accuracy as a qualitative measure of how well the model represents the concept.Formally, given a set of tokens W = {w 1 , w 2 , ..., w N } ∈ C, we generate feature vectors, a sequence of latent representations: W M − → z l = {z l 1 , . . ., z l n } for each word w i by doing a forward pass over s i .We then train a binary classifier over the representations to predict the concept C minimizing the cross-entropy loss: is the probability that word x i is assigned concept c.We learn the weights θ ∈ R D×L using gradient descent.Here D is the dimensionality of the latent representations z i and L is the size of the concept set which is 2 for a binary classifier.

Concept Neurons
An alternative area of research in interpreting NLP models involves conducting representation analysis at a more fine-grained level, specifically focusing on individual neurons.Our demonstration showcases how the extensive annotations of latent concepts enhance the analysis of neurons towards more intricate concepts.We show this by using a neuron ranking method called Probeless (Antverg and Belinkov, 2022) over our concept representations.The method obtains neuron rankings using an accumulative strategy, where the score of a given neuron n towards a concept C is defined as follows: where µ(C) is the average of all activations z(n, w), w ∈ C, and µ( Ĉ) is the average of activations over the random concept set.Note that the ranking for each neuron n is computed independently.

Experimental Setup
Latent Concept Data We used a subset of the WMT News 2018 dataset, containing 250K randomly chosen sentences (≈5M tokens).We set a word occurrence threshold of 10 and restricted each word type to a maximum of 10 occurrences.This selection was made to reduce computational and memory requirements when clustering highdimensional vectors.We preserved the original embedding space to avoid information loss through dimensionality reduction techniques like PCA.Consequently, our final dataset consisted of 25,000 word types, each represented by 10 contexts.

Concept Discovery
We apply agglomerative hierarchical clustering on contextualized feature vectors acquired through a forward pass on a pLM for the given data.The resulting representations in each layer are then clustered into 600 groups. 4oncept Annotation We used ChatGPT available through Azure OpenAI service5 to carryout the annotations.We used a temperature of 0 and a top p value of 0.95.Setting the temperature to 0 controls the randomness in the output and produces deterministic responses.
Probing and Neuron Analysis For each annotated concept, we extract feature vectors using the relevant data.We then train linear classifiers with a categorical cross-entropy loss function, optimized using Adam (Kingma and Ba, 2014).The training process involved shuffled mini-batches of size 512 and was concluded after 10 epochs.We used a data split of 60-20-20 for train, dev, test when training classifiers.We use the same representations to obtain neuron rankings.We use NeuroX toolkit (Dalvi et al., 2023a)

Results
To validate ChatGPT's effectiveness as an annotator, we conducted a human evaluation.Evaluators were shown a concept through a word cloud, along with sample sentences representing the concept and the corresponding GPT annotation.They were then asked the following questions: • Q1: Is the label produced by ChatGPT Acceptable or Unacceptable?Unacceptable annotations include incorrect labels or those that ChatGPT was unable to annotate.
• Q2: If a label is Acceptable, is it Precise or Imprecise?While a label may be deemed acceptable, it may not convey the relationship between the underlying words in the concept accurately.This question aims to measure the precision of the label itself.
• Q3: Is the ChatGPT label Superior or Inferior to human annotation?BCN labels provided by Dalvi et al. (2022) are used as human annotations for this question.
In the first half of Table 1, the results indicate that 90.7% of the ChatGPT labels were considered Acceptable.Within the acceptable labels, 75.1% were deemed Precise, while 24.9% were found to be Imprecise (indicated by Q2 in Table 1).We also computed Fleiss' Kappa (Fleiss et al., 2013) to measure agreement among the 3 annotators.For Q1, the inter-annotator agreement was found to  Landis and Koch (1977).However, for Q2, the agreement was 0.34 (indicating a fair level of agreement among annotators).This was expected due to the complexity and subjectivity of the task in Q2 for example annotators' knowledge and perspective on precise and imprecise labels.

ChatGPT Labels versus Human Annotations
Next we compare the quality of ChatGPT labels to the human annotations using BERT Concept Net, a human annotated collection of latent concepts learned within the representations of BERT.BCN, however, was annotated in the form of Concept Type:Concept Sub Type (e.g., SEM:entertainment:sport:ice_hockey) unlike GPT-based annotations that are natural language descriptions (e.g.Terms related to ice hockey).Despite their lack of natural language, these reference annotations prove valuable for drawing comparative analysis between humans and ChatGPT.For Q3, we presented humans with a word cloud and three options to choose from: whether the LLM annotations are better, equalivalent, or worse than the BCN annotations.We found that ChatGPT outperformed or achieved equal performance to BCN annotations in 75.5% of cases, as shown in Table 2.The inter-annotator agreement for Q3 was found to be 0.56 which is considered moderate.

Error Analysis
The annotators identified 58 concepts where human annotated BCN labels were deemed superior.We have conducted an error analysis of these instances and will now delve into the cases where GPT did not perform well.
Sensitive Content Models In 10 cases, the API calls triggered one of the content policy models and failed to provide a label.The content policy models aim to prevent the dissemination of harmful, abusive, or offensive content, including hate speech, misinformation, and illegal activities.Figure 3a shows an example of a sensitive concept that Linguistic Ontologies In 8 of the concepts, human annotations (BCN) were better because the concepts were composed of words that were related through a lexical, morphological, or syntactic relationship.The default prompt we used to label the concept tends to find semantic similarity between the words, which did not exist in these concepts.For example, Figure 3b shows a concept composed of 3rd person singular present-tense verbs, but ChatGPT incorrectly labels it as Actions/Events in News Articles.However, humans are robust and can fall back to consider various linguistic ontologies.
The BCN concepts are categorized into semantic, syntactic, morphological, and lexical groups (See Table 3).As observed, both humans and ChatGPT found semantic meaning to the concept in majority of the cases.However, humans were also able to identify other linguistic relations such as lexical (e.g.grouped by a lexical property like abbreviations), morphological (e.g.grouped by the same parts-of-speech), or syntactic (e.g.grouped by position in the sentence).Note however, that prompts can be modified to capture specific linguistic property.We encourage interested readers to see our experiments on this in Appendix A.2-A.3.

Insufficient Context
Sometimes context contextual information is important to correctly label a concept.While human annotators (of the BCN corpus) were provided with the sentences in which the underlying words appeared, we did not provide the same to ChatGPT to keep the prompt costeffective.However, providing context sentences in the prompt6 along with the concept to label resulted in improved labels for 11 of the remaining 40 error cases.Figure 3d shows one such example where providing contextual information made ChatGPT to correctly label the concept as Cricket Scores as opposed to Numerical Data the label that it gives without seeing contextual information.However, providing context information didn't consistently prove helpful.Figure 3c shows a concept, where providing contextual information did not result in the accurate label: Rock Bands and Artists in the US, as identified by the humans.
Uninterpretable Concepts Conversely, we also annotated concepts that were considered uninterpretable or non-meaningful by the human annotators in the BCN corpus and in 21 out 26 cases, ChatGPT accurately assigned labels to these concepts.The proficiency of ChatGPT in processing extensive textual data enables it to provide accurate labels for these concepts.Now that we have established the capability of large language models like ChatGPT in providing rich semantic annotations, we will showcase how these annotations can facilitate extensive finegrained analysis on a large scale.

Probing Classifiers
Probing classifiers is among the earlier techniques used for interpretability, aimed at examining the knowledge encapsulated in learned representations.However, their application is constrained by the availability of supervised annotations, which often focus on conventional linguistic knowledge and are subject to inherent limitations (Hewitt and Liang, 2019).We demonstrate that using GPT-based annotation of latent concepts learned within these models enables a direct application towards finegrained probing analysis.By annotating the latent space of five renowned pre-trained language models (pLMs): BERT, ALBERT, XLM-R, XLNet, and RoBERTa -we developed a comprehensive Transformers Concept Net.This net encompasses 39,000 labeled concepts, facilitating cross-architectural comparisons among the models.Table 4 showcases a subset7 of results comparing ALBERT and XLNet through probing classifiers.
We can see that the model learns concepts that may not directly align with the pre-defined human onotology.For example, it learns a concept based on Spanish Male Names or Football team names and stadiums.Identifying how fine-grained concepts are encoded within the latent space of a model enable applications beyond interpretation analysis.For example it has direct application in model editing (Meng et al., 2023) which first trace where the model store any concept and then change the relevant parameters to modify its behavior.Moreover, identifying concepts that are associated with gender (e.g., Female names and titles), religion (e.g.Islamic Terminology), and ethnicity (e.g., Nordic names) can aid in elucidating the biases present in these models.

Neuron Analysis
Neuron analysis examines the individual neurons or groups of neurons within neural NLP models to gain insights into how the model represents linguistic knowledge.However, similar to general interpretability, previous studies in neuron analysis are also constrained by human-in-the-loop (Karpathy et al., 2015;Kádár et al., 2017) or pre-defined linguistic knowledge (Lakretz et al., 2019;Dalvi et al., 2019;Hennigen et al., 2020).Consequently, the resulting neuron explanations are subject to the same limitations we address in this study.
Our work demonstrates that annotating the latent space enables neuron analysis of intricate linguistic hierarchies learned within these models.For example, Dalvi et al. (2019) and Hennigen et al. (2020) only carried out analysis using very coarse morphological categories (e.g.adverbs, nouns etc.) in parts-of-speech tags.We now showcase how our discovery and annotations of fine-grained latent concepts leads to a deeper neuron analysis of these models.In our analysis of BERT-based partof-speech tagging model, we discovered 17 finegrained concepts of adverb (in the final layer).It is evident that BERT learns a highly detailed semantic hierarchy, as maintains separate concepts for the adverbs of frequency (e.g., "rarely, sometimes") versus adverbs of manner (e.g., "quickly, softly").We employed the Probeless method (Antverg and Belinkov, 2022) to search for neurons associated with specific kinds of adverbs.We also create a super adverb concept encompassing all types of adverbs, serving as the overarching and generic representation for this linguistic category and obtain neurons associated with the concept.We then compare the neuron ranking obtained from the super concept to the individual rankings from sub concepts.Interestingly, our findings revealed that the top-ranking neurons responsible for learning  the super concept are often distributed among the top neurons associated with specialized concepts, as shown in Figure 4 for adverbial concepts.The results, presented in Table 5, include the number of discovered sub concepts in the column labeled # Sub Concepts and the Alignment column indicates the percentage of overlap in the top 10 neurons between the super and sub concepts for each specific adverb concept.The average alignment across all sub concepts is indicated next to the super concept.This observation held consistently across various properties (e.g.Nouns, Adjectives and Numbers) as shown in Table 5.For further details please refer to Appendix C).
Note that previously, we couldn't identify neurons with such specific explanations, like distinguishing neurons for numbers related to currency values from those for years of birth or neurons differentiating between cricket and hockey-related terms.Our large scale concept annotation enables locating neurons that capture the fine-grained aspects of a concept.This enables applications such as manipulating network's behavior in relation to that concept.For instance, Bau et al. ( 2019) identified "tense" neurons within Neural Machine Translation (NMT) models and successfully changed the output from past to present tense by modifying the activation of these specific neurons.However, their study was restricted to very few coarse concepts for which annotations were available.

Related Work
With the ever-evolving capabilities of the LLMs, researchers are actively exploring innovative ways to harness their assistance.Prompt engineering, the process of crafting instructions to guide the behavior and extract relevant knowledge from these oracles, has emerged as a new area of research (Lester et al., 2021;Liu et al., 2021;Kojima et al., 2023;Abdelali et al., 2023;Dalvi et al., 2023b).Recent work has established LLMs as highly proficient annotators.Ding et al. (2022) carried out evaluation of GPT-3's performance as a data annotator for text classification and named entity recognition tasks, employing three primary methodologies to assess its effectiveness.Wang et al. (2021) showed that GPT-3 as an annotator can reduce cost from 50-96% compared to human annotations on 9 NLP tasks.They also showed that models trained using GPT-3 labeled data outperformed the GPT-3 fewshot learner.Similarly, Gilardi et al. (2023) showed that ChatGPT achieves higher zero-shot accuracy compared to crowd-source workers in various annotation tasks, encompassing relevance, stance, topics, and frames detection.Our work is different from previous work done using GPT as annotator.
We annotate the latent concepts encoded within the embedding space of pre-trained language models.We demonstrate how such a large scale annotation enriches representation analysis via application in probing classifiers and neuron analysis.

Conclusion
The scope of previous studies in interpreting neural language models is limited to general ontologies or small-scale manually labeled concepts.In our research, we showcase the effectiveness of Large Language Models, specifically ChatGPT, as a valuable tool for annotating latent spaces in pre-trained language models.This large-scale annotation of latent concepts broadens the scope of interpretation from human-defined ontologies to encompass all concepts learned within the model, and eliminates the human-in-the-loop effort for annotating these concepts.We release a comprehensive GPTannotated Transformers Concept Net (TCN) consisting of 39,000 concepts, extracted from a wide range of transformer language models.TCN empowers the researchers to carry out large-scale interpretation studies of these models.To demonstrate this, we employ two widely used techniques in the field of interpretability: probing classifiers and neuron analysis.This novel dimension of analysis, previously absent in earlier studies, sheds light on intricate aspects of these models.By showcasing the superiority, adaptability, and diverse applications of ChatGPT annotations, we lay the groundwork for a more comprehensive understanding of NLP models.

Limitations
We list below limitations of our work: • While it has been demonstrated that LLMs significantly reduce the cost of annotations, the computational requirements and response latency can still become a significant challenge when dealing with extensive or high-throughput annotation pipeline like ours.In some cases it is important to provide contextual information along with the concept to obtain an accurate annotation, causing the cost go up.Nevertheless, this is a one time cost for any specific model, and there is optimism that future LLMs will become more cost-effective to run.
• Existing LLMs are deployed with content policy filters aimed at preventing the dissemination of harmful, abusive, or offensive content.However, this limitation prevents the models from effectively labeling concepts that reveal sensitive information, such as cultural and racial biases learned within the model to be interpreted.For example, we were unable to extract a label for racial slurs in the hate speech detection task.This restricts our concept annotation approach to only tasks that are not sensitive to the content policy.
• The information in the world is evolving, and LLMs will require continuous updates to reflect the accurate state of the world.This may pose a challenge for some problems (e.g.news summarization task) where the model needs to reflect an updated state of the world.

References
Ahmed  Initially, we used a simple prompt to ask the model to provide labels for a list of words keeping the system description unchanged: Assistant is a large language model trained by OpenAI Prompt Body: Give the following list of words a short label: ["word 1", "word 2", ..., "word N"] The output format from the first prompt was unclear as it included illustrations, which was not our intention.After multiple design iterations, we developed a prompt that returned the labels in the desired format.In this revised prompt, we modified the system description as follows: Assistant is a large language model trained by OpenAI.Instructions: When asked for labels, only the labels and nothing else should be returned.
We also modified the prompt body to: Give a short and concise label that best describes the following list of words: ["word 1", "word 2", ..., "word N"] Figure 5 shows some sample concepts learned in the last layer of BERT-base-cased along with their labels.

A.2 Prompts For Lexical Concepts
During the error analysis (Section 4.2), we discovered that GPT struggled to accurately label concepts composed of words sharing a lexical property, such as a common suffix.However, we were able to devise a solution to address this issue by curating the prompt to effectively label such concepts.We modified the prompt to identify concepts that contain common n-grams.
Give a short and concise label describing the common ngrams between the words of the given list Note: Only one common ngram should be returned.If there is no common ngram reply with 'NA' Using this improved we were able to correct 100% of the labeling errors in the concepts having lexical coherence.See Figure 7a for example.With the default prompt it was labelled as Superlative and ordinal adjectives and with the modified prompt, it was labeled as Hyphenated, cased & -based suffix.

A.3 Prompts for POS Concepts
Similarly we were able to modify the prompt to correctly label concepts that were made from words having common parts-of-speech.From the prompts we tested, the best performing one is below: Give a short and concise label describing the common part of speech tag between the words of the given list Note: The part of speech tag should be chosen from the Penn Treebank.If there's no common part of speech tag reply with 'NA' In Figure 7b, we present an example of a concept labeled as Surnames with 'Mc' prefix.However, it is important to note that not all the names in this concept actually begin with the "Mc" prefix.The appropriate label for this concept would be NNP: Proper Nouns or SEM: Irish Names.With the POS-based prompt, we are able to achieve the former.

A.4 Providing Context
Our analysis revealed that including contextual information is crucial for accurately labeling concepts in certain cases.As shown in Figure 8, concepts were incorrectly labeled as Numerical Data despite representing different entities.Incorporating context enables us to obtain more specific labels.However, we face limitations in the number of input tokens we can provide to the model, which impacts the quality of the labels.Using context of 10 sentences we were able to correct 9 of the 38 erroneous labels.

A.5 Other Details
Tokens Versus Types We observed that the quality of labels is influenced by the word frequency in the given list.Using tokens instead of types leads to more meaningful labels.However, when the latent concept includes hate speech words, passing a token list results in failed requests due to content policy violations.In such cases, we opted to pass the list of types instead.Although this mitigates the issue to a certain extent, it does not completely

Figure 2 :
Figure 2: Illustrative Examples of Concept Learned in BERT: word groups organized based on (a) Lexical, (b) Parts of Speech, and (c) Semantic property Figure 3: Failed cases for ChatGPT labeling: a) Non-labeled concepts due to LLM content policy, b) Failing to identify correct linguistic relation, c) Imprecise labeling d) Imprecise labels despite providing context

Figure 5 :Figure 6 :
Figure 5: Sample Concepts Learned in the last layer of BERT

Figure 7 :Figure 8 :
Figure 7: Illustrating lexical and POS concepts: (a) A concept that exhibits multiple lexical properties, such as being hyphenated and cased.ChatGPT assigns a label based on the shared "-based" ngram found among most words in the cluster.(b) ChatGPT labeled this concept as NNP (proper noun) to train our probes and run neuron analysis.

Table 2 :
Annotation for Q3 with 3 choices: GPT is better, labels are equivalent, human annotation is better.

Table 5 :
Neuron Analysis on Super Concepts extracted from BERT-base-cased-POS model.
The alignment column shows the intersection between the top 10 neurons in the Super concept and the Sub concepts.For detailed results please check Appendix C (See Table11)

Table 6 :
Prompting ChatGPT to label a concept with keywords instead of one label

Table 11 :
Neuron Analysis Results on Super Concepts extracted from BERT-base-cased model.The alignment column shows the intersection between the top 10 neurons in the Super concept and the Sub concepts.