Unraveling Feature Extraction Mechanisms in Neural Networks

The underlying mechanism of neural networks in capturing precise knowledge has been the subject of consistent research efforts. In this work, we propose a theoretical approach based on Neural Tangent Kernels (NTKs) to investigate such mechanisms. Specifically, considering the infinite network width, we hypothesize the learning dynamics of target models may intuitively unravel the features they acquire from training data, deepening our insights into their internal mechanisms. We apply our approach to several fundamental models and reveal how these models leverage statistical features during gradient descent and how they are integrated into final decisions. We also discovered that the choice of activation function can affect feature extraction. For instance, the use of the \textit{ReLU} activation function could potentially introduce a bias in features, providing a plausible explanation for its replacement with alternative functions in recent pre-trained language models. Additionally, we find that while self-attention and CNN models may exhibit limitations in learning n-grams, multiplication-based models seem to excel in this area. We verify these theoretical findings through experiments and find that they can be applied to analyze language modeling tasks, which can be regarded as a special variant of classification. Our contributions offer insights into the roles and capacities of fundamental components within large language models, thereby aiding the broader understanding of these complex systems.


Introduction
Neural networks have become indispensable across a variety of natural language processing (NLP) tasks.There has been growing interest in understanding their successes and interpreting their characteristics.One line of works attempts to identify possible features captured by them for NLP tasks (Li et al., 2016;Linzen et al., 2016;Jacovi et al., 2018;Hewitt and Manning, 2019;Vulić et al., 2020).They mainly develop empirical methods to verify hypotheses regarding the semantic and syntactic features encoded in the output.Such works may result in interesting findings, but those models still remain black-boxes to us.Another line seeks to reveal internal mechanisms of neural models using mathematical tools (Levy and Goldberg, 2014;Saxe et al., 2013;Arora et al., 2018;Bhojanapalli et al., 2020;Merrill et al., 2020;Dong et al., 2021;Tian et al., 2023), which can be more straightforward and insightful.However, few of them have specifically focused on the feature extraction of neural NLP models.
When applying neural models to downstream NLP tasks in practice, we often notice some modules perform better than others on specific tasks, while some exhibit similar behaviors.We may wonder what mechanisms are behind such differences and similarities between those modules.By acquiring deeper insights into the roles of those modules in a complex model with respect to feature extraction, we will be able to select or even design more suitable models for downstream tasks.
In this work, we propose a novel theoretical approach to understanding the mechanisms, through which fundamental models (often used as modules in complex models) acquire features during gradient descent in text classification tasks.The evolution of model output can be described as learning dynamics involving NTKs (Jacot et al., 2018;Arora et al., 2019), which are typically used to study various properties of neural networks, including convergence and generalization.While these representations can be complex in practice, when the width of the network approaches infinity, they tend to converge to less complex representations and remain asymptotically constant (Jacot et al., 2018), allowing us to intuitively interpret the learning dynamics and identify the relevant features cap-tured by the model.
We applied our approach to several fundamental models, including a multi-layer perceptron (MLP), a convolutional neural network (CNN), a linear Recurrent Neural Network (L-RNN), a self-attention (SA) model (Vaswani et al., 2017), and a matrixvector (MV) model (Mitchell and Lapata, 2009) and exhibit the MLP, CNN, and SA models may behave similarly in capturing token-label features, while the MV and L-RNN extract different types of features.Our contributions include: • We propose an approach to theoretically investigate feature extraction mechanisms for fundamental neural models.
• We identify significant factors such as the choice of activation and unveil the limitations of these models, e.g., both the CNN and SA models may not effectively capture meaningful n-gram information beyond individual tokens.
• Our experiments validate the theoretical findings and reveal their relevance to advanced architectures such as Transformers (Vaswani et al., 2017).
Our intention through this work is to provide new insights into the core components of complex models.By doing so, we aim to contribute to the understanding of the behaviors exhibited by stateof-the-art large language models and facilitate the development of enhanced model designs 1 .

Related Work
Probing features for NLP models Probing linguistic features is an important topic for verifying the interpretability of neural NLP models.Li et al. (2016) employed a visualization approach to detect linguistic features such as negation captured by the hidden states of LSTMs.Linzen et al. (2016) examined the ability of LSTMs to capture syntactic knowledge using number agreement in English subject-verb dependencies.Jacovi et al. (2018) studied whether the CNN models could capture n-gram features.Vulić et al. (2020) presented a systematic analysis to probe possible knowledge that the pre-trained language models could implicitly capture.Chen et al. (2020) proposed an algorithm to detect hierarchical feature interaction for text classifiers.Empirically, such work reveals that neural NLP models can capture useful and interpretable features for downstream tasks.Our work seeks to explain how neural NLP models capture

Analysis
We use learning dynamics to describe the updates of neural models during training with the aim of identifying potentially useful properties.For the ease of presentation and discussion, we focus on binary text classification2 .
Model Description Assume we have a training dataset denoted by D, consisting of m labeled instances.Let X and Y represent all the sentences and labels in the training dataset, respectively.x ∈ X is an instance consisting of a sequence of tokens, and y ∈ Y is the corresponding label.The vocabulary size is |V |.Consider a binary text classification model, where y ∈ {−1, +1}.The model output, denoted as s(t) ∈ R at time t is where θ t (a vector) is the concatenation of all the parameters, which are functions of time t.We refer to the model output s(t) as the label score at time t.This score is used for classification decisions, positive if s(t) > 0 and negative otherwise.
Learning Dynamics The evolution of a label score can be described by learning dynamics, which may indicate interesting properties.Let f t (X ) ∈ R m represent the concatenation of all the outputs of training instances at time t, and y ∈ Y is the desired label.Given a test input x ′ , the corresponding label score s ′ (t) follows the dynamics where Θ t (x ′ , X ) is the NTK at time t and L is the empirical loss defined as where g is the sigmoid function.For simplicity, we will omit the time stamp t in our subsequent notations.The dynamics ṡ′ will obey where s (x) is the label score for the training instance x.Obtaining closed-form solutions for the differential equation in Equation 4 is a challenge.We thereby consider an extreme scenario with the infinite network width, suggested by Lee et al. (2018).
Infinite-Width When the network width approaches infinity, the NTK will converge and stay constant during training (Jacot et al., 2018;Arora et al., 2019;Yang and Littwin, 2021).Therefore, the learning dynamics can be written as follows, where Θ ∞ (x ′ , x) refers to the converged NTK determined at initialization.This convergence may allow us to simplify the representations of the learning dynamics and offer more intuitive insights to analyze its evolution over time.
There can be certain interesting properties (regarding the trend of the label scores) harnessed by the interaction yΘ ∞ (x ′ , x), where y controls the direction and Θ ∞ (x ′ , x) may indicate the relationship between x ′ and x.Certain hypotheses can be drawn from these properties.First, the converged NTK Θ ∞ (x ′ , x) may intuitively represent the interaction between the test input x ′ and the training instance x.This could extend to the interaction between the basic units (tokens or n-grams) from x ′ and x, as the semantic meaning of an instance can be deconstructed into the combination of the meanings of its basic units (Mitchell and Lapata, 2008;Socher et al., 2012).Second, if Θ ∞ (x ′ , x) depends on the similarity between x ′ and x, a more deterministic trend can be predicted for a test input x ′ that closely resembles the training instances of a specific type.For example, suppose Θ ∞ (x ′ , x) exhibits a significantly large gain when x ′ is similar to x at a particular y, and the dynamics will likely receive significant gains in a desired direction during training, thus enabling us to predict the trend of the label score.
We thereby propose the following approach to investigate a target model and verify our aforementioned hypotheses: 1) redefining the target model following the settings proposed by Jacot et al. (2018); Yang and Littwin (2021), which guarantees the convergence of NTKs; 2) obtaining the converged NTK Θ ∞ (x ′ , x) and the learning dynamics under the infinite-width condition; 3) performing analysis on the learning dynamics of basic units and revealing possible features.

Interpreting Fundamental Models
We investigate an MLP model, a CNN model, an SA model, an MV model, and an L-RNN model, respectively.Details and proofs for the lemmas and theorems can be found in Appendix A.
Notation Let e ∈ R |V | be the one-hot vector for token e, l (x) be the instance length, W e ∈ R d in ×|V | be the weight of the embedding layer, and v ∈ R dout be the final layer weight.W ∈ R dout×d in is the weight of the hidden layer in the MLP model.W c k ∈ R dout×d in is the kernel weight corresponding to the k-th token in the sliding window in the CNN model.For simplicity, we let d out = d in = d.Assume all the parameters are initialized with Gaussian distributions in our subsequent analysis, i.e., W ij ∼ N (0, σ 2 w ), W e ij ∼ N (0, σ 2 e ), and v j ∼ N (0, σ 2 v ), and W c ij ∼ N (0, σ 2 w ), for the sake of NTK convergence.

MLP
Following Wiegreffe and Pinter (2019), given instance x, the output of MLP is defined as The label score s will be used for making classification decisions.ϕ is the element-wise ReLU function.e j is the one-hot vector for token e j .It is not straightforward to analyze s directly, which can be viewed as the sum of token-level label scores.Instead, as basic units are tokens in this model, we focus on the label score of every single token and understand how they contribute to the instancelevel label score.When the test input x ′ is simply a token e, we can get the corresponding NTK with the infinite network width.Lemma 4.1.When d → ∞, the NTK between the token e and instance x in the MLP model converges to where and µ = σ 2 e σ 2 w 2π .Note that, for two tokens e j and e k , their onehot vectors satisfy e ⊤ j e k = 0 if e j ̸ = e k ; e ⊤ j e k = 1 if e j = e k .The dot-product l (x)  j=1 e ⊤ e j can be interpreted as the frequency of e appearing in instance x.Theorem 4.2.The learning dynamics of token e's label score obey where ω(e, x) = l (x) j=1 e ⊤ e j , which depends on the training data and will not change over time.
The non-linearity of the sigmoid function g(−ys) makes it a challenge to obtain a closedform solution for the dynamics.
However, we can predict trends for the label scores in special cases.Note that the polarity of the first term in Equation 8 will depend on yω(e, x) in each training instance.For instance, consider a token that only appears in positive instances, i.e., ω(e, x) > 0 when y = +1; ω(e, x) = 0 when y = −1.In this case, the first term remains positive and incrementally contributes to the label score s e throughout the training process.The opposite trend occurs for tokens solely appearing in negative instances.If the impact of the second term is minimal, the label scores of these two types of tokens will be significantly positive or negative after sufficient updates.The final classification decisions are made based on the linear combination of the label scores for the constituent tokens.The second term in Equation 8 is unaffected by ω(e, x) and is shared by all the tokens e at each update.It can be interpreted as an induced feature bias.Particularly, when this term is sufficiently large, it may cause an imbalance between the tokens co-occurring with the positive label and those co-occurring with the negative label, rendering one type of tokens more influential than the other for classification.
Theorem 4.2 may explain how the MLP model leverages the statistical co-occurrence features between e and y as shown in Figure 1, and integrate them in final classification decisions, i.e., tokens solely appearing in positive/negative instances will likely contribute in the direction of predicting a positive/negative label.

CNN
We consider the 1-dimensional CNN, with kernel size, stride size, and padding size set to K, 1, and K − 1 respectively.For each sliding window c j comprising K consecutive tokens, the corresponding feature c j ∈ R d can be represented as where W c k is the kernel weight corresponding to the k-th token in the sliding window.
The label score of an instance is computed as where −(K − 1) means the position for the leftmost padding token.The first and last K − 1 padding tokens in an instance are represented by zero vectors.ϕ is the element-wise ReLU function.
For brevity, we will denote l (x) j=−(K−1) by j .Let us focus on a single sliding window and study the learning dynamics of its label score.Lemma 4.3.Consider a sliding window c consisting of tokens e 1 , e 2 , . . ., e K , when d → ∞ the NTK between c and instance x converges to where ω c means the number of shared tokens between c and c j regardless of positions.F and H3 are monotonically-increasing and non-negative functions depending on σ 2 e σ 2 w .The first term in Θ ∞ (c, x) captures the token similarity between sliding windows c and c j regardless of token positions.In the second term, j H[ω c (c, c j )]e ⊤ k e j+k−1 can be viewed as the weighted frequency of token e k in instance x, and when σ v is sufficiently large, the converged NTK is majorly influenced by the sum of the weighted frequencies of the tokens in c appearing in x.
(12) where ω(e k , x) = j H[ω c (c, c j )]e ⊤ k e j+k−1 .Theorem 4.4 indicates that with a sufficiently large σ v , the learning dynamics for window c may mainly depend on the linear combination of the weighted learning dynamics of its constituent tokens.Similar analysis can be performed on the label score of the sliding window.This may not exactly encode n-grams, which are inherently sensitive to order and can extend beyond their constituent elements.Instead, for each window, it is more akin to the composition model based on vector addition as described in the work of Mitchell and Lapata (2009).The second term in Equation 12may not be zero even if c shares no tokens with x, suggesting there can be an induced feature bias similar to the one in the MLP model.
When c only shares tokens with either positive or negative instances, regardless of position, the corresponding label score will receive relatively large gains in one direction during updates.This means the CNN model also captures co-occurrence features between tokens and labels.Importantly, a single token can also be viewed as a sliding window, padded with additional tokens, thereby leading to conclusions about the trend of label scores that mirror those drawn from the MLP model.

SA
We employ a fundamental self-attention module, analogous to the component found in Transformers.The representation of the i-th output in the instance will be computed as a weighted sum of token representations as follows, where α ij is the weight produced by a softmax function as follows, We define the attention score a ij from position i to j as where P i (P j ) is the positional embedding at position i (j) and will be fixed during training.The instance label score will be computed as which can be viewed as the weighted sum of tokenlevel label scores if we define such a score for each token e as s e = 1 √ d v ⊤ W e e.We consider the case where the test input is also simply a token e. Lemma 4.5.When d → ∞, the NTK between the token e and the instance x will converge to Θ ∞ (e, x), which obeys where ω(e, x) j=1 E(α ij )e ⊤ e j and ρ = σ 2 e +σ 2 v .Theorem 4.6 shows the learning dynamics of token e's label score also depends on the weighted sum of the frequencies of e appearing in x.The learning dynamics of a single token's label score will likely resemble that in the MLP model, capturing the co-occurrence features between tokens and labels despite the weights.This model may not experience an induced bias, compared to the MLP model as discussed in Theorem 4.2.This will be further explored in our experiments.

MV
We consider the matrix-vector representation as applied in adjective-noun composition (Baroni and Zamparelli, 2010) and recursive neural networks (Socher et al., 2012).It models each word pair through matrix-vector multiplication.The label score of an instance is defined as where M (e j ) = diag(W W e e j ) (diag converts a vector into a diagonal matrix.)and j = 1, 2, . . ., l (x) − 1. Lemma 4.7.Given a bigram consisting of two tokens e a e b , with the infinite network width the NTK will converge to Table 1: Co-occurrence features captured by target models."k" in L-RNN refers to the distance from the last tokens.ρ (µ) refers to the non-negative coefficient determined at initialization for each model.e and x refer to a token and an instance, respectively."-" means not applicable.
(20) It is worth highlighting that the interaction e ⊤ j e a e ⊤ j+1 e b is different from the interaction resulting from the aforementioned models.When e a ≡ e j and e b ≡ e j+1 (i.e., e a e b ≡ e j e j+1 ), the NTK will gain a relatively large value, implying the ability to capture co-occurrence knowledge between bigrams and labels.
where ρ = (σ 2 e + 3σ 2 v )σ 2 e σ 2 w and ω(e a e b , x) = j e ⊤ j e a e ⊤ j+1 e b .Here, ω(e a e b , x) can be viewed as the frequency of bigram e a e b seen in instance x.Specifically, when a bigram co-occurs with a positive (negative) label, it will receive a positive (negative) gain during gradient descent.
We provide the analysis of the L-RNN model in Appendix A. The features captured by different architectures are listed in Table 1.

Experiments
We conduct experiments to verify our aforementioned analysis in the following aspects: a) verify the features acquired by our models; b) explore factors that may affect feature extraction; c) examine the limitations of those models.
Datasets We consider the following datasets: SST, instances with "positive" and "negative" labels are extracted from the original Stanford Sentiment Treebank dataset (Socher et al., 2013).We also extract instances with sub-phrases (along with labels) under the name "SSTwsub".Agnews, the AG-news dataset, which consists of titles and description fields of news articles from 4 classes, "World", "Sports", "Business" and "Sci/Tech".IMDB, the binary IMDB dataset (Maas et al., 2011)  which consists of movie reviews with relatively longer texts.The statistics are listed in Table 2.
Setup We randomly initialize all the parameters with Gaussian distributions.Unless specified otherwise, the variances of the parameters are set as 0.01.While our analysis is based on vanilla gradient descent, training models using SGD optimizers with a small learning rate can be challenging.Therefore, Adagrad (Duchi et al., 2011) optimizers5 are used in practice.The network width d is set as 64.To verify the features learned by the models, we extract corresponding co-occurrence pairs for each model from training data.Specifically, for the MLP, CNN, and SA models, we calculate the token-label frequencies from the training data.For example, if a token e co-occurs three times more frequently with the positive label (+) than with the negative label (-), i.e., f req(e,+) f req(e,−) ≥ 3, we will extract this (e, +) pair6 .For the MV and L-RNN models, we calculate bigram-label and token-label-position frequencies respectively, and extract co-occurrence pair in a similar way.For simplicity, tokens (bigrams) cooccurring with the positive/negative label will be referred to as positive/negative tokens (bigrams).Figure 2: Evolution of the label scores for the extracted tokens from SST over epochs."pos token" and "neg token" refer to "positive tokens" and "negative tokens" respectively.

Feature Extraction
We illustrate the label scores for the extracted cooccurrence pairs to examine the features predicted by our approach.It can be seen from Figures 2a,  2b, and 2c that the label scores for tokens in the extracted co-occurring pairs evolve as expected over epochs for the MLP, CNN, and SA models.The label scores of the tokens co-occurring majorly with the positive label consistently receive positive gains during training, whereas those of the tokens co-occurring majorly with the negative label experience negative gains, thus playing opposite roles in final classification decisions.Similar patterns can be observed on IMDB in Appendix C. We also extract bigrams co-occurring majorly with either the positive or negative label from SSTwsub and calculate their label scores using a trained MV model, which exhibits the capability of capturing the co-occurrence between bigrams and labels as shown in Figure 3.
Our analysis on the binary classification tasks can be extended to the multi-class classification scenario on Agnews of four-class.The label scores for the tokens associated with a specific class would be assigned relatively large scores in the dimension corresponding to the class as shown in Figures 4a, 4b, 4c, and 4d.These observations support our analysis of the feature extraction mechanisms within our target models.
In addition, we extend our experiments to language modeling tasks, which can be viewed as a variant of multi-class classification tasks, with the label space equivalent to the vocabulary size.Interestingly, we observe similar token-label patterns on Transformer-based language models incorporating self-attention modules despite their complexity, in both word and character levels.Particularly, we find that nanoGPT, a light-weight implementation of GPT, can capture the co-occurrence features between context characters and target characters on the character-level Shakespeare dataset, and reflect them in the label scores as shown in Figure 5.Given a context character, the model's output is more likely to assign higher scores to target characters that predominantly co-occur with this context character in the training data, thereby making those target characters more likely to be predicted.This implies the significance of a large dataset may be (partially) ascribed to rich co-occurrence information between tokens.Further details can be found in Appendix C. Induced Bias Our approach also indicates that factors such as activation and initial weight variances could affect feature extraction.We downscale the variances for the final layer weight vectors at initialization and compare the learning curves of the extracted tokens' label scores from models with different activations.As can be seen from Figures 2d and 2e, a smaller initialization of the final layer weight variance can lead to a large feature bias, rendering negative tokens less significant than positive ones in the MLP and CNN models.This may not be a desirable situation, as vation functions such as tanh, GeLU, and SiLU, which are alternatives to ReLU7 .Figures 2g and 2h show that these alternatives are more robust than ReLU in the MLP model.This also suggests that while non-linear activations may not significantly alter the nature of learned features during training, they can affect the balance of the extracted features.
Figure 2f shows the SA model is also robust to the change in initialization.However, incorporating an MLP with ReLU activation after the SA model reintroduces bias, as can be observed in Figure 2i, suggesting a possible reason why ReLU was replaced in models such as BERT (Devlin et al., 2019), GPT3 (Brown et al., 2020), andLLaMA (Touvron et al., 2023), despite its presence in the original Transformer architecture (Vaswani et al., 2017).

Models' Limitations
We aim to examine whether the CNN and SA models have a limitation in encoding n-grams in situations beyond constituent tokens' semantic meanings.We choose negation phenomena as our testbed, where a negation token can (partially) reverse the meanings of both positive and negative phrases, a task that is challenging to achieve by linear combination.We run experiments on the SSTwsub dataset with labeled sub-phrases, which contains rich negation phenomena, i.e., phrases with their negation expressions achieved by prepending negation tokens such as not and never.We extract positive and negative adjectives and create their corresponding negation expressions by prepending the negation word not. Figure 6a shows that the SA model can capture negation phenomena for positive adjectives but does not perform well for negative adjectives as shown in Figure 6b.Specifically, prepending a negation word to the negative adjectives does not alleviate their negativity as expected but leads to the contrary.Based on our analysis, the polarity of a negation expression relies largely on the linear combination of the tokens' polarity in the SA model.As both the negation word not8 and negative adjectives are assigned negative scores, their linear combination will still be negative.This is not a desirable case and not surprising, as recent studies (Liu et al., 2021;Dong et al., 2021;Orvieto et al., 2023) have challenged the necessity of self-attention modules.Similar patterns can also be observed on extracted phrases with negation words, on the CNN model, and even the Transformer model in Appendix C. Conversely, the MV model demonstrates the efficacy of capturing such negation for negative adjectives, as shown in Figure 6c, demonstrating that the multiplication mechanism may play a more effective role in composing semantic meanings.

Discussion
Our experimental results verify our theoretical analysis of the feature extraction mechanisms employed by fundamental models during their training process.These findings are consistent even with network widths as small as d = 64, a scenario in which the infinite-width hypothesis is not fully realized.This observed pattern underscores the robustness and generalizability of our model, a conclusion that aligns with the insights presented by Arora et al. (2019).They suggest that as network width expands, the NTK closely approximates the computation under infinite width conditions while keeping the error within established bounds.In our study, we noted that both CNN and Self-Attention models predominantly rely on the linear combination of token-label features.However, they exhibit limi-tations in effectively composing n-grams beyond tokens, a deficiency highlighted in negation cases.
This observation points towards a potential necessity for alternative models that are adept at handling tasks involving complex n-gram features.This observation aligns with studies by Bhattamishra et

Conclusions
We propose a theoretical approach to delve into the feature extraction mechanisms behind neural models.By focusing on the learning dynamics of neural models under extreme conditions, we can shed light on useful features acquired from training data.We apply our approach to several fundamental models for text classification and explain how these models acquire features during gradient descent.Meanwhile, our approach allows us to reveal significant factors for feature extraction.For example, inappropriate choice of activation functions may induce a feature bias.Furthermore, we may also infer the limitations of a model based on the features it acquires, thereby aiding in the selection (or design) of an appropriate model for specific downstream tasks.Despite the infinite-width hypothesis, the patterns observed are remarkable with finite widths.Our future directions include analyzing more complex neural architectures.

Limitations
Despite the findings on the aforementioned fundamental models, applying our approach to analyze complex models like Transformers, which incorporate numerous layers, non-linear activation functions, and normalizations, presents challenges due to the increased complexity.These factors contribute to more intricate learning dynamics, making it less straightforward to gain comprehensive insights into the model's behavior.We would like to investigate and formulate them in future directions.

A Learning Dynamics of Models with Infinite-width
A.1 MLP model The representation for the instance x is defined as and the label score of x is computed as follows, where l (x) is the instance length, W ∈ R dout×d in is the weight of the hidden layer, W e ∈ R d in ×|V | is the weight of the embedding layer, and v ∈ R dout is the final layer weight.For simplicity, we let e j is the one-hot vector for token e j .
The gradients of the parameters can be computed as follows, where Note that D j is a diagonal matrix with elements being either 1s or 0s.Given a test input x ′ , the learning dynamics of the label score s ′ will be With the gradients, we can obtain the NTK Θ(x ′ , x) for this MLP model as follows, where The label score of an instance can be viewed as the sum of the label scores of all the tokens and the NTK can be viewed as the sum of the interaction between each token pair from the test input x ′ and the training instance x.

A.1.1 NTKs under the Infinite-width
It's delicate to analyze such an NTK directly in practice as the NTK will vary over time.However, the previous work discussed in the literature has proved that the NTK will converge and stay constant during training under the infinite-width condition.Now let us consider the infinite-width scenario (Lee et al., 2018) and obtain the NTK subsequently.We first give the NTK between two instances and then give the NTK between an input token and an instance.
We will give the proof along with the proof of Lemma 4.1, where the test input is simply a token e.
Proof.We only need to compute the NTK when the network width grows to infinity, as the NTK's convergence during training has been proved in the work of Jacot et al. (2018); Yang and Littwin (2021).
For the first part in Equation 27, the dot-product between two activation outputs can be written as, where r refers to the row index and As elements in W and W e follow Gaussian distributions, when the network width d → ∞, c ir and c jr will also be Gaussian distributed respectively and they follow a Gaussian process (Lee et al., 2018).Based on the work of Cho and Saul (2009), the covariance K(ϕ(c ir ), ϕ(c jr )) (regardless of r) will be calculated as where α ij = cos −1 e ⊤ i e j .Then, we arrive at For the second part in Equation 27, let us look at e ⊤ i (W e ) ⊤ W e e j first.Let M e = (W e ) ⊤ W e and the elements of M e can be computed as where W e ri , W e rj ∼ N (0, σ 2 e ).When d → ∞, We can thereby arrive at Let us look at v ⊤ D i D j v.If e ⊤ i e j = 0, which means D i ̸ = D j , we will arrive at where D irr and D jrr are diagonal elements in D i and D j respectively.v 2 r is the r-th element in v.If e ⊤ i e j ̸ = 0, which means D i = D j , we will have Similarly, we can get the third part in Equation 27, Plugging the above equations in Equation 27, we will arrive at Lemma A.1.
NTK between a token and an instance Next, we give the proof of Lemma 4.1 based on Lemma A.1.
Proof.Note that as e i and e j are one-hot vectors, their dot-product e ⊤ i e j satisfies that, e ⊤ i e j = 1 when e i ≡ e j or 0 otherwise.Therefore, α ij = cos −1 e ⊤ i e j , α ij can be π 2 or 0. And the kernel can be further written as . This means the converged kernel Θ ∞ (x ′ , x) will keep being non-negative during training, and the direction of the dynamics will depend on the label y in Equation 26.
Let us look at the token-label features learned in the dynamics.As previously mentioned, the instance label score can be viewed as the sum of token label scores.Consider the scenario where the test input x ′ is simply a token e, the NTK Θ ∞ (e, x) obeys where the dot-product l (x ′ ) j=1 e ⊤ e j will be interpreted as the frequency of e appearing in instance x.We can thereby arrive at Lemma 4.1.

A.1.2 Features Encoded in Gradient Descent
With Lemma 4.1 and Equation 5, we can get Theorem 4.2.Under the infinite-width, the dynamics of token e's label score obey where ω(e, x) is the frequency of token e in instance x. ω(e, x) depends on the training data and will not change over time.We cannot give a closed-form solution for this ODE due to the nonlinearity of the sigmoid function g(−ys).However, as g(−ys (x) ) is non-negative, there can be certain interesting trends for the label scores.Note that the first term A in Equation 41 will depend on this token's term-frequencies ω(e, x) in each training instance.The second term B depends on the entire training set and is shared by all the tokens e.For example, if e does not appear in an instance x, ω(e, x) will be 0.

A.1.3 Bias Induced in Gradient Descent
Let us look at the term B in Equation 41, which can be viewed as an induced feature bias shared by all tokens.It is affected by the variances and the instance lengths.Suppose the term B is sufficiently large; in this case, the positive tokens and the negative tokens will be affected by this bias in different directions during training.For example, if term B is positively large, it will positively contribute to the learning dynamics of positive tokens and make their label scores much larger than 0 after sufficient updates.However, the negative tokens will have weakened learning dynamics and end up with label scores close to 0.
In Equation 41, both ω(e, x) and l (x) are determined by the training instances.Therefore, the factor ρ and µ defined in Equation 39 can affect the influence of this induced bias.It can be inferred that, a significantly large variance σ v can make ρ much larger than µ, thus reducing the influence of the bias.

A.2 CNN Model
We consider the 1-dimension CNN, which is commonly used in NLP tasks.The kernel size, stride size, and padding size will be set as K, 1, and K − 1, respectively.
For each sliding window c j that consists of K consecutive tokens, the corresponding feature c j ∈ R dout can be represented as where W c k ∈ R dout×d in is the kernel weight corresponding to the k-th token in the sliding window, W e ∈ R d in ×V is the embedding matrix, and e is the one-hot vector for token e.We also let Given an input x, the label score will be calculated as where −(K − 1) means the position for the leftmost padding token.The first K − 1 and last K − 1 tokens in an instance are padding ones represented by zero vectors.ϕ is the element-wise ReLU function.
For brevity, we will denote l (x) j=−(K−1) by j .The gradients will be computed as where For a test input x ′ , the dynamics of its label score will obey where A.2.1 NTK under the Infinite-width It should be highlighted that, even when the network width approaches infinity, it may not be easy to describe the converged NTK with an explicit closed-form expression due to the integrals used in obtaining expectations.However, we will show that the converged NTK can be written as functions of parameter variances and similarity between sliding windows, and the functions are affected by the similarity between sliding windows.
Lemma A.2. Assume we initialize parameters following Gaussian distributions, i.e., When the network width approaches infinity, given two instances x ′ and x, the NTK Θ(x ′ , x) for the CNN model converges to where c i and c j are sliding windows starting from the i-th token and the j-th token in instances x ′ and x, respectively.Functions ω c and ω 2 are defined as Functions F and H are defined as where n (0 ≤ n ≤ K) is the number of tokens shared by two sliding windows and w i , w ′ i ∼ N (0, σ 2 e σ 2 w ).ϕ and D are the ReLU function and step function 9 respectively.Remark.It can be seen that the similarity between sliding windows influences the converged NTK.As the variances are constants, we can focus on ω c and ω 2 , which can be viewed as similarity metrics for sliding windows.The former does not take positional information into consideration, while the latter does.Particularly, we can have that ω c (c i , c j ) ≥ ω 2 (c i , c j ).When the two sliding windows share tokens in the right order, ω 2 (c i , c j ) becomes large.
Proof of Lemma A.2 We give the proofs for each part in the NTK shown in Equation 48.Let us prove that the ReLU output multiplication (term A in Equation 48) can be written as a function of F (n) under the infinite-width condition.
Proof.The ReLU output multiplication can be written as where r refers to the r-th element and k and W e follow Gaussian distributions and are I.I.D (Independent and identically distributed) random variables.With the infinite network width, given a token e, elements in W c k 1 √ d W e e can also be viewed as I.I.D and follow a Gaussian process (Lee et al., 2018) We can get that w ∼ N (0, σ 2 e σ 2 w ).Similarly, we can obtain that c ir and c jr follow Gaussian distributions.
When the network width approaches infinity, the multiplication will be viewed as the expectation as follows, Then, we can arrive at where n is the number of shared tokens between sliding windows c i and c j .
We prove term B in Equation 48 can be written as a function of H(n) under the infinite-width condition.
Proof.With the infinite network width, we can have where r is the row number.D irr and D jrr refer to the r-th elements in the diagonal positions of D i and D j , respectively.Term B will obey Let w i ∼ N (0, σ 2 e σ 2 w ).D irr (D jrr ) equals 1 when the corresponding c ir > 0 (c jr > 0), 0 otherwise.Then, we can obtain where n is the number of shared tokens between sliding windows c i and c j .
For the third part (term C in Equation 48), when the network width approaches infinity, the expectation will be It is obvious that this term will be positive if the two windows c i and c j do not share any tokens, 0 otherwise.Now, let us look at the F and H functions, which have interesting properties regarding the similarities between sliding windows.
Proposition A.1.Both F and H functions in Lemma A.2 are monotonically increasing as the on-negative integer n increases.Given two nonnegative integers n and n ′ , when n ′ > n, the two functions obey Remark.This indicates the more similar the two sliding windows are (i.e., the more tokens they share), the larger F and H will be.
The core idea leveraged in the proofs is based on the inequality, V ar(x) = E(x 2 ) − E(x) 2 ≥ 0, where x is a random variable yielding to a Gaussian distribution.

Monotonicity of F function
Proof.First, let us consider the scenario n > 0, which means the two sliding windows share tokens.
The expectation can be computed as The expectation for the multiplication of two sliding windows without sharing tokens can be written as which can be described as where w i and w ′ i are I.I.D random variables and n ≥ 1.This indicates that when two sliding windows share tokens, the expectation will be larger than that in the case where two sliding windows do not share any tokens.
We can prove that one more shared tokens between two sliding windows can result in an increase in expectation.Let us increase the number of shared tokens by 1 between the two sliding windows, the expectation can be written as ] can be written as we can arrive at (64) We can prove recursively for the case F (n + l) ≥ F (n) where l > 1.Therefore, the expectation will be monotonically increasing with n.Let us focus on a single sliding window c.The converged NTK between a sliding window c (consisting of tokens e 1 , e 2 , . . ., e K ) and instance x obeys If c does not share tokens with any of the sliding windows in x, the NTK Θ ∞ (c, x) will reach its minimum, namely, Θ ∞ (c, x) = j F (ω c (0)).
Otherwise, Θ ∞ (c, x) will be significantly large if c bears similarity to the sliding windows of x.Then the dynamics of the label score of c obey ṡc = ρ m (x,y)∈D yg(−ys (x) )Θ ∞ (c, x).
The sliding windows can be interpreted as ngrams.Suppose an n-gram represented by a sliding window c bears similarity only to the sliding windows in positive (negative) instances.In that case, it will receive positively (negatively) large gains in one direction during training and will likely end up being significant.
Let us examine what features will be learned for a single token e.We define a positional-relevance label score as follows, which reflects the label score of token e in the k-th position of a sliding window.
It should be noted that based on our analysis, the kernel size K in Equation 10 does not affect the monotonicity of the F and H functions.Suppose the sliding window c shares tokens with instance x; the NTK will learn the features regardless of the kernel size K.

A.3 SA Model
Let us define an intermediate score s j , corresponding to the label score in the j-th output, as follows, The gradients can be computed as where l (x) is the instance length and δ jk = 1 if j ≡ k, 0 otherwise.P k is the positional embedding at position k.
We assume that the parameters are initialized as W e ij ∼ N (0, σ 2 e ), and v j ∼ N (0, σ 2 v ), and the distribution of the attention weights are independent of the parameters W e and v. Let us consider the case where the test input is simply a token e.If the network width approaches infinity, → 0 and NTK between the token e and the instance x will converge to Θ ∞ (e, x), which obeys where E(α ij ) is the expectation of α ij when elements of W e obeys W e ij ∼ N (0, σ 2 e ).

A.4 MV Model
The label score of an instance is defined as where M (e j ) = diag(W W e e j ) (diag converts a vector into a diagonal matrix.)and j = 1, 2, . . ., l (x) − 1.
The proof sketch is given as follows: The gradi-ents can be computed as where ⊙ refers to element-wise multiplication.
Given a bigram e a e b , the NTK will be computed as θ(e a e b , x) = which can capture the co-occurrence between bigrams.

A.5 L-RNN Model
We follow the work of Emami et al. (2021) and Gu et al. (2021) and focus on a linear RNN, whose hidden state is defined as follows, where W h ∈ R d×d and W ∈ R d×d and the initial hidden state is a zero vector.We can expand the hidden states across time steps and obtain where The label score of an instance is computed based on the final hidden state as where W W e e j and T is the final time step.Note that T − j means the distance between the current token and the last token in an instance and s j can be viewed as the label score for the token with a distance of T − j from the last token.The gradients will be calculated as where T > 1.When T = 1, ∂s ∂W h does not exist.
Proof.The multiplication between W h and its transpose can be computed as where each element w ij in W h follows w ij ∼ N (0, σ 2 h ).Each element w ′ in the output of the multiplication will be where I ∈ R d×d is an identity matrix.We can also obtain where k > 0 is an integer and the elements in vectors v α and v β ∈ R d are Gaussian distributed with zero means.Based on this, we can arrive at where k ′ > k (both are integers).A similar conclusion can be obtained for the case k ′ < k.
Given instances x ′ and x, whose label scores are s ′ and s respectively, we can have where Similarly, we can obtain terms < ∂s ∂W e , ∂s ∂W e >, < ∂s ′ ∂W , ∂s ∂W > and < ∂s ′ ∂W h , ∂s ∂W h >.
Let k = T − j, we can re-write s j in Equation 77as which means the label score for token e at position k from the last token.We thereby define an NTK Θ(e, k, x) to represent the interaction between token e at position k and instance x.
When the network width approaches infinity, the NTK Θ(e, k, x) converges to a deterministic one Θ ∞ (e, k, x), which obeys where e and k is a non-negative integer.To make it consistent, we can also replace the instance length with l (x) .

A.6 Multi-class Classification
Compared to the binary architecture in the main paper, the last linear layer will be modified to project the hidden states into a L-dimension vector (L is the label space), and the sigmoid layer will be replaced by a softmax layer.
The label score of an instance will be described as: where s(t) is a vector with a dimension of L. For brevity, we omit the denotation t.
We can get the probability distribution for all the labels: where p ∈ R L .The cross-entropy loss will be used, and the loss can be computed as: where p (x) refers to the softmax output for instance x and y (x) is the one-hot label vector for instance x.
The derivative of L with respect to vs is computed as follows, Given a test input x ′ , the learning dynamics of its output from the model with the infinite-width network can be described as where Θ ∞ (x ′ , x) ∈ R L×L refers to the converged NTK determined at initialization.We can have a similar analysis on this dynamics as (y (x) − p (x) ) will only be positive in the dimension where y (x) z = 1 (z is the dimension index in y).Suppose Θ ∞ (x ′ , x) ∈ R L×L works in a way that increases the influence of the corresponding dimension z in ṡ′ when x ′ is associated with label z, i.e., the label score corresponding to such a label will receive a positive gain and grow to be large.

B Influence of Activation Functions
ϕ = I In this case, the model is linear.When the network width approaches infinity, Θ(x ′ , x) converges to a deterministic NTK Θ ∞ (x ′ , x) during training which obeys (93) Given token e, the NTK Θ ∞ (e, x) obeys which means the NTK is affected by the frequency that e is seen in instance x.This is similar to the MLP model with the ReLU activation function but without the induced bias.ϕ = tanh Suppose ϕ = tanh, then we have that where r is the row index and E[tanh(c er ) tanh(c jr )] is a constant regardless of r.
It can be inferred that c er yields to a Gaussian distribution with a zero mean.As tanh is an odd function, we will arrive at E[tanh(c er )] = 0.
Suppose e ⊤ e j = 0, namely, e ̸ = e j , tanh(c er ) and tanh(c jr ) are two independent random variables and E[tanh(c er ) tanh(c jr )] = 0. Similarly, when e ̸ = e j , we can obtain where D jrr is the r-th diagonal element in D j .
Given token e, the converged NTK is computed as This indicates the MLP model with the tanh activation does not have the induced bias.

C More Experimental Results
The statistics of language modeling datasets are listed in Table 4.The performances are listed in Table 5 10 .
Extracted Tokens&Bigrams We automatically extracted tokens associated with specific labels as shown in Table 6.token not appears more in negative instances, it is assigned negative label scores.Therefore, adding the token not to a positive adjective can weaken its positive polarity but for a negative adjective, adding not can strengthen its negative polarity.This also implies the limitation of such models: they rely largely on token-label features.We also extracted 65 phrases (less than 11 words, 17 positive and 48 negative) starting with negation words not, never, and hardly from SSTwsub.We computed their label scores and the label scores of their sub-phrase constructed by removing the negation words.Figure 9 shows the differences between the label scores from the subphrases and the phrases are all positive, indicating the negation words play negative roles in a linear combination and do not reverse the polarity of negative subphrases.

C.3 Feature Extraction
SiLU Figure 10 shows SiLU can also prevent an induced bias in the features captured.

Adam Optimizer
We conduct experiments on the Adam optimizer and observed patterns (shown in Figure 11) similar to those from the Adagrad optimizer in the main paper.
L-RNN Extracting sufficient tokens that appear in a specific position and a specific category of instances in real-world datasets like SST is not easy.Instead, we created a synthetic dataset (1,000 positive instances and 1,000 negative instances) based on three types of tokens.One type of token is seen in positive instances with a fixed distance from the last tokens; another type of token is seen in negative instances with a fixed distance from the last tokens.The other tokens are seen randomly in both positive and negative instances with random positions.In this experiment, we set the fixed distance from the last ones as 2. Adagrad optimizers were used.It can be seen from Figure 12 that when k = 2, the positive and negative tokens are assigned significant label scores, while when k = 0 and k = 4, the label scores are less significant, supporting our aforementioned analysis on the L-RNN model.
IMDB IMDB is a dataset with relatively long instances.Our findings can also be observed on the IMDB dataset in Figure 13.Particularly, we could also observe a feature bias on the IMDB dataset (as shown in Figure 13c) when we used the MLP model with ReLU, supporting our analysis in the main paper again that ReLu may cause a feature bias.However, we did not observe an obvious performance decline on the IMDB dataset.

Figure 1 :
Figure 1: Example of co-occurrence features between tokens and labels (self-attention model).
ys(x) )yω(e k , x) ) where E(α ij ) is the expectation of α ij .Theorem 4.6.The learning dynamics of the label score of a token e obey ṡe = ρ m (x,y)∈D g(−ys (x) )yω(e, x), Theorem 4.8.The dynamics of the label score of the test bigram e a e b obey ṡab = ρ m (x,y)∈D g(−ys (x) )yω(e a e b , x),

Figure 3 :
Figure 3: Distribution of the label scores for extracted bigrams from SSTwsub."p" refers to positive and "n" refers to negative.

Figure 4 :
Figure 4: Label scores for extracted tokens from Agnews, a dataset with four classes.SA model.d = 64.

Figure 5 :
Figure 5: Distribution of the label scores for target characters majorly (blue) and rarely co-occurring with each extracted context character.nanoGPT.Shakespeare Dataset.

Figure 6 :
Figure 6: Label scores for the extracted positive adjectives (pos/p adj) and negative adjectives (neg/n adj), as well as their negation expressions.SSTwsub."[-]" refers to the negation operation.
)We can compute the expectation E[D irr D jrr ] (outputs of the step function) similarly to that of ReLU output multiplication.

∞−∞
refers to the integrals for all the variables involved.
Monotonicity of H function We can obtain the expectation of the other terms in the NTK similarly as the activation function ϕ can be replaced with different activation functions with non-negative outputs.Proofs of Lemma 4.3 and Theorem 4.4 With Lemma A.2 and Proposition A.1, we can prove Lemma 4.3 by replacing the test input x ′ with sliding window c.Similarly, we can prove Theorem 4.4 with Lemma 4.3.

Figure 8 :Figure 9 :
Figure 8: Label scores for the positive adjectives (pos adjectives) and negative adjectives (neg adjectives) as well as their negation expressions."TR" refers to the Transformer model (one head, one layer).

Figure 10 :
Figure 10: Label scores for extracted positive tokens and negative tokens from SST. MLP with SiLU.
Figure 11: Label scores for the extracted tokens from SST. Adam optimizer.

Figure 12 :
Figure 12: Label scores for the positive tokens (pos token) and negative tokens (neg token) at different positions for the L-RNN model.k refers to the distance from the last tokens.Synthetic dataset.

Table 2 :
Dataset statistics."Train", "Valid", and "Test" refer to the training, validation, and test sets, respectively."|V|" refers to the vocabulary size and "Len" refers to the average training instance length.
Table 3 suggests a performance decline for the MLP model with ReLU.Furthermore, we compare other acti- 84)Like what we have done previously, we can obtain that the elements in e ⊤ i (W e ) ⊤ W ⊤ and W W e e j are Gaussian distributed.Let v ⊤ α =