Emergent Modularity in Pre-trained Transformers

This work examines the presence of modularity in pre-trained Transformers, a feature commonly found in human brains and thought to be vital for general intelligence. In analogy to human brains, we consider two main characteristics of modularity: (1) functional specialization of neurons: we evaluate whether each neuron is mainly specialized in a certain function, and find that the answer is yes. (2) function-based neuron grouping: we explore finding a structure that groups neurons into modules by function, and each module works for its corresponding function. Given the enormous amount of possible structures, we focus on Mixture-of-Experts as a promising candidate, which partitions neurons into experts and usually activates different experts for different inputs. Experimental results show that there are functional experts, where clustered are the neurons specialized in a certain function. Moreover, perturbing the activations of functional experts significantly affects the corresponding function. Finally, we study how modularity emerges during pre-training, and find that the modular structure is stabilized at the early stage, which is faster than neuron stabilization. It suggests that Transformers first construct the modular structure and then learn fine-grained neuron functions. Our code and data are available at https://github.com/THUNLP/modularity-analysis.


Introduction
Recently, pre-trained Transformers have shown the potential to achieve general intelligence (Brown et al., 2021;Fei et al., 2022;Reed et al., 2022;OpenAI, 2023;Bubeck et al., 2023), which encourages researchers to explore the analogy between Transformers and human brains (Toneva and Wehbe, 2019;Caucheteux et al., 2021;Caucheteux and King, 2022;Goldstein et al., 2022).These works have shown that the behaviors of Transformers resemble those of human brains.Do the internal structures of Transformers also mirror those of human brains in order to achieve similar behaviors?
Neuroscientists have found that the structure of neuron organization in human brains follows a modular pattern (Bullmore and Sporns, 2009;Meunier et al., 2010), which has two main characteristics.
(1) Functional specialization of neurons: each neuron is mainly specialized in a certain function.
(2) Function-based neuron grouping: neurons with the same function are clustered in a local region, and each function relies on a specific region.In this work, we wonder whether Transformers also organize neurons in a modular way.
(Q1) Are neurons functionally specialized?The functional specialization of neurons is the basis of modularity.To this end, we propose a novel framework to analyze the functionality of neurons in Transformers.In this framework, we study three representative functions by a unified method, including semantic function (Scarlini et al., 2019a;Suau et al., 2020), knowledge function (Jiang et al., 2020;Dai et al., 2022), and task function (Wang et al., 2022b).Experimental results show that the neurons in pre-trained Transformers become much more specialized than those in randomly-initialized ones after self-supervised learning on large-scale corpora.Moreover, in pre-trained Transformers, there are several groups of neurons, each of which excels in a specific function.
(Q2) Is there a modular structure of neurons?In analogy to human brains, a modular structure should group neurons with the same function together as modules, and each module plays a crucial role in a specific function.Since there are thousands of neurons in a Transformer, it is impractical to iterate over all possible structures of neurons.We consider the grouping of Mixture-of-Experts (MoE) (Jacobs et al., 1991;Fedus et al., 2022) as a promising candidate, which gracefully partitions neurons into experts and is widely used in Transformers.Moreover, most MoE models are sparsely activated, which is similar to human brains.Specifically, we study two types of MoE models, pre-partitioned MoE (pre-MoE) and postpartitioned MoE (post-MoE).Pre-MoE refers to the model architectures that expand feedforward layers by MoE to improve model capacity before pretraining (Fedus et al., 2022).Post-MoE refers to the models that are converted from vanilla Transformers to their MoE version by partitioning feedforward layers into experts after pre-training (Zhang et al., 2022b).It provides a way to group neurons without changes in parameters and forward pass.
By studying the function distribution on experts, we find that both pre-MoE and post-MoE Transformers have a strong tendency to distribute the neurons excelling in a certain function concentratedly into some experts.Moreover, perturbing the activations of 3% of experts by function leads to performance close to random guessing, which is more significant than perturbing the same amount of individual neurons specialized in the function.Therefore, the MoE structure indeed reflects the modularity of pre-trained Transformers.
(Q3) How does modularity emerge during pretraining?By analyzing the pre-training process, we find that the functions of expert networks are stabilized to a large extent at the early stage (around 15% of the total training steps) for both pre-MoE and post-MoE Transformers, which is faster than the neuron stabilization (around 75% of the total steps).Our findings provide evidence for a coarseto-fine mechanism of pre-training, which first constructs the coarse modular structure and then learns fine-grained neuron functions.

Related Work
Interpreting Pre-trained Transformers.Although pre-trained Transformers have achieved great success in the field of NLP (Min et al., 2021;Bommasani et al., 2021), their inner working mechanism is still a black box.Researchers explore interpreting the pre-trained Transformers (Rogers et al., 2020) from different perspectives, such as hidden representations (Liu et al., 2019), attention matri-ces (Voita et al., 2019;Clark et al., 2019b), and output distributions (Petroni et al., 2019).Among them, studying the behavior of a single neuron is an important branch (Radford et al., 2017;Sajjad et al., 2022;Bills et al., 2023), which is most related to our work.From the neurons in feedforward layers, researchers have found various encoded information such as concepts, facts, and task abilities (Suau et al., 2020;Dai et al., 2022;Wang et al., 2022b).In this work, we study how these neurons are organized to form a modular structure, which is a new interpretation perspective.
Transformers with MoE.Sparse MoE is usually used to enlarge the model capacity of Transformers while keeping the computational efficiency (Lepikhin et al., 2021;Lewis et al., 2021;Fedus et al., 2022).Specifically, for a given input, MoE conditionally selects a subset of experts to process the input and then combines the outputs of these experts to generate the final output.Beyond computation efficiency, MoE is also used to implement modular Transformers (Gururangan et al., 2022;Zhang et al., 2022a;Pfeiffer et al., 2022;Wang et al., 2022a).These works explicitly design extra constraints during pre-training to ensure the modularization of expert networks.However, it is still unclear whether modular experts can emerge naturally in Transformers during pre-training.

Functionality Evaluation
In this section, we first introduce the definition of neurons and how to evaluate the functionality of neurons and experts.Then, we briefly introduce the evaluation setups, including the pre-trained models.
Specifically, the FFN is a two-layer MLP and computes the output by FFN the bias vectors, d is the dimensions of input and output, d f f is the dimensions of intermediate hidden states, and σ is the activation function.For simplicity, we discard the bias terms in the following part.For fine-grained analysis, we dissect the FFN into neurons and rewrite the FFN equation as where W I i,: and W O :,i are the i-th row and column of W I and W O , respectively.The FFN output is the sum of the outputs of all neurons.From this perspective, we define a neuron n i as a row vector W I i,: and a column vector W O :,i .The neuron activation of n i is σ(W I i,: • x).The number of neurons in an FFN is equal to the intermediate hidden dimension d f f .Sparse Mixture-of-experts in Transformer is a variant of FFN (Lepikhin et al., 2021), which significantly increases the model capacity by adding more parameters and keeps the computational cost affordable.In MoE layers, each expert is an FFN, and the output of the MoE layer is the weighted sum of the outputs of all experts.We can also define neurons of MoE as Equation 1.The neuron-based form of MoE is provided in Appendix A.1.
Predictivity for Functions.To comprehensively study the functions of neurons and experts, we cover three typical functions.For each function, we construct various sub-functions, each of which is a fine-grained version of the function.There are 576 sub-functions in total.Here we introduce the three functions and their sub-functions.
Semantic Function.Semantic function refers to the ability to understand the meaning of input texts.In this work, we focus on how neurons capture the patterns of word senses.We use a large-scale dataset with word-sense annotations, OneSec (Scarlini et al., 2019b), to construct binary classification data for semantic sub-functions.In OneSec, each sentence has a keyword whose sense is annotated based on Wikipedia1 .We first filter out the keywords that have more than one sense in the dataset and then randomly select 100 sentences for each sense.For each sense pair of a word, we construct a binary classification dataset by labeling the sentences with one sense as positive and the sentences with the other sense as negative.Finally, we have 529 semantic sub-functions, each of which is a binary classification problem to distinguish the two senses of a word.For experiments on Switch Transformer in Section 6, we randomly sample 100 semantic sub-functions for evaluation.
Knowledge Function.Knowledge function refers to the ability to memorize factual knowledge.In this work, we focus on the factual triples, which are used to construct knowledge graphs.We define a knowledge sub-function as a binary classification to identify whether a triple is correct.Specifically, we sample several triples from Wikidata as positive instances and randomly replace their head or tail entities to construct negative instances.We group these instances according to their relations and each relation has its corresponding knowledge subfunction.There are 39 knowledge sub-functions from T-REx (Elsahar et al., 2018;Elazar et al., 2021) and each sub-function has 400 instances.
Task Function.Task function refers to the ability to perform downstream tasks.Previous work has shown that training a small part of parameters in pre-trained Transformers can achieve comparable performance to full-parameter fine-tuning (Lester et al., 2021;Hu et al., 2022) so that Transformers are supposed to learn amounts of task knowledge from pre-training.In this work, we use several classification datasets.from GLUE (Wang et al., 2018), including SST-2 (Socher et al., 2013), QQP2 , MNLI (Williams et al., 2018), CoLA (Warstadt et al., 2018), MRPC (Dolan and Brockett, 2005), RTE (Dagan et al., 2006), QNLI (Rajpurkar et al., 2016).There are 8 task sub-functions in total because MNLI is split into two binary classification tasks.To stimulate these sub-functions, we adopt the input templates provided by (Raffel et al., 2020) to improve neuron predictivity.For each label class, we randomly sample 1000 instances for evaluation if the number of instances under this class is larger than 1000.
Admittedly, coarse is our function classifica-tion.It does not cover all functions learned by pre-trained Transformers, and there are interactions between each pair of functions so there is unavoidable overlap.However, we focus on a unified framework and concrete evaluation approach, and they can be easily generalized to other ways of function classification, meaning that our contribution is independent of function classification.Using this way of classification is simply due to its typicality.
To evaluate the ability of a neuron to capture the pattern of a sub-function, we compute the predictivity of the neuron activations for the sub-function.Following Suau et al. (2020); Wang et al. (2022b), we focus on the sub-functions that can be formulated as a binary classification problem.
We denote the dataset of a sub-function as , where s i is the input sequence and y i ∈ {0, 1} is the label.Since the computation of FFNs is word-wise, we define the activation of the neuron n j of a sequence s i as a ij = max x k ∈s i σ(W I j,: • x k ), where s i = {x 1 , x 2 , . . ., x l } is the hidden states of s i and l is the length of s i .Then, we have the pairs of neuron activations and labels, Based on A j , we compute the average precision (AP) (Zhu, 2004) of the neuron activations as the predictivity of the neuron n j .For each expert, we compute the average AP of all neurons in the expert as the predictivity of the expert for the sub-function.Please refer to Appendix A.2 for more details.
Evaluation Setups.We evaluate two kinds of MoE models, pre-partitioned MoE and postpartitioned MoE.(1) Pre-partitioned MoE expands feedforward layers by MoE before pretraining.Switch Transformer (Fedus et al., 2022) is a representative pre-MoE model.The architecture of Switch Transformer is similar to T5 (Raffel et al., 2020) except that Switch Transformer replaces FFNs in the even Transformer layer with MoE layers.(2) Post-partitioned MoE converts a pre-trained vanilla Transformer into its MoE model by partitioning FFNs into experts.Since Zhang et al. (2022b) show that vanilla Transformers have implicit MoE structures by discovering the inner correlation among neurons, we use the same method to MoEfy vanilla Transformers.Note that we only adopt MoEfication to provide a neuron grouping.We do not change any architecture, parameters, forward function (computing all neurons and then summing them up), etc, making the MoE-fied T5 identical to the original T5.
Due to the computational cost, we consider the Switch Transformer with 16 experts and use the same number of experts for the post-MoE models.Since the functionality evaluation does not involve decoding, we only compute the neuron and expert predictivities of the encoders.Besides, we focus on the neurons in MoE layers for Switch Transformer to facilitate the modular analysis in the following sections.

Functional Specialization of Neurons
In this section, we analyze the functional specialization of neurons.First, we compare the neuron predictivities of pre-trained models with those of randomly-initialized models at the layer level, which will present a general picture of neuron functions.Second, we study how functions distribute on neurons in each layer.
Neuron predictivities of different layers.At each layer, we compute the best predictivity of neurons for each sub-function and then calculate the average best predictivity among all sub-functions in each function.For comparisons, we also evaluate randomly-initialized models.We report the results in the left part of Figure 1.
From this figure, we have two observations.(1) The average best predictivities of pre-trained neurons are significantly higher than those of randomly-initialized neurons, indicating that the neurons have learned these functions from pretraining and the neurons with top-ranked predictivities indeed excel in corresponding sub-functions.
(2) The best predictivity of the task function increases with the layer number while the best predictivities of the semantic and knowledge functions vary little across layers.It suggests that the task function may be more difficult than the semantic and knowledge functions so the higher layers are more suitable for learning the task function.Note that we evaluate the frozen pre-trained models on the tasks without fine-tuning or prompt-tuning and the pre-training data of Switch Transformer do not contain the tasks.Hence, we can exclude the possibility of optimization artifacts (Durrani et al., 2021), which may make the higher layers more related to the task function.
Distribution in each layer.We first identify the neurons with the top predictivity rankings for each sub-function as sub-functional neurons in each layer and then compute the overlap between q a I S T C c d H Z 6 5 + 1 b p u f 1 I 2 6 f Q H a l / J 1 I q j R n K 0 H Z K i v d m u p a L / 9 X a C f Z P O y l X c Y K g 2 N i o n w g X I z d P w e 1 x D Q z F 0 B L K N L e 3 u u y e a s r Q Z j X h E k r 7 B w V P L J I 2 o l 4 a X G d p k B u G Y X q d 5 X n 5 0 + n M k u Z h z T + q H V 8 d V e t n R X I r Z J f s k Q P i k x N S J x f k k j Q I I w l 5 I a / k z X l z 3 p 0 P 5 3 P c O u c U M z t k A s 7 3 L 7 M A p h U = < / l a t e x i t > Knowledge < l a t e x i t s h a 1 _ b a s e 6 4 = " T Q A p e g t f g 7 e t 0 K B j 8 L J E f E 7 x / A q T 8 r L 4 = < / l a t e x i t > (a) Switch Transformer < l a t e x i t s h a 1 _ b a s e 6 4 = " c o i c i 4 V C D u P U z L 9 a a J H e r a d < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 U B w e J g q e V 2 w P the two sets of sub-functional neurons.Formally, assuming that we identify the top k neurons for each sub-function, the overlap score is defined as , where N 1 and N 2 are the sets of neurons for the two considered sub-functions.If the overlap score is high, there is a group of neurons that are good at both sub-functions.In the experiments, we consider the neurons with the top 1% predictivities for each sub-function.Since there are hundreds of sub-functions, it is impossible to display all of them in a figure and we compute the average overlap score between two functions to measure the distribution similarity between different functions.Note that we omit the self-overlap scores, which are always equal to 1.The results are reported in the right part of Figure 1.
From this figure, we observe that: (1) In the pretrained models, the distribution similarity of the same function is significantly larger than that of different functions, which indicates that there are some groups of neurons, each of which is good at a certain function.And, it is not mainly caused by the similarity that naturally exists between subfunctions of the same function because the results of randomly-initialized models do not show such a phenomenon as shown in Appendix A.3.Hence, we conclude that there are some emergent groups of functionally-specialized neurons after pretraining.
(2) One neuron may be capable of multiple sub-functions even from different functions.For example, the average overlap score between the knowledge function and task function is also significantly higher than that of random models so there are some neurons good at both knowledge and task sub-functions.

Finding Modular Structure of Neurons
Since MoE is a promising candidate for the modular structure of Transformers, we analyze the MoE structure in this section.First, we verify whether the neurons specialized in a certain function are concentrated in some experts, called functional experts.Second, we perturb the activations of experts corresponding to a certain function.Third, we train the model on different datasets, retaining only functional experts by removing non-functional experts.The last two experiments verify the importance of the experts for their corresponding functions.
Distribution on experts.If experts were not functionally specialized, the sub-functional neurons would be randomly distributed among experts.Hence, we conduct statistical hypothesis testing to evaluate whether the neuron distribution on experts is significantly different from random.Assume that there are N neurons in a layer, n E neurons in each expert, k sub-functional neurons for each subfunction, and M sub-functions in a certain function.The null hypothesis is that the sub-functional neurons of each sub-function are independently and randomly distributed on experts, i.e., the number of sub-functional neurons in each expert follows a hypergeometric distribution with parameters N , K, Table 1: Proportion of functional experts to all experts and their modularization degree.Higher modularization degrees mean that functional experts have more sub-functional neurons than the expectation of uniform distribution.
and n E .The sum of the numbers of sub-functional neurons on all sub-functions in an expert is denoted by r i . 3The alternative hypothesis is that an expert has a larger r i than expected by chance.
In the experiments, we also treat the neurons with the highest 1% predictivities for each subfunction as its sub-functional neurons.For each function, we compute the p-value of the sum of the hypergeometric distribution for each expert and reject the null hypothesis if the p-value is less than 0.001.We also conduct the same experiment on random partitioning, where the neurons are randomly partitioned into expert-sized clusters, and randomly-initialized counterparts.We regard the experts that reject the null hypothesis as functional experts and report the proportion of functional experts to all experts in each function.And, we also consider the modularization degree, which is defined as the relative ratio of functional neurons in the expert compared to a uniform distribution.The proportion of functional experts and the modularization degree are computed by where E f is the number of functional experts, E is the number of experts, r i n E is the proportion of functional neurons in the expert, and M k N is the proportion expectation of functional neurons under a uniform distribution.The overall degree is 0 if no functional expert exists, and otherwise is the average degree among all functional experts.
The results are shown in Table 1.From this table, we have three observations.(1) Since the sub-function distributions of pre-trained models are not independent, the proportion of functional 3 We do not find a general form for the distribution of the sum of independent hypergeometric distributions.Since K is significantly smaller than N , we approximate the hypergeometric distribution with a binomial distribution.
Figure 2: Perturbation performance of T5.For "Random", we randomly perturb neurons.For "SST-2", "MRPC", "CoLA", "QQP", we are guided by the neuron predictivities on each dataset and perturb the top-ranked neurons.For "Avg", we sum the predictivities on all datasets above and also perturb the top-ranked neurons.For "MoE", we consider the experts with top-ranked sums of the predictivities.The seen datasets are the four datasets above.The unseen datasets include five other datasets.
experts of pre-trained models with random partitioning is higher than that of random-initialized counterparts.(2) However, the proportion of functional experts of pre-trained models with MoE partitioning is still higher than that with random partitioning.Moreover, the modularization degree of the functional experts in MoE structures is significantly higher than that in random partitioning.It indicates that the experts of both pre-MoE and post-MoE are more likely to intensively include neurons excelling in a certain function.(3) We further compare the predictivities of the functional experts and non-functional experts and find that the functional experts have significantly higher predictivities than the non-functional experts in their corresponding functions.It indicates that our quantification for expert predictivities is consistent with the concept of functional experts.More details are in Appendix A.4. Perturbation analysis.Furthermore, we conduct perturbation experiments, which are widely used to analyze both biological and artificial neural networks (Cowley et al., 2022;Wang et al., 2022b), to evaluate the causal effect of experts on model performance.Since T5 is a dense model and Switch Transformer is a sparse model, we use different perturbation methods for them.Specifically, the pre-MoE models only select one expert at each layer, so we choose to perturb the selection function.For the post-MoE models, which select all experts at each layer, we perturb the activation values of the targeting experts.Due to the difference in experiment setups, the perturbation performance is also different and their perturbation results are not comparable.
For T5, we perturb all neuron activations of the target experts by adding random noises to them and evaluate the perturbed models on the downstream tasks.We rank experts according to their sum of the predictivities for several downstream datasets and perturb the top-ranked experts.We regard the datasets used in computing the sum of the predictivities as seen datasets, including SST-2 (Socher et al., 2013), MRPC (Dolan and Brockett, 2005), CoLA (Warstadt et al., 2018), and QQP.To evaluate the generalization ability of the experts, we also perturb them and evaluate the perturbed models on unseen datasets, including MNLI (Williams et al., 2018), QNLI (Rajpurkar et al., 2016), CB (De Marneffe et al., 2019), Mul-tiRC (Khashabi et al., 2018), and BoolQ (Clark et al., 2019a).We compute the average performance of the perturbed models on the seen and unseen datasets, respectively.For comparisons, we also conduct neuron-level perturbation and keep the proportion of perturbed neurons equal to that of expert-level perturbation.There are three kinds of neuron-level perturbations: (1) perturb the neurons that have top-ranked predictivities for a certain dataset, (2) perturb the neurons that have topranked sums of the predictivities for seen datasets, and (3) perturb the neurons randomly.We perturb the neurons in the last four layers where the task function is mainly located as shown in Figure 1.For fine-grained perturbation percentage, we partition FFNs into 96 experts, which is different from that of the other experiments.
We report the average accuracy of the perturbed models on the downstream tasks in Figure 2. From this figure, we have three observations.(1) The experts with high predictivities are very important for the model performance.For example, perturbing 10% of the experts by function in the last four layers (around 3% of the total experts in the model) decreases the average accuracy by nearly 30% and makes the model perform as random guessing.
(2) For neuron-level perturbations, "Avg" perturbation achieves a larger performance drop than single-dataset perturbation, which is expected because intuitively it perturbs the neurons with high overall predictivities.
(3) Perturbing experts leads to a more significant performance drop than perturbing individual neurons on both seen and unseen datasets when the perturbation proportion is higher than 6%.It suggests neurons in the experts cooperate instead of working independently so perturbing them will influence the cooperation and lead to a significant performance drop.
We conclude single neurons can not perform a function well in lack of the cooperation with modules despite their high overall predictivity.
For Switch Transformer, in the last four layers, we perturb the selection probabilities of experts.We constraint the model to select from a subset of experts by setting the selection probabilities of the other experts to 0 and fine-tune the perturbed models on the downstream tasks with the template in Gao et al. (2021).There are two kinds of perturbations: (1) No Function: we only select the non-functional experts in a certain task, which is equivalent to setting the activations of the functional experts to 0. (2) Function: we only select the functional experts in a certain task.Since the number of functional experts is smaller than that of non-functional experts, we randomly select a subset of non-functional experts for No Function to make the size of the subset equal to that of the functional experts.Besides, we also report the performance of the original model without any perturbation.More details are in Appendix A.2. From Table 2, we observe that avoiding the functional experts leads to an overall performance drop.Furthermore, only selecting the functional experts even achieves a higher performance than the original model.
In summary, we observe that the specialized neurons tend to be located concentratedly in some experts based on function and the functional experts play an important role when the model performs related functions.Hence, it is reasonable to study < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 U B w e J g q e V 2 w P v P 0 i L O a u F 3 W e 0 w = " < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 U B w e J g q e V 2 w P v P 0 i L O a u F 3 W e 0 w = "

Emergence of Modularity
In this section, we study the emergence of modularity during pre-training.To this end, we pre-train the base version of T5 and Switch Transformer from scratch.We use the MoEfication partitioning of the last checkpoint as the MoE structure of T5.More details of pre-training are in Appendix A.2.
Emergence Patterns of Functional Experts.We first study the changing curves of the proportion of functional experts and their modularization degree during pre-training, i.e., we apply the same analysis in Section 5 to each checkpoint.The results are shown in Figure 3.We have the following observations.(1) The proportion of functional experts and their modularization degree quickly achieves a high point, and then keeps relatively stable till the end.It indicates that functional experts emerge at the early stage of pre-training.(2) The proportion of functional experts in the Switch Transformer fluctuates significantly at about 20K steps and its stabilization is slower than that of T5.It suggests that the emergence of the modular structure in Switch Transformer is surprisingly more difficult than T5.The reason may be that Switch Transform omits the gradients of unselected experts, which causes the optimization to be harder than that of T5 (Du et al., 2022;Zoph et al., 2022).
Stabilization of Experts and Neurons.Even though clear is the changing curve of the number and modularization degree of functional experts from a global perspective, we still do not know how the predictivities of neurons and experts changes.
There are two kinds of predictivity dynamics.The first is the changing of the absolute predictivities, and the second is the relative order changing of predictivities among all experts or neurons.While it is straightforward to study the absolute predictivities, it is difficult to have a consistent analysis standard due to different scales for different func-tions and layers.Hence, we focus on the relative order changing of predictivities.Intuitively, some experts or neurons may excel in a sub-function at a certain stage, and maintain this relative dominance as pre-training continues.With this in mind, we study the stabilization of predictivity rankings.
To study the stabilization of predictivity rankings, we quantify the similarity between a layer of two model checkpoints w.r.t. a particular subfunction, which is either at the expert level or at the neuron level.Specifically, for a sub-function, we define such a similarity as Spearman's rank correlations (Spearman, 1961) between the predictivities of experts or neurons in the considered layer of the two checkpoints.In this way, we measure to what extent the predictivities of the two checkpoints is positively correlated.We measure the similarity between two adjacent (saved) checkpoints as stabilization score, which reflects the trend toward stabilization.Higher similarity indicates a lower changing pace and thus a higher degree of stabilization.For each function, we show the curve of average stabilization score among all sub-functions in it and across all layers, both at the expert level and neuron level.To facilitate our analysis, we also measure it on random partitioning.
We report the result in Figure 4. From this figure, we have four observations.(1) During the pre-training, both experts and neurons are increasingly stabilized.(2) Experts are stabilized to a large extent at the early stage of pre-training.It takes around 15% of the total training steps for the expert predictivities to achieve a stabilization score of 0.9.(3) Expert stabilization is notably faster than both neuron stabilization (around 75% of the total training steps) and the stabilization for random partitioning 4 .In conclusion, we see strong evidence that coarse-to-fine is the inner mechanism of pre-training.Transformer first learns a modular  structure, where the structure becomes stable at the early stage, and then there is a fine-grained process to learn neuron functions.

Discussion
Efficient Pre-training.Large-scale pre-training of Transformers is very expensive (Brown et al., 2021).It is promising to use MoE to reduce the computational cost by activating only a small part of the experts.Our findings have shown the emergent modularity of experts, which demonstrates the reasonableness of the MoE structure.However, as shown in Section 6, Switch Transformer is more unstable than T5 on the modular structure at the beginning.It suggests that we should begin with a dense model and gradually make it sparse instead of directly training a sparse model from scratch, which has been explored in some preliminary works (Nie et al., 2021;Hazimeh et al., 2021).
Model Fusion.Considering there are many pre-trained models on different corpora and even modalities, researchers have started to explore how to fuse them to aggregate different knowledge together.Compared with model ensembling, model fusion has the potential to be more efficient because it does not compute all of the models.Much of the current research in model fusion focuses on weight averaging and achieves some promising results (Li et al., 2022;Matena and Raffel, 2021).However, weight averaging requires two models having the same architecture, which is not always the case.In this work, we discover the modular structure of pre-trained Transformers, which may facilitate the model fusion based on module combinations, which gets rid of the architecture constraint.
Connection between Brains and Pre-trained Transformers.Building an artificial brain that corresponds to the human brain is an important neuroscience problem, e.g., the Blue Brain project (Markram, 2006).Currently pre-trained Transformers show strong power for predicting brain signals (Toneva and Wehbe, 2019;Caucheteux et al., 2021), but more fine-grained connections between the two are still not clear.In analogy to brain regions, we present the modular structure of pre-trained Transformers.It will be interesting to explore the connection between brain regions and the Transformer modules in the future.

Conclusion
In this paper, we study the modularity of pretrained Transformers and validate the emergence of modularity.We also study the pre-training process to understand the emergence of modularity and find the coarse-to-fine mechanism of pre-training.We expect our evaluation framework and findings will facilitate and inspire future research in this area.

Limitations
The major limitations of our work are as follows: (1) We show that the neuron structure of MoE reveals the presence of modularity in pre-trained Transformers.However, the MoE structure is not the only possible modular structure.To better understand the modular structure of Transformers, we need to explore more types of structures.For example, the number of neurons in each module could be different, and the modular structure could be hierarchical, where modules are grouped into larger modules.(2) We study three typical functions for language processing: semantic function, knowledge function, and task function.There are many other functions that could be studied, such as the syntactic function, discourse function, etc.Moreover, our categorization of functions may be not suitable for pre-trained Transformers because there are some overlaps between studied functions.A new Transformer-based function categorization may be needed.(3) We transform T5 into its MoE version to study its modular structure while not all dense pre-trained Transformers can be studied in this way because the adopted MoEfication technique (Zhang et al., 2022b) can only transform ReLU-based Transformers.Studying the modularity of other dense pre-trained Transformers, such as BERT, is also important for future research.

A.1 Neurons of MoE
In MoE layers, each expert is an FFN, and the output of the MoE layer is the weighted sum of the outputs of all experts, MoE(x) = E i=1 α i FFN i (x), where α i is the weight of the i-th expert and FFN i is the i-th expert.α i is computed by a gating network.Note that the weights of unselected experts are zero, which makes the MoE layer sparse.We can also rewrite the MoE layer into a neuron-based form, where W I i,j,: and W O i,:,j are the j-th row and column of W I i and W O i in the i-th expert, respectively.The gating weight α i is non-negative and can be viewed as the scaling factor of W O i,:,j .Correspondingly, we define a neuron n i,j as a row vector W I i,j,: and a column vector W O i,:,j , and the neuron activation of n i,j is σ(W I i,j,: • x).

A.2 Experimental Details
Calculation of AP.AP is the weighted average of precision at different recall levels, which is a common metric for evaluating the performance of binary classification models.Since AP only represents the positive correlation, we compute the APs of both neuron activations and their opposite values, −a i , and take the maximum as the final AP.
The final AP ranges from 0.5 to 1, where 0.5 means the neuron is useless for the sub-function and 1 means the neuron is perfect for the sub-function.
Randomly-initialized models.In Section 4, the evaluation of randomly-initialized models is conducted 3 times and we report the average results.
Perturbation analysis on T5.To match the magnitude of neuron activations of T5, we set the variance of the Gaussian noise to be 4.The perturbation analysis is conducted 5 times and we report the average results.
Perturbation analysis on Switch Transformer.In this experiment, the predictivity and thus the functional experts are calculated based on the training set.Since Switch Transformer has not been trained on the considered datasets during pretraining, we train it on 128 randomly-picked training instances.We tune W O in the last four years  (Radford et al., 2019), which contains 40GB of web text.We use the same pre-training task as the official T5 and Switch Transformer, which is masked language modeling.The total number of training steps is 200K and we save the model every 5K steps.We use the same hyper-parameters for pre-training to avoid the effect of hyper-parameters on our analysis, and it is also a common practice when comparing dense and sparse T5s (Zoph et al., 2022).The learning rate is 1e-4.The batch size is 512.The max lengths of encoder inputs and decoder inputs are 512 and 256, respectively.We use 8 NVIDIA A100 GPUs for pre-training.The total pre-training time is around 3 days.
Experiments on random partioning.In Section 5 and Section 6, we do the hypothesis testing on random partitioning, and we also calculate Spearman's rank correlation between adjacent checkpoints on random partitioning.These random experiments are done 1000 times and we report the average results.

A.3 Function Distribution in Each Layer
Following Section 4, we report the distribution similarity of the randomly-initialized models in Figure 5, which is significantly different from that of the pre-trained models.We consider the pre-trained models and their randomly-initialized counterparts.
, where b i is the average predictivity across all sub-functions for expert e i , and f i is whether e i is a functional expert or not.

A.4 Predictivity of Functional Experts
We quantify the predictivity of the expert for subfunctions in Section 3 and define functional experts in Section 5, and different are the technical details of quantification and definition.Hence, we conduct an experiment to check their consistency.For a function, we calculate b i as the average predictivity across all sub-functions for each expert e i , and we also denote f i ∈ {0, 1} as whether e i is a functional expert or not.Now we have B = {(b i , f i )} E i=1 , based on which we compute the AP.A high AP indicates a high consistency.We report the average AP across all layers in Table 5.
The average AP is quite high.Therefore, we are confident that the quantification for expert predictivity is consistent with the concept of functional experts.

A.5 Sub-Functioanl Experts
Similar to the concept of functional experts discussed in Section 4, we can also define so-called sub-functional experts.Basically, for each subfunction, we conduct statistical hypothesis testing on its sub-functional neurons.We similarly calculate the proportion of sub-functional experts and their modularization degree.We report the average result within each function.The result of the Switch Transformer and T5 used in Section 4 is reported in Table 4.The changing curve of the Switch Transformer and T5 trained by us is reported in Figure 6.

A.6 Organization of Sub-Functions
We further study how the model organizes subfunctions into their functional experts 5 and how the organization changes during the pre-training.From the perspective of sub-functions, it is basically how a sub-function shares functional experts with others.
For a function, we can list all the sub-functions within it denoted as w 1 , w 2 , . . ., w M .The similarity score between each pair of sub-functions can be seen as a matrix S, where S i,j is the similarity score between w i and w j for all 1 ≤ i, j ≤ M .
5 Strictly speaking, we did not define functional experts for a sub-function.In this context, the concept of "functional experts" is used to refer to the experts that have high predictivity for a sub-function.< l a t e x i t s h a 1 _ b a s e 6 4 = " 5 U B w e J g q e V 2 w P v P 0 i L O a u F 3 W e 0 w = " > A A A C H 3 i c b V D L S g M x F M 3 4 r O + q S z f B I u i m z E h F l 6 I b l y p W h c 4 g S X q r w S Q z J n f U M s x 3 u L V f 4 0 7 c + j G C m d q F r w O B w z n 3 c m 4 O z 5 R 0 G I b v w d j 4 x O T U d G 1 m d m 5 + Y X G p v r x y 7 t L c C m i L V K X 2 k j M H S h p o o 0 Q F l 5 k F p r m C C 3 5 7 W P k X 9 2 C d T M 0 Z 9 j N I N L s 2 s i c F Q y 8 l M c I j F p t 8 i 5 7 t l F f 1 R t g M h 6 B / S T Q i D T L C 8 V X 9 I + 6 m I t d g U C j m X C c K M 0 w K Z l E K B e V s n D v I m L h l 1 9 D x 1 D A N L i m G R 5 d 0 w y t d 2 k u t f w b p U P 2 + U T D t X F 9 z P 6 k Z 3 r j f X i X + 5 3 V y 7 O 0 l h T R Z j m D E V 1 A v V x R T W j V A u 9 K C Q N X 3 h A k r / a 1 U 3 D D L B P q e f q R w 7 f 9 g 4 E G k W j P T L e L T s o i r Q M 6 L 0 7 L q K / r d z l 9 y v t 2 M W s 2 d k 1 Z j / 2 D U X I 2 s k X W y S S K y S / b J E T k m b S L I H X k i z 2 Q Q D I K X 4 D V 4 + x o d C 0 Y 7 q + Q H g v d P w k 6 j + A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 U B w e J g q e V 2 w P v P 0 i L O a u F 3 W e 0 w = " > A A A C H 3 i c b V D L S g M x F M 3 4 r O + q S z f B I u i m z E h F l 6 I b l y p W h c 4 g S X q r w S Q z J n f U M s x 3 u L V f 4 0 7 c + j G C m d q F r w O B w z n 3 c m 4 O z 5 R 0 G I b v w d j 4 x O T U d G 1 m d m 5 + Y X G p v r x y 7 t L c C m i L V K X 2 k j M H S h p o o 0 Q F l 5 k F p r m C C 3 5 7 W P k X 9 2 C d T M 0 Z 9 j N I N L s 2 s i c F Q y 8 l M c I j F p t 8 i 5 7 t l F f 1 R t g M h 6 B / S T Q i D T L C 8 V X 9 I + 6 m I t d g U C j m X C c K M 0 w K Z l E K B e V s n D v I m L h l 1 9 D x 1 D A N L i m G R 5 d 0 w y t d 2 k u t f w b p U P 2 + U T D t X F 9 z P 6 k Z 3 r j f X i X + 5 3 V y 7 O 0 l h T R Z j m D E V 1 A v V x R T W j V A u 9 K C Q N X 3 h A k r / a 1 U 3 D D L B P q e f q R w 7 f 9 g 4 E G k W j P T L e L T s o i r Q M 6 L 0 7 L q K / r d z l 9 y v t 2 M W s 2 d k 1 Z j / 2 D U X I 2 s k X W y S S K y S / b J E T k m b S L I H X k i z 2 Q Q D I K X 4 D V 4 + x o d C 0 Y 7 q + Q H g v d P w k 6 j + A = = < / l a t e x i t > For a given k, we denote O (k) i,j as the top k expert overlap for w i and w j .When Spearman's rank correlation between S i,: and O (k) i,: (denoted as V (k) i ) is high, it indicates that the sub-functions similar to w i share more functional experts than sub-functions dissimilar to w i do, and vice versa.According to the meaning, we call V (k) i clustering score.
We do a case study on the semantic function of the Switch Transformer, which focuses on understanding word meanings.Since relatively mature is the method of quantifying word similarity, we take S as the word similarity matrix calculated by spaCy (Honnibal and Montani, 2017).Note that the features at the word level are the lowest level of semantic information, so the word similarity reflects the lowest level of similarity between two semantic sub-functions.
We report the curve of level information into the same functional experts in the low layer.However, the final clustering score of higher MoE layers is close to 0, indicating that high layers do not organize sub-functions based on word similarity.We guess that the reason is that the high layers may process high-level semantic information, which is not related to the word similarity.
(3) We see three interesting curves of layers 3, 5, and 11.Their clustering scores achieve a high point when the clustering score of layer 1 first achieves its highest point, and then they continuously decrease to 0. The trend that high layers become increasingly responsible for high-level features may grow faster when the low-layer organization has been established than when the organization is forming.
Semantic < l a t e x i t s h a 1 _ b a s e 6 4 = " n T J w j Y o N f d 7 Y G W k 5 J g 7 / 5 6 G z / b s = " > A A A C I 3 i c b V D L S s N A F J 3 4 t j 5 a d e k m W A R X J R F F l 0 U 3 L n 2 1 F p p S J t P b O j g z C T M 3 a g n 5 E r f 6 N e 7 E j Q s / R X D S Z m F b D w w c z n 2 c O y e M B T f o e V / O 3 P z C 4 t L y y m p p b X 1 j s 1 z Z 2 m 6 a K N E M G i w S k W 6 F 1 I D g C h r I U U A r 1 k B l K O A u f D j P 6 3 e P o A 2 P 1 C 0

Figure 1 :
Figure 1: Left part: average predictivity of each function of each layer for the pre-trained models and their randomlyinitialized counterparts of (a) Switch Transformer and (b) T5.Right part: average distribution similarity between different functions of different layers for the same pre-trained models.All of the similarities of random-initialized models are around 0.01, including the diagonal elements, which are significantly lower than those of the pre-trained models.
t e x i t s h a 1 _ b a s e 6 4 = " c o i c i 4 V C D u P U z L 9 a a J H e r a d I H R 4

Figure 3 :
Figure 3: Changing curves of the proportion of functional experts and their modularization degree.We also mark the value of the last checkpoint with random partitioning on the curve by points with different shapes.the modularity of Transformers by MoE.
t e x i t s h a 1 _ b a s e 6 4 = "

Figure 4 :
Figure 4: Spearman's rank correlation between the functionality distributions of two adjacent checkpoints.
t e x i t s h a 1 _ b a s e 6 4 = " 7 c Z K i + q g i / c b L C 6 A G w N M D c g 5 A d g = " > A A A C M X i c b V D L S g M x F M 3 4 t r 6 q L k U I F k E 3 Z U Y q u h T d u N T a W q F T S i a 9 t c E k M y R 3 1 D L M y q 9 x q 1 / j T t z 6D Y K Z 2 o W v A 4 H D O f d y c k + U S G H R 9 1 + 8 i c m p 6 Z n Z u f n S w u L S 8 k p 5 d e 3 S x q n h 0 O S x j M 1 V x C x I o a G J A i V c J Q a Y i i S 0 o p u T w m / d g r E i 1 g 0 c J t B R 7 F q L v u A M n d Q t b 4 Y I 9 5 j t 8 F 1 6 c S e Q D 2 j D M G 3 7 s V F g 8 m 6 5 4 l f 9 E e h f E o x J h Y x x 1 i 1 / h L 2 Y p w o 0 c s m s b Q d + g p 2 M G R R c Q l 4 K U w s J 4 z f s G t q O a q b A d r L R G T n d d k q P u m j 3 N N K R + n 0 j Y 8 r a o Y r c p G I 4 s L + 9 Q v z P a 6 f Y P + x k Q i c p g u Z f Q f 1 U U o x p 0 Q n t C Q M c 5 d A R x o 1 w f 6 V 8 w A z j 6 J r 7 k R I p d 4 O G O x 4 r x X Q v C + t 5 F h a B U Z T V 8 6 K v 4 H c 7 f 8 n l X j W o V f f P a 5 W j 4 3 F z c 2 S D b J E d E p A D c k R O y R l p E k4 e y C N 5 I s / e s / f i v X p v X 6 M T 3 n h n n f y A 9 / 4 J L 9 m r Z g = = < / l a t e x i t > (d) Random Switch Transformer < l a t e x i t s h a 1 _ b a s e 6 4 = " X + z d b 9 c s F i T T 3 q X I H p p b 8 D T O A 3 w = " > A A A C O H i c b V C 7 b h N B F J 0 N j x j z i C E l z Q i D Z B p r N 3 I U S g s a S m P 8 i O S 1 r N n Z a 3 u U e a x m 7 s Z Y q / 0 B v i Z t / C d 0 6 R A t P R K z t g u S c K S R j s 6 5 V 2 f u S T I p H I b h j + D g w c N H j w 9 r T + p P n z 1 / c d R 4 + W r k T G 4 5 D L m R x p 4 n z I E U G o Y o U M J 5 Z o G p R M I 4 u f h U + e N L s E 4 Y P c B 1 B l P F F l r M B W f o p V n j b Y z w D Y t W + p 7 2 m U 6 N o l 9 X A v m S D i z T b m 6 s A l v O G s 2 w H W 5 B 7 5 N o T 5 p k j 9 6 s 8 S d O D c 8 V a O S S O T e J w g y n B b M o u I S y H u c O M s Y v 2 A I m n m q m w E 2 L 7 T U l f e e V l P p o / z T S r f r v R s G U c 2 u V + E n F c O n u e p X 4 P 2 + S 4 / z D t B A 6 y x E 0 3 w X N c 0 n R 0 K o a m g o L H O X a E 8 a t 8 H + l f M k s 4 + g L v J W S K H + D h h U 3 S v n W i r h f F n E V m C R F v 6 z 6 i u 6 2 c 5 + M T t p R p 3 3 6 p d P s f t w 3 V y O v y R v S I h E 5 I 1 3 y m f T I k H D y n V y R a 7 I J N s F N 8 D P 4 t R s 9 C P Y 7 x + Q W g t 9 / A b 6 1 r i 4 = < / l a t e x i t > (b) Random T5 < l a t e x i t s h a 1 _ b a s e 6 4 = " + B D q x R W R s p k P 8 + I + S P P B y d H v C r c = " > A A A C K H i c b V B N T x s x F P T S U i j Q s v 2 4 9 W I R I a W X a B c l g i M q l x 4 h I o C U j S L b e S F W / L G y 3 0 L T 1 f 4 X r u X X c E N c + z + Q 6 g 0 5 F M J I l k Y z 7 2 m e h + d K e k y S h 2 j l z d v V d 2 v r 7 z c 2 t z 5 8 3 I 4 / f T 7 z t n A C e s I q 6 y 4 4 8 6 C k g R 5 K V H C R O 2 C a K z j n 0 6 P a P 7 8 C 5 6 U 1 p S s W E O S Y w F P Y s h e v w B w P X w m o d W i q z b l V m d S D n Z b e q + 0 p f t r N M z v Z a a b v V O W k 3 D n 8 s m l s n 3 8 g O a Z K U 7 J N D 8 p M c k x 4 R 5 D e 5 I X / I b X Q b 3 U X 3 0 c P T 6 E q 0 2 P l C n i H 6 + w + j I 6 b w < / l a t e x i t > (a) T5< l a t e x i t s h a 1 _ b a s e 6 4 = " 2 7 r 9 6 6 a h h L / H + h a 6 L x y 6 8 0 b + Figure5: Distribution similarity between different sub-functions.We report the average similarity between functions.We consider the pre-trained models and their randomly-initialized counterparts.
t e x i t s h a 1 _ b a s e 6 4 = " c o i c i 4 V C D u P U z L 9 a a J H e r a d I H R 4 4 j / b P E B + p 7 z c K 0 M 4 N d e I n N d DA f f Y q 8 T u v n V N / r 1 N I k + W E R r w F 9 X P F K e V V J 7 w n L Q p S Q 0 9 A W O n / y s U A L A j y z X 1 I S b S / w e C d S L U G 0 y v i 0 7 K I q 8 A k K U 7 L q q / o c z t f y c V u I 2 o 2 f p 0 0 6 / s H 4 + b m 2 E + 2 y b Z Z x H 6 z f X b E j l m L C f b A / r C / 7 D F 4 D J 6 C 5 + D l b X Q i G O + s s w 8 I X v 8 D L F G r Z A = =< / l a t e x i t > (b) T5 t e x i t s h a 1 _ b a s e 6 4 = " c o i c i 4 V C D u P U z L 9 a a J H e r a dI H R 4 = " > A A A C M X i c b V D L S i N B F K 3 2 P T 6 j L g e h M A i 6 C d 2 S Y V y K b l z 6 i g r p E G 5 X b k x h V X V T d V s N T a / 8 m t m O X + N O 3 P o N A 1 M d s / B 1 o O B w z r 2 c u i f J l H Q U h k / B x O T U 9 M z s 3 I / 5 h c W l 5 Z X a 6 t q F S 3 M r s C V S l d q r B B w q a b B F k h R e Z R Z B J w o v k 5 v D y r + 8 R e t k a s 5 p m G F H w 7 W R f S m A v N S t b c S E 9 1 R s w w 4 / u 5 M k B v z c g n H 9 1 G q 0 Z b d W D x v h C P w r i c a k z s Y 4 7 t b + x b 1 U 5 B o N C Q X O t a M w o 0 4 B l q R Q W M 7 H u c M M x A 1 c Y 9 t T A x p d p x i d U f I t r / S 4 j / b P E B + p 7 z c K 0 M 4 N d e I n N d D A f f Y q 8 T u v n V N / r 1 N I k + W E R r w F 9 X P F K e V V J 7 w n L Q p S Q 0 9 A W O n / y s U A L A j y z X 1 I S b S / w e C d S L U G 0 y v i 0 7 K I q 8 A k K U 7 L q q / o c z t f y c V u I 2 o 2 f p 0 0 6 / s H 4 + b m 2 E + 2 y b Z Z x H 6 z f X b E j l m L C f b A / r C /7 D F 4 D J 6 C 5 + D l b X Q i G O + s s w 8 I X v 8 D L F G r Z A = = < / l a t e x i t > (b) T5

Figure 6 :Figure 7 :
Figure 6: Changing curves of the proportion of sub-functional experts and their modularization degree.The horizontal line is the value for random partitioning.

Figure 8 :
Figure 8: Changing curves of the average clustering scores on random partitioning.We plot the curve for each MoE layer in Switch Transformer.

Table 2 :
Perturbation performance of Switch Transformer on GLUE tasks.

Table 4 :
Proportion of sub-functional experts identified by hypothesis testing and their modularization degree.The result is averaged within each function.

Table 5 :
Average AP calculated based on B