One Wide Feedforward Is All You Need

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model’s parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.


Introduction
The Transformer architecture (Vaswani et al., 2017) has become the de facto paradigm in many Natural Language Processing (NLP) tasks, including Machine Translation (MT).Several studies have shown that Transformers exhibit impressive scaling-law properties (Gordon et al., 2021;Bansal et al., 2022;Ghorbani et al., 2022), wherein increasing the number of model parameters leads to further accuracy gains.In parallel with this architecture's impressive scaling of the numbers of parameters (Chowdhery et al., 2022), there is a growing trend towards reducing model footprints for real-world deployment, to satisfy practical constraints like latency requirements as well as memory and disk space limitations.In turn, researchers are actively exploring parameter sharing (Ge et al., 2022;Takase and Kiyono, 2023;Lou et al., 2022), reducing the dimensionality of Transformer compo-nents, and pruning components like attention heads (Voita et al., 2019;Michel et al., 2019).
Although the role of attention in learning pairwise dependencies between tokens is relatively well understood (Voita et al., 2019;Clark et al., 2019;Vig and Belinkov, 2019), the role of the Feed Forward Network (FFN) remains under-explored.Recently, Geva et al. (2021) established a connection between the FFN and attention by positing that the FFN corresponds to learnable key-value pairs where the weights of the first layer of the FFN corresponds to the keys and those of the second to the values.They find that the keys are able to capture salient textual patterns at each layer, and they notice that the classes of patterns tend to overlap between neighboring layers, indicating redundancy in the representation.
This observation motivates our work, where we revisit the conventional practice of allocating an individual FFN per layer.We investigate the effect of sharing and dropping the FFN across different layers on MT models.We conduct thorough experiments with different configurations of the Transformer, across different language pairs, including a low resource language pair and multilingual.In addition, we investigate the effect of the FFN in a decoder-only Transformer-based model.We find that a considerable level of redundancy exists between the encoder and decoder FFNs.As a result, we are able to eliminate the decoder FFN and share a single FFN across the encoder without significantly compromising the model's accuracy.This step leads not only to significant parameter savings but also opens up opportunities for further improvements.We also suggest using wider FFNs in the encoder while dropping the decoder's FFN, which results in a model with a similar size, but improved accuracy and reduced latency.
Finally we conduct a fine-grained analysis of the representational similarity between the original model, using one independent FFN per layer, and various models with shared FFNs.Our results reveal that both model accuracy and the internal representation of Transformer blocks remain stable when sharing the FFN.

Transformer
The Transformer architecture has two main components: attention and the FFN, which are connected via a residual connection (He et al., 2016) and layer normalization (Ba et al., 2016).In an encoderdecoder model, there are two types of attention: self-attention and cross-attention.Self-attention is used in both the encoder and the decoder, allowing the model to focus on relevant information within the same sequence.Cross-attention is exclusive to the decoder and allows it to attend to the encoder's output.Attention takes as input a set of queries, keys and values, projected using four R d model ×d model matrices (one for the queries, keys, values, and final output) where d model is the model's hidden dimension.It then applies the SOFTMAX function to allow it to focus on the most relevant values.
The FFN is applied after attention on both the encoder and the decoder and consists of the following 2-layer linear transformation: where a RELU non-linearity is applied to the transformation of the input sequence (x).At each layer, the FFN is parameterized with two matrices, W 1 ∈ R d model ×d ff and W 2 ∈ R d ff ×d model where d ff is the FFN dimension and is usually set to 4×d model (Vaswani et al., 2017).
Recent work has drawn a significant link between attention and the FFN (Geva et al., 2021), wherein W 1 and W 2 assume roles akin to the keys and values to an unnormalized attention where the input (x) acts as the query.Unlike regular attention, the FFN employs a RELU, which allows multiple keys to significantly contribute to the final output (Geva et al., 2021).Additionally, these keys correspond to an inventory of salient patterns that are learned from the training data.Geva et al. (2021) suggest that at the lower layers the FFN learns shallow syntactic patterns and progressively learns deep semantic patterns on the deeper layers.Moreover, the authors find that there's a substantial overlap between patterns captured by adjacent layers, indicating that there are redundancies in the FFNs and suggesting a better allocation of these parameters might be beneficial for performance.

Sharing and Widening the FFN
The vanilla Transformer allocates one FFN for each layer of the encoder and decoder, i.e.FFN enc i or FFN dec i , respectively.Excluding embedding parameters, these FFNs occupy around two thirds of the parameter budget, while attention occupies the remaining third1 .Earlier work found that constraining the parameterization of the decoder FFNs causes no degradation in accuracy (Ge et al., 2022).In this work, we share the parameters of the FFN across layers and/or across the encoder and decoder to minimize redundancy between FFNs.
Let N enc , N dec be the numbers of encoder and decoder layers, respectively.We consider multiple configurations for parameter sharing as follows: • One FFN enc all for the whole encoder: • One FFN dec all for the whole decoder: for both the encoder and the decoder: Additionally, we explore modifying the dimension of the shared FFN, which we denote as d ff ′ .Setting d ff ′ > d ff widens the shared FFN while d ff ′ < d ff narrows it.We also consider the extreme cases of setting d ff ′ to 0 or to (N enc + N dec ) × d ff (and beyond).Setting d ff ′ = 0 is equivalent to dropping the FFN2 while setting d ff ′ = (N enc + N dec ) × d ff is akin to sharing the concatenation of all individual FFNs.
Sharing the FFNs directly affects the number of parameters and, to a certain extent, latency.For instance, sharing FFN enc all for the whole encoder reduces the number of parameters by (N enc − 1) × 2 × d model × d ′ ff decoder, i.e., setting d ff ′ = 0 for FFN dec all , reduces the parameters by (N dec ) × 2 × d model × d ′ ff and reduces the amount of computation to be done.This is particularly important during inference since the forward pass of the decoder is autoregressive, and changing the decoder's FFN dimension has a higher latency impact than on the encoder.
Since different configurations have different impacts, we analyse the trade-off between model size, latency, and accuracy: (i) How many parameters can be shared/pruned with negligible (if any) accuracy degradation?(ii) Are the encoder and decoder FFNs affected similarly?(iii) Keeping the same model size, can the FFN parameters be allocated more efficiently?
We propose a novel configuration, which we call the One Wide FFN model, consisting of a single shared wide FFN on the encoder and no FFN on the decoder.To keep the number of parameters the same as in the baseline, we increase the shared FFN dimension accordingly: FFN enc all with For completeness, we include similar experiments on the attention mechanism in Appendix B. These experiments show that, contrary to the FFN, individual layer-specific attention weights are more important and not as redundant, as sharing the attention leads to significant accuracy drops.

Representational Similarity
Besides investigating the impact on accuracy, we study the similarity between different models in terms of their internal representations and the semantic space they produce.
We use Linear Centered Kernel Alignment (CKA, Kornblith et al., 2019) to measure the similarity between the internal representations of different models.CKA uses inner products to estimate how similar the kernel matrices of two different representations are, and is based on the Hilbert-Schmidt Independence Criterion (HSIC, Gretton et al., 2005), a statistical measure of independence of two random variables.Linear CKA uses the dot product as a kernel and can be written as: where To measure the similarity between the semantic spaces of different models, we use Local Neighborhood Similarity (LNS, Boggust et al., 2022).Local neighborhood similarities have been previously been used in analyzing semantic shifts in word embeddings (Hamilton et al., 2016).The premise of LNS is that two semantic spaces are similar if a sentence has similar neighbors in the two spaces.The LNS of a sentence s between models 1 and 2 is defined as: where k-NN(s) is the set of k nearest neighbors of sentence s for a model and Sim is the intersectionover-union (Jaccard similarity) of the two sets of neighbors.For each pair of components (attention and FFN) in models 1 and 2 we compute the LNS of all sentences in the evaluation dataset and take the mean LNS as our layer similarity measure.The smaller the value of k the more local the neighborhoods we are comparing, and the more specific the retrieval task.We pick k to be small enough to visually inspect sentence neighborhoods if necessary.In our analysis, we use cosine distance as the distance metric between activations and set k to 5% of the dataset size (∼ 100 sentences).

Experimental Setup
Data In our experiments, we show results on WMT22 English (EN) → German (DE) (296M pairs), which we obtained using the provided mt-data scripts4 , WMT16 EN → Romanian (RO) (610K pairs), and for the multilingual setup of Pires et al. (2023), consisting of 10 languages: German, English, Spanish, French, Italian, Japanese, Korean, Portuguese, Swahili, and Chinese.In our analysis, we mostly focus on WMT22 EN →DE.
Following Schmidt et al. (2022), we use WMT'16 provided scripts to normalize the RO side.EN →RO keeps diacritics for producing accurate translations.For more details refer to Schmidt et al. (2022).For the multilingual experiments, we replicated the setup of Pires et al. (2023), which in-cludes all details, including data preprocessing and dataset sizes.Latency We report inference time in tokens/second (the higher, the better), averaged over 5 runs.For the multilingual models, we use the DE →EN test set.Our measurements were collected using a single NVIDIA V100 GPU on a single-threaded Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with batch size of 1, in order to realistically mimic the inference of a deployed model.

Metrics
Tokenization For WMT22 EN →DE, we use SENTENCEPIECE (Kudo and Richardson, 2018), with a vocabulary size of 32K and a character coverage of 1.0, while for the multilingual experiments we use a vocabulary size of 250k and a character coverage of 0.9995.For WMT16 EN →RO we use byte-pair encoding (BPE, Sennrich et al., 2016) with 40, 000 merge operations.

Model Architectures
We focus our analysis on the Transformer Big where N enc = N dec = 6, d model = 1024, d ff = 4096, and it has 16 attention heads.We also report results on Transformer Base (N enc = N dec = 6, d model = 512, d ff = 2048, and 8 attention heads), and a deep encoder shallow decoder (Kasai et al., 2021) Transformer Big with 12 encoder layers, and 2 decoder layers.For our decoder-only experiments, the model is identical to the Transformer Big, except that all 12 layers are on the decoder.Our decoder-only model is similar to a Transformer-based language model, particularly Prefix-LM (Raffel et al., 2020), where we apply a non-autoregressive mask on the source side and an autoregressive mask on the target.
Hyperparameters All experiments are implemented using FAIRSEQ (Ott et al., 2019).Our optimizer is ADAM (Kingma and Ba, 2015) with a learning rate of 0.0007.We train for 80k, 80k, 150k steps on WMT22, WMT16, and multilingual, respectively.We use 4000 warm-up steps, and an inverse square root learning rate scheduler (Vaswani et al., 2017).We use a dropout rate of 0.1 for WMT22, 0.3 for WMT16, and 0 for the multilingual experiments due to the abundance of data, following Pires et al. (2023).All models are trained using fp16 (Ott et al., 2018).
Nomenclature In our experiments, we run a number of different configurations per model architecture that differ in the way the FFN is used, shared, or dropped, as well the size of the shared FFN (d ff ′ ).To facilitate our discussion, we introduce in Table 1 the nomenclature that will serve as reference for the rest of the text.Unless otherwise stated, the dimension of the shared FNN * all , i.e. d ff ′ is equal to the d ff of the original model.
For decoder-only models, only SharedDec and NoDec configurations are defined.For conciseness, we drop the mention of FFN from the text when possible, i.e.SharedEnc instead of SharedEncFFN.

FFN Description
Encoder Decoder Representational Similarity We use the WMT22 EN →DE evaluation set for both CKA and LNS analysis.We analyze encoder and decoder representations independently and present these metrics in a matrix heatmap plot showing pairwise similarity between layers.The diagonal of this matrix is the similarity of corresponding layers, i.e., layer i on both architectures.In order to facilitate an "apples-to-apples" comparison across models, we extract decoder representations by force decoding the (first) reference.We establish 2 crucial similarity scores: a benchmark on similarity for each of these metrics, where we train two additional models using the same architecture but with different random seeds; a a similarity lower bound, where we compare the baseline Transformer Big with a randomly initialized (i.e., untrained) model with the same architecture.We present these bounds in Appendix C.

Sharing FFNs
The results of various FFN sharing configurations are summarized in  While we focus on sharing one FFN for all layers within a module, we compare with sharing multiple FFNs following Takase and Kiyono (2023) in Appendix A. We find that sharing one FFN is as accurate as sharing multiple FFNs within a module, while being more parameter-efficient.

Dropping FFNs
Table 3 summarizes the performance of models with no FFNs.Besides BLEU and number of parameters, we report the inference speed for each architecture.Dropping the FFN on the encoder (NoEnc) leads to a 0.9 BLEU point drop while reducing the parameter count by 22% and with minimal effect on inference speed.Dropping the FFN on the decoder (NoDec), on the other hand, causes a degradation of only 0.4 BLEU points while increasing the inference speed by 20%7 .The highest latency reduction is obtained by removing the FFNs on both the encoder and the decoder (NoEncNoDec), but it comes with a significantly larger degradation of over 2 BLEU points.Combining sharing and dropping These results, together with those from Table 2, suggest that the encoder and decoder FFNs have different contributions: the decoder's are more redundant, corroborating previous work on FFNs parametrization (Ge et al., 2022).With this in mind, we experiment with one shared FFN on the encoder and dropping it on the decoder, reported as SharedEncNoDec in Table 3.As shown, we observe a 41% reduction of the number of parameters and a 22% improvement in inference speed, at the cost of 1.0 BLEU point.

One Wide FFN Model
Previous sections describe models that share and/or drop FFNs, effectively reducing model size at some modest accuracy cost.In this section, we investigate whether we can regain the accuracy lost while preserving the parameter efficiency and the latency reduction.We focus on ShareEncNoDec model as it provides a strong baseline with significant parameter savings and inference speedups.
We propose increasing the dimension of the shared FFN to match the number of parameters of the original (fully-parameterized) model, so as to avoid increasing the overhead of model storage.In particular, ShareEncNoDec saves around This model achieves an accuracy on par (or slightly above) the baseline Transformer Big with 20% fewer parameters and a significant inference speed-up.
Our proposed model with d ff ′ = 49, 152 goes beyond that, achieving a gain of 1.2 BLEU points over the vanilla ShareEncNoDec and 0.9 BLEU points over the Transformer Big.These gains remain consistent across CHRF and COMET.Furthermore, it comes while maintaining a similar inference speed as the ShareEncNoDec model.For completeness, we include a wider model with d ff ′ = 98, 304.Despite the extra capacity, this model does not provide any additional accuracy gains, which we suspect is due to the lack of data to train a model this big.

Analyzing Internal Representations
We now report a post-hoc analysis of the internal representations of the models introduced in preceding sections.Our objectives are twofold: 1) to ascertain whether the proposed models' internal representations exhibit a significant degree of similarity to those of the original base model; 2) to delve into the impact of the proposed methods on redundancy.We adopt the definition of redundancy of Dalvi et al. (2020), who visually inspect the similarity between adjacent modules within a model (high similarity entails high redundancy).

Architecture
Encoder Decoder

Similarity to Baseline
We ground the pairwise similarity metrics, by normalizing them against a benchmark.As mentioned in Section 3, we establish the benchmark scores by training two additional Transformer Big models, but using different random seeds.These models achieve similar accuracy as the baseline model (see Appendix C.1 for more details).The benchmark score is the similarity between the baseline and these models Because the benchmark is calculated by averaging similarity scores from different training runs of our baseline, individual runs can have a normalized score above 100%.
Table 5 shows normalized similarity scores for several models.Under the Encoder columns we compare the encoder representations, and under the Decoder columns we compare decoder representations.Sharing FFNs leads to consistenly lower  (normalized) similarity scores than models that do not share, both in terms of internal representation (CKA) and semantic spaces (LNS).As shown, although models that share FFNs have lower similarity scores compared to those that do not, the scores are still very close to 100%.Moreover, these decreases align with the drops in BLEU seen in Table 2, where the model with the lowest similarity score (ShareEncDec) is also the least accurate model.We observe a similar trend for dropping FFNs where models that drop FFNs exhibit lower similarity scores than those that do not, but the scores not too far from the benchmark scores.
For completeness, we report on the last row the similarity scores for the One Wide FFN model, which is more accurate than the base model.The internal representations generated by that model diverge from those of the base model.Interestingly, we observe a larger drop in LNS scores than in CKA scores, indicating that the shift occurs mostly in semantic space, rather than the Euclidean space captured by CKA.For a detailed layer-wise similarity analysis that breaks out the aggregate analysis in Table 5 see Appendix C.2.

A Qualitative View of Redundancy
We now study into the impact of our One Wide FFN model on the redundancy of the internal representations.In addition to adopting their definition of redundancy, we also adopt Dalvi et al. (2020)'s method of computing self-similarity, namely looking at how the representations change as they go through each module (self-attention, FFN, or crossattention) of the model.In particular, we use CKA to compute similarity between the output of different modules within the same model.
In Figure 1a, we show the CKA self-similarity matrices for the encoders of the One Wide FFN model and the Transformer Big.We do the same for the decoders in Figure 1b.These matrices show how similar each module of the network is to all other modules within that network.The diagonal of the matrix is the similarity between a module and itself and is always 1.
As shown, there is high similarity between adjacent modules of the Transformer Big, both on the encoder and decoder, indicated by areas with darker red around the diagonal.The prevalence of high similarity patterns among adjacent modules suggests a substantial degree of redundancy, and eliminating a module has a negligible impact on the final representations.On the other hand, we observe a distinct checkerboard pattern on the selfsimilarity matrices of the One Wide FFN model, where individual modules tend to exhibit lower similarity with their immediate neighbors than with their second neighbors (i.e., the neighbors of the neighbors).On the encoder, the checkerboard pattern emerges especially in the earlier modules while on the decoder, that pattern appears more consistently throughout the layers.This pattern gives an indication that our model is learning non-trivial transformations of the input, leading to decreased redundancy within the network.

Other architectures and Languages
So far, all our experiments focused on the Transformer Big and on WMT22 EN →DE.In this section, we apply what we learned to other architectures and language pairs.We run experiments on the low resource language direction EN →RO and a large scale multilingual model.
For EN →DE, we apply our proposal to a Transformer Base model, a Deep Encoder Shallow Decoder model (Kasai et al., 2021)  vanilla SharedEncNoDec model) and an inference speedup of around 25%.In the Deep Encoder Shallow Decoder model, we observe a more modest accuracy gain of 0.2 BLEU points (0.9 BLEU over the vanilla SharedEncNoDec model).However, the inference speedup from dropping the decoder FFNs is minimal (< 1%), which is expected because of the small depth of the decoder in this architecture.
Decoder-only models With the advent of Large Language Models (LLMs) like GPT (Brown et al., 2020), and PaLM (Chowdhery et al., 2022), a lot of effort has been put on decoder-only Transformer models.We train a decoder-only model on WMT22 EN →DE, as shown on Table 6.Due to the absence of an encoder, we are limited to applying a wide FFN on the decoder side.As in the other setups, we get an accuracy gain of +0.3 BLEU over the baseline decoder-only model (+1.7 BLEU over ShareDec), but the latency degrades by 12%.This is not surprising: due to the autoregressive nature of the decoder, increasing the size of its FFN has a bigger impact on speed.
Low-resource languages In EN →RO the accuracy of the One Wide FFN Model is only on par compared to the base model, even though it is a higher than the vanilla SharedEncNoDec model.We hypothesize that due to the low resource condi-tion, our proposed model already reaches saturation as there are not that many salient textual patterns to be learned by the FFN.
Multilingual Finally, we observe the similar trend on the multilingual setup, where the One Wide FFN Model is +1.2 SPBLEU points more accurate than the baseline Transformer Big and +2.5 SPBLEU points more accurate than the vanilla SharedEncNoDec, this gain is significant in 79 out of 90 directions and when all tests sets are concatenated.Additionally, this large accuracy gain also comes with around 18% inference speed-up, consistent with our previous results.

Related Work
Weight pruning and parameter sharing are wellknown techniques to reduce a model's footprint.
Neuron pruning methods often focus on finding and pruning redundant neurons through correlation methods (Dalvi et al., 2020), but also on how Trans-former components like the multi-head attention can be pruned significantly due to model redundancy in the encoder or decoder either by checking the gradients salience (Michel et al., 2019) or a differentiable relaxation of the l 0 regularization at training time (Voita et al., 2019).
For parameter sharing, the Universal Transformer (Dehghani et al., 2019) proposed a model where all layers are shared (i.e., in effect it reduced the model to a single shared layer).Takase and Kiyono (2023) proposes finding an optimal configuration of shared layers in the encoder or decoder through different methods of sharing (in sequence, in cycle, or in reversed cycle, i.e. starting sharing from the top) always keeping a specified number of final layers8 .Similarly, Reid et al. (2021) proposes an approach where just the middle layers are shared, while the bottom and top layers are independent, and using a lower dimensionality for the embedding layer.Analogously, Ge et al. (2022) focus on minimizing the number of parameters and the number of calls to each parameters' group in order to optimise on-device models.They achieve this by sharing the encoder and decoder in a similar way to both previous methods, particularly by sharing all layers parameters in cycle like Takase and Kiyono (2023).
Previous works also focus on reducing the dimensionality of certain parameters, mostly through low rank factorization.Lan et al. (2020) decomposes the embedding layer into a lower rank embedding matrix and a projection to the actual hidden size while also sharing all parameters across all layers.In addition to sharing parameters efficiently, Ge et al. (2022) proposes a lightweight decomposition of the FFN where instead of a single component there are 2 projections with a smaller dimensionality than vanilla Transformers.Our work is close to Ge et al. (2022) but instead of factorizing we explore sharing and full pruning of the FFN.In contrast with previous works, we also explore increasing the encoder FFN size while dropping the decoder's completely.

Conclusion
In this work, we studied the importance of the FFN in Transformer models.We analyzed the impact of removing and/or sharing the FFN across layers and found that, due to this component's redundancy, the model sizes can be substantially reduced with little impact on model accuracy for Machine Translation.In particular, we found that sharing the FFN across all encoder layers and removing it from the decoder layers, while increasing the dimension (d ff ′ ) of the encoder FFN leads to models that are more accurate and faster at inference.
Our findings are applicable across multiple settings, including decoder-only and multilingual models.In a low-resource setting the results are modest but our approach can still recover the baseline's performance with a faster inference.
Finally, we conducted a thorough similarity analysis between the vanilla Transformer and our proposed architectures, and found that the latter's internal representations do not differ significantly from the former's, except in that they are less redundant.

Limitations
In this work, our focus was Machine Translation.Although we expect the results to generalize to other sequence-to-sequence tasks, further experiments are needed, which we leave for future work.

A Custom Sharing of Multiple FFNs
There is a combinatorial number of ways of sharing M < N FFNs within a module of N layers.Since this is prohibitive, we investigate the following strategies from Takase and Kiyono (2023): • Sequence: assign one FFN for every M/N consecutive layers, forming a block pattern.
• Cycle: stack M FFNs in an identical order, forming a repetitive checkerboard pattern.
order, forming a repetitive palindrome series.
Note that we assume that N is an even number and divisible by N .Cycle (Rev) is only valid for M = N/2.The EdgeFormer (Ge et al., 2022) adopts Cycle with M = 2 for the encoder FFNs.
Table 7 shows the results of these strategies applied on the encoder.As references, we copy the results of the Transformer Big and ShareEnc from Table 2.Not only is the accuracy of ShareEnc similar to Takase and Kiyono (2023)

B Sharing or Dropping Attention
We report the results of sharing attention modules (either self, cross or both) across layers in Table 8.

C Details on Internal Representations Analysis
C.1 Raw Similarity Scores for Benchmarking We establish a benchmark score for the expected similarity of our two metrics by comparing the baseline Transformer Big with identical models trained from different random seeds.Table 9 presents the raw similarity scores from which we compute the normalized scores presented in Table 5.
As shown, the similarity between

C.2 Layer-wise Analysis
In Table 5, we report the aggregated similarity scores across all layers of Transformer encoder and decoder.Here, we report a more fine-grained layer-wise similarity score mostly to showcase the reliability of the aggregated scores.In Figure 2, we plot layerwise LNS to study how similar the semantic information captured at each layer is to that of the baseline model at every layer.When LNS scores are high, the network is producing similar local neighborhoods for each sentence in our evaluation set.In particular, we are interested in comparing the benchmark LNS scores and those of SharedEncSharedDec at each layer.track the baseline scores at almost every layer, confirming the reliability of the aggregated score.We observe similar pattern for all the models that we evaluate in this paper.

Figure 1 :
Figure 1: Self similarity structure of encoder and decoder layers of the One Wide Encoder model vs. the Transformer Big baseline.

Table 2 :
sacreBLEU results on WMT 22 EN →DE for different FFN sharing configurations.| θ | is the number of parameters.

Table 3 :
sacreBLEU results on WMT 22 EN →DE for different FFN dropping configurations.
ters as there's one single shared FFN in the encoder.On the other hand, the Transformer Big

Table 4 :
Accuracy of One Wide FFN for Transformer Big EN →DE on WMT22.†implies the system is statistical significantly different at p < 0.05.has(N enc + N dec ) FFNs.Thus, we match the size of the original model by setting the dimension of the shared FFN, d ff ′ , to (N enc + N dec ) × d ff .

Table 4
summarizes our results.It includes our proposed model, the One Wide FFN model (d ff ′ = 49, 152), as well as the baseline Transformer Big, and the corresponding ShareEncNoDec (d ff ′ = 4, 096).It also includes a wide model with d ff ′ = 24, 576, which uses the same number of parameters as NoDec, with , and a Decoder-Only model.For the Transformer Base, we observe an accuracy gain of 0.5 BLEU (2.2 BLEU over the

Table 6 :
Accuracy of One Wide FFN for EN →DE with Transformer Base, Decoder Only, and Deep Encoder Shallow Decoder on WMT22; for low resource EN →RO with Base version on WMT16, and multilingual with Transformer big on Flores.† implies the system is statistical significantly different at p < 0.05.
's strategies, but it also uses fewer parameters and is easier to extend.

Table 7 :
Accuracy of different FFN sharing strategies on WMT22 EN → DE.

Table 8 :
BLEU scores on WMT 22 EN →DE when sharing the attention of both encoder and decoder (self and cross).Nomenclature follows Section 3 but with Self Attn an Cross Attn as the encoder/decoder's self attention and cross-attention (decoder), respectively.
As shown, the layer-wise LNS scores of SharedEncSharedDec

Table 9 :
Raw similarity of the representations (%) of corresponding layer-modules of different architectures vs. the Transformer Big for WMT22 EN →DE.For NoDec configurations we compare the final output of the transformer layer as a whole as they have different sub-modules.The columns for shared and for dropped FFNs are highlighted in gray and blue respectively.