What’s Hidden in a One-layer Randomly Weighted Transformer?

We demonstrate that, hidden within one-layer randomly weighted neural networks, there exist subnetworks that can achieve impressive performance, without ever modifying the weight initializations, on machine translation tasks. To find subnetworks for one-layer randomly weighted neural networks, we apply different binary masks to the same weight matrix to generate different layers. Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29.45/17.29 BLEU on IWSLT14/WMT14. Using a fixed pre-trained embedding layer, the previously found subnetworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer_\text{small/base} on IWSLT14/WMT14. Furthermore, we demonstrate the effectiveness of larger and deeper transformers in this setting, as well as the impact of different initialization methods.


Introduction
Modern deep learning often trains millions or even billions of parameters (Devlin et al., 2018;Shoeybi et al., 2019;Raffel et al., 2019;Brown et al., 2020) to deliver good performance for a model. Recently, Frankle and Carbin (2018);  demonstrated that these over-parameterized networks contain sparse subnetworks, when trained in isolation, that can achieve similar or better performance than the original model. Furthermore, recent studies revisit the initialization stage of finding these subnetworks in vision models . Such a mask, which is used to mask out a part of the entire network to those subnetworks, 1 We released the source code at https://github. com/sIncerass/one_layer_lottery_ticket. * Equal contribution.  is referred to as a "Supermask." That is to say, subnetworks of a randomly weighted neural network (NN) can achieve competitive performance, which may act as a good "prior" (Gaier and Ha, 2019) and connect to the long history of leveraging random features (Gamba et al., 1961;Baum, 1988) and/or random kernel methods Recht, 2008, 2009) in machine learning. Here, we examine the following question: how does a fully randomized natural language processing (NLP) model perform in the multi-layer setting, and particularly in the (so far under-explored) onelayer setting?
In this work, we first validate that there exist subnetworks of standard randomly weighted Transformers (Reservoir Transformers in (Shen et al., 2021)) that can perform competitively with fully-weighted alternatives on machine translation and natural language understanding tasks. With 50% randomized weights remaining, we found a subnetwork that can reach 29.45/17.29 BLEU on IWSLT14/WMT14, respectively. We also investigate the special case of finding subnetworks in one-layer randomly weighted Transformers (see Fig. 1). To obtain the subnetworks, we repeatedly apply the same randomized Transformer layer several times with different Supermasks. The resulting subnetwork of a one-layer randomly-weighted Transformer has similar performance as the multi-layer counterparts with a 30% lower memory footprint. We also study the impact of different depths/widths of Transformers along with the effectiveness of two initialization methods. Finally, using the pre-trained embedding layers, we find that the subnetworks hidden in one layer randomly weighted Transformer wide/wider are smaller than, but can match 98%/92% of the performance of, a trained Transformer small/base on IWSLT14/WMT14. We hope our findings can offer new insights for understanding Transformers.

Related Work
Lottery Tickets Hypothesis. Frankle and Carbin (2018) found that NNs for computer vision contain subnetworks that can be effectively trained from scratch when reset to their initialization. Subsequent works  demonstrated that so-called winning tickets can achieve performance without training, where the mask for finding the subnetwork at initialization is called "supermask." In NLP, previous works find that matching subnetworks exist early in training with Transformers (Yu et al., 2019), LSTMs (Renda et al., 2020), and fully-weighted per-trained BERT Prasanna et al., 2020) or Vison-and-Language model (Gan et al., 2021), but not at initialization. Random Feature. In the early days of neural networks, fixed random layers (Baum, 1988;Schmidt et al., 1992;Pao et al., 1994) have been studied in reservoir computing (Maass et al., 2002;Jaeger, 2003;Lukoševičius and Jaeger, 2009), "random kitchen sink" kernel machines Recht, 2008, 2009), and so on. Recently, random features have also been extensively explored for modern neural networks in deep reservoir computing networks (Scardapane and Wang, 2017;Gallicchio and Micheli, 2017;Shen et al., 2021), random kernel feature (Peng et al., 2021;Choromanski et al., 2020), and applications in text classification (Conneau et al., 2017;Wieting and Kiela, 2019), summarization (Pilault et al., 2020) and probing (Voita and Titov, 2020). Compressing Transformer. A wide range of neural network compression techniques have been applied to Transformers. This includes pruning (Fan et al., 2019;Michel et al., 2019;Sanh et al., 2020;Yao et al., 2021) where parts of the model weights are dropped, parameter-sharing (Lan et al., 2020;Dehghani et al., 2018;Bai et al., 2019) where the same parameters are used in different parts of a model, quantization  where the weights of the Transformer model are represented with fewer bits, and distilliation (Sun et al., 2020;Jiao et al., 2020) where a compact student model is trained to mimic a larger teacher model. To find the proposed subnetwork at initialization, we develop our method in the spirit of parameter sharing and pruning.

Methodology
Finding a Supermask for Randomly Weighted Transformer. In a general pruning framework, denote weight matrix as W ∈ R d×d (W could be a non-square matrix), input as x ∈ R d and the network as f (x; W). A subnetwork defined is f (x; W M), where M ∈ R d×d is a binary matrix and is the element-wise product. To find the subnetwork for a randomly weighted network, M ∈ R d×d is trained while W is kept at a random initialization. Following Ramanujan et al.
(2020), denote S ∈ R d×d as the associated importance score matrix of W, which is learnable during training. We keep top-k percents of weights by the importance score of S to compute M, i.e., Note that Top k is an undifferentiated function. To enable training of S, we use the straight-through gradient estimator (Bengio et al., 2013), in which Top k is treated as the identity in backpropagation. During inference, we can simply construct and store the binary Supermask M and the floatingpoint W while dropping S for future usage. One-layer randomly weighted Transformer. We use the Transformer architecture (see Vaswani et al. (2017) for more details). For a general randomly weighted Transformer model with Supermask, there exist M l s and W l s for all layers l ∈ {1, ...L}. Due to the natural property of layer stacking in Transformers, all W l s have the same shape with the same initialization method. This leads to an unexplored question: "What's hidden in a one-layer (instead of L-layer) randomly weighted transformer?" Let us use a toy example to explain why there is no need for L redundant W l s. Assume that, for a random weighted matrix W l , the probability that it has a "good" subnetwork is p 2 . Furthermore, assume that for two different layers, the probability that both have the "good" subnetworks is independent. Then for L different layers, the probability that all W l s have the "good" subnetworks is p L . Meanwhile, since W 1 has the same initialization method as W l , the probability that W 1 has a "good" subnetwork for l-th layer is also p. Thus, for L different layers, the probability that using W 1 to generate all "good" subnetworks is also p L . In this paper, we investigate the scenario where one randomized layer is applied for L times repeatedly with L different Supermasks. As a result, this can reduce the memory footprint since all Supermasks can be stored in the binary format.

Experiments
Model Architecture.
For model architectures, we experiment with Transformer small and Transformer base , following the same setting as in : 6 encoder layers and 6 decoder layers on IWSLT14 and WMT14. We also vary the depth and width of the Transformer model on machine translation tasks. On IWSLT14, we use 3 different random seeds and plot the mean accuracy ± one standard deviation. All the embedding layers (including the final output projection layer) are also randomized and pruned unless otherwise specified. Moreover, on all figures, the "fully-weighted model" denotes the standard full model (all weights remaining). Machine Translation results. In Fig. 2, we present results for directly pruning a randomly weighted Transformer on IWSLT14 and WMT14 tasks. Specifically, we vary the ratio of remaining parameters in the randomized model.
As can be seen, there is no significant performance difference between a one-layer random Transformer versus a 6-layer standard random Transformer across different percents of remaining weights on IWSLT14 and WMT14. We also observe that having the remaining randomized weight percents approach 0 or 100 leads to the worst performance across the settings. This is expected since the outputs will be random when we have 100% randomized weights, and the model will not perform well when only limited weights are unpruned (close to 0%). The best performing subnetwork of a one-layer randomized Transformer has 50% weights remained. Connected to the search space of the employed method where we are choosing σ% out of 100% randomized weights, σ = 50 leads to the largest search space. Effectiveness of Pre-trained Embeddding layers. Embedding layers are critical since they can be viewed as the inputs for an NLP model, which are analogous to the image pixels in vision. Plenty of prior studies have explored how to obtain the pre-trained embedding in an unsupervised way (Mikolov et al., 2013;Pennington et al., 2014). We experiment with this practical setting where we could have access to the encoder/decoder embedding layers, which are pretrained from the public checkpoint in fairseq 3 , and we present the results in Fig. 3. We observe a significant performance boost for a one-layer randomized transformer across different remaining weights. The difference is much larger for the bigger WMT14 dataset (around +3.0 BLEU for WMT14 and +1.0 BLEU for IWSLT14). The best one-layer randomized Transformer reaches 89%/74% of the fully-weighted Transformer performance on IWSLT14/WMT14, respectively. Effectiveness of Depth and Width. In Tab. 1, we report the parameter size, BLEU score, and memory size of different one-layer randomized Transformers with 50% remaining weights, where Trans deep/deeper are 12 encoder/decoder layers variant of Trans small/base . Trans wide/wider have 2x hidden size as the Trans small/base . The results are gathered with pre-trained encoder/decoder embedding layers. 4 Either increasing the depth or enlarging the width can improve the performance of our one-layer random transformer.
Particularly, 3 https://github.com/pytorch/fairseq/ 4 We use the checkpoint from FairSeq for Transbase/big on WMT14, and Transsmall on IWSLT14 to obtain the pre-trained embedding layer for one-layer Transbase/wider and one-layer Transsmall. For one-layer Transwide on IWSLT14, we pre-train fully-weighted model and then dump the embedding layer. Transdeep/deeper share the same embedding of the Transsmall/base.  1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Remaining the deeper transformer can already achieve 79%/90% of the fully-weighted baseline models on WMT14/IWSLT14, respectively. For wider models, those numbers even increase to 92%/98%. This is mainly due to the larger search space introduced by the larger weight matrix. Another important point is that even when we increase/enlarge the depth/width of the model, the total memory consumption of these models is actually smaller than the standard baseline, since we only have one repeated layer and all the masks can be stored in a 1-bit setting. Furthermore, we explore the effect of the different ratios of remaining parameters for different models on IWSLT14 in Fig. 4. As can be seen, for the wider model, its performance is always better than the standard one across all different settings. However, for the deeper model, there is a sharp transition that happens at 50%-60% remaining parameters. The reason is that, given that our deeper model is twice as deep as the original, when we retain more random parameters (>50%), the probability that the layer has a good "subnetwork" decreases significantly. This will lead the final probability to be p 2L smaller (p smaller < p), which is much smaller than p L (see Section 3). Different Initialization. Weight initialization is one of the critical components to the success of the random feature (Wieting and Kiela, 2019;Shen et al., 2021). We experiment with kaiming uniform (Ramanujan et al., 2020) and Xavier uniform (Vaswani et al., 2017) initialization methods, and we scale the standard deviation by 1/σ when we retain σ randomized weights. As shown in Fig. 5, the performance of the one-layer randomized Transformer decreases when we switch to the Xavier uniform. The degradation becomes larger when more randomized weights retain in the network.
QQP and MNLI results. On QQP and MNLI, we experiment with RoBERTa small and RoBERTa large , following . We use the pre-trained embedding layer of RoBERTa base/large . In Fig. 6 and 7, we show consistent results on QQP and MNLI, except that the best performing onelayer randomly weighted RoBERTa is achieved when we retain 70% randomized weights, it reaches 79%/91% fully-weighted RoBERTa base accuracy on QQP and MNLI, respectively. The performance approaches 84%/92% of the afore-mentioned fully-weighted model performance when using the larger hidden size with one-layer randomly weighted RoBERTa large .
Implementation Details. We evaluate on IWSLT14 de-en (Cettolo et al., 2015) and WMT14 en-de (Bojar et al., 2014) for machine translation; QQP (Iyer et al., 2017) and MultiNLImatched (MNLI) (Williams et al., 2017) for natural language understanding. 5 We use 8 Volta V100 GPUs for WMT, and one V100 for IWSLT, QQP, and MNLI. The hyperparameters on IWSLT14 and WMT14 for training a one-layer randomized Transformer were set the same to the best-performing values from  for training fully-weighted Transformer. The QQP and MNLI experiments followed .

Conclusions
In this paper, we validate the existence of effective subnetworks in a one-layer randomly weighted Transformer on translation tasks. Hidden within a one-layer randomly weighted Transformer wide/wider with fixed pre-trained embedding layers, we find there exist subnetworks that are smaller than, but can competitively match, the performance of a trained Transformer small/base on IWSLT14/WMT14.