Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites previous relative position encoding methods under a convolutional framework. We conduct experiments by training BERT with composite attention, finding that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings. To inform future work, we present results comparing lightweight convolutions, dynamic convolutions, and depthwise-separable convolutions in language model pre-training, considering multiple injection points for convolutions in self-attention layers.


Introduction
In recent years, Transformer-based language models have brought dramatic improvements on a wide range of natural language tasks (Brown et al., 2020;Devlin et al., 2019). The central innovation of Transformer architectures is the self-attention mechanism (Vaswani et al., 2017), which has grown beyond NLP, extending into domains ranging from computer vision (Dosovitskiy et al., 2021) and speech recognition (Dong et al., 2018) to reinforcement learning (Parisotto et al., 2020;Touvron et al., 2020).
In computer vision, self-attention and convolutions have been combined to achieve competitive results for image classification (Bello et al., 2019). Similarly, researchers in NLP have begun integrating convolutions into self-attention for natural language tasks. Recent work has shown initial success adding convolutional modules to self-attention in pre-trained language models (Jiang et al., 2020), or even replacing self-attention entirely with dynamic convolutions (Wu et al., 2019). These successes defy theoretical proofs showing that multi-headed self-attention with relative position embeddings is strictly more expressive than convolution (Cordonnier et al., 2020). To identify why convolutions have been successful in NLP, we seek to isolate the differences between self-attention and convolution in the context of natural language.
In this work, we formalize the relationship between self-attention and convolution in Transformer encoders by generalizing relative position embeddings, and we identify the benefits of each approach for language model pre-training. We show that self-attention is a type of dynamic lightweight convolution, a data-dependent convolution that ties weights across input channels (Wu et al., 2019). Notably, previous methods of encoding relative positions (Shaw et al., 2018;Raffel et al., 2020) are direct implementations of lightweight convolutions. Under our framework, the benefits of convolution come from an ability to capture local position information in sentences.
Then, we propose composite attention, which applies a lightweight convolution that combines previous relative position embedding methods. We find that composite attention sufficiently captures the information provided by many other convolutions. To validate our framework, we train BERT models that integrate self-attention with multiple convolution types, evaluating our models on the GLUE benchmark (Wang et al., 2018). All of our convolutional variants outperform the default model, demonstrating the effectiveness of convolutions in enhancing self-attention for natural language tasks. Our empirical results provide evidence for future research integrating convolutions and self-attention for NLP. Attention weights α ij are analogous to convolution kernel weights β j−i .

Self-attention and lightweight convolutions
First, we outline the relationship between selfattention and convolutions. Specifically, we show that a self-attention operation can be viewed as a dynamic lightweight convolution, a depthwise convolution that ties weights along channels (Wu et al., 2019). We then isolate the differences between self-attention and lightweight convolutions, highlighting the benefits of each approach in language models.

Self-attention
In a Transformer self-attention layer, inputs x 1 , ..., x n ∈ R d are projected to corresponding queries, keys, and values by linear transformations W Q , W K , W V ∈ R d×d h for each attention head, projecting into the head dimension size d h . Output vectors y 1 , ..., y n ∈ R d are linear combinations of values, concatenating all attention heads. Value weights (before softmaxing) are determined by: Intuitively, α ij represents the attention that token i pays to token j, incorporating the value x j W V into the resulting vector y i . From the attention scores between various tokens i and j, an attention map of α ij is produced (see Figure 1).

Lightweight convolutions
In contrast, a standard one-dimensional convolution slides a kernel of weights along the input sequence; each feature in each output representation y i is a weighted sum of all features (called "channels") in the surrounding x i . To save parameters, it is common to consider depthwise convolutions where each channel c in y i is a weighted sum only of the features in channel c for the surrounding x i . Formally, each entry of y i can be written as: where k is the kernel size in each direction. Each scalar β j−i,c represents the attention paid to relative position j − i for channel c. To further simplify depthwise convolutions for use in language models, Wu et al. (2019) propose lightweight convolutions, which tie weights β j−i,c along all channels c. As a result, the lightweight convolution contains only 2k + 1 weights, one scalar β j−i for each relative position considered. Then, each y i is a linear combination of surrounding x i : Importantly, we can then consider each β j−i as an attention weight analogous to self-attention, representing the attention that token i pays to token j.
The lightweight convolution produces an attention map of β j−i as visualized in Figure 1. Finally, furthering the similarity between lightweight convolutions and self-attention, Wu et al. (2019) propose dynamic lightweight convolutions, which dynamically compute relative weights β j−i based on individual input tokens. In other words, each row in Figure 1 has relative weights determined dynamically based on the input token x i for that row. Because attentions for relative positions are no longer fixed across rows, the attention map in Figure 1 achieves similar flexibility to standard self-attention.

Self-attention vs. convolution
We have shown that both self-attention and lightweight convolution compute linear combinations of token representations, but we now isolate the differences between the two approaches. Perhaps most importantly, the two methods assign attention scores α ij and β j−i in fundamentally different ways.
Self-attention computes α ij based on the dot product between query i and key j, ignoring the relative position between i and j. In this way, selfattention layers model interactions exclusively between token representations. If the tokens are arbitrarily shuffled in a standard self-attention layer, the output for each token is unchanged. All position information is injected before the first self-attention layer in the form of absolute position embeddings.
In contrast, dynamic lightweight convolutions assign attention scores directly to relative positions. This allows convolutions to directly integrate relative position information without relying on absolute positions. Thus, convolutions could be better at capturing local information in sentences. However, convolutions alone are limited in their ability to model interactions between tokens because they lack the query-key mechanism central to standard self-attention. In future sections, we consider methods of integrating the two approaches.

Integrating lightweight convolutions
Previous work has sought to integrate local information into global self-attention. This can be achieved by restricting the range of self-attention to nearby tokens, or by incorporating relative position information into attention maps (Hofstätter et al., 2020;Raganato et al., 2020;Wei et al., 2021). Notably, Shaw et al. (2018) introduced relative position em-beddings, which inspired similar embeddings in models such as Transformer-XL and XLNet (Dai et al., 2019;Yang et al., 2019). In this section, we show that several previous methods of encoding relative positions are direct implementations of lightweight convolutions.

Relative embeddings as lightweight convolutions
First, the simplest way to combine self-attention with lightweight convolution is to generate a standard attention map, then add the attention map generated by a lightweight convolution. Given a fixed lightweight convolution, this results in attention scores as follows: This is exactly the relative position term used in T5 (Raffel et al., 2020) and TUPE (Ke et al., 2021). We further consider a dynamic lightweight convolution, where the β j−i weights are computed by passing the query through a linear feedforward layer W C ∈ R d h ×(2k+1) (Wu et al., 2019). 1 Because W C is linear, each weight β j−i is equal to the dot product between the query and the (j − i) column of W C . We then obtain attention scores: If we scale the dynamic lightweight convolution term according to the head dimension size, we obtain precisely the relative embeddings proposed in Shaw et al. (2018): Under this interpretation, Shaw's relative embeddings are essentially identical to the dynamic lightweight convolutions used in Wu et al. (2019). In both formulations, relative position weights are computed as dot products between the query and a learned relative position embedding. Previous work has considered relative positions in language models independently from convolutions, but our derivations suggest that the underlying mechanisms may be the same.

Composite attention and lightweight convolution experiments
To validate lightweight convolutions in combination with self-attention, we pre-trained and evaluated BERT-small models ( Evaluation Models were evaluated on the GLUE benchmark, a suite of sentence classification tasks including natural language inference (NLI), grammaticality judgments, sentiment classification, and textual similarity (Wang et al., 2018). For each task, we ran ten fine-tuning runs and used the model with the best score on the development set. We report scores on the GLUE test set. Development scores and statistics for all experiments are reported in Appendix A.2.
Models We trained two baseline models, a default BERT-small with standard absolute position embeddings, and a BERT-small with no position information whatsoever. Then, we trained models with fixed lightweight convolutions (Equation 4; Raffel et al. 2020), and dynamic lightweight convolutions that generated convolution weights based on each query (i.e. using relative embeddings, Equation 5; Shaw et al. 2018). Finally, we propose composite attention, which simply adds dynamic lightweight convolutions to fixed lightweight convolutions, resulting in attention scores α ij as follows: Dynamic convolution (relative embeddings) Fixed convolution (6) Intuitively, composite attention has the flexibility of dynamic lightweight convolutions, while still allowing models to incorporate relative positions directly through fixed lightweight convolutions. Alternatively, composite attention can be interpreted as adding a fixed bias term to relative position embeddings.
All of our experiments used a convolution kernel size of 17, or eight positions in each direction, a mid-range value that has been found to work well for both relative positions and convolution in language models (Huang et al., 2020;Jiang et al., 2020;Shaw et al., 2018). As in Shaw et al. (2018), relative embeddings W C j−i shared weights across heads. Unless stated otherwise, models used no absolute position embeddings.
For completeness, we also considered dynamic lightweight convolutions based on the key (as opposed to the query). In contrast to query-based lightweight convolutions, key-based convolutions allow each token to dictate which relative positions should pay attention to it, rather than dictating which relative positions it should pay attention to. Referring to the visualization in Figure 1, key-based dynamic convolutions correspond to columns instead of rows. These key-based dynamic lightweight convolutions are the same as the relative embeddings proposed in Huang et al. (2020), but they are now formulated as dynamic lightweight convolutions.

Lightweight convolution results
GLUE test set results are presented in Table 1.
Lightweight convolutions consistently improved performance. Notably, even the fixed lightweight convolution was sufficient to replace absolute position embeddings, outperforming the default BERT-small model. This indicates that even naïve sampling from nearby tokens can be beneficial to language model performance.
Dynamic convolutions provided further improvements. When the lightweight convolutions were generated dynamically based on token queries, the models outperformed the default model by even larger margins. This improvement over fixed lightweight convolutions suggests that different tokens find it useful to generate different lightweight convolutions, paying attention to different relative positions in a sentence.

Composite attention performed the best.
Combining fixed lightweight convolutions with dynamic lightweight convolutions proved an effective strategy for encoding relative positions. Although composite attention is simply a combination of Shaw et al. (2018) and Raffel et al. (2020)'s relative position embeddings, it validates convolution as a viable method of encoding relative positions in self-attention.
Key-based dynamic convolutions provided no additional benefit. When we generated an additional lightweight convolution based on keys, the model performed worse than composite attention alone (GLUE 74.0 compared to 75.2). This result clarifies the findings of Huang et al. (2020), who reported only small improvements from query and key-based relative position embeddings for a subset of the GLUE tasks. Grammaticality judgments were particularly sensitive to position information. On the CoLA task (the corpus of linguistic acceptability; Warstadt et al. 2019), there was a dramatic performance drop when absolute position embeddings were removed. However, when any type of lightweight convolution was added, performance improved even over the baseline established by absolute positions. The pronounced effects of local position information on the CoLA task support the intuitive hypothesis that local dependencies are particularly important for grammaticality judgments. This result also suggests that convolutions could be beneficial to more local tasks (e.g. token-level tasks) along with sentence classification tasks.

Interpreting lightweight convolutions
To better understand how lightweight convolutions improve language models, we visualized the learned lightweight convolution kernel weights in Figure 2. Qualitatively, the kernels exhibited specific types of patterns: • Paying particular attention to the previous or next token.
• Paying graded attention either to past or future tokens, dictated by how far the target token is from the present token.
These observations support the assumption that nearby tokens are relevant to the interpretation of the current token. They also align with the findings of Voita et al. (2019), who identified "positional" attention heads that focus primarily on the next or previous token. From this perspective, lightweight convolutions allow language models to explicitly represent nearby tokens' positions.
Interestingly, we also found that some kernels paid fairly uniform attention to all tokens, even decreasing attention to nearby and adjacent tokens. It is likely that these attention heads focused on more global information, relying on the query-key attention mechanism rather than the convolution.

BERT-base models
To thoroughly assess the impact of composite attention on pre-trained language models, we trained full-sized BERT models for 1M steps each, replicating our BERT-small experiments. Pre-training details are outlined in Appendix A.1.
Results are presented in Table 1. Differences between models decreased substantially for full sized models, and the relative performances of different approaches varied across tasks. Our results suggest that relative position information is more useful for smaller or more data-limited models; extending the benefits of convolutions robustly from small models to larger models is an important direction for future research. That said, even in the larger models, composite attention slightly outperformed the other position embedding methods in overall GLUE score. Our results demonstrate that convolutions can perform at least on par with absolute position embeddings even in larger models.

Non-lightweight convolutions
The previous section found that lightweight convolutions consistently improved pre-trained language model performance. Next, we investigated whether the additional flexibility of non-lightweight convolutions could provide additional benefits. Specifically, we considered convolutions that were fixed but non-lightweight. In other words, convolution weights were fixed regardless of the input query, but weights were not tied across channels, equivalent to a standard depthwise convolution. We only considered fixed depthwise convolutions because under existing frameworks, dynamic depthwise convolutions would introduce large numbers of parameters.
To implement depthwise convolutions, we added a convolution term identical to the fixed lightweight convolution in Equation 4, except that β j−i was learned separately for each feature channel: 3 This is equivalent to adding a depthwise convolution of the token values to the standard selfattention output.

Non-lightweight convolution experiments
We ran experiments using the same setup as the lightweight convolution experiments in Section 3.2. To compare the effects of dynamic lightweight convolutions (e.g. composite attention) and nonlightweight (depthwise) convolutions, we trained models using each possible combination of the two convolutions. Results are presented in Table 2.
Depthwise convolutions were less effective than lightweight convolutions. As with lightweight convolutions, the depthwise convolutions effectively replaced absolute position embeddings, outperforming the default model. However, fixed depthwise convolutions performed worse than fixed lightweight convolutions on the majority of tasks. This indicates that flexibility across channels is not critical to the success of convolutions in language models.   Composite attention already provided the necessary flexibility. Composite attention outperformed the fixed depthwise convolutions; even when composite attention was combined with depthwise convolutions, there was no overall improvement over composite attention alone. This suggests that in the context of language, dynamic lightweight convolutions efficiently encode any local position information provided by depthwise convolutions.
Depthwise convolutions differentiated previous and next tokens. In previous sections, we found that lightweight convolution kernels often pay attention specifically to adjacent tokens. As can be seen in Figure 3, this result was even more pronounced in depthwise convolutions, with individual channels focusing on the previous or next token. Interestingly, other channels specifically directed attention away from adjacent tokens. This indicates that the relevant information about next and previous tokens can be compressed into a subset of the feature channels, freeing other channels to consider more distant or position-independent information.

Convolutional queries, keys, and values
Improvements over the non-convolutional baselines indicate that convolutions are beneficial to language model pre-training, serving as replacements for absolute position embeddings. Our previous experiments applied different types of convolutions to self-attention values. To take this result one step further, we replaced the linear query, key, and value projections themselves with convolutional layers. Intuitively, applying convolutions before selfattention induces even more mixing of token representations. If convolutions are built into every query, key, and value, then it becomes impossible for a token i to pay attention to a single token j without also incorporating information about tokens surrounding token j.

Convolutional Q, K, V experiments
As in Sections 3.2 and 4.1, we ran experiments on BERT-small. We replaced the query, key and value projections with depthwise-separable convolutions in half of the self-attention heads. 4 This aligns with previous work in which only half of the output dimensions for each token were generated using convolutions (Jiang et al., 2020). Indeed, our initial explorations found that it was more effective to replace the linear projections in only half, not all, the attention heads.
Then, we considered whether convolutions from previous experiments provided additional benefits over convolutional queries, keys, and values. To test this, we trained BERT-small models with composite attention (Equation 6), adding convolutional queries, keys, and values.

Convolutional Q, K, V results
Results are presented in Table 3. Similar to our previous convolution experiments, all convolutional replacements successfully outperformed the default model. These results strongly support the conclusion that convolutions are a viable method of encoding positional information for language tasks.
However, all convolutional replacements for queries, keys, and values slightly decreased the performance of models using composite attention. Convolutional values in particular were effective in models without composite attention, but they slightly decreased performance in models that already incorporated such lightweight convolutions. We conclude that although convolutions can benefit models by adding local position information, there is a limit to how much local mixing should be done. It is sufficient to apply convolutions to token values on top of self-attention; additional convolutional layers applied before the self-attention map enforce unnecessary mixing of token representations.

Discussion
Our results demonstrate that convolutions provide consistent benefits to pre-trained language models. Our proposed composite attention mechanism combines previous relative position embedding methods, showing that convolutions can effectively compensate for the lack of local position information in Transformer models.

Related work
Our work unites and builds upon previous work using convolutions and relative positions in Transformers. We adopted the relative embeddings from Shaw et al. (2018) and Huang et al. (2020), showing that these embeddings are equivalent to the dynamic lightweight convolutions in Wu et al. (2019). Combining these dynamic lightweight convolutions with fixed lightweight convolutions (equivalent to the relative position terms in Raffel et al. 2020), we studied relative embeddings under the framework of convolution integrated with selfattention. As far as we are aware, our work is the first to holistically compare relative positions, convolutions, and self-attention in language models.
Building upon dynamic lightweight convolutions, recent work has incorporated both depthwiseseparable and dynamic lightweight convolutions in pre-trained language models. Jiang et al. (2020) proposed ConvBERT, which adds a convolutional module alongside the standard self-attention mechanism in BERT. ConvBERT's convolutional module consists of a depthwise-separable convolution combining with a query to generate a dynamic lightweight convolution. Under our integrated framework, this is analogous to the model which uses depthwise-separable convolutions for queries and keys, using composite attention as a querybased dynamic lightweight convolution (see Table  3). To make this comparison concrete, we trained a ConvBERT-small model using the same setup as our experiments. Indeed, the analogous model under our framework outperformed ConvBERT-small (GLUE score 74.5 compared to 70.3). Details for the ConvBERT comparison can be found in Appendix A.3.
Finally, recent work has proved theoretical relationships between self-attention and convolution. Cordonnier et al. (2020) showed that given enough self-attention heads, self-attention weights can express any convolution; in fact, they showed that self-attention layers often learn such convolutional structures when trained on vision tasks. However, this theoretical equivalence does not explain convolution-based improvements for Transformers in language tasks. To clarify the relationship between self-attention and convolution in language, our work characterizes self-attention as a type of dynamic lightweight convolution. By establishing a per-parameter equivalence between relative position embeddings and Wu's dynamic lightweight convolutions, we provide a concrete foundation where self-attention and convolution are used together in practice.

Conclusion
In this work, we formalized the relationship between self-attention and convolution. We proposed composite attention, which combines self-attention with lightweight convolution, uniting previous approaches to relative positions. Our formulation and empirical results demonstrate that convolutions can improve self-attention by providing local position information in sentences, capable of replacing absolute position embeddings entirely.
Our findings provide a solid foundation from which to study convolutions and self-attention in language tasks. The spatially-oriented nature of convolutional neural networks translates directly into positional information in language. As vision and language researchers strive towards common deep learning architectures, it is important to recognize how architectures for vision tasks can be adapted to linguistic domains.    Table 4, and fine-tuning hyperparameters are listed in Table 5.
Hyperparameters are based on those used in Clark et al. (2020) and Devlin et al. (2019).

A.2 GLUE development results
Results for each model on the GLUE development set are reported in Table 6. We report averages over ten fine-tuning runs for each task, including standard errors of the mean. Each overall GLUE score was computed as the average of individual task scores; we computed GLUE score averages and standard errors over ten GLUE scores, corresponding to the ten fine-tuning runs. We note that development scores were generally higher than test scores due to differences between the test and 5 Because BERT-small models were only trained for 125,000 steps with batch size 128, small models were trained on 16M sentence pairs. training distributions (Wang et al., 2018).

A.3 Detailed ConvBERT comparison
ConvBERT adds a convolutional module alongside the standard self-attention mechanism in BERT (Jiang et al., 2020). ConvBERT uses half the number of standard self-attention heads, using convolutional modules for the other half. In each convolutional module, a depthwise-separable convolution is multiplied pointwise with the query in the corresponding self-attention head. This convolutional query is fed into a linear layer to generate a dynamic lightweight convolution.
Under our framework, the analogous model replaces half of the queries and keys with depthwiseseparable convolutions and uses composite attention (a query-based dynamic lightweight convolution; see Table 3 in the full paper). In both models (ConvBERT and our own), half of the attention heads use a convolutional query. Additionally, in both models, the convolutional query is used to generate a dynamic lightweight convolution.
However, in our model, the dynamic lightweight convolution (in this case, composite attention) is used for all attention heads, not just the convolutional heads. Furthermore, our convolutional heads still use a self-attention mechanism along with the dynamic lightweight convolutions, by generating convolutional keys. In this way, our model adds convolutions to ConvBERT's self-attention heads, and adds self-attention to ConvBERT's convolutional heads.
Then, we investigated whether the separate selfattention and convolutional modules in ConvBERT provide any benefit over our integrated convolution and self-attention. We trained a ConvBERTsmall model using the same pre-training setup as our BERT-small experiments, comparing performance to the analogous model under our framework. Results are shown in Table 7. Indeed, integrated convolutions and self-attention outperformed ConvBERT-small, using only 3% more parameters. Table 6: GLUE development set scores for each model described in the main paper, reporting averages and standard errors of the mean over ten fine-tuning runs for each task. * denotes the default BERT model.