miCSE: Mutual Information Contrastive Learning for Low-shot Sentence Embeddings

This paper presents miCSE, a mutual information-based contrastive learning framework that significantly advances the state-of-the-art in few-shot sentence embedding.The proposed approach imposes alignment between the attention pattern of different views during contrastive learning. Learning sentence embeddings with miCSE entails enforcing the structural consistency across augmented views for every sentence, making contrastive self-supervised learning more sample efficient. As a result, the proposed approach shows strong performance in the few-shot learning domain. While it achieves superior results compared to state-of-the-art methods on multiple benchmarks in few-shot learning, it is comparable in the full-shot scenario. This study opens up avenues for efficient self-supervised learning methods that are more robust than current contrastive methods for sentence embedding.


Introduction
Measuring sentence similarity has been a challenging endeavor due to the ambiguity and variability of linguistic expressions. The community's strong interest in the topic can be attributed to its applicability in numerous language processing applications, such as sentiment analysis, information retrieval, and semantic search (Pilehvar and Navigli, 2015;Iyyer et al., 2015). It has been recently shown that the Transformer-based language models already perform surprisingly well (Reimers and Gurevych, 2019). Simultaneously, to unfold their full potential, language models pre-trained on large generic corpora require fine-tuning on the downstream task and corpora (Devlin et al., 2018;Pfeiffer et al., 2020;Mosbach et al., 2021). In terms of sentence embeddings, contrastive learning schemes have already been adopted successfully (van den Oord et al., 2018;Liu et al., 2021;Gao et al., 2021;Carlsson et al., 2021). The idea of contrastive learning is that positive and negative pairs are generated given a batch of samples. Whereas the positive pairs are obtained via augmentation, negative pairs are often created by random collation of sentences. Following the construction of pairs, contrastive learning forces the network to learn feature representations by pushing apart different samples (negative pairs) or pulling together similar ones (positive pairs). While some methods seek to optimize for selecting "hard" negative for negative pair generation (Zhou et al., 2022), others investigated better augmentation techniques for positive pair creation. In this regard, many methods have been proposed to create augmentations to boost representation learning. A standard approach for the augmentation aim at input data level (a.k.a discrete augmentation), which comprises word level operations such as swapping, insertion, deletion and substitution (Xie et al., 2017;Coulombe, 2018;Wei and Zou, 2019). In contrast to that, continuous augmentation operate at the representation level, comprising approaches like interpolation or "mixup" on the embedding space (Chen et al., 2020;Cheng et al., 2020;Guo et al., 2019). Most recently, augmentation was also proposed in a more continuous fashion operating in a parameter level via simple techniques such as drop-out (Gao et al., 2021;Liu et al., 2021;Klein and Nabi, 2022) or random span masking (Liu et al., 2021). The intuition is that "drop-out" acts as minimal data augmentation, providing an expressive semantic variation. However, it will likely affect syntactic alignment across views. Since positive pairs are constructed from identical sentences, utilizing dropout noise, we hypothesize that the syntactic dependency over the views should be preserved. Building on this idea, we maximize the syntactic dependence by enforcing distributional similarity over the attention values across the augmentation views. To this end, we employ maximization of the mutual information (MI) on the attention tensors of the positive pairs. However, since attention tensors can be very high-dimensional, computing mutual information (MI) can quickly become a significant burden if not intractable. Common approaches for MI estimation are non-parametric, e.g., relying on binning, nonparametric kernel density estimators -see (Suzuki et al., 2008;Kwak and Choi, 2002;Kraskov et al., 2004). These estimators often do not scale well and cannot be computed efficiently on GPUs (Gao et al., 2014). In contrast, the parametric modeling of attention seems to be a more reasonable choice for statistical modeling of Transformer attention distribution (e.g., Gaussian distribution (Bahuleyan et al., 2018), Dirichlet distribution (Deng et al., 2018), Weibull and Log-Normal distribution (Fan et al., 2020)). This paper proposes a simple parametric solution to alleviate the computational burden of MI computation, which can be deployed efficiently. Specifically, we adopt the Log-Normal distribution for modeling the attention. On the one hand, empirical evidence confirms this model to be a good fit. On the other hand, it facilitates the optimization objective to be defined in closed form.
In this case, mutual information can be provably reformulated as a function of correlation, allowing native GPU implementation. As discussed above, the proposed approach builds upon the contrastive learning paradigm known to suffer from model collapse. This issue becomes even more problematic when enforcing MI on the attention level, as it tightens the positive pairs via regularizing the attention. Therefore the selection of negative pairs becomes more critical in our setup. To this end, we utilize momentum contrastive learning to generate harder negatives (He et al., 2020). A "tighter" binding on positive pairs and repulsion on "harder" negative pairs empowers the proposed contrastive objective, yielding more powerful representations.
Combining ideas from momentum contrastive learning and attention regularization, we propose miCSE, a conceptually simple yet empirically powerful method for sentence embedding, with the goal of integrating semantic and syntactic information of a sentence in an information-theoretic and Transformer-specific manner. We conjecture the relation between attention maps and a form of syntax to be the main driver behind the success of our approach. To validate that, we performed a controlled empirical observation on this matter which suggests the lack of syntax-related properties of the sentences in a previous work (i.e., SimCSE (Gao et al., 2021)) compared to miCSE (see Fig. 1). We speculate that our proposed method injects syntactic information into the model as an inductive bias, facilitating representation learning with fewer samples. The adopted syntactic information inductive biases provide a structural prior as an implicit form of supervision during training (Wilcox et al., 2020), which promotes few-shot learning capabilities in neural language models. To validate this, we introduced a low-shot setup for training sentence embeddings. In this benchmark, we finetune the language model only with a small number of training samples. Note that this is a very challenging setup. The inherent difficulty can be attributed to the need to mitigate the domain shift in the low-shot self-supervised learning scheme. We emphasize the importance of this task, as in many real-world applications, only small datasets are often available. Examples of such cases are NLP for low-resource languages or expert-produced texts (e.g., medical records by doctors), personalized LM for social media analysis (e.g., personalized hate speed recognition on Twitter), etc. In the low-shot sentence embedding benchmark, our proposed method significantly improves over the state-of-the-art. This is the first work that explores how to combine semantic and syntactic information through attention regularization, and empirically demonstrates this benefit for low-shot sentence embeddings.
Recently, VaSCL (Zhang et al., 2022), Con-SERT (Yan et al., 2021) and PCL (Wu et al., 2022) proposed contrastive representation learning with diverse augmentation strategies on positive pair. However, we proposed a principled approach for enforcing alignment in positive pairs at contrastive learning. Similar to us, ESimCSE  and MoCoSE (Cao et al., 2022) proposed to exploit a momentum contrastive learning model with negative sample queue for sentence embedding to boost uniformity of the representations. However, unlike us, they do not enforce any further tightening objective on the positive pairs nor consider few-shot learning. Very recently, authors in InforMin-CL (Chen et al., 2022) proposed information minimization-based contrastive learning. Specifically, the authors propose to minimize the information entropy between positive embed- dings generated by drop-out augmentation. Our model differs from this paper and the method in (Bachman et al., 2019;Yang et al., 2021;Zhang et al., 2020;Sordoni et al., 2021;Wu et al., 2020), which focuses on using mutual information for selfsupervised learning. A key difference compared to these methods is that they estimate MI directly on the representation space. In contrast, our method computes the MI on attention. Other related works include (Chuang et al., 2022;Liu et al., 2022). The contributions of the proposed work are threefold: First, we propose to inject syntactic information into language models by adding an attentionlevel objective. Second, we introduce Attention Mutual Information (AMI), a simple and efficient objective for sample efficient self-supervised contrastive learning. Third, we introduce low-shot learning for sentence embedding. We show that our method performs comparably to the state-ofthe-art in the full-shot scenario and significantly better in few-shot learning.

Method
The proposed approach aims to exploit the syntactic structure of the sentences in a contrastive learning scheme. Compared to conventional contrastive learning that solely operates at the level of semantic similarity in the embedding space, the proposed approach injects syntactic information into the model. This is achieved by regularizing the attention space of the model during training. We let D denote a dataset consisting of string sequences (sentences) from corpus X with D = {x 1 , x 2 , ..., x |X | }, where we assume x i to be a tokenized sequence of length n with x i ∈ N n . For mapping the input data to the embedding space, we use a bi-encoder f θ parametrized by θ. Bi-encoders entail the com-putation of embeddings for similarity comparison, whereby each sentence in a pair is encoded separately. Hence, the instantiation of a bi-encoder on augmented input data induces multiple views. For the following, we let v ∈ {1, 2} denote the index of the view, where each view corresponds to a different augmentation. Consequently, encoding a data batch D b yields embedding matrices E v ∈ R |D b |×U , where U denotes the dimensionality of the embeddings. Employing a Transformer, encoding the input data yields the embedding matrices and the associated attention tensors W v . Then learning representation of the proposed approach entails the optimization of a joint loss: Here, L C is responsible for the semantic alignment, corresponding to the standard InfoNCE (van den Oord et al., 2018) loss that seeks to pull positive pairs close together while pushing away negative pairs in the embedding space. In contrast, L D is responsible for the syntactic alignment, operating on the attention space. However, in comparison to L D is employed only on the attention tensors of the positive pairs.

Embedding-level Momentum-Contrastive Learning (InfoNCE)
The InfoNCE-loss seeks to pull positive pairs together in the embedding space while pushing negative pairs apart. Specifically, InfoNCE on embeddings pushes for the similarity of each sample and its corresponding augmented embedding. Negatives pairs are constructed in two ways, reflected by the two terms in the denominator of Eq. 2. First, inbatch negative pairs are constructed by pairing each Algorithm 1 Mutual Information estimation Input: Batch D b , encoder f θ , multinomial sampler p mult Output: Average mutual information 1 Compute correlation coefficient on centered attentions Return − 1 2 (1 − ρ 2 ) Mutual information for tensor slice end procedure sentence with another random sentence (sharing no semantic similarity), pushing for dissimilarity. Second, using embeddings obtained from a momentum encoder known as MoCo (He et al., 2020;Cao et al., 2022). The momentum encoder is a replication of the encoder f θ , whose parameters are updated more slowly. Specifically, while the parameters of f θ encoder are updated via back-propagation, the parameters of the momentum encoder are updated using an exponential moving average from the former. The negative embeddings are produced from samples from previous batches, which are stored in queue Q and are forward-passed through the momentum encoder. Then the InfoNCE (van den Oord et al., 2018) loss (L C ) is defined as: where e i ∈ E 1 and + e i ∈ E 2 denote the embeddings of different augmentations of x i . Furthermore, d(x, y) = exp(sim(x, y)/τ ) with sim(.) the cosine similarity metric, q j denoting representations obtained from momentum encoder, and τ ∈ R is a temperature scalar.

Attention-level Mutual Information (AMI)
Preliminaries and notations: We first briefly review the attention mechanism and explain the notation used in the rest of this section. A Transformer stack consists of a stack of L layers, with input data cascading up the layer stack. Each layer comprises a self-attention module and a feed-forward network in its simplest form. Passing sentences through the encoder stack entails simultaneous computation of attention weights. These attention weights indicate the relative importance of every token. To this end, key-value pairs are computed for each token of the input sequence Within each self-attention module. This entails the computation of three different matrices: key matrix K, value matrix V , and query matrix Q. The values of the attention weights W are obtained according to is a scaled dot-product. Output features are then generated as obtained according to W V . In order to attend to different sub-spaces (Vaswani et al., 2017) simultaneously, the attention mechanism is replicated H times, which is referred to as multi-head attention. During training the encoder, the selfattention tensors W values are subject to a random deterministic process, with randomness arising due to drop-out. Hence, the proposed approach seeks to optimize syntactic alignment by maximizing mu-tual information between the attention tensors W v of the augmentation views. To regularize the joint attention space, we propose a pipeline consisting of four steps: 1) Attention Tensor Slicing: Given that augmentation has different effects on the attention distribution depending on the depth (layer) and the position (head) in the Transformer stack, we propose to slice the attention tensor. Chunking the attention has multiple advantages. On the one hand, this allows for preserving the locality of distribution change. This is important as it can be empirically observed that distribution divergence between views decreases with increasing depth in the encoding stack. On the other hand, restricting the space permits the use of a simple distributional model such as bivariate distribution. For the sake of economy in notation, we will restrict the attention tensor of a single encoded sample for the following. To this end, a slicing function π : R L×H×n×n → R R×n×n cuts the attention tensor for sample i into R (indexed) elements: 2) Attention Sampling: Different sentences in the batch are typically in token sequences of different lengths. To accommodate the different lengths and facilitate efficient training, sequences are typically padded with [PAD]-token for length equality. Although this allows for efficient batch encoding on GPU, attentions arising from [PAD]-tokens have to be discarded when looking at statistical relationships. In order to accommodate for the different lengths of tokenized sequences, perform a sampling step for attention values within each grid cell w r i . To this end, we leverage multinomial distribution P mult (p 1 , .., p n 2 ), where s correspond to the number of non-padding tokens with 1 ≤ s ≤ n. Specifically, we sample from the pool of s 2 attention values, each with a probability of 1 s 2 , with the remaining elements associated with probability 0. As a result, we obtain a set J r = {j 1 , ..., j m } consisting of m indices of the attention tensors for each slice r ∈ R: It should be noted that for the same slice r across the views, the same index set is used for sampling: .

3) Attention Mutual Information Estimation:
We propose using mutual information to measure the similarity of attention patterns for different views. Specifically, we follow (Fan et al., 2020) and adopt the Log-Normal distribution for modeling the attention distribution, which is prudent for several reasons. First, Empirical observation confirms attention asymmetry. Second, leveraging a non-symmetric distribution accommodates that the attention tensor W decomposes into K and Q, which enables attention to be non-symmetric. Third, adopting the log-normal models facilitates the optimization objective to be defined in closed form and hence easy to optimize, particularly on GPUs. Mutual information for two normally distributed tuple vectors (z 1 , z 2 ) can be written as a function of correlation (I.M. and A.M., 1957) given by: where ρ corresponds to the correlation coefficient computed from from z 1 and z 2 . Hence, we compute the mutual information with for each slice r and sample x i as M I r i = I(log(w r i ), log( +w r i )). Application of log(.) function is employed to accommodate for the Log-Normal to Normal random variable transformation. For details on the implementation, see Alg. 1. 4) Mutual Information Aggregation: In order to compute the loss component for attention regularization, we need to aggregate the distributional similarities for the entire tensor. Aggregation is obtained by averaging the individual similarities obtained for each slice r ∈ R and each sample x i in the batch. With λ ∈ R some weighting scalar, the attention alignment loss term is:

Experiments
In this section, we describe the experimental setting used for the evaluation, present our main results, and discuss different aspects of our method by providing several empirical analyses.

Model and Hyperparameters:
Training is started from a pre-trained transformer LM. Specifically, we employ the Hugging Face (Wolf et al., 2020) implementation of BERT base . For each approach evaluated, we follow the same hyperparameters proposed by the authors. In the In- foNCE loss we set τ = 0.05. In order to determine the hyperparameter λ a coarse grid search {1.0, 0.1, ..., 1.0e−5} was conducted to assess the magnitude. Upon determination, a fine grid search was conducted once with 10 steps. We set λ = 2.5e − 3 for training 100% of the data in a single episode with a batchsize of 50 at a learning rate of 3.0e−5 and 250 warm-up steps. The number of optimization steps is largely kept constant for training the different dataset sizes. For the training set of size 10 6 (= 100%) we train for 1 epoch, for the size of 10 5 (= 10%), we train for 10 epochs, etc. The momentum encoder is associated with a sample queue of size |Q| = 384. The momentum encoder parameters are updated with a factor of 0.995, except for the MLP pooling layer, which is kept identical to the online network. Additionally, we increase the drop-out for the momentum encoder network from the default rate (0.1) to 0.3.

2021)
, we train the model in an unsupervised fashion on sentences from Wikipedia. In order to train the model in a few-shot learning scenario, we create random sample sets of different sizes {10 6 , 10 5 , 10 4 , 5.0 · 10 3 , 10 3 }. We repeated the training set creation for each size for 5 times with different random seeds. Mutual Information Estimation: Following the observations in (Voita et al., 2019), we restrict the computation of the mutual information to the upper part of the layer stack. Specifically, we select the layers between 8 and 12 (= last layer in BERT base ). To accommodate input sequences of varying lengths and make computation more efficient, we pool together pairs of adjacent heads (without overlap) while preserving the head separation. From each of the (4 × H 2 ) chunks of pooled attentions, we random sample 150 joint-attention pairs for each embedding of the bi-encoder.

Experimental Results
Unsupervised Sentence Embedding: We compare miCSE to previous state-of-the-art sentence embedding methods on STS tasks. For comparisons, we favored comparable architectures (bi-encoder) that facilitate seamless integration of the proposed approach and methods of comparable backbone.
For semantic text similarity, we evaluated on 7 STS tasks: (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, STS Benchmark (Cer et al., 2017) and SICK-Relatedness (Marelli et al., 2014). These datasets come in sentence pairs with correlation labels in the range of 0 and 5, indicating the semantic relatedness of the pairs. Specifically, we employ the SentEval toolkit (Conneau and Kiela, 2018) for evaluation. It should be noted that all our STS experiments are conducted in a fully unsupervised setup, not involving any STS training data. The benchmark measures the relatedness of two sentences based on the cosine similarity of their embeddings. The evaluation criterion is Spearman's rank correlation (ρ). For comparability, we follow the evaluation protocol of (Gao et al., 2021), employing Spearman's rank correlation and aggregation on all the topic subsets. Results for the sentence similarity experiment are presented in Tab. 1. As can be seen, the proposed approach is slightly lower in terms of average performance than state-of-the-art algorithms such as PCL. However, our proposed method has the most consistent performance across all benchmarks, with performance always in the top-3. A more in-depth analysis shows the best performance on the SICK-R benchmark, where it outperforms the second best approach SCD by (+0.44) and PCL by (+0.87). We highlight the comparison to the closest method SimCSE, where the proposed approach has an average gain of (+3.94). This improvement is due to the two additional components (i.e., AMI and MoCo) we add to this baseline method.
Low-shot Sentence Embedding: In this experiment, the performance of several SOTA sentence embedding approaches is benchmarked elaboratively. Similar to Sec. 3.2, we evaluate 7 STS  tasks, STS Benchmark, and SICK-Relatedness with Spearman's ρ rank correlation as the evaluation metric. However, in contrast to the previous section, models are trained on different subsets of the data, namely {100%, 10%, 1%, 0.1%} of the Wikipedia dataset used in (Gao et al., 2021). Results for the low-shot sentence similarity experiment can be presented in Fig. 2. As can be seen, the proposed approach gains by increasing the training set size and consistently outperforms all the baselines in all training subsets. Interestingly, our proposed method reaches the performance of SimCSE trained on the entire dataset with only 0.5% of the data. We believe it shows the impact of exploiting syntax information for data augmentation during training. It should be noted that the performance gain is most significant when conducted on a single token rather than token averaging. We attribute this to token averaging to a certain degree equivalent to attention regularization. On the extremely low data regime, the proposed approach shows very strong performance up (+11) compared to SimCSE -see Fig. 3a. It suggests the potential resilience of our method to very small batch training.

Experimental Analysis of components
Given that AMI acts as a regularizer on Transformer attention, we evaluate the applicability in conjunction with other contrastive learning methods. We evaluate the following approaches CT ( Additionally, it shows the most significant boost in performance in combination with SimCSE. In addition, we observe that the impact of AMI grows with declining training set size. In combination with SimCSE, AMI leads to a performance gain of up to (+5.91) at 0.1% of the data. We also observe that adding AMI to all the approaches significantly reduces the variance for all methods. This can probably be attributed to the regularization effect of the proposed AMI component. In addition, we conducted an ablation study to assess the effect of AMI and MoCo w.r.t. the baseline SimCSE. As can be seen in Fig. 3b, both AMI and MoCo improve the baseline at different data ratios. Again, AMI provides a particularly strong performance boost in the low-data regime. In contrast, the impact of MoCo diminishes with decreasing training set size. We emphasize that our approach gets the best of both worlds by integrating these two components. In practice, this can be directly exploited for different few-shot setups by adjusting the relative importance of AMI via tuning hyper-parameter λ.

Analysis on Syntax vs. Semantic
In light of the lack of a rigorous benchmark for analyzing syntax in sentence embedding, we performed two qualitative analyses visualized in Fig. 1 and Observation (i) There is higher semantic and syntactic similarity between positive pairs compared to the negative pairs: Our contrastive learning approach assumes that positive pairs exhibit more syntactic similarity than negative pairs (i.e., syntactic inductive bias). To validate this hypothesis, we plot the semantic similarity against syntactic similarity -for both positive and negative pairs. Specifically, we analyzed the embeddings and attention values of the trained model with SimCSE and the proposed approach. Input to the models was randomly sampled sentences from Wikipedia. Interestingly enough, although training the proposed model involves maximization of MI over the attention w.r.t. positive pairs, we also observe the reflection of syntactic information in the negative pairs. As can be seen in Fig. 1, the negative pairs end up in the low left corner, whereas the positive pairs are in the upper right corner.

Observation (ii):
Negative pairs with similar syntax shows higher attention similarity, compared to pairs with dissimilar syntax: For a more in-depth analysis of this, we further sub-divided the negative pairs into two groups: a) negative pairs with similar dependency trees, b) negative pairs with dissimilar dependency trees. For simplicity, we adopted a binary similarity scheme -"similar" implies an identical dependency tree, whereas "dissimilar" corresponds to a non-identical dependency tree. To highlight the inter-group syntax similarity, samples of each group were normalized w.r.t. the centroid of the opposite group. As can be seen in Fig 4 (by the increased distance between the cluster centers), the proposed approach encodes a notion of syntactic similarity. Note that this margin appeared solely due to enforcing the AMI on attention for the positive pairs, leading to the emergence of a notion of "syntax" on negative pairs. Discussion on the Syntax and Attention relation: The proposed approach aligns the attention patterns for drop-out augmented input pairs. We posit that conducting such a regularization enforces constraints w.r.t. the syntax tree of the sentence embeddings. This is motivated by recent literature findings, which suggest that the Transformer's attention captures the syntactic grammatical relationships of the sentences (Ravishankar et al., 2021;Clark et al., 2019;Raganato et al., 2018;Voita et al., 2019). Additionally, recent research explicitly targets the extraction of topologies from attention maps for diverse tasks on syntactic and grammatical structure (Kushnareva et al., 2021;Cherniavskii et al., 2022;Perez and Reinauer, 2022). Although no "one-to-one" mapping connects syntactic structures and attention patterns, the attention tensor, at the bare minimum, encodes a "holistic notion" of the grammatical structure of sentences.

Conclusion
We proposed a method to inject structural similarity into language models for self-supervised representation learning for sentence embeddings. The proposed approach integrates the inductive bias at the level of Transformer attention by enforcing mutual information on positive pairs obtained by drop-out augmentation. Leveraging attention regularization makes the proposed approach much more sample efficient. Consequently, it outperforms methods with a significant margin in low-shot learning scenarios while having state-of-the-art performance in full-shot to comparable approaches. Future work will investigate the extension of the proposed approach to discrete augmentation.

A Appendix
In the following sections, we add additional details omitted in the main paper due to space restrictions. First, we illustrate the cosine similarity distribution according to human judgment (ground-truth) in Sec. B. Next, in Sec. C, we visualize the 2D histogram of joint distributions between views. In Sec. D, we present detailed results of the few-shot performance of miCSE in contrastive and non-contrastive setup. Finally, the exact relation between mutual information and correlation is presented in Sec. E.

B Cosine-similarity Distribution
To directly show the strengths of our approaches on STS tasks, we illustrate the cosine similarity on embeddings distributions of STS-B pairs in combination with human ratings in Fig. 5. The STS dataset comes in sentence pairs together with correlation labels in the range of 0 and 5, indicating the semantic relatedness of the pairs. Here, the x-axis is the sample similarity of sentences according to human judgment (ground-truth), and the y-axis represents cosine similarity between pairs using embeddings.
Color coding corresponds to ground-truth similarity. Compared to the baseline model (SimCSE), miCSE better distinguishes sentence pairs with different levels of similarities, as can be seen from the stronger correlation between embedding distance and human rating. This property leads to better performance on STS tasks. In addition, we observe that miCSE generally shows a more scattered distribution while preserving a lower variance on semantically similar sentence pairs. This observation further validates that miCSE can potentially achieve a better alignment-uniformity balance.

C Visualization of Joint Distribution
To analyze the impact of the proposed approach compared to the baseline SimCSE at the attention level, we visualized the joint distribution of the attention values created by the two views created by the bi-encoder. The joint distribution and mutual information are closely related. More specifically, given two random variables X and Y , the associated mutual information can be expressed in terms of the joint distribution as: where p(x, y) denotes the joint-distribution and p(x), p(y) the marginals. Assuming random variables that are normally distributed, the joint distribution of random variables is distinctly shaped depending on the correlation coefficient ρ. See Sec. E details on the relationship between entropy and the correlation coefficient. In the extreme case of totally unrelated marginals ρ = 0, the joint distribution assumes a circular shape having the lowest possible mutual information. On the other end of the spectrum, in the case of perfect correlation, the joint distribution assumes collinearity (45 • diagonal), with mutual information assuming maximal value. To avoid visual clutter, we sliced the attention tensor into 12 slices, pooling together every 3 adjacent heads and every 4 adjacent layers. Slicing the tensor at a higher resolution leads to visually very similar results. The axes of the joint distribution (2d histogram) correspond to the marginals' distribution. As miCSE maximizes the mutual information, one can observe a reduction in the scatter of the joint distribution compared to SimCSE.

D Detailed Comparison with SimCSE
Our proposed method is built on top of contrastive learning. Thus it intrinsically relies on the existence of the negative pairs. To complement the performance comparison of contrastive learning in Fig. 3a, we designed an experiment to analyze the extent to which attention regularization alone (AMI) can compensate for the lack of negative pairs. To that end, we conducted training with positive pairs only. See Tab. 3 and Fig. 7 for results.
The integration of mutual attention information boosts the performance by up to (+15) across all training set sizes. It suggests the potential application of our proposed attention regularization for non-contrastive learning.  . Joint distribution between two augmentation induced views. Images depict 12 attention slices per methods, obtained by slicing the attention tensor for the input sentence "the best thing you can do is to know your stuff." Increasing depth in layer stack from left to right, top to bottom. ( ) SimCSE, ( ): miCSE (best viewed in color)