Efficient Document Embeddings via Self-Contrastive Bregman Divergence Learning

Learning quality document embeddings is a fundamental problem in natural language processing (NLP), information retrieval (IR), recommendation systems, and search engines. Despite recent advances in the development of transformer-based models that produce sentence embeddings with self-contrastive learning, the encoding of long documents (Ks of words) is still challenging with respect to both efficiency and quality considerations. Therefore, we train Longfomer-based document encoders using a state-of-the-art unsupervised contrastive learning method (SimCSE). Further on, we complement the baseline method -- siamese neural network -- with additional convex neural networks based on functional Bregman divergence aiming to enhance the quality of the output document representations. We show that overall the combination of a self-contrastive siamese network and our proposed neural Bregman network outperforms the baselines in two linear classification settings on three long document topic classification tasks from the legal and biomedical domains.


Introduction
The development of quality document encoders is of paramount importance for several NLP applications, such as long document classification tasks with biomedical (Johnson et al., 2016), or legal (Chalkidis et al., 2022b) documents, as well as information retrieval tasks (Chalkidis et al., 2021a;Rabelo et al., 2022;Nentidis et al., 2022).Despite the recent advances in the development of transformer-based sentence encoders (Reimers and Gurevych, 2019;Gao et al., 2021;Liu et al., 2021;Klein and Nabi, 2022a) via unsupervised contrastive learning, little do we know about the potential of neural document-level encoders targeting the encoding of long documents (Ks of words).The computational complexity of standard Transformer-based models (Vaswani et al., 2017;Devlin et al., 2019) (PLMs) given the quadratic self-attention operations poses challenges in encoding long documents.To address this computational problem, researchers have introduced efficient sparse attention networks, such as Longformer (Beltagy et al., 2020), BigBird (Zaheer et al., 2020), and Hierarchical Transformers (Chalkidis et al., 2022a).Nonetheless, fine-tuning such models in downstream tasks is computationally expensive; hence we need to develop efficient document encoders that produce quality document representations that can be used for downstream tasks out-ofthe-box, i.e., without fully (end-to-end) fine-tuning the pre-trained encoder, if not at all.Besides computational complexity, building good representation models for encoding long documents can be challenging due to document length.Long documents contain more information than shorter documents, making it more difficult to capture all the relevant information in a fixed-size representation.In addition, long documents may have sections with different topics, which increases the complexity of encoding that usually leads to collapsing representations (Jing et al., 2022).More-over, long documents can be semantically incoherent, meaning that content may not be logically related or may contain irrelevant information.For these reasons, it is challenging to create a quality representation that captures the most important information in the document.
To the best of our knowledge, we are the first to explore the application of self-contrastive learning for long documents (Table 1).The contributions of our work are threefold: (i) We train Longfomer-based document encoders using a state-of-the-art self-contrastive learning method, SimCSE by Gao et al. (2021).
(ii) We further enhance the quality of the latent representations using convex neural networks based on functional Bregman divergence.The network is optimized based on self-contrastive loss with divergence loss functions (Rezaei et al., 2021).
(iii) We perform extensive experiments to highlight the empirical benefits of learning representation using unsupervised contrastive and our proposed enhanced self-contrastive divergence loss.We compare our method with baselines on three long document topic classification tasks from the legal and biomedical domain.

Related Work Document Encoders
The need for quality document representations has always been an active topic of NLP research.Initial work on statistical NLP focused on representing documents as Bag of Words (BoW), in which direction TF-IDF representations were the standard for a long time.In the early days of deep learning in NLP, models developed to represent words with latent representations, such as Word2Vec (Mikolov et al., 2013), and GloVe (Pennington et al., 2014).Within this research domain, the use of word embedding centroids as document embeddings, and the development of the Doc2Vec (Le and Mikolov, 2014) model were proposed.Given the advanced compute needs to encode documents with neural networks, follow-up work mainly developed around sentence/paragraph-level representations, such as Skip Thoughts of Kiros et al. (2015), which relies on an RNN encoder.In the era of pre-trained Transformer-based language models, Reimers and Gurevych (2019) proposed the Sentence Transformers framework in order to develop quality dense sentence representations.Many works followed a similar direction relying on a selfsupervised contrastive learning setup, where most ideas are adopted mainly from Computer Vision literature (Chen et al., 2020;Bardes et al., 2022).
Self-Supervised Contrastive Learning in NLP Several self-contrastive methods have been proposed so far for NLP applications.To name a few: MirrorRoBERTa (Liu et al., 2021), SCD (Klein and Nabi, 2022b), miCSE (Klein and Nabi, 2022a), DeCluTR (Giorgi et al., 2021), and SimCSE (Gao et al., 2021) -described in Section 3.2-, all create augmented versions (views) of the original sentences using varying dropout and comparing their similarity.The application of such methods is limited to short sentences and relevant downstream tasks, e.g., sentence similarity, while these methods do not use any additional component to maximize diversity in latent feature representations.

Base Model -Longformer
We experiment with Longformer (Beltagy et al., 2020), a well-known and relatively simple sparseattention Transformer.Longformer uses two sets of attention, namely sliding window attention and global attention.Instead of using the full attention mechanism, the sliding-window attention gives local context higher importance.Given a fixed window size w, each token attends to 1 2 w tokens on the respective side.The required memory for this is O(n × w).Sliding-window attention is combined with global attention from/to the [CLS] token.

Domain-Adapted Longformer:
As a baseline, we use Longformer DA models which are Longformer models warm-started from domain-specific PLMs.
To do so, we clone the original positional embeddings 8× to encode sequences up to 4096 tokens.The rest of the parameters (word embeddings, transformers layers) can be directly transferred, with the exception of Longformer's global attention K, Q, V matrices, which we warm-start from the standard (local) attention matrices, following Beltagy et al. (2020).All parameters are updated during training.
For legal applications (Section 4.1), we warmstart our models from Legal-BERT (Chalkidis et al., 2020), a BERT model pre-trained on diverse English legal corpora, while for the biomedical one, we use BioBERT (Lee et al., 2020), a BERT model pre-trained on biomedical corpora.

Self-supervised Contrastive Learning
To use our Longformer DA for self-supervised contrastive learning, we need to use a Siamese network architecture (left part of Figure 1).Assume we have mini-batch D = {(x i )} N i=1 of N documents.As positive pairs (x i , x i + ), the method uses augmented (noised) versions of the input feature x i .As negative pairs (x i , x i − ), all remaining N-1 documents in a mini-batch are used.The augmentations take place in the encoder block f θ of the model.θ is the parameterization of the encoder.We use the SimCSE (Gao et al., 2021) framework, in which case the encoder f θ is a pre-trained language model, Longformer DA in our case, and augmentation comes in the form of varying token dropout (masking) rate (τ).The loss objective used in the unsupervised version of SimCSE is the multiple negatives ranking loss (ℓ mnr ): where si is the positive augmented input sequence in the mini-batch, and s j are the negatives.Multiple negatives ranking loss takes a pair of representations (s i , si ) and compares these with negative samples in a mini-batch.In our experiments, we train such models, dubbed Longformer DA+SimCSE .

Bregman Divergence Loss
We complement this method with an additional ensemble of subnetworks optimized by functional Bregman divergence aiming to improve the output document latent representations further.Specifically, the embedding of self-contrastive networks further passes to k-independent subnetworks to promote diversity in feature representations.
The s i and s j vectors from the contrastive framework are mapped to k-independent ensemble of neural networks that are optimized using functional Bregman divergence.
Each sub-network produces a separate output (right part of Figure 1).The divergence is then computed using the output at point ŝa and ŝb using the projections as input.We convert the divergence to similarity using a Gaussian kernel as done by Rezaei et al. (2021).1 The mini-batch has size N.For empirical distributions s a α(z i ), s b (z j ) where i and j are the respective index for the two branches and z the projector representation, we have: The final objective function is computed on the combination of as follows:  (SCOTUS).This is a single-label multi-class topic classification task, where given a SCOTUS opinion, the model has to predict the relevant area among 14 issue areas (labels).
MIMIC (Johnson et al., 2016) dataset contains approx.50k discharge summaries from US hospitals.Each summary is annotated with one or more codes (labels) from the ICD-9 hierarchy, which has 8 levels in total.We use the 1st level of ICD-9, including 19 categories, respectively.This is a multi-label topic classification task, where given the discharge summary, the model has to predict the relevant ICD-9 top-level codes (labels).

Experimental Settings
To get insights into the quality of the learned representations out-of-the-box, we train classifiers using document embeddings as fixed (frozen) feature representations.We consider two linear classification settings: (i) Linear evaluation plugging a MLP classification head on top of the document embeddings; (ii) Linear evaluation plugging a linear classifier on top of the document embeddings.

Results and Discussion
In Table 2, we present the results for all examined Longformer variants across the three examined datasets and two settings using macro-F1 (m-F 1 ) and micro-F1 (µ-F 1 ) scores.
Classification performance: In the last line of Table 2, we present the results for the baseline Longformer DA model fine-tuned end-to-end, which is a 'ceiling' for the expected performance, comparing to the two examined linear settings, where the document encoders are not updated.We observe that in the SCOTUS dataset training models with an MLP head are really close to the ceiling performance (approx.1-4p.p. less in µ-F 1 ).The gap is smaller for both models trained with the self-contrastive objective (+SimCSE, +Sim-CSE+Bregman), especially the one with the additional Bregman divergence loss, where the performance decrease in µ-F 1 is only 1 p.p.
In the other two datasets (ECtHR and MIMIC), the performance of the linear models is still approx.10-15 p.p. behind the ceilings in µ-F 1 .In ECtHR, we find that self-contrastive learning improves performance in the first settings by 3 p.p. in µ-F 1 , while the additional divergence Bregman loss does not really improve performance.This is not the Model µ-F 1 m-F 1 Longformer DA 54.9 48.1 » + SimCSE 51.8 43.6 » + SimCSE + Bregman 56.9 48.5 case, in the second linear setting (second group in Table 2), where the baseline outperforms both models.Similarly in MIMIC, we observe that selfcontrastive learning improves performance in the first settings by 3 p.p. in µ-F 1 , but the performance is comparable given linear classifiers.Overall, our enhanced self-contrastive method leads to the best results compared to its counterparts.
In Table 3, we also present results on SCOTUS in a few-shot setting using the SetFit (Tunstall et al., 2022) framework, where Bregman divergence loss improves performance compared to the baselines.
Given the overall results, we conclude that building subnetwork ensembles on top of the document embeddings can be a useful technique for encoding long documents and can help avoid the problem of collapsing representations, where the model is unable to capture all the relevant information in the input.Our approach has several advantages for long-document processing: Efficiency considerations: In Table 2, we observe that in both linear settings where fixed document representations are used, the training time is 2-8× decreased compared to end-to-end fine-tuning, while approx.0.5% of the parameters are trainable across cases, which directly affects the compute budget.We provide further information on the size of the models in Appendix B.
Avoidance of collapsing representations: When processing long documents, there is a risk that the representation will collapse (Jing et al., 2022), meaning that the model will not be able to capture all the relevant information in the input.By mapping the document embedding from the base encoder into smaller sub-networks, the risk of collapsing representations is reduced, as the divergence loss attempts to reduce redundancy in the feature representation by minimizing the correlation.The results shown in Table 3 in a low-resource setting further highlight the advantage of training a Longformer with contrastive divergence learning.

Conclusions and Future Work
We proposed and examined self-supervised contrastive divergence learning for learning representation of long documents.Our proposed method is composed of a self-contrastive learning framework followed by an ensemble of neural networks that are optimized by functional Bregman divergence.Our method showed improvement compared to the baselines on three long document topic classifications in the legal and biomedical domains, while the improvement is more vibrant in a few-shot learning setting.In future work, we would like to further investigate the impact of the Bregman divergence loss in more classification datasets and other NLP tasks, e.g., document retrieval.

Limitations
In this work, we focus on small and medium size models (up to 134M parameters), while recent work in Large Language Models (LLMs) targets models with billions of parameters (Brown et al., 2020;Chowdhery et al., 2022).It is unclear how well the performance improvement from the examined network architecture would translate to other model sizes or baseline architectures, e.g., GPT models.
Further on, it is unclear how these findings may translate to other application domains and datasets, or impact other NLP tasks, such as document retrieval/ranking.We will investigate these directions in future work.

A Hyper-parameter Optimization
Continued Pre-training: We define the search space based on previous studies such as Rezaei et al. (2021) and Gao et al. (2021).For the contrastive Bregman divergence, we benchmark the performance for the first-stage hyper-parameters on the downstream task to tune the respective hyper-parameters.We use mean pooling for all settings.The learning rate, the total optimization steps, the use of a batch-norm layer, the σ parameter, the number of sub-networks g, and the batch size are grid-searched.Temperature (.1) and the input length to 4096 are fixed beforehand.The learning rate for these models was 3e-5.We run 50.000optimization steps for each model.

Training for classification tasks:
We used AdamW as an optimizer.Bayesian optimization is used to tune the hype-rparameters learning rate, number of epochs and batch size.We use mean pooling for all settings.Early stopping is set to a patience score of 3.3 These parameters were fixed after some early experiments.We use a learning rate of 1e-4 and run ECTHR and SCOTUS for 20 and ,5,8,10,20] 74

C Pooling methods
We evaluate Mean, Max and [CLS] pooling.Results for end-to-end fine-tuning can be found in the table 6.Our results show that using mean pooling during continued pre-training in combination with max-pooling for classification could further enhance the performance instead of using the same pooling method for both stages.

D Neural network Architecture
Our model contains two linear layers with one activation layer and two batch normalization layers.We also compare the model without batch normalization layers.The comparison is made on the SCO-TUS dataset using end-to-end fine-tuning.One can see that removing batch normalization worsens performance.

Figure 1 :
Figure 1: Illustration of our proposed self-contrastive method combining SimCSE of Gao et al. (2021) (left part) with the additional Bregman divergence networks and objective of Rezaei et al. (2021) (right part).

G
ϕ (s a , s b ) = ϕ(s a ) − ϕ(s b )− [s a (x) − s b (x)]δϕ(s b )(x)dµ(x) (2) s a and s b are vectors output by the self-contrastive network, and ϕ is a strictly convex function and can be described via a linear functional, consisting of weights w k and biases ϵ k .The function ϕ(s a ) is approximate by: ϕ(s a ) = sup (w,ϵ w )∈Q s a (x)w(x)dx + ϵ w (3) We take the empirical distribution of the projection representation to compute ŝa and ŝb .Specifically we define: ŝi = argmax k [ s a (x)w k (x)dx + ϵ k ] for i = (a,b).Using the above specification and ϕ(s a ), we get the following functional divergence term:

Table 2 :
Test Results for all methods across all datasets.Best performance in bold, and second-best score is underlined.We also report average training time and the percentage of parameters that are trainable.

Table 3 :
Test Results for all Longformer variants for SCOTUS.Best performance in bold, and second-best score is underlined.
Table5shows the number of parameters for the different models.Modding the transformer to a Longformer adds 6M parameters for LegalBERT small and 24M parameters for BioBERT medium.By working with LegalBERT-small and BioBERTbase we cover both small and medium sized models.

Table 5 :
Number of Parameters for the Longformer variants.

Table 6 :
Test results for various pooling operators with end-to-end tuning on SCOTUS for Longformer DA .

Table 7 :
F1 performance for ablation model without batch norm layers for end-to-end fine-tuning on SCO-TUS.