Contrastive Document Representation Learning with Graph Attention Networks

Recent progress in pretrained Transformer-based language models has shown great success in learning contextual representation of text. However, due to the quadratic self-attention complexity, most of the pretrained Transformers models can only handle relatively short text. It is still a challenge when it comes to modeling very long documents. In this work, we propose to use a graph attention network on top of the available pretrained Transformers model to learn document embeddings. This graph attention network allows us to leverage the high-level semantic structure of the document. In addition, based on our graph document model, we design a simple contrastive learning strategy to pretrain our models on a large amount of unlabeled corpus. Empirically, we demonstrate the effectiveness of our approaches in document classification and document retrieval tasks.


Introduction
Document representations that capture the semantics are crucial to various document-level Natural Language Processing (NLP) tasks, including sentiment analysis (Medhat et al., 2014), text classification (Kowsari et al., 2019) and information retrieval (Lin et al., 2020). In recent years, an increasing volume of work has focused on learning a task-agnostic universal representation for long documents. While improved performance in downstream tasks have been achieved, there are two challenges towards learning a high quality document representation: (1) absence of document structure. Most works treat the document as a sequence of tokens without considering high-level structure.
(2) data scarcity. Existing methods in document representation learning are significantly affected by the scarcity of document-level data.
Transformers-based pretrained language models are ubiquitously state-of-the-art across many NLP tasks. Transformer models such as BERT (Devlin et al., 2019) and its variants have shown great success in learning contextual representation of text. Representation from large language models can partially mitigate the data scarcity issue due to pretraining on a large amounts of unlabeled data. However, those models mostly consider token-level information and their pretraining tasks are not directly targeting long document representations. Another issue of directly applying transformer-based models is the limit of the input text length. Due to the quadratic complexity of self-attention, most of the pretrained transformers models can only handle a relatively short text. A wide spectrum of efficient, fast transformer models (collectively called "Xformers") have been proposed to tackle this problem; e.g., Longformer (Beltagy et al., 2020) and Bigbird (Zaheer et al., 2020) use sparse attention to improve the computational and memory efficiency for long sequence text. Nevertheless, these models still focus on token-level interactions without considering high-level semantic structure of the document.
Recently, there is a resurgence of interest in Contrastive Learning (CL) due to its success in selfsupervised representation learning in computer vision He et al., 2020). Contrastive Learning offers a simple method to learn disentangled representation that encodes invariance to small and local changes in the input data without using any labeled data. In NLP domain, contrastive learning has been employed to learn sentence representation Qu et al., 2020) under either self-supervised or supervised settings.
In this work, we propose a Graph Attention Network (GAT) based model that explicitly utilizes the high-level semantic structure of the documents to learn document embeddings. We model the document as not just a sequence of text, but a collection of passages or sentences. Specifically, the proposed model introduces a graph on top of the document passages (Fig. 1) to utilize multi-granularity information. First, passages are encoded using RoBERTa  to collect word-level knowledge. Then passages are connected to leverage the higher-level structured information. At last, a graph attention network (Veličković et al., 2017) is applied to obtain the multi-granularity document representation. To better learn the document embedding, we propose a document-level contrastive learning strategy to pretrain our models. In our contrastive learning framework, we split the document into random sub-documents and train the model to maximize the agreement over the representations of the sub-documents that come from the same document. This simple strategy allows us to pretrain our models on a large unlabelled corpus without any additional priors. As we will see, this simple pretraining task indeed helps the model on the downstream tasks.
The contributions of this paper can be summarized as follows.
• We propose to a graph document model with graph attention networks that can not only explicitly utilize the high-level structure of the document but also leverage pretrained Transformer encoders to obtain low-level contextual information.
• We propose a simple document-level contrastive learning strategy, which does not require any handcrafted transformations and is suitable for large-scale pretraining.
• We conduct empirical evaluations on our models and contrastive pretraining strategy. We show that our graph-roberta models achieve great performance on both document classification and retrieval tasks. Specifically we demonstrate that our contrastive pretraining helps the model learn a meaningful document representation even without fine-tuning, and improve both the training convergence speed and final performance during end-to-end finetuning on downstream classification tasks. For document retrieval tasks, we demonstrate that our graph-roberta models have great semantic matching performance, compensating the typical lexical matching system.

Methodology
In this section, we describe our main model and contrastive pretraining strategy. Figure 1: An example of the Graph-Roberta architecture for the document representation.

Graph Document Architecture
In this work, we model a document as graph over passages. Given a document D with passages {p 1 , . . . , p |D| }, we define an undirected graph G : (V, E), where V consists of n + 1 nodes (v D , v p 1 , · · · , v pn ) and the graph edges E are constructed based on the document structure. An example of a document graph is shown in Fig. 1. Once the document graph is defined, we can instantiate a neural network model based on the graph structure.
Passage Node Initialization First, we use the state-of-the-art contextual language models to encode each passage text, since each passage is relatively short. Specifically, given a passage p i consists of a sequence of words {w i,1 , w i,2 , · · · , w i,|p i | }, we use Roberta ) as the encoder model for the passage node and project the [CLS] vector into fixed embedding space as the initial passage node representation.
where φ is RoBERTa with [CLS] vector.
Document Node Initialization For the document node, we simply use the average of all the passage node embeddings as the initial representation.
Graph Attention Layers Finally, we apply T Graph Attention Layers (GAL) 1 to aggregate all the information from different nodes.

v
where N (i) is the neighbour node set of passage node p i on the given graph structure. The step t counts from 1 to T and the final document node representation is v

Contrastive Pretraining for Document Representation Learning
We design a simple contrastive learning task to pretrain our graph document models. The main idea follows from the contrastive learning framework in , where the task is to learn an encoder function to maximize the agreement between augmented views of the same image. Here we consider that any proportions inside the same document are the different "views". The task is to maximize the agreement between different proportions that come from the same document. Since our document has been represented as a list of passages, a proportion of a document would be any subset of passages, which we call a sub-document. During training time, we randomly sample a mini-batch of For each document D i , we randomly split passages to two subsets as sub-documents: where D i is the union set ofD i andD i . We treatD i andD i as the positive pair. Any pair of sub-documents that come from different documents are negative pairs. Then the noise contrastive loss function for a positive pair is defined as , where vD i is the encoding of the sub-documentD i based on the proposed graph document model.
The final loss is computed across all the pairs.

Experiments
We experiment on two popular applications, text classification and document retrieval to evaluate the proposed approach. The experimental results show that the graph based document representation could capture long document information and the contrastive learning strategy could utilize unlabeled data to further improve the performance and training efficiency.
Model details Throughout the paper, we use roberta-base model  as our passage node encoder. On top of that, we add 2 graph attention layers with 2 heads and skip connections. Specifically, we utilize the Deep Graph Library 2 for GAT implementation. The embedding size for passage and document nodes is 512. We refer this model as graph-roberta model.

Datasets
In this section, we describe all the datasets we use in this paper.
OpenWebText (Gokaslan et al., 2019) is an open-source recreation of the Webtext corpus in Radford et al. (2019). The text was extracted from Reddit post urls, which produces around 8M documents.
arXiv ) is a collection of 33,388 arXiv scientific papers from 11 categories. The average document length exceeds 5,000 words. We create a random train/dev/test split of 25,568/3,196/3,197. Newsgroup (Lang, 1995) a collection of newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It contains 11,314 training and 7,532 test samples. We sample 10% of the training data for validation.
IMDB (Maas et al., 2011) is a dataset for binary sentiment classification. It contains 25,000 labeled movie reviews as the training set and another 25,000 movie reviews as the test set. We random sample 1,000 examples from the training set for validation.
Robust04 (Voorhees, 2005) is the news collection from the TREC 2004 Robust track. It is a doc-ument retrieval dataset consisting of 249 queries with relevance labels on a corpus of 528K documents.
MSMARCO DR (Bajaj et al., 2016) is a document ranking dataset with about 3.2M documents. It provides over 367K training queries and an official dev set of 5,193 queries. The Trec 2019 Deep Learning track (Craswell et al., 2020) also provides an additional test set of 43 queries.
WIKIR 3 (Frej et al., 2019) is an open-source toolkit to create large-scale information retrieval datasets based on Wikipedia. In this work, we use the English Wikipedia dump from 2020/12/20 4 and following the the same settings in Frej et al. (2019) except for that we preserve the punctuation and section information in the document. We obtain two datasets, WIKIR62K and WIKIRS62K, both of which contain around 60k training queries, 1k dev queries and 1k test queries. The queries in WIKIR62K are built based on titles and the ones in WIKIRS62K are based on the first sentences. The processed document corpus size is around 2.4M documents.
Since graph-roberta models take a document input as a graph over passages, we split each document into passages with around 100 words while respecting the sentence boundary. For WIKIR documents, we also respect the section boundaries. Without additional specification, we use the fullyconnected graph structure by default.

Document Classification
In this section, we conduct empirical evaluation of our models on document classification tasks. We consider 4 datasets, arXiv, Hyperpartisan, IMDB and Newsgroup. We compare our graph-roberta models with the baseline model Roberta , as well as Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020), two state-ofthe-art transformer models that handle long text input 5 . In our experiments, we only consider the base version of those models.

Contrastive Pretraining
We pretrain our graphroberta models on OpenWebText dataset. During the training process of contrastive learning, for each document, we keep up to 50 passages and we randomly select half the number of passages as the sub-document and the rest of the half as the other sub-document. We train for 10 epochs with batch size 1,536, using Adam (Kingma and Ba, 2014) optimizer with a learning rate 5e-5 and warm up rate 0.1.
Finetuning For graph-roberta models, we keep up to 50 passages per document during training and at inference time, we keep up to 100 passages per document. For the other models, we truncate the document text up to the maximum sequence length they are allowed to handle; Roberta's maximum input length is 512, and Longformer and BigBird's maximum input length is 4,096. The detailed training configurations are shown in appendix.
Clustering First, we evaluate the capability of our graph-roberta model as an off-the-shelf document encoder through document clustering. We take the document node representation with the pretrained graph-roberta model and [CLS] embeddings from the other three models. We run k-means clustering methods on the training set and run inference on the test set. We compute the normalized mutual information (NMI) and Purity to evaluate the clustering quality. We report the results on arXiv and Newsgroup dataset in Table 1. As we can see, our pretrained graph-roberta model clearly outperforms the other three models by a large margin. This is expected that the other three models are not pretrained on any document-level tasks. Fig. 3 & 4 showcase that the simple unsupervised contrastive learning strategy indeed helps graph-roberta model learn meaningful document representations.  End-to-end Classification To evaluate the full capability of the graph-roberta model, we also conduct end-to-end finetuning on the 4 datasets. In addition to 4 pretrained models, we also report the performance of graph-roberta without contrastive learning. The results are shown in Table 2. First, we can see that graph-roberta model outperforms all the other methods on 3 out of 4 datasets. The exception is IMDB dataset, which has relatively      short text. Also we see that contrastive learning indeed helps improve the final performance. Fig. 5 shows the end-to-end training processes of graph-roberta models on arXiv and Newsgroup datasets. It demonstrates that contrastive learning task speeds up the finetuning progress and helps learn a better model.

Document Retrieval
In this section, we extend our model to embeddingbased document retrieval task. In this case, we consider the query as a single-node graph, where the representation is computed by the initial node representation. With that, we apply dot-product similarity to retrieve relevant documents. Our approach is essentially representation-based model.

Contrastive Pretraining
To better align with the retrieval tasks, during pretraining, we sample one passage from each document as the sub-document and the rest as the other sub-document instead of an even random split, and we only compute the contrastive loss over the long sub-documents. In addition, we set 50% of the time we select the first passage of each document and 50% of the time comparable with ours because they did not release the train/test split of the data. we sample uniformly from the document. This is very similar to the Inverse Cloze Task (ICT) introduced from , except for the difference that ICT randomly selects one sentence from the passage whereas we randomly selects passages from the document. We pretrain the graph-roberta model on the OpenWebText data for 10 epochs with batch size 1,024, using Adam (Kingma and Ba, 2014) optimizer with learning rate 5e-5 and warmup rate 0.1.
Finetuning To finetune the model on ranking datasets, we use the similar training loss in the contrastive pretraining except we use the actual training queries-documents pairs. Besides in-batch negatives, we also sample additional negative candidates either uniformly or from some hard negatives such as the top BM25 retrieval pool for each training query.
First, we run the experiment with Robust04 dataset. We train and cross-validate machine learning models on the given 5 folds. During each run, we finetune the model 10 epochs with batch size 32, using Adam optimizer with learning rate 5e-5 and warmup rate 0.1. We sample additional 8 random negatives uniformly for each training query. We compare our models with BM25 baseline. The results are shown in Table 3. First we can see that the contrastive pretraining significantly improves the graph-roberta model performance (e.g. P@20 improves by over 100%). Still as a retrieval model, graph-roberta underperforms BM25. We conjecture that there are two reasons. (1) robust04 query set is too small to train such a complex neural representation model. (2) robust04 queries are all short key word queries, which favor lexical-matching methods such as BM25 over contextual transformer models. Nevertheless, we combine the retrieval results from graph-roberta model and BM25 through a weighted average of their scores (the weight is selected through cross-validation), we improve the nDCG@20 by 2% in absolute value over BM25, which indicates our model compensates BM25 results for semantic matching. Now we present the experiment on much larger document ranking dataset MSMARCO. We finetune graph-roberta models on MSMARCO training set with batch size 128 for 10 epochs. For each training query, we also sample one hard negative in addition to the batch negatives. For the first 5 epochs, we randomly sample one negative from the top 100 BM25 retrieval results. For the latter  5 epochs, we randomly sample one from the top 100 results retrieved using the 5-epoch checkpoint model. We also sample 100 queries from the official training set as our own validation set to monitor the training progress. We report the retrieval performance (without reranking) in Table 4.
25.87 52.97 DE-Hybrid-E (Luan et al., 2020) 28.70 59.50 ME-Hybrid-E (Luan et al., 2020) 31.00 61.00 ACNE FirstP (Xiong et al., 2020)   Similarly to Robust04 experiment, contrastive learning as a pretraining strategy again improves the graph-roberta model performance. Note the improvement on MSMARCO is not as significant as in Robust04. Considering the fact that MSMARCO has a much larger training set, it is expected that the benefit of pretraining is less. Comparing with BM25, graph-roberta as a dense retrieval method achieves almost 9 points better in MRR@100 on the official dev set. We also list the performance of the state-of-the-art neural retrieval methods. DE-Hybrid-E and ME-Hybrid-E methods (Luan et al., 2020) are the two hybrid sparse-dense models that combining BM25 and BERT encoded dense presentations. Note that graph-roberta already outperforms the hybrid models on the official dev set, indicating that the representation learned by the graph-roberta model is very effective. Lastly, combining graph-roberta and BM25 retrieval results through simple weighted average, gives us the similar performance by the SOTA method ACNE 7 (Xiong et al., 2020). Furthermore, we believe that  the training strategy introduced in ACNE can also be applied on graph-roberta model training and we leave it to the future work.
To further demonstrate the effectiveness of graph-roberta models for document retrieval task, we evaluate our models on the two large document retrieval datasets created via WIKIR (Frej et al., 2019), namely WIKIR62K and WIKIRS62K. In this experiment, we also consider different graph structures in modeling the Wikipedia documents. Besides the default fully-connected graph, we also consider the section structure information in the documents. Specifically, we consider the structure that all the passages within each section are mutually connected. The document node and the first passage nodes are connected with each other. We denote this graph as the section graph. We finetune the models on the training data for 5 epochs with batch size 128, using Adam optimizer with learning rate 2e-5 and warmup rate 0.1. For each query, we also sample one hard negative from the top 100 BM25 retrieved candidates. The final retrieval benchmark is shown in Table 5.
In Table 5, we see that contrastive pretraining consistently helps improve the model performance in both title queries and first-sentence queries. BM25 performs much better for title queries than the first-sentence queries, as observed in Frej et al. (2019) since title queries are usually keyword queries. Our graph-roberta model outperforms MatchPyramid (Pang et al., 2016) and ConvKRNM (Dai et al., 2018), and performs consistently on both title and first-sentence queries. We further combine the results of graph-roberta and BM25. Overall, the ensemble of BM25 and graph-roberta 38.38% on the official dev set) is not based on learned document embeddings, but on a set of passage representations for each document.
gives the best results.
We notice that for graph-roberta models, utilizing the section graph as described earlier performs slightly better than the default fully-connected graph, although the difference is small. We conjecture that on this dataset, the document representation does not rely much on the interaction between passages. We look into the graph attention patterns by the two models (graph-roberta with fullyconnected graph and graph-roberta with section graph). We compute the average attention weights of the last graph attention layer. We observe that on both models, the document node usually attends to similar passages. As an example, we plot the graph attention weights on both models. Fig. 6 shows the attention weights of the document node. As we can see, for this example, both models attend to similar passages besides the document node. In Fig. 7, we observe that for graph-roberta with the fullyconnected graph, all the other passage nodes have similar attention patterns as the document node, while for graph-roberta with the section graph, the passage nodes can actually learn some nontrivial patterns, which we believe could be beneficial for more complex tasks.

Related work
Document Representation Learning One line of related work is to utilize the successful pretrained Transformer models (Radford et al., 2018;Devlin et al., 2019) to obtain contextual text representations. It has been shown to be successful on sentences and short passages in textual similarity tasks and passage retrieval tasks (Reimers and Gurevych, 2019;Minaee et al., 2021;Karpukhin et al., 2020;Liang et al., 2020) handle long text sequences more efficiently. However these pretrained models still focus on tokenlevel interactions. Jiang et al. (2019) proposed a hierarchical attention model on top of recurrent neural network to tackle text matching of long documents, which later was extended by  to transformer architectures. Both works only focus on the text matching task. Pappagari et al. (2019) proposed a hierarchical transformer model to encode long documents, where they apply a recurrent network or transformer layer on top of the original BERT model. In our work, we use a GAT network which can better leverage the existing document structure and we design a simple and effective contrastive learning framework based on our graphic model. Another line of related work is to use graph neu-ral network for document modeling. Peng et al. (2018Peng et al. ( , 2019 proposed to use graph convolutional networks (GCN) to model document as a graph of words, which allows the model to capture longdistance semantics. Yao et al. (2019) built a single graph for a whole corpus based on both word-toword and document-to-word relations, which is learned by a GCN model.
Contrastive Learning Contrastive learning used as a self-supervised pretraining method has been widely used in NLP models (Rethmeier and Augenstein, 2021). Token or sentence-level contrastive learning tasks have been shown to be very useful in learning better contextual presentations (Clark et al., 2020;Giorgi et al., 2020;Meng et al., 2021). There also have been works that propose data augmentations for contrastive learning. Fang et al. (2020) proposed to use back-translation to construct positive sentence pairs in their contrastive learning framework. ; Qu et al. (2020) proposed multiple sentence-level augmentations strategies to do sentence contrastive learning. Most of these work still focus on either local token-level tasks or short sentence-level tasks. In our work, we directly work on document-level contrastive learning task. More recently Luo et al. (2021) proposed to use multiple data augmentations such as synonym substitution and back-translation to do unsupervised document representation learning. The difference in our work is that we have a much simpler framework that does not require those hand-craft transformations and we demonstrate that our contrastive learning strategy as a pretraining task can help the downstream tasks across various datasets.

Conclusions
In this work, we propose a simple graph attention network model to learn document embeddings. Our model not only can leverage the recent advancement of pretrained Transformer models as building blocks, but also explicitly utilize the high-level structure of the documents. In addition, we propose a simple document-level contrastive learning strategy that does not require handcraft transformations. With this strategy, we conduct large scale contrastive pretraining on a large corpus. Empirically we demonstrate our methods achieve great performance on both document classification and document retrieval tasks.

A Training details for document classification
We list the hyperparameters for finetuning the models on 4 document classification datasets in Table 6.