Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations

Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level representations. In this work, we propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations. As a result, our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network. By experimenting with benchmarks designed to evaluate discourse-related knowledge using pre-trained sentence representations, we demonstrate that our approach improves performance in 6 out of 11 tasks by excelling in discourse relationship detection.


Introduction
Pre-trained language models are among the leading methods to learn useful representations for textual data. Several pre-training objectives have been proposed in recent years, such as causal language modeling (Radford et al., 2018(Radford et al., , 2019, masked language modeling (Devlin et al., 2019), and permutation language modeling (Yang et al., 2019). However, these approaches do not produce suitable representations at the discourse level (Huber et al., 2020).
In this work, we propose to extend BERT-type models with recursive bottom-up and top-down computation based on PC theory. Specifically, we incorporate top-down connections that, according to PC, convey predictions from upper to lower layers, which are contrasted with bottom-up representations to generate an error signal that is used to guide the optimization of the model. Using this approach, we attempt to build feature representations that capture discourse-level relationships by continually predicting future sentences in a latent space. We evaluate our approach on DiscoEval (Chen et al., 2019) and SciDTB for discourse evaluation (Huber et al., 2020) to assess whether the embeddings produced by our model capture discourse properties of sentences without finetuning. Our model achieves competitive performance compared to baselines, especially in tasks that require to discover discourse-level relations.
2 Related Work 2.1 BERT for Sentence Representation Pre-trained self-supervised language models have become popular in recent years. BERT (Devlin et al., 2019) adopts a transformer encoder using a masked language modeling (MLM) objective for word representation. It also proposes an additional loss called next-sentence prediction (NSP) to train a model that understands sentence relationships. On the other hand, ALBERT (Lan et al., 2020) proposes a loss based primarily on coherence called sentence-order prediction (SOP).
SBERT (Reimers and Gurevych, 2019) uses a siamese structure to obtain semantically meaningful sentence embeddings, focusing on textual similarity tasks. ConveRT (Henderson et al., 2020) uses a dual-encoder to improve sentence embeddings for response selection tasks. These models focus on obtaining better representations for specialized sen-tence pair tasks, so they are not comparable with our which intended to be general-purpose.
More recently, SLM (Lee et al., 2020) proposes a sentence unshuffling approach for a fine understanding of the relations among the sentences at the discourse level. CONPONO (Iter et al., 2020) considers a discourse-level objective to predict the surrounding sentences given an anchor text. This work is related to our approach; the key difference is that our model predicts future sentences sequentially using a top-down pathway. We consider CONPONO as our main baseline.

Predictive Coding and Deep Learning
Recent work in computer vision takes inspiration from PC theory to build models for accurate (Han et al., 2018) and robust (Huang et al., 2020) image classification. PredNet (Lotter et al., 2017) proposes a network capable of predicting future frames in a video sequence by making local predictions at each level using top-down connections. CPC (Oord et al., 2018) is an unsupervised learning approach to extract useful representations by predicting text in a latent space. Our method takes inspiration from these models, considering top-down connections and predictive processing in a latent space.

Model Details
Our model consists of a BERT-style model as a sentence encoder (ALBERT and BERT are used in this work) and a GRU model (Cho et al., 2014) that predicts next sentences (see Figure 1). Our intuition is that by giving the model the ability to predict future sentences using a top-down pathway, it will learn better relationships between sentences, thus improving sentence-level representations of each layer for downstream tasks.
The input is a sequence s 1 , s 2 , .., s n of sentences extracted from a paragraph. We encode sentence s t with encoder g enc that generates output z l t at time step t and layer l (l is from 1 to L). Note that vector z l t is obtained from the special token [CLS], which is commonly used as sentence representation. Next, an autoregressive model g ar produces a context vector c l t as a function of z l t and the context vectors of the upper layer c l+1 t and the previous step c l t−1 .
Then we introduce a predictive function f (.) to predict a future sentence. In other words, f (.) takes as input the context representation c l t from time step ..... Figure 1: Example of predicting one future sentence. Given an input sentence s t at time step t, the corresponding representation z l t is calculated at layer l. Then a context vector c l t is computed via a top-down pathway (left). Afterwards, a future sentenceẑ l t+1 is predicted to be compared to the actual representation z t+1 (right).
t and layer l, and predicts the latent representation z l t+1 at time step t + 1, i.e.: In the spirit of Seq2Seq (Sutskever et al., 2014), representations are predicted sequentially, which differs from the CONPONO model that predicts k future sentences with a unique context vector.

Loss Function
We rely on the InfoNCE loss proposed for the CPC model (Oord et al., 2018). This constructs a binary task in which the goal is to classify one real sample among many noise samples. InfoNCE encourages the predicted representationẑ to be close to the ground truth z.
In the forward pass, the ground truth representation z and the predicted representationẑ are computed at each layer of the model. So we denote the corresponding feature vectors as z j i andẑ j i where i denotes the temporal index and j is the layer index. A dot product computes the similarity between the predicted and ground truth pair. Then, we optimize a cross-entropy loss that distinguishes the positive pair out of all other negative pairs: There is only one positive pair (ẑ j i , z j i ) for a predicted sentenceẑ j i , which are the features at the same time step and the same layer. The rest of pairs (ẑ j i , z j m ) are negative pairs, where (i, j) = (m, j). In practice, we draw negative samples from the batch. This is a simple method and a more complex generation of negative samples could improve results. Our loss function, which we refer to as next-sentence modeling (L nsm ), is used in conjunction with the BERT masked language model loss (L mlm ). Accordingly, we train to minimize:

Pre-training and Implementation Details
We extend ALBERT and BERT models, obtaining PredALBERT and PredBERT as a result. As mentioned above, our models are fed with a set of contiguous sentences s n that are processed one-ata-time. Note that the length of the conventional BERT input is 512 tokens. However, it is unlikely that a sentence will have that many tokens. We join 3 contiguous sentences to create a long sequence. Longer sequences are truncated, and shorter sequences are padded. We use an overlapping sentence between contiguous sentence groups. For instance, given a paragraph s 1 , s 2 , .., s 9 , the 1st sequence is s 1 , s 2 , s 3 , the 2nd sequence is s 3 , s 4 , s 5 , and so on. Our early experiments show that this setting improves the model's predictive ability in the validation set. We hypothesize that the model can predict up to 3 sentences by using information from the overlapping sentences.
We pre-train our models with the predictive mechanism set to predict the next 2 future sentences (k = 2). At time 1, our model represents sequence 1, then this vector feeds the top-down flow (GRU) generating a context representation in each layer that is used to predict sequence 2. Then, the model represents sequence 2 to contrast it with the predicted one. This is repeated one more time to reach k = 2 predicted future sequences. For a fair comparison, we train using the BookCorpus (Zhu et al., 2015) and Wikipedia datasets, as well as the BERT, ALBERT, and CONPONO models.
Note that top-down connections are only available during pre-training. At evaluation time, we discard the top-down connections keeping only the encoder, thus obtaining a model equivalent to BERT or ALBERT in terms of the parameters. Table 1 shows the number of parameters in our models. We used the Huggingface library (Wolf et al., 2020) to implement our models. We initialize the encoder model with BERT or ALBERT weights depending on the version. The autoregressive model was initialized with random weights. For model efficiency in both versions, we use parameter-sharing across layers in the autoregressive model. We trained the models for 1M steps with batch size 8. We use Adam optimizer with weight decay and learning rate of 5e-5. For the masked language modeling, we consider dynamic masking, where the masking pattern is generated every time we feed a sequence to the model. Unlike BERT, we mask 10% of all tokens in each sequence at random.

Datasets
Our focus is to evaluate if the discourse properties of sentences are captured by our model without finetuning. DiscoEval (Chen et al., 2019) and SciDTB-DE (Huber et al., 2020) datasets include probing tasks designed for discourse evaluation, thus letting us know what discourse-related knowledge our model is capturing effectively.
DiscoEval: Suite of tasks to evaluate discourserelated knowledge in sentence representations. It includes 7 tasks: Sentence position (SP), Binary sentence ordering (BSO), Discourse coherence (DC), Sentence section prediction (SSP), Penn discourse tree bank (PDTB-E/I), and Rhetorical structure theory (RST). SP, BSO, DC, and SSP assess discourse coherence with binary classification, while PDTB and RST assess discourse relations between sentences through multi-class classification.
SciDTB-DE: Set of tasks designed to determine whether an encoder captures discourse properties from scientific texts. It considers 4 tasks: Swapped units detection (Swapped), Scrambled sentence detection (Scrambled), Relation detection (BinRel), and Relation semantics detection (SemRel). Both Swapped and Scrambled tasks were designed for clause coherence verification, while BinRel and SemRel for discourse relationship detection.   (Devlin et al., 2019). We also evaluate CONPONO (Iter et al., 2020), which is the most related model to our approach. Because these models have more parameters than PredBERT, we also include AL-BERT (Lan et al., 2020) Base and Large, which are directly comparable to our model. For a fair and consistent comparison, we rerun all baseline evaluations. We use the pre-trained Huggingface models (Wolf et al., 2020) for BERT and ALBERT. In the case of CONPONO, we use a version pre-trained to predict 2 next surrounding sentences 1 .
Evaluation: In the case of DiscoEval, we use the original code provided by Chen et al. (2019). We observe that this configuration leads to CONPONO model results that differ from the reported on the original paper. On the other hand, following Huber et al. (2020), we use SentEval (Conneau and Kiela, 2018) toolkit for SciDTB-DE evaluation. In both cases, the process involves loading a pre-trained model with frozen weights and training a logistic regression on top of the sentence embeddings. To train, we use the average of sentence representations ([CLS]) from all the layers. Table 2 shows the results of our models. We observe improvements in discourse relation detection (PDTB, RST, BinRel, SemRel) and discourse coherence (DC) tasks compared to the best baseline (CONPONO). Across these tasks, PredALBERT-L outperforms by ∼1.34 points on average, We also highlight that PredALBERT-B/L achieves competitive performance with fewer parameters than BERT and CONPONO. Decreased performance of our models in the SP, BSO, SSP, Swap, and Scram tasks is due to the fact they are closely related to the baselines optimization objectives, which consist of sentence order prediction (ALBERT), topic prediction (BERT), or a combination of them (CONPONO). In contrast, our approach uses a next sentence prediction task in a generative way that encourages the capture of discourse relationships, improving its performance on PDTB, RST, BinRel, and SemRel tasks.

Ablation Study
In order to verify the influence of PC mechanism on the pre-training result, we carry out ablation experiments. We use our PredALBERT-B as the Default model, which includes top-down connections and recurrence from layer 12 to layer 1. Ablations involve removing top-down connections and the recurrence of certain layers. Table 3 shows performance across all tasks for each benchmark.
The first experiment uses the PC mechanism on Half the layers, i.e., the GRU and predictive layer are present from layer 12 to layer 6. This variation exceeds the Default model by ∼0.03 in Disco-Eval and ∼0.14 in SciDTB-DE. The second experiment uses the PC mechanism only on the Last layer of the transformer. It means that the combination of the GRU and prediction layer is only present in layer 12. This reduces the performance by ∼0.91 in DiscoEval and ∼2.41 in SciDTB-DE.
Also, we conducted an additional experiment where we removed the top-down connections (w/o TDC) to the Default model. This is equiv-  alent to modifying equation 1 by g ar (z l t , c l t−1 ). We found that this ablation severely affects the performance of the Default model performance by ∼2.89 in DiscoEval and ∼4.43 in SciDTB-DE.
Our findings indicate that top-down pathway is beneficial for improving discourse representations of BERT-type models. However, it is not clear in which layers it is crucial to have the PC mechanism. We hypothesize that this is related to the fact that the BERT-style models encode syntactic and semantic features in different layers (Jawahar et al., 2019;Aspillaga et al., 2021), so a specialized PC mechanism for syntax or semantics would be desirable. We left this study for future work.

What Does The Model Learn?
Because our model excels at detecting discourse relations, in this section, we explore whether the resulting vectors actually represent the role of a sentence in its discursive context. To illustrate what PredBERT learns, we follow the methodology proposed by Lee et al. (2020). We use labeled sentences with discourse relations as queries to retrieve the top 3 most similar sentences from an unlabeled corpus using cosine similarity. We obtained the queries from the MIT Discourse Relations Annotation Guide 2 and the unlabeled sentences from the Gutenberg dataset (Lahiri, 2014). We compute the representations as mentioned in Section 4.2. This process allowed us to verify that similar vectors share the same or equivalent discourse relations.
Temporal relation: Query = He knows a tasty meal when he eats one.
1. The last five words took Tuppence's fancy mightily, especially after a meagre breakfast and a supper of buns the night before. 2. I know a disinterested man when I see him.
3. He had about ten pounds when I found him. Sentence 1 has a succession relation due to the use of the word after. Sentence 3 shows a synchrony relation because it uses when as the query. Sentence 2 does not have a temporal relation. 2 https://bit.ly/3z45IG2 Comparison relation: Query = IBM's stock price rose, but the overall market fell.
1. The stock markets of the world gambled upon its chances, and its bonds at one time were high in favor. 2. Tommy's heart beat faster, but his casual pleasantness did not waver. 3. I guess I was just a mite hasty, but I've been feeling bad about this money question. Sentence 1 matched stock and market words but does not contain a comparison relation. Sentences 2 and 3 include a counter-expectation relation similar to the query sentence that uses the word but.
Contingency relation: Query = I refused to pay the cobbler the full $95 because he did poor work.
1. I did the labor of writing one address this year, and got thunder for my reward. 2. I don't believe in a law to prevent a man from getting rich; it would do more harm than good. 3. When I fixed a plan for an election in Arkansas I did it in ignorance that your convention was doing the same work. All sentences contain semantically related words like pay/reward and poor/rich. Sentences 1 and 2 include a cause relation explicit and implicit. This is related to the query that has pragmatic cause relation. Sentence 3 shows a temporal relation.

Conclusions
We introduce an approach based on PC theory, which extends BERT-style models with recursive bottom-up and top-down computation along with a discourse representation objective. Our models achieve competitive results in discourse analysis tasks, excelling in relations detection.