A Neural Graph-based Local Coherence Model

Entity grids and entity graphs are two frameworks for modeling local coherence. These frameworks represent entity relations between sentences and then extract features from such representations to encode coherence. The beneﬁts of convolutional neural models for ex-tracting informative features from entity grids have been recently studied. In this work, we study the beneﬁts of Relational Graph Convolutional Networks (RGCN) to encode entity graphs for measuring local coherence. We evaluate our neural graph-based model for two benchmark coherence evaluation tasks: sentence ordering (SO) and summary coherence rating (SCR). The results show that our neural graph-based model consistently outperforms the neural grid-based model for both tasks. Our model performs competitively with a strong baseline coherence model, while our model uses 50% fewer parameters. Our work deﬁnes a new, efﬁcient, and effective baseline for local coherence modeling 1 .

Motivated by the Centering theory (Joshi and Weinstein, 1981), many approaches to local coherence modeling rely on entity relations between sentences. The entity grid Lapata, 2005, 2008) and the entity graph (Guinaudeau and Strube, 2013) are two well-studied frameworks for representing entity relations in a text. Entity grid-based models use grids while entity graph-based models use graphs to capture entity relations between sentences. Several methods have been proposed to enrich these representations and also to extract features from these representations to model local coherence. Recent work shows the effectiveness of convolutional neural networks (CNNs) for extracting features from entity grids to encode coherence (Tien Nguyen and Joty, 2017;Joty et al., 2018). Pre-trained transformer-based encoders can also capture relations between tokens in a text (Devlin et al., 2019). However, these encoders are potentially incapable of capturing long-distance relations (Martins et al., 2021), specifically where the text length is greater than the maximum input length in these encoders.
In this work, we revisit graph-based coherence assessment by introducing a neural graph-based coherence model. To do so, we represent a text via a graph ( Figure 1) since a graph can capture longdistance relations in a text. Such a graph contains two types of edges: (1) Edges that capture entitybased relations between sentences, and (2) edges that capture the linear order of sentences in the text. To encode such graphs, we adapt Relational Graph Convolutional Networks (RGCNs) (Schlichtkrull et al., 2018). RGCNs encode nodes of a graph into vectors using the graph's connectivity structure and any feature information captured in the graph, such as edge types. We then apply a self-attention layer to these node vectors to capture to what extent each sentence of the text is crucial for estimating the coherence of the entire text. We finally use an output layer to transform the outputs of the self-attention layer to a score, which estimates the coherence degree of the text. Figure 2 depicts an overview of our model. We evaluate our model for two benchmark co-s1: LDI Crop., Cleveland, said it will offer $50 million in commercial paper backed by lease-rental receivables. s2: The program matches funds raised from the sale of the commercial paper with small to medium-sized leases. s3: LDI leases and sells data-processing telecommunications and other high-tech equipment. s4: LDI termed the paper 'non-resource financing', meaning that investors would be repaid from the lease receivables, rather than directly by LDI Corp.

Graph Representations
For a text as a sequence of sentences T = (s 1 , ..., s n ), we construct a graph G = (V, E, R) in which V is the set of nodes, E is the set of edges, and R denotes the label set for edges ( Figure 1). Each node v i ∈ V is corresponded with a sentence s i in the text T . We connect the nodes in a graph by two types of edges: (1) Edges with "adj" labels which connect nodes associated with any two adjacent sentences in the text to capture their linear order; and (2) Edges with "ent" labels which capture entity relations between sentences. We add an entity edge between nodes v i and v j if sentence s i precedes sentence s j and these sentences contain co-referring entity mentions. Edge directions capture the order of sentences. We use boldface notations for variables that refer to vectors or matrices.

Neural Graph-based Model
Our model consists of three layers ( Figure 2): an RGCN, a self-attention, and an output layer.
RGCN As nodes in a graph represent sentences in a text, we first map sentences to vectors in an embedding space. Given sentence s = (t 1 , ..., t |s| ) with |s| tokens, we first map each token t to its corresponding embeddings t. We then apply BiLSTM to embeddings of tokens to condition each token representation on the representations of its neighboring tokens in the sentence. : The reason that we use BiLSTM (instead of transformer-based encoders like BERT) is that we aim to keep our model's size in terms of the number of parameters efficient. We concatenate the output vectors associated with the last tokens in the left-toright ( − → H ) and right-to-left ( We adapt an RGCN layer to take these sentence vectors and enrich them with the graph structure of the text as well as edge types as follows: where W r ∈ R d×d encodes the label r ∈ R between node v j and v i . The set N r (v i ) contains the nodes connected to v i by edges with label r.

Self-attention
We use a multi-head self-attention (Vaswani et al., 2017) layer to estimate to what extent each sentence contributes to the coherence representation of a text. Each attention head computes a representation z i of node vector v i as follows: where W a ∈ R d×d is learning parameters. We define attention weights α ij as follows: where e ij is the attention function, and W q , W k ∈ R d×d are its parameters. d s is the dimension of the input vectors. K independent attention heads are concatenated and linearly transformed to obtain final node representations, Output layer We then apply a mean pooling to the output vectors of the attention layer to obtain a vector representing the coherence of the entire text. We map this vector to a score as follows: where w o ∈ R d and b o ∈ R are trainable parameters of the output layer. The output of the model c estimates the coherence degree of the entire text T .

Training and Evaluation
We train our model in a ranking scenario (Joty et al., 2018). Given T + as a text with a coherence degree higher than that of text T − , we update parameters of our model with respect to the following loss function L(Θ) = max{0, τ − c + + c − }, where c + and c − are the coherence degrees our model estimates for text T + and text T − , respectively. τ is the margin, Θ indicates all trainable parameters in our model. During training, our model shares all the layers to obtain c + and c − . Once the model is trained for a task, we use it to score any text independently during evaluation for that task.

Experiments
We evaluate our model for two benchmark tasks for coherence modeling: sentence ordering (SO) and summary coherence rating (SCR). In SO, a text is compared with random permutations of its sentences (Barzilay and Lapata, 2008 model should ideally rank a text higher than its permutations concerning coherence. In SCR, we deal with ranking summary texts, where each summary text comes with a coherence rating assigned by human judges (Barzilay and Lapata, 2008). Given a pair of summary texts with different coherence ratings, a coherence model is expected to rank them properly with respect to their coherence ratings. Sections 00-13 of WSJ are used for training and sections 14-24 for testing (Table 1). We randomly select 10% of texts from the training set for development purposes. We compare any of these texts with 20 permutations. For SCR, we use the dataset proposed by Barzilay and Lapata (2008) and used by prior work for coherence evaluation (Guinaudeau and Strube, 2013;Tien Nguyen and Joty, 2017). The dataset comprises texts from the DUC-2003 corpus, which contains English summaries produced by human experts and extractive summarization systems. Seven human annotators judged the summaries in a seven-point scale to rate how coherent the summaries were without having seen the source texts. For any summary in this dataset, the average of seven ratings, each assigned by a human judge, is taken as the coherence rating of the summary. Each data point in this dataset is a pair consisting of two summaries of the same text, where the rating of one of the summaries is higher than the rating of the other one. The training set contains 144 pairs, among which 14 pairs are used for development. The test set contains 80 pairs.

Datasets
Settings We compare our model (Section 2) with the following coherence models: EntGraph (Guinaudeau and Strube, 2013), Neural EntGrid (Tien Nguyen and Joty, 2017), Lex. Neural EntGrid (Joty et al., 2018)  results on our machines. For others, we report the results from their papers. We use word2vec (Mikolov et al., 2013) as word embeddings since we aim to compare with Lex. Neural EntGrid in identical settings. Additionally, it keeps the number of parameters in our model low. We leave the study about the impact of different embeddings on the performance of our model for future work. We construct our graphs using the grids identical with those used by Neural EntGrid where all nouns are taken as entity mentions, and the string match approach is used to detect coreferent mentions. The batch sizes for training and evaluation is 5, τ is set to 5, and we train our model up to 5 epochs. The sizes of the word vectors, the BiLSTM and the RGCN layer are 300, 256 and 512, respectively. We optimize the parameters by Adam with a learning rate 0.0001 and L2 regularization. We use only one RGCN layer and one head for our attention. At each epoch we evaluate the model on the validation set. We use the model with the best scores on the validation set for evaluations on the test set. We run all experiments on a V100 GPU where each run of our model takes on average about 5 hours. We use accuracy as the evaluation metric, which corresponds to the number of correct rankings divided by the number of comparisons. Table 2 shows the accuracy of the examined models for the SO and SCR tasks. Overall, our neural graph-based coherence model outperforms the examined baseline coherence models for both tasks. Our model performs substantially better than EntGraph. Similar to EntGraph, we use graphs to represent relations between sentences. However, EntGraph relies on merely entity-based relations to construct graphs and uses a heuristically-defined feature (i.e., the average outdegree of nodes in a graph) to estimate the text coherence. Our model performs better because our graphs contain edges for capturing linear order of sentences as well as entity-based relations. Moreover, our model adapts RGCN to extract features for estimating coherence.

Results and Discussion
Our model also outperforms the examined entity grid-based models. The Neural EntGrid and Lex. Neural EntGrid models represent entity relations in text by entity grids and then apply CNNs to these grids to extract features for modeling the text coherence. Differently, our model uses graphs to represent relations between sentences and applies RGCN to learn features from graphs.
Our model slightly outperforms the model proposed by Moon et al. (2019). We note that the best results for M&M are 92.93 for SO and 83.8 for SCR, achieved with ELMo as word embeddings. We compare with their Word2Vec setting to study the influence of our models, not word embeddings. Note that the Neural EntGrid's score for SCR is its best performing results, where the model is first pretrained for SO and then fine-tuned on the training set of the SCR's dataset. Our model outperforms the Neural EntGrid model while our model is trained for SCR from scratch, i.e., without pretraining. It is worth noting that the size of the test split used for SCR is small (80 text pairs). The improvements achieved by our model translates into the fact that our model makes 10 and 6 out of 80 correct rankings more than what Neural EntGrid and the (Moon et al., 2019)'s model make, respectively. However, such improvements on the SCR's dataset are important as texts in this dataset are associated with human-provided coherence ratings.   Table 3 depicts the accuracy of our model when different edge sets are used to construct graphs. "Ours w/o ent." shows our model trained on graphs with only adjacent edges. "Ours w/o adj." shows our model trained on graphs with only entity edges. We observe that edges with "adj" labels are more predictive signals than entity-based edges for SO. This observation intuitively makes sense as perturbations may change the order of only adjacent sentences. For SCR, entity-based relations are more predictive. Summary texts are supposed to express information about entities from source documents in a few sentences. Interestingly, by removing edges with "adj" labels, the performance of our model does not decrease for SCR. In sum, our model performs its best for both tasks when both edge types are used to construct graphs.

Conclusions
We introduced a neural graph-based model for local coherence assessment. We construct a graph of relations among sentences in a text using entitybased and linear relations between sentences. We apply relational graph convolutional networks to such graphs to extract features encoding coherence. Our model outperforms its counterparts for sentence ordering and summary coherence rating. The high performance of current coherence models on tasks with synthetic data possibly being not representative of real-life performance (Mohiuddin et al.). So, we aim to further study the performance of our model for tasks with natural data.