CoCoLM: Complex Commonsense Enhanced Language Model with Discourse Relations

Large-scale pre-trained language models have demonstrated strong knowledge representation ability. However, recent studies suggest that even though these giant models contain rich simple commonsense knowledge (e.g., bird can fly and fish can swim.), they often struggle with complex commonsense knowledge that involves multiple eventualities (verb-centric phrases, e.g., identifying the relationship between “Jim yells at Bob” and “Bob is upset”). To address this issue, in this paper, we propose to help pre-trained language models better incorporate complex commonsense knowledge. Unlike direct fine-tuning approaches, we do not focus on a specific task and instead propose a general language model named CoCoLM. Through the careful training over a large-scale eventuality knowledge graph ASER, we successfully teach pre-trained language models (i.e., BERT and RoBERTa) rich multi-hop commonsense knowledge among eventualities.Experiments on multiple commonsense tasks that require the correct understanding of eventualities demonstrate the effectiveness of CoCoLM.

Task-specific Fine-tuning Event tasks [CLS] E1 [SEP] E2 Transformer Figure 1: The overall framework of CoCoLM. On top of base pre-trained language models, complex commonsense knowledge from the eventuality sequences is injected by fine-tuning masked language models and auxiliary discourse tasks.
between "Jim yells at Bob" and relevant eventualities. An important reason behind this is that current language models heavily rely on the token-level masked language model (MLM) as the loss function, which can effectively represent and memorize the token-level co-occurrence information but struggle at perceiving multi-token concepts. For example, to correctly understand the relation between "Jim yells at Bob" and "Bob is upset", the model needs perceive these two eventualities as independent semantic units while current LMs often failed to do so. Even though BERT introduced an extra next sentence prediction task to capture more complex semantics, its effect is not very significant because different from tokens, entities, or eventualities, the semantics of a sentence is often the combination of multiple semantic units rather than a single one [5].
To address this problem and equip LMs with complex and accurate human knowledge, [4] proposed to regularize the language model with key-value based entity representations from external knowledge graphs. While such approach has been proved effective merging accurate structured knowledge into the language models, it still has two limitations: (1) It is restricted to a fixed set of named entities, and cannot be easily generated to more complex semantic components (i.e., eventualities) because there are enormous eventualities; (2) The key-value structure can only encode one-hop knowledge and thus the language models still struggle to understand complex knowledge that involves high-order inference. As a result, it can only be used to model entity-based factual knowledge rather than more complex commonsense knowledge.
In this paper, to effectively inject complex commonsense knowledge about eventualities into pre-trained language representation models, we propose a knowledge injection framework CoCoLM. Through carefully controlled walk over a large-scale eventuality knowledge graph, ASER [6], we manage to collect rich knowledge about the discourse relations among eventualities (e.g., "being hungry" can cause "eat food" and "being hungry" often happens at the same time as "being tired"). After that, we design a special masking strategy to help language representation models to view each eventuality as a whole. By doing so, we successfully inject rich eventuality knowledge into pre-trained language representation models. As ASER is also automatically extracted from unlabeled corpus, the proposed pipeline is cheap and scalable. Experiments on multiple eventuality-relevant natural language understanding tasks show that the proposed solution can help improve the performance of all four language models (i.e., BERT-base, BERT-large, RoBERTa-base, and RoBERTa-large). Extensive analysis are conducted to show the effect and contribution of all components in CoCoLM. The main contributions of this paper are as follows: • We propose CoCoLM, a new contextualized language model enhanced by complex commonsense knowledge from high-order discourse relations. CoCoLM is trained to predict the whole eventuality among the sequences using a large-scale eventuality KG. • We introduce two auxiliary discourse tasks to help incorporate discourse-related knowledge into pre-trained language models, which complement the special eventuality masking strategy. • CoCoLM achieves stronger performance than the baseline LMs on multiple datasets that require the understanding of complex commonsense knowledge about eventualities.

Type Sequences
Temporal The police said. Succession The father of boy be charged murder yesterday. Co-occurrence Eleven people were arrested yesterday in raid.

Casual
They speak. Condition They have a interest. Reason they come there.

Others
The police said. Conjunction He was taken to house. Contrast Officer Schwarz held him.

Methods
The overall framework of CoCoLM is presented in Figure 1. Given a pre-trained language model, we inject complex commonsense knowledge about eventualities into the pre-trained model. Specifically, we first generate eventuality sequences based on carefully controlled walks over existing eventuality knowledge graphs, and then use the sequences as the context to help LMs handle eventualities. Besides the original masked language model loss, we also introduce the eventuality-based masking loss function and several auxiliary tasks to assist the training. As this new language model training is not task specific, the resulting language model can be easily applied to any downstream tasks via another task-specific fine-tuning. Details about all steps are as follows.

Eventuality Sequence Generation
As aforementioned, we leverage ASER, which uses eventualities as nodes and the discourse relations as edges, as the eventuality knowledge resource. ASER contains rich eventuality knowledge such as "being hungry" and "being tired" often happen together and people often "make a call" before they go. Interestingly, beyond the single edges, higher-order connections over ASER can also reflect insightful eventuality knowledge. For example, "sleep" and "go" are not likely to happen at the same time because "sleep" can be caused by "being tired" and there exist a contrast connection between "being tired" and "go". To include the higher-order knowledge into the model, we propose to take the whole graph into consideration rather than just the edges. At the same time, considering the large-scale of ASER, it is infeasible to input the whole graph into a graph model. Motivated by DeepWalk [7], we randomly sample paths to simulate the overall graph structure.
Given the initial knowledge graph G = (E, R), where E is the eventuality set and R is the relation set, we conduct the weighted random walk based on the edge weights over G to sample eventuality paths. We denote each path as (E 0 , r 0 , E 1 , r 1 , ..., r l−1 , E l ), where E means an eventuality, r a discourse edge connecting two eventualities, and l the numbers of eventualities along the sequence. To convert the sampled sentence into a token list, we keep all words in each event as a sentence and use representative connectives for each discourse relation to connect them. As ASER is automatically extracted from raw documents, it may contains noise. To minimize the influence of the noise and improve the quality of sampled paths, we required the selected paths to fulfill the following requirements: 1. To filter out rare eventualities, the frequency of starting eventualities has to be larger than five. 2. Other than the relations that have the transitive property (e.g., Precedence, Result), each selected path should not contain successive edges with repeated relations. 3. To make sampled sequences more informative, we manually improve the sampling probability of selecting subsequence patterns like "E i Condition E j Reason E k ". Since it has been proven that if -then rules [8] and if -thenbecause rules [9] are crucial for reasoning.
Implementation details and careful analysis are presented in Section 4 and several examples are shown in the Table 2.

Eventuality-Level Mask
Masking strategy plays a crucial role for the training of language representation models. Besides the random token-level masking strategy, many other masking strategies have been explored by previous literature such as the-whole-word masking [1,10], named entity masking [11] or text span masking [12]. Similarly, to effectively help the model view each eventuality as an independent semantic unit, we propose the following two masking strategies: (1) Whole Eventuality Masking: Similar to the whole word masking or entity masking strategies, the whole eventuality masking aims to reduce the prior biases of eventuality tokens. For example, given an eventuality sequence "I feel sleepy because I drink a cup of [MASK].", BERT would easily predict "coffee" or "tea" because of the prior knowledge of "cup of" inside the eventuality. Instead of that, masking the whole "I drink a cup of coffee" would encourage the prediction to treat each Co-Occurrence Prediction Figure 2: Illustration of CoCoLM-(Complex commonsense pre-training stage). Given an eventuality sequence, it is either masked by the whole eventuality masking (in blue) or discourse connective masking strategy (in pink). Besides the regular masked language model, the discourse relation labels are jointly predicted for masked connective tokens (on x4, x8 and x12). Co-occurrence prediction (on x1) is conducted for both masking strategies.
eventuality as an independent semantic unit and focus on the relations between them. For each sampled sequence, we randomly mask at most one eventuality to fulfill the masking budget, which is typically 25% of the sequence token length.
(2) Discourse Connective Masking: Besides masking the eventualities, to effectively encode the discourse information, we also tried masking the discourse connectives.
Examples of the two masking strategies are shown in Figure 2. It is worth mentioning that for each sequence, we only randomly select one type of masking strategy to guarantee that enough information is kept in the left tokens for the prediction. The formal masking strategy is defined as follows. Given a tokenized sampled sequence X = (x 1 , x 2 , ..., x n ), after randomly passing several tokens, we pass it to a transformer encoder [13] and denote the resulting vector representations as x 1 , x 2 , ...x n . The training loss L mlm can thus be defined as: where M means the set of masked tokens following the aforementioned masking strategies.

Auxiliary Tasks
A limitation of the MLM loss is that the prediction is over the entire vocabulary, and as a result the model could not effectively learn the connection between eventualities and connective words. To remedy this and force the model to learn the discourse knowledge, we propose to add an additional classification layer after the last layer of transformer encoder and it feeds the output vector x i of connective token x i into a softmax layer over the set of discourse relation labels as follows.
where M R is the index set of masked discourse connective tokens (e.g., because, and, so) in Figure 2, l i is the predicted discourse relation label, and l i the label provided by ASER. W and b are trainable parameters.
Besides the aforementioned discourse relations, ASER also provides the Co-Occurrence relations between eventualities, which means that two eventualities appear in the same sentence, but there is no explicit discourse relations between them. Even though compared with discourse relations, the Co-Occurrence relations are less informative, we still think it reflects rich knowledge about eventualities. Motivated by this, we propose another auxiliary task to help the model to learn such knowledge. Specially, given an eventuality sequence S = (E 0 , r 0 , E 1 , r 1 , ..., r l−1 , E l ) and an eventuality E c , we format the input 2 as "[CLS] S [SEP] E c [SEP]". We set 50% of the time E c to be the positive co-occurred x cls , we add another classification layer to predict whether the Co-Occurrence relations appears or not. The training objective L occur for binary classification is similar to L rel : where l i is the true co-occurrence label (positive or negative) for the sequence.
Merging all three losses together, we can then define the overall loss function L as:

Implementation Details
In this work, we use the released ASER-core version 3 extracted from multi-domain multi-resource corpora, which contains over 27.6 million eventualities and 8.8 million relations. We follow the heuristic rules in the Sec. 2.1 to sample eventuality sequences for pre-training. Overall we generated 4,901,511 eventuality sequences, ranging from one to five hops and the one-hop sequence means the direct (first-order) edge in the ASER. We also discard some edges with uninformative relation types such as Co-Occurrence except those used for auxiliary tasks and down-sample eventuality nodes with extremely high frequency such as I see. The sequence distribution over different lengths is shown in Figure 3.
We select BERT-base, BERT-large [1], RoBERTa-base, and RoBERTa-large [2] as the base language model. All models are implemented based with the Huggingface library [14]. For the continual pre-training phrase, we use the Adam optimizer for 10 epochs with batch size 128, learning rate 1e-5 and weight decay 0.01. Considering the relative longer span of masked eventualities, we enlarge the masking proportion from 15% to 25%.

Experiments
In this section, we conduct experiments on three widely used downstream tasks that require the correct understanding of complex commonsense knowledge about events: 1. ROCStories [16] is widely used for story comprehension tasks such as Story Cloze Test. It contains 98,162 five-sentence coherent stories as the unlabeled training dataset, 1,872 four-sentence story contexts along with two candidate ending sentences in the development and test datasets. We follow the dataset split for the story ending prediction task in [17].  3. COPA [19] is a binary-choice commonsense causal reasoning task, which requires models to predict which the candidate hypothesis is the plausible effect/cause of the given premise. We follow the training/dev/test split in SuperGLUE [20].
Statistics and examples of the three selected datasets are presented in Table 3. We implemented the experiments with Huggingface [14] and run experiments on eight Nvidia V100 32GB GPUs. For all experiments, we set the learning rate to be 1e-5, and maximize the sequence length and batch size such that they can fit into GPU memory.
For the ROCStories task, as mentioned in [21], there is strong bias about the human-created negative endings such that the model can distinguish the positive and negative endings without seeing the first four events. Even though [21] tried to filter the annotations, the bias still cannot be fully relieved. As a result, to clearly show the effect of adding complex knowledge about events into the LM, besides the most widely used supervised setting, we also report the performance of a debiased setting, where the model randomly selects events from other stories as the negative ending during the training. The debiased setting is indicated with "D". Following previous works, we report accuracy for the ROCStories, MATRES and COPA tasks. For MATRES, we also report the metric of F1 by considering the task as general relation extraction and treating the label of vague as no relation [22]. All models are trained until converge on the training set and the best model on the dev set is selected to be evaluated 4 .

Experimental Results
The experimental results are presented in Table 4, from which we can see that CoCoLM consistently outperform all the baselines on all three different commonsense tasks, especially on the de-biased setting of ROCStories. Besides that we can make the following observations: 1. For the ROCStories dataset, Compared with the original supervised setting, the de-biased setting is more challenging for all models, which helps verify our assumptions that previous models benefit from the bias.
2. The improvement of our model is more significant on ROCStories and COPA rather than MATRES, which is mainly because events in MATRES are associated with the context and many temporal relations have to be inferred with the local context rather than memorized by the language model. As a comparison, Both the ROCStories and COPA do not have any extra context, and thus require the pre-trained LM to know the essential knowledge to solve those problems.
In the rest of this section, we conduct extensive experiments and case studies to demonstrate the contribution of different components. For the experiment efficiency, in all analysis experiments, we use BERT-large as the base language model and ROCStories as the evaluation dataset.  Table 4: All three commonsense task evaluation results. The best performance is highlighted in bold. Note our proposed CoCoLM with RoBERTa-large achieves the best scores over all the tasks.

Ablation Study
From the results in Table 5, we can see that all components contribute to the final success of our model, especially the Co-Occurrence relation. This result proves our previous assumption that even though compared other discourse relations (e.g., Before and Cause), the Co-occurrence relations has relatively weaker semantic, it still can help models to better understand events due to its large scale.

Effect of Different Event Knowledge Resources
Besides ASER [6], another important event knowledge resource is ATOMIC [8], which is a crowdsourced commonsense knowledge graph that contains nine human-defined relations between daily events. As ATOMIC is a bipolar graph and thus we could not find a multi-hop path, we did not select it as the event knowledge resource in our final model. In this experiment, we report the performance of using ATOMIC as the event knowledge resource. As ATOMIC does not contain the multi-hop knowledge, we also add ASER (single-hop), which only uses the single-hop knowledge in ASER, as another baseline approach.
The overall results are shown in Table 6, from which we can see that there is still a notable gap between ATOMIC and our final model. At the same time, we can also see that ATOMIC can outperform the single-hop version of ASER. Based on these two observations, we can make the following conclusions: (1) Compared with ASER, ATOMIC is cleaner because it is created with human annotations; (2) Multi-hop knowledge is crucial for LMs to understand events. We leave how to combine ATOMIC and ASER to get more high-quality multi-hop event knowledge as our future work.

Case Study
We present the case study in Table 7 and use a probing approach to further investigate the reason for our success. Similar to [3], we put a "[MASK]" token between two events and try to ask the model to predict the connective. Take the case from COPA dataset as an example, connectives predicted by CoCoLM clearly show the effect relation between two events. However predictions from the baseline models reveal weaker (temporal, conjunction) or wrong (contrast) relations. Similar observations could be drawn from another two datasets. These observations show that compared with the original LM, CoCoLM manages to memorize richer event knowledge.

Related Work
Understanding Events. It is important to represent and learn the commonsense knowledge for deeply understanding the causality and correlation between events. Recently various kinds of tasks requiring multiple dimensional event knowledge are proposed such as story ending prediction [16], event temporal ordering prediction [23], and event casual reasoning [19]. Prior studies have incorporated external commonsense knowledge from ConceptNet [24] and ATOMIC [8] for solving event representation [25], story generation tasks [26], event KG completion [27]. However, their event-level knowledge is sparse and incomplete due to the human-annotated acquisition, which thus limits the model capacity, especially when injecting into pre-trained LMs. [6] builds a large-scale eventuality knowledge graph, ASER, by specifying eventuality relations mined from discourse connectives. It explicitly provides structural high-order discourse information between events spanning from temporal, casual to co-occurred relations, which has been proven to be transferable to human-defined commonsense [28]. In this work, we aim at making full use of multi-dimensional high-order event knowledge in the ASER to help pre-trained LMs understand events.
Injecting Knowledge into LMs. Though [3] shows that pre-trained LMs store factual relational knowledge without finetuning, still, LMs can not handle knowledge-intensive tasks such as open-domain question answering or commonsense reasoning. Previous works explore different ways to inject various knowledge into pre-trained LMs for downstream task performance improvement. They mainly differ from the knowledge resources, masking strategies, and training objectives. From the perspective of knowledge resources, entity-centric knowledge graphs are infused into LMs in the form of linked entities [29,30,31] or triplets [32,33,34]. Besides that, linguistic knowledge (e.g.,synonym/hypernym relations [35], word-supersense knowledge [36], dependency parsing [34], and constituent parsing [37]) also plays a critical role to improve the pre-trained LMs. Last but not least, domain-specific knowledge is also customized to improve relevant tasks such as mined sentiment word [38], event temporal patterns [39], and numerical reasoning data [40]. In this work, we aim at injecting complex commonsense into pre-trained LMs with two significant difference against previous works: (1) we use the event rather than tokens as the semantic unit, and propose to use an event-based masking strategy as well as two auxiliary tasks to help LMs understand events; (2) We first leverage the random walk process on a large-scale knowledge graph to include multi-hop knowledge.

Conclusion
In this work, we aim at helping pre-trained language models understand complex commonsense about events. Specifically, we first conduct the random walk over a large-scale eventuality-based knowledge graph to collect multi-hop event knowledge and then inject the knowledge into the pre-trained LMs with an event-based mask strategy as well as two auxiliary tasks. Experiments on three downstream tasks as well as extensive analysis demonstrate the effectiveness of the proposed model. As our approach is a general solution, we believe that it can also be helpful for other tasks that require complex commonsense about events. Both the code and pre-trained model CoCoLM are released.