Uncertain Local-to-Global Networks for Document-Level Event Factuality Identification

Event factuality indicates the degree of certainty about whether an event occurs in the real world. Existing studies mainly focus on identifying event factuality at sentence level, which easily leads to conflicts between different mentions of the same event. To this end, we study the problem of document-level event factuality identification, which determines the event factuality from the view of a document. For this task, we need to consider two important characteristics: Local Uncertainty and Global Structure, which can be utilized to improve performance. In this paper, we propose an Uncertain Local-to-Global Network (ULGN) to make use of these two characteristics. Specifically, we devise a Local Uncertainty Estimation module to model the uncertainty of local information. Moreover, we propose an Uncertain Information Aggregation module to leverage the global structure for integrating the local information. Experimental results demonstrate the effectiveness of our proposed method, outperforming the previous state-of-the-art model by 8.4% and 11.45% of F1 score on two widely used datasets.


Introduction
Event factuality refers to the degree of certainty about whether events actually occur or not in the real world. Generally, event factuality can be classified into five categories (Saurı, 2008): Certain Positive (certainly happening, denoted as CT+), Certain Negative (certainly not happening, CT-), Possible Positive (possibly happening, PS+), Possible Negative (possibly not happening, PS-) and Underspecified (events' factuality cannot be identified, Uu). For example, in the sentence "An economist thinks that the tax rate probably increases soon", the event "increases" may happen. Therefore, an event factuality identification (EFI) model should * Equal contribution.

Event: United States reaches an agreement with Mexico
Text: According to Politico.com, the United States probably reaches (PS+) an agreement with Mexico on the new trade deal before December, 2017. [S1] However, Mexican Economy Minister Ildefonso Guajardo denied that they plan to reach (CT-) any agreement with the U.S. on the trade deal talks.
[S2] A journalist agreed the view, said the two sides may reach (PS+) an agreement within hours. [S3] The government has not been informed that any agreement will be reached (CT-) yet, said another two Mexican officials. [S4] Some media speculate that they will possibly reach (PS+) an agreement.

[S8]
Document-level Event Factuality: CT- Figure 1: An example document with both sentenceand document-level event factuality. The factuality between sentence-and document-level may be different.
be able to predict the factuality of the event is PS+. EFI is an important task in natural language processing (NLP) area, which is beneficial for a wide range of NLP applications, such as rumor detection (Qazvinian et al., 2011), sentiment analysis (Klenner and Clematide, 2016) and machine reading comprehension (Richardson et al., 2013).
Existing EFI studies mainly focus on sentencelevel EFI, i.e., judging event factuality based on an individual sentence in which the event is located. In recent years, various neural models have been proposed for sentence-level EFI, and achieve stateof-the-art performance (Rudinger et al., 2018;Qian et al., 2018;Veyseh et al., 2019). Despite these successful efforts, sentence-level EFI suffers from an inevitable restriction in practice: it easily leads to conflicts between different mentions of the same event. Take Figure 1 as an example, the "reach" event is mentioned multiple times in a document, which has various factuality values in different sentences. The factuality of the event "reach" in S2 is PS+ according to the speculative word "may", while in S3, its factuality is CT-due to the negative word "denied". According to our statistics on the English and Chinese event factuality datasets (Qian et al., 2019), 25.7% (English) and 37.8% (Chinese) of instances have the problem of event factuality conflict at sentence level for the same event, which is not negligible. Fortunately, the event factuality can be uniquely determined from the perspective of a document, which is able to naturally address the problem of sentence-level event factuality inconsistency. Therefore, it is necessary to move EFI forward from sentence level to document level. However, identifying document-level event factuality is non-trivial. As shown in Figure 1, the factuality between document-level and sentence-level may be quite different. In this scenario, documentlevel event factuality cannot be deduced from each sentence-level factuality separately, but depends on the comprehensive semantic information of the entire document. To this end, we first learn the local information, and then integrate local representations to the global representation for prediction. In this process, we need to consider two important characteristics: Local Uncertainty and Global Structure, which can be leveraged to improve performance. In the following, we will introduce the two characteristics and give the reasons why they are critical for document-level EFI.
Local Uncertainty: As illustrated in Figure 1, different sentences (i.e., local information) describe different cognitive individuals' judgements towards the event factuality. However, the degree of uncertainty of these judgements is different. For example, as direct participants in the "reach" event, Mexican officials (in S4) can judge the event factuality with lower uncertainty (i.e., higher confidence) than other cognitive individuals (e.g., a journalist in S2). Apparently, the information of S4 is more important than that of S2 when predicting the document-level event factuality. It would be better if we could explicitly model the uncertainty of local information. Therefore, the first challenging problem is how to model the uncertainty of local information.
Global Structure: When integrating local information, utilizing global structure (i.e., document structure) could yield a better global representation. The global structure is manifested in two aspects: positional structure and semantic structure. For positional structure, as shown in Figure 1, the content of the document is roughly organized in chronolog-ical order, which can reflect the evolution of events. For semantic structure, there is a semantic correlation between local information. For instance, the content of S2 is the support of the view about the event factuality in S1, while the content of S3 is the denial of that in S1. There is no doubt that capturing the global structure enables a better understanding of documents. Thus, the second challenging problem is how to leverage the document structure for integrating local information.
In this paper, we propose a novel method termed as Uncertain Local-to-Global Network (ULGN) to address aforementioned problems. Specifically, to model the uncertainty of local information, we propose a Local Uncertainty Estimation module. It utilizes a probability distribution to represent the local information, rather than a deterministic feature vector. For ease of modeling, we adopt Gaussian distributions. Namely, the local information is now parameterized by a mean and variance. The former acts like the normal feature vector as in the conventional model, whereas the latter measures the feature uncertainty. The higher the uncertainty of the local information is, the larger its corresponding variance is. To leverage the global structure for synthesizing local information, we devise an Uncertain Information Aggregation module. The module first constructs a global graph based on the document structure, and then employs an uncertain graph convolution layer to aggregate the local information. It considers the uncertainty of local information via variance-based attention. Experimental results on two widely used datasets demonstrate that our method substantially outperforms previous state-of-the-art models.
Overall, the main contributions of this work can be summarized as follows: • We propose a novel Uncertain Local-to-Global Network (ULGN) for document-level event factuality identification. To our best knowledge, we are the first to consider local uncertainty and global structure for the task.
• To model the uncertainty of local information, we propose a local uncertainty estimation module. To leverage the global structure for integrating local information, we devise an uncertain information aggregation module. 11.45% improvements of F1 score on two widely used datasets. The source code of this paper is available at https://github.com/ CPF-NLPR/ULGN4DocEFI.

Methodology
We propose an uncertain local-to-global network (ULGN) for document-level EFI. Figure 2 schematically visualizes our approach, which consists of three major components: (1) Local Uncertainty Estimation ( §2.1), which represents the local information by using a probability distribution; (2) Uncertain Information Aggregation ( §2.2), which leverages the global structure to integrate the local information; (3) Reparameterization for Prediction ( §2.3), which utilizes the reparameterization trick (Kingma and Welling, 2013) to obtain the global representation for final prediction. We will illustrate each component in detail.

Local Uncertainty Estimation
We treat sentences and event mentions as the local information. Our local context encoder is based on the Transformer architecture (Vaswani et al., 2017). We adopt the BERT (Devlin et al., 2019) to encode the local information, 1 which has achieved the state-of-the-art performance for EFI task (Veyseh et al., 2019). The local context encoder takes each sentence of a document as input, which is defined as follows: where S i denotes the i-th sentence and N s is the number of sentences in the document. We use the [CLS] token representation of the last layer in BERT as the sentence representation. The representation of the event mention e i (i = 1, 2, . . . , N e , where N e is the number of times the event is mentioned.) is defined by averaging the representations of contained words, denoted as f e i . After obtaining the feature vector of the local information, we need to estimate its uncertainty. To this end, we use a Gaussian distribution N (µ i , σ 2 i ) to represent the local information, instead of a deterministic feature vector. The µ i and σ 2 i refer to mean vector and variance matrix respectively, which is formulated as follows: where f i denotes the original representation of the sentence or event mention (i.e., f s i or f e i ). W µ and W σ are trainable parameters.
In this way, each local information is represented , which not only gives the contextualized representation of local information (i.e., mean vector), but also estimates the uncertainty of local information (i.e., variance matrix).

Global Graph Construction
To leverage the global structure for integrating the local information, we first construct a Global Graph. As shown in Figure 2, the graph has three kinds of nodes: mention node, sentence node and document node. The mention nodes and sentence nodes can provide the local information for prediction. The global graph has one document node that aims to capture the information of the entire document. According to the document structure, we define the following five types of edges: • Adjacent sentence edge: We connect a sentence node with its previous and next sentence nodes.
• Document-sentence edge: We connect the document node with all sentence nodes.
• Document-mention edge: All event mention nodes are connected to the document node.
• Sentence-mention edge: The mention node is connected to its corresponding sentence node.
• Mention coreference edge: Mentions referring to the same event are fully connected.
With the above connections, the positional structure can be modeled via the adjacent sentence edge. Besides, the document node could serve as a pivot to interact with other nodes and thus reduce the long distance among them in the document. Any two local nodes (i.e., mention nodes and sentence nodes) that are not directly connected can pass information to each other through the document node. Thus, the above connections can also model the semantic structure.

Uncertain Graph Convolution Layer
After constructing the global graph, we aggregate the local information based on the graph. For conventional graph convolution networks (GCNs) (Kipf and Welling, 2017), the (l+1)-th convolution layer is defined as: or in the equivalent matrix form: whereÃ = A+I. A denotes the adjacency matrix of the global graph, and I is the identify matrix.D i,i = jÃ i,j . ρ is an activation function (e.g., ReLU). ne(i) denotes neighbors of the node i.
Since the local information is parameterized by a probability distribution, existing graph convolutions are no longer applicable. Inspired by , we formally utilize an uncertain graph convolution layer (UGCL) to perform convolution operations between Gaussian distributions. Denote h Nn ] to denote the matrix of means and variances for all nodes respectively, where N n is the number of nodes in the global graph (i.e., N n = N s + N e + 1).
According to the additivity of the Gaussian distribution (LeCam, 1965) and assuming all hidden representations of nodes are independent, we can aggregate node neighbors as follows: Due to the different importance of the local information, we propose a variance-based attention mechanism to assign different weights to neighbors. Intuitively, a smaller variance means that the node is more important. Specifically, we use a smooth exponential function to control the effect of variances on weight: where α (l) i are the attention weights of node i in the l-th layer and γ is a hyper-parameter. Considering the variance-based attention, the Eq.(5) can be modified as follows: where denotes the element-wise product operation. To better integrating the local information, the attention weights are exerted for different dimensions separately. Similar to Eq.(3), we need to apply learnable filters and non-linear activation functions to h (l) In such a scenario, we directly impose layer-specific parameters and non-linear activation functions to the means and variances, respectively. Therefore, the uncertain graph convolution can be defined as follows: or equivalently in the matrix form: where A (l) = exp(−γΣ (l) ). M (0) and Σ (0) are computed via Eq.(2).

Reparameterization for Prediction
We use the representation of the document node for prediction. Considering the representation of the document node is a Gaussian distribution, we first adopt a sampling process in the last graph layer: where N (u d ) denotes the representation of the document node in the last layer. However, directly sampling z will cause the problem of preventing gradients from propagating back to the preceding layers. Thus, we use the reparameterization trick (Kingma and Welling, 2013) to bypass the problem. Specifically, we first sample a random noise from the standard Gaussian distribution, and then generate z as the equivalent sampling representation: After obtaining z, we feed it into a softmax function for prediction: For training, we use the cross entropy loss to optimize the model parameters: where N is the number of training instances. y i is the label of the i-th instance.
In addition, to ensure that the learned representations are indeed Gaussian distributions, we devise an explicit regularization loss to constrain the input representations of the first layer: ij )||N (0, I)), (14) where N (u ij ) is the initialized Gaussian distribution of j-th node of i-th instance. KL(·||·) is the KL-divergence between two distribution. Since deeper layers are naturally Gaussian distributions by using the proposed UGCL, we only need to regularize M (0) and Σ (0) . We reach the final loss function by combining the above terms: where β is a hyper-parameter that controls the impact of the L reg .

Datasets and Evaluation Metrics
We evaluate our proposed method on two widely used datasets, English and Chinese event factuality datasets (Qian et al., 2019). The number of English and Chinese documents is 1,730 and 4,650, respectively. The PS-and Uu documents only cover 1.39% and 1.20% in the English and Chinese datasets, respectively. Therefore, following previous work (Qian et al., 2019), we mainly focus on the performance of CT+, CT-and PS+. For a fair comparison with previous work (Qian et al., 2019), we both perform 10-fold cross-validation on English and Chinese corpora. In addition, we adopt F1 score as the evaluation metric for each category of factuality value. We also consider macro-averaged and micro-averaged F1 score for the overall performance of all the categories of factuality values.

Parameter Settings
In our implementations, our method uses the Hug-gingFace's Transformers library 3 to implement the BERT base model, which has 12-layers, 768hidden, and 12-heads. The learning rate is initialized as 2e-5 with a linear decay. We use the AdamW algorithm (Loshchilov and Hutter, 2018) to optimize model parameters. The batch size is set to 4 and 2 for English and Chinese event factuality datasets, respectively. The number of uncertain graph convolution layers is set to 2. The size of hidden states of the uncertain graph convolution layer is 768.

Baselines
We compare the proposed approach ULGN with the following methods: (1) MaxEntVote (Qian et al., 2019), which first uses maximum entropy model to identify sentencelevel event factuality, and then votes, i.e., choosing the value committed by the most sentences as the document-level factuality value.
(2) BiLSTM-Att (Qian et al., 2019), which employs the bidirectional long short-term memory network (BiLSTM) to extract features, and uses the intra-sentence attention to capture the most important information in the sentence.
(3) Att-Adv (Qian et al., 2019), which leverages the intra-sentence and inter-sentence attention to learn the document representation, and utilizes adversarial training to improve the robustness.
(4) BERT Model, which utilizes the BERT-base (Devlin et al., 2019) to encode the document, and uses the [CLS] representation for prediction. Table 1 shows the results on the English and Chinese datasets, respectively. We note the following key observations throughout our experiments:

Overall Results
(1) Our method outperforms all the baselines by a large margin. For example, compared with the previous state-of-the-art model Att-Adv (Qian et al., 2019), our method achieves 11.45% improvements of macro-F1 score on the Chinese event fac-tuality dataset. The significant performance gain of our method over the baselines demonstrates that the proposed ULGN is very effective for this task.
(2) Our method improves upon the BERT Model by 7.92% and 8.81% in term of macro-F1 score on the English and Chinese event factuality datasets, respectively. We attribute the improvements to that our method ULGN takes advantage of local uncertainty and global structure, thus achieving superior performance than the BERT Model.
(3) The BERT Model achieves comparable performance with complex state-of-the-art methods such as Att-Adv (Qian et al., 2019) on these two datasets, which indicates that the BERT is able to extract useful text features for the task.

Ablation Study
To demonstrate the effectiveness of the local uncertainty estimation (LUE) and uncertain information aggregation (UIA), we conduct an ablation study as follows. 1) w/o VA, which removes the variancebased attention; 2) w/o LUE, which first uses BERT to encode the local information as the vector, and then employs vanilla GCNs to aggregate the local information; 3) w/o UIA, which first samples a representation (i.e., vector) for each local information, and then performs max-pooling over these sampled representations to get the global representation for prediction; 4) w/o LUE and UIA, which is the same as the BERT Model introduced in Section 3.3. We present the results of ablation study in Table 2. From the results, we can observe that: (1) Effectiveness of Local Uncertainty Estimation. When we remove the LUE module from the ULGN, the macro-F1 score drops by 4.35% on  Table 2: Ablation study by removing the main components, where "w/o" indicates without. The performance is followed by the drop (↓) compared with the method ULGN. "VA", "LUE" and "UIA" refer to "variance-based attention", "local uncertainty estimation" and "uncertain information aggregation", respectively.  the English dataset. It proves the local uncertainty estimation is very effective for the task.
(2) Effectiveness of Uncertain Information Aggregation. Compared with the model removed UIA module, our method ULGN achieves 5.54% improvements of micro-F1 score on the Chinese dataset. Moreover, removing the VA module also brings performance degradation. It demonstrates that the uncertain information aggregation is able to effectively integrate the local information.
(3) Effectiveness of Local Uncertainty Estimation and Uncertain Information Aggregation. When we remove the LUE and UIA, the performance drops significantly. The macro-F1 score drops from 93.09% to 84.28% on the Chinese dataset. It indicates simultaneously utilizing the local uncertainty estimation and uncertain information aggregation is also very effective.

Results on the Documents with Different Sentence-Level Event Factuality Values
The  documents with n types of sentence-level factuality values. The results are shown in Table 3. From the table, we have two important observations: (1) Compared with improvements over the Att-Adv (Qian et al., 2019) when n=1, our method achieves more improvements when n>1. For example, our method ULGN achieves 3.54% improvements of macro-F1 score when n=1, while 14.72% improvement when n>1 on the English dataset. It indicates that our method is able to handle well the problem of sentence-level event factuality inconsistency.
(2) The micro-F1 and macro-F1 of n>1 are lower than those of n=1 for both Att-Adv and our approach ULGN, indicating that the factuality of documents that have different types of sentencelevel factuality are more difficult to identify due to the interference from sentence-level values. Longformer 4 (Beltagy et al., 2020) to extract the global feature for prediction; 2) BERT-GCN and BERT-GAT, which first uses the BERT to the local information, and then employs GCN and GAT (Veličković et al., 2017) for integrating the local information, respectively. We present the experimental results in Table 4. From the results, we can clearly see that our method ULGN significantly outperforms other baselines. It indicates that when modeling the document for the document-level EFI task, we not only need to consider the uncertainty of local information, but also need to leverage the document structure for integrating local information.

Impact of the Number of Graph Layers
We evaluate the influence of graph layer numbers, which is illustrated in Figure 3. From the figure, we can observe that: (1) Our method ULGN yields the best performance when the number of graph layers is 2. We attribute it to the fact that any two local nodes that are not directly connected can pass information to each other through the document node (i.e., 2-hop).
(2) When the number of graph layers is too large, the micro-F1 and macro-F1 scores both stop increasing or even decrease. We guess that increasing the size of randomly initialized parameters may not be beneficial for BERT fine-tuning.
Despite these successful efforts, sentence-level event factuality easily leads to conflict. To this end, Qian et al. (2019) propose the document-level EFI task. However, when modeling the document for the task, their method ignores the uncertainty of local information and the global structure.

Uncertainty Modeling
Uncertainty is a crucial but long-ignored issue in many applications of NLP area. Conventionally, the high-level representation of an input instance is modeled as a fixed-length feature vector, which can be regarded as a "point" in low-dimensional spaces. However, such a point estimate is not sufficient to express uncertainty, as point-based methods assume that learned features are always correct (Gal and Ghahramani, 2016;Kendall and Gal, 2017). In recent years, Gaussian embedding has been getting more attention in deep learning. For example, Vilnis and McCallum (2015) utilize Gaussian embeddings to represent words, where the covariance naturally measures the ambiguity of the words. He et al. (2015) attempt to leverage the Gaussian distribution to represent the entity and relation, which aims to model the uncertainty of entities and relations in knowledge graphs. In addition, Xiao and Wang (2019) quantify uncertainties in some NLP tasks, such as sentiment analysis, named entity recognition and language modeling.
To the best of our knowledge, we are the first to consider the uncertainty of local information for the document-level EFI task. Namely, we represent the local information as a probability distribution, rather than a deterministic feature vector.

Conclusion
In this paper, we propose a novel uncertain local-toglobal network (ULGN) for document-level event factuality identification. To model the uncertainty of local information, we propose a local uncertainty estimation module to represent the local information with a probability distribution. To leverage the global structure, we devise an uncertain information aggregation module to integrate the local information. Experimental results on two widely used datasets indicate that our approach substantially outperforms previous state-of-the-art methods.