KALM: Knowledge-Aware Integration of Local, Document, and Global Contexts for Long Document Understanding

With the advent of pre-trained language models (LMs), increasing research efforts have been focusing on infusing commonsense and domain-specific knowledge to prepare LMs for downstream tasks. These works attempt to leverage knowledge graphs, the de facto standard of symbolic knowledge representation, along with pre-trained LMs. While existing approaches leverage external knowledge, it remains an open question how to jointly incorporate knowledge graphs represented in varying contexts — from local (e.g., sentence), document-level, to global knowledge, to enable knowledge-rich and interpretable exchange across contexts. In addition, incorporating varying contexts can especially benefit long document understanding tasks that leverage pre-trained LMs, typically bounded by the input sequence length. In light of these challenges, we propose KALM, a language model that jointly leverages knowledge in local, document-level, and global contexts for long document understanding. KALM firstly encodes long documents and knowledge graphs into the three knowledge-aware context representations. KALM then processes each context with context-specific layers. These context-specific layers are followed by a ContextFusion layer that facilitates knowledge exchange to derive an overarching document representation. Extensive experiments demonstrate that KALM achieves state-of-the-art performance on three long document understanding tasks across 6 datasets/settings. Further analyses reveal that the three knowledge-aware contexts are complementary and they all contribute to model performance, while the importance and information exchange patterns of different contexts vary on different tasks and datasets.


INTRODUCTION
Pre-trained language models (LMs) have become the dominant paradigm in NLP research, while knowledge graphs (KGs) are the de facto standard of symbolic knowledge representation. Recent advances in knowledge-aware NLP focus on combining the two paradigms (Wang et al., 2021b;, infusing encyclopedic (Vrandečić & Krötzsch, 2014;Pellissier Tanon et al., 2020), commonsense (Speer et al., 2017), and domain-specific Chang et al., 2020) knowledge with LMs. Knowledge-grounded models achieved state-of-the-art performance in tasks including question answering (Sun et al., 2022), commonsense reasoning (Kim et al., 2022;, and social text analysis . Prior approaches to infusing LMs with knowledge typically focused on three hitherto orthogonal directions: incorporating knowledge related to local (e.g., sentence-level), document-level, or global context. Local context approaches argue that sentences mention entities, and the external knowledge of entities, such as textual descriptions Wang et al., 2021b) and metadata (Ostapenko et al., 2022), help LMs realize they are more than just tokens. Document-level context approaches argue that core idea entities are repeatedly mentioned throughout the document, while related concepts might be discussed in different paragraphs. These methods attempt to leverage entities and knowledge across paragraphs with techniques such as document graphs . Global context approaches argue that unmentioned Figure 1: Overview of KALM, which encodes long documents and knowledge graphs into local, document, and global contexts while enabling interpretable information exchange across contexts. yet connecting entities help connect the dots for knowledge-based reasoning, thus knowledge graph subgraphs are encoded with graph neural networks alongside textual content Yasunaga et al., 2021). However, despite their individual pros and cons, how to integrate the three document contexts in a knowledge-aware and interpretable way remains an open problem.
Controlling for varying scopes of knowledge and context representations could benefit numerous language understanding tasks, especially those centered around long documents. Bounded by the inherent limitation of input sequence length, existing knowledge-aware LMs are mostly designed to handle short texts (Wang et al., 2021b;. However, processing long documents containing thousands of tokens (Beltagy et al., 2021) requires attending to varying document contexts, disambiguating long-distance co-referring entities and events, and more.
In light of these challenges, we propose KALM, a Knowledge-Aware Language Model for long document understanding. Specifically, KALM first derives three context-and knowledge-aware representations from the long input document and an external knowledge graph: the local context represented as raw text, the document-level context represented as a document graph, and the global context represented as a knowledge graph subgraph. KALM layers then encode each context with context-specific layers, followed by our proposed novel ContextFusion layers to enable knowledgerich and interpretable information exchange across the three knowledge-aware contexts. A unified document representation is then derived from context-specific representations that also interact with other contexts. An illustration of the proposed KALM is presented in Figure 1.
While KALM is a general method for long document understanding, we evaluate the model on three tasks across six datasets/settings that are particularly sensitive to broader contexts and external knowledge: political perspective detection, misinformation detection, and roll call vote prediction. Extensive experiments demonstrate that KALM outperforms pre-trained LMs, task-agnostic knowledge-aware baselines, and strong task-specific baselines on all six datasets. In ablation experiments, we further establish KALM's ability to enable information exchange, better handle long documents, and improve data efficiency. In addition, KALM and the proposed ContextFusion layers reveal and help interpret the roles and information exchange patterns of different contexts.

PROBLEM DEFINITION
Let d = {d 1 , . . . , d n } denote a natural language document with n paragraphs, where each paragraph contains a sequence of n i tokens d i = {w i1 , . . . , w ini }. Knowledge-aware long document understanding assumes the access to an external knowledge graph KG = (E, R, A, , ϕ), where E = {e 1 , . . . , e N } denotes the entity set, R = {r 1 , . . . , r M } denotes the relation set, A is the adjacency matrix where a ij = k indicates (e i , r k , e j ) ∈ KG, (·) : E → str and ϕ(·) : R → str map the entities and relations to their textual descriptions.
Given pre-defined document labels, knowledge-aware natural language understanding aims to learn document representations and classify d into its corresponding label with the help of KG.

KNOWLEDGE-AWARE CONTEXTS
We hypothesize that a holistic representation of long documents should incorporate contexts and relevant knowledge at three levels: the local context (e.g., a sentence with descriptions of mentioned entities), the broader document context (e.g., a long document with cross-paragraph entity reference structure), and the global/external context represented as external knowledge (e.g., relevant knowledge base subgraphs). Each of the three contexts uses different granularities of external knowledge, while existing works fall short of jointly integrating the three types of representations. To this end, KALM firstly employs different ways to introduce knowledge in different levels of contexts.
Local context. Represented as the raw text of sentences and paragraphs, the local context models the smallest unit in long document understanding. Prior works attempted to add sentence metadata (e.g., tense, sentiment, topic) , adopt sentence-level pre-training tasks based on KG triples (Wang et al., 2021b), or leverage knowledge graph embeddings along with textual representations . While these methods were effective, in the face of LM-centered NLP research, they are ad-hoc add-ons and not fully compatible with existing pre-trained LMs. As a result, KALM proposes to directly concatenate the textual descriptions of entities (e i ) to the paragraph if e i is mentioned. In this way, the original text is directly augmented with the entity descriptions, informing the LM that entities such as "Kepler" are more than mere tokens and help to combat the spurious correlations of pre-trained LMs (McMilin, 2022). For each augmented paragraph d i , we adopt pre-trained LM(·) and mean pooling to extract a paragraph representation. We also add a fusion token at the beginning of the paragraph sequence for information exchange across contexts. After processing all n paragraphs, we obtain the local context representation T (0) as follows: where θ rand denotes a randomly initialized vector of the fusion token in the local context and the superscript (0) indicates the 0-th layer.
Document-level context. Represented as the structure of the full document, the document-level context is responsible for modeling cross-paragraph entities and knowledge on a document level. While existing works attempted to incorporate external knowledge in documents via document graphs , these approaches are heavyweight, require a lot of preprocessing, and add great computational burden. In light of these challenges, we propose knowledge coreference, a simple and effective mechanism for modeling text-knowledge interaction on the document level. Specifically, a document graph with n + 1 nodes is constructed, consisting of one fusion node and n paragraph nodes. If paragraph i and j both mention entity e k , node i and j in the document graph are then connected with relation type k. In addition, the fusion node is connected to every paragraph node with a super-relation. As a result, we obtain the adjacency matrix of the document graph A g . Paired with the knowledge-guided GNN to be introduced in Section 2.3, knowledge coreference enables the information flow across paragraphs guided by external knowledge. Node feature initialization of the document graph is as follows: Global context. Represented as external knowledge graphs, the global context is responsible for leveraging unseen entities and facilitating KG-based reasoning. Existing works mainly focused on extracting knowledge graph subgraphs (Yasunaga et al., 2021; and encoding them alongside document content. Though many tricks are proposed to extract and prune knowledge graph subgraphs, in KALM, we employ a straightforward approach: for all mentioned entities in the long document, KALM merges their 2-hop neighborhood to obtain a knowledge graph subgraph. A fusion entity is then introduced and connected with every other entity, resulting in a connected graph. In this way, KALM cuts back on the preprocessing for modeling global knowledge and better preserve the information in the KG. Knowledge graph embedding methods (Bordes et al., 2013) are then adopted to initialize node features of the KG subgraph: where KGE(·) denotes the knowledge graph embeddings trained on the original KG, |ρ(d)| indicates the number of mentioned entities identified in document d.

KALM LAYERS
After obtaining the local, document-level, and global context representations of long documents, we employ KALM layers to learn document representations. Specifically, each KALM layer consists of three context-specific layers to process each context. A ContextFusion layer is then adopted to enable the knowledge-rich and interpretable information exchange across the three contexts.

CONTEXT-SPECIFIC LAYERS
Local context layer. The local context is represented as a sequence of vectors extracted from the knowledge-enriched text with the help of pre-trained LMs. We adopt transformer encoder layers (Vaswani et al., 2017) to encode the local context: where φ(·) denotes non-linearity andt ( ) 0 denotes the transformed representation of the fusion token.
Document-level context layer. The document-level context is represented as a document graph based on knowledge coreference. To better exploit the entity-based relations in the document graph, we propose a knowledge-aware GNN architecture to enable knowledge-guided message passing: where GNN(·) denotes the proposed knowledge-guided graph neural networks as follows: where α i,j denotes the knowledge-guided attention weight and is defined as follows: whereg ( ) 0 denotes the transformed representation of the fusion node, a and Θ are learnable parameters, a g ij is the i-th row and j-th column value of adjacency matrix A g of the document graph, and f (·) is a learnable linear layer. The term of Θf (KGE(a g ij )) is responsible for enabling the knowledge-guided message passing on the document graph.
Global context layer. The global context is represented as a relevant knowledge graph subgraph. We follow previous works and adopt GATs  to encode the global context: wherek ( ) 0 denotes the transformed representation of the fusion entity.

CONTEXTFUSION LAYER
The local, document, and global contexts model external knowledge within sentences, across the document, and beyond the document. These contexts are closely connected and a robust long document understanding method should reflect their interactions. Existing approaches mostly leverage only one or two of the contexts (Wang et al., 2021b;, falling short of jointly leveraging the three knowledge-aware contexts. In addition, they did not enable context-specific information to flow across contexts in a knowledge-rich and interpretable manner. As a result, we propose the ContextFusion layer to tackle these challenges. We firstly take a local perspective and extract the representations of the fusion tokens, nodes, and entities in each context: We then take a global perspective and use the fusion token/node/entity as the query to conduct attentive pooling ap(·, ·) across all other tokens/nodes/entities in each context: where attentive pooling ap(·, ·) is defined as follows: In this way, the fusion token/node/entity in each context serve as the information exchange portals. We then use a transformer encoder layer to enable information exchange across the contexts: As a result,t ( ) L are the representations of the fusion token/node/entity that incorporates information from other contexts. We formulate the output of the l-th layer as follows: Our proposed ContextFusion layer is interactive since it enables the information to flow across different document contexts, instead of direct concatenation or hierarchical processing. The Context-Fusion layer is interpretable since the attention weights in TrmEnc(·) could provide insights into the roles and importance of each document context, which will be further explored in Section 3.5.

LEARNING AND INFERENCE
After a total of P KALM layers, we obtain the final document representation as t (P) L ,g Given the document label a ∈ A, the label probability is formulated as p . We then optimize KALM with the cross entropy loss function. At inference time, the predicted label is argmax a p(a|d).

EXPERIMENT SETTINGS
Tasks and Datasets. We propose KALM, a general method for knowledge-aware long document understanding. We evaluate KALM on three tasks that especially benefit from external knowledge  and broader context: political perspective detection, misinformation detection, and roll call vote prediction. We follow previous works to adopt SemEval (Kiesel et al., 2019) and Allsides (Li & Goldwasser, 2019) for political perspective detection, LUN (Rashkin et al., 2017) and SLN  for misinformation detection, and the 2 datasets proposed in  for roll call vote prediction. For external KGs, we follow existing works to adopt the KGs in KGAP , CompareNet (Hu et al., 2021), and ConceptNet (Speer et al., 2017) for the three tasks.
Baseline methods. We compare KALM with three types of baseline methods for holistic evaluation: pre-trained LMs, task-agnostic knowledge-aware methods, and task-specific models. For pre-trained LMs, we evaluate RoBERTa , Electra , DeBERTa (He et al., 2020), BART , and LongFormer  on the three tasks. For task-agnostic baselines, we evaluate KGAP , GreaseLM , and GreaseLM+ on the three tasks. Task-specific models are introduced in the following sections. For pre-trained LMs, task-agnostic methods, and KALM, we run each method five times and report the average performance and standard deviation. For task-specific models, we compare with the results originally reported since we follow the exact same experiment settings and data splits.

TASK 1: POLITICAL PERSPECTIVE DETECTION
The task-specific baselines on political perspective detection are HLSTM (Yang et al., 2016), MAN (Li & Goldwasser, 2021), and KCD . We present the results in Table 1 with / indicating that the result is not reported in the original paper. It is demonstrates that KALM outperforms all task-specific baselines, pre-trained LMs, and task-agnostic methods. In addition, KALM's performance gain is greater on the smaller dataset SemEval, suggesting that by jointly leveraging three document contexts and introducing knowledge in varying contexts, KALM has improved data efficiency, which is further explored in Section C.1 in the appendix.    Table 2, which demonstrates that KALM outperforms task-speicifc, task-agnostic baselines as well as pre-trained LMs. In addition, most task-agnostic knowledge-aware LMs (KGAP, GreaseLM+, and KALM) outperform the rest of the baselines, indicating that the task of misinformation detection especially benefits from external knowledge graphs.

TASK 3: ROLL CALL VOTE PREDICTION
The task-specific baselines on roll call vote prediction are ideal point (Gerrish & Blei, 2011), ideal vector , Vote , and PAR . We present the results in Table 3, which demonstrates that KALM outperforms all baselines methods. In addition, the great performance gain between GreaseLM and GreaseLM+ demonstrates the importance of the document-level context, which is further explored in Section 3.5.

CONTEXT EXCHANGE STUDY
By jointly modeling three document contexts and employing the ContextFusion layer, KALM facilitates interpretable exchange across the three document contexts. We conduct an ablation study to examine whether the contexts and the ContextFusion layer are essential in the KALM architecture. Specifically, we remove the three contexts one at a time and change the ContextFusion layer into MInt , concatenation, and sum. We present the results in Table 4, which demon-  Figure 2: Interpreting the roles of the three contexts with attention maps in the ContextFusion layer. t L , t G , g L , g G , k L , k G denote the context representations in equations (9) and (10), so that the first two columns indicate how the local context attends to information in other contexts, the next two columns for the document-level context, and the last two columns for the global context.
strates that all three knowledge-aware contexts contribute to model performance. while the proposed ContextFusion layer outperforms other strategies of aggregating information across contexts.
In addition to boosting model performance, the ContextFusion layer enables the interpretation of how different document contexts contribute to document understanding. We calculate the average of attention weights' absolute values of the multi-head attention in the TrmEnc(·) layer of Con-textFusion and illustrate in Figure 2. It is demonstrated that the three contexts' contribution and information exchange patterns vary with respect to datasets and KALM layers. Specifically, local and global contexts are important for the LUN dataset, document and global contexts are important for the task of roll call vote prediction, and the SLN dataset equally leverages the three contexts. However, for the task of political perspective detection, the importance of the three aspects varies with the depth of KALM layers. This is especially salient on SemEval, where KALM firstly takes a view of the whole document, then draws from both local and document-level contexts, and closes by leveraging global knowledge to derive an overall document representation. In summary, the Con-textFusion layer in KALM successfully identifies the relative importance and information exchange patterns of the three contexts, providing insights into how KALM arrives at the conclusion and which context information should be the focus of future research. We further demonstrate that the role and importance of each context changes as training progresses in Section C.2 in the appendix.

LONG DOCUMENT STUDY
KALM complements the scarce literature in knowledge-aware long document understanding. In addition to more input tokens, the understanding of long documents often relies on more knowledge reference and knowledge reasoning. To examine whether KALM indeed improved in the face of longer documents and more external knowledge, we illustrate the performance of KALM and competitive baselines with respect to document length and knowledge intensity in Figure 3. It is illustrated that while baseline methods are prone to mistakes when the document is long and knowledge is rich, KALM alleviates this issue and performs better in the top-right corner.

RELATED WORK
Knowledge graphs are playing an increasingly important role in language models and NLP research.
Local context approaches focus on external knowledge in individual sentences to enable fine-grained knowledge inclusion. A straightforward way is to encode KG entities with KG embeddings (Bordes et al., 2013;Lin et al., 2015;Cucala et al., 2021;Sun et al., 2018) and infuse the embeddings with language representations Kang et al., 2022). Later approaches focus on augmenting pre-trained LMs with KGs by introducing knowledge-aware training tasks and LM architectures (Wang et al., 2021b;a;Sridhar & Yang, 2022;Moiseev et al., 2022;Kaur et al., 2022;Hu et al., 2022;Arora et al., 2022;de Jong et al., 2021;Meng et al., 2021;. However, local context approaches fall short of leveraging inter-sentence and inter-entity knowledge, resulting in models that could not grasp the full picture of the text-knowledge interactions.
Document-level context approaches take a view of the whole document, jointly considering external knowledge across multiple sentences and paragraphs. The predominant way of achieving documentlevel knowledge understanding is through the construction of "document graphs", where the textual context, external KGs, and other sources of information are condensed into graphs, often heterogeneous information networks . Graph neural networks are then employed to learn graph representations, which fuses both textual information and external KGs. However, document-level context approaches fall short of preserving the original KG structure, resulting in models with reduced knowledge reasoning abilities.
Global context approaches focus on the external KG, extracting relevant KG subgraphs based on entity mentions of the long document. Pruned with certain mechanisms (Yasunaga et al., 2021) or not (Qiu et al., 2019), these KG subgraphs are encoded with GNNs, and such representations are fused with LMs from simple concatenation  to deeper interactions . However, global context approaches leverage external KGs in a stand-alone manner, falling short of enabling the dynamic integration of textual content and external KGs.
While existing approaches successfully introduced external KG to LMs, long document understanding poses new challenges to knowledge-aware NLP. Long documents possess greater knowledge intensity where more entities are mentioned, more relations are leveraged, and more reasoning is required to fully understand the nuances, while existing approaches are mostly designed for sparse knowledge scenarios. In addition, long documents also exhibit the phenomenon of knowledge coreference, where central ideas and entities are reiterated throughout the document and co-exist in different levels of document contexts. In light of these challenges, we propose KALM to jointly leverage the local, document, and global contexts of long documents for knowledge incorporation.

CONCLUSION
In this paper, we propose KALM, a knowledge-aware long document understanding approach that introduces external knowledge to three levels of document contexts and enables interactive and interpretable exchange across them. Extensive experiments demonstrate that KALM achieves stateof-the-art performance on three tasks across six datasets. Our analysis shows that KALM provides insights into the roles and patterns of individual contexts, improves the handling of long documents with greater knowledge intensity, and has better data efficiency than existing works.

A LIMITATIONS
Our proposed KALM has two minor limitations: • KALM relies on existing knowledge graphs to facilitate knowledge-aware long document understanding. While knowledge graphs are effective and prevalent tools for modeling real-world symbolic knowledge, they are often sparse and hardly exhaustive (Tan et al., 2022;Pujara et al., 2017). In addition, external knowledge is not only limited to knowledge graphs but also exists in textual, visual, and other symbolic forms. We leave it to future work on how to jointly leverage multiple forms and sources of external knowledge in document understanding. • KALM leverages TagMe (Ferragina & Scaiella, 2011) to identify entity mentions and build the three knowledge-aware contexts. While TagMe and other entity identification tools are effective, they are not 100% correct, resulting in potentially omitted entities and external knowledge. In addition, running TagMe on hundreds of thousands of long documents is time-consuming and resource-consuming even if processed in parallel. We leave it to future work on how to leverage knowledge graphs for long document understanding without explicitly using entity linking tools.  Mehrabi et al., 2021;Du et al., 2022;Keidar et al., 2021). As a result, KALM might leverage the biased and unethical correlations in LMs and KGs to arrive at conclusions. We encourage KALM users to audit its output before using it beyond the standard benchmarks. We leave it to future work on how to leverage knowledge graphs in pre-trained LMs with a focus on fairness and equity.
C ADDITIONAL EXPERIMENTS C.1 DATA EFFICIENCY Existing works argue that introducing knowledge graphs to NLP tasks could improve data efficiency and help alleviate the need for extensive training data . By introducing knowledge to all three document contexts and enabling knowledge-rich context information exchange, KALM might be in a better position to tackle this issue. To examine whether KALM has indeed improved data efficiency, we compare the performance of KALM with competitive baselines when trained on partial training sets and illustrate the results in Figure 4. It is demonstrated that while performance did not change greatly with 30% to 100% training data, baseline methods witness significant performance drops when only 10% to 20% of data are available. In contrast, KALM maintains steady performance with as little as 10% of training data.

C.2 CONTEXT EXCHANGE STUDY (CONT.)
In Section 3.5, we conducted an ablation study of the three knowledge-aware contexts and explored how the ContextFusion layer enables the interpretation of context contribution and information exchange patterns. It is demonstrated that the three contexts play different roles with respect to datasets and KALM layers. In addition, we explore whether the role and information exchange patterns of contexts change when the training progresses. Figure 5 illustrates the results with respect to training epochs, which shows that the attention matrices started out dense and ended sparse, indicating that the role of different contexts is gradually developed through time.

C.3 LONG DOCUMENT STUDY (CONT.)
We present error analysis with respect to document length and knowledge intensity on more baseline methods, including language models (RoBERTa, BART, LongFormer), knowledge-aware LMs (KGAP, GreaseLM, GreaseLM+), and our proposed KALM in Figure 6. Our conclusion still holds  For task-specific baselines, we directly use the results reported in previous works  since we follow the same experiment settings and the comparison is thus fair. For pre-trained LMs and task-agnostic baselines, we run each method five times with different random seeds and report the average performance as well as standard deviation. Figure 4 is an exception, where we only run each method one time due to computing constraints.

D.6 MORE EXPERIMENT DETAILS
We provide more details about the experiments that are worth further explaining.
• Table 3: We implement pre-trained LMs and task-agnostic baselines for roll call vote prediction by using them to learn representations of legislation texts, concatenate them with the legislator representations learned with PAR , and adopt softmax layers for classification. • Table 4: We remove each context by only applying ContextFusion layers to the other two context representations. We follow the implementation of MInt described in . We implement concat and sum by using the concatenation and summation of the three context representations as the overall document representation. • Figure 2: The multi-head attention in the ContextFusion layer provides a 6 × 6 attention weight matrix indicating how information flowed across different contexts. The six rows (columns) stand for the local view of the local context, the global view of the local context, the local view of the document-level context, the global view of the document-level context, the local view of the global context, and the global view of the global context, which are described in detail in Section 2.3.2. The values in each square are the average of the absolute values of the attention weights across all data samples in the validation set.

D.7 COMPUTATIONAL RESOURCES DETAILS
We used a GPU cluster with 16 NVIDIA A40 GPUs, 1,988G memory, and 104 CPU cores for the experiments. Running KALM with the best parameters takes approximately 1.5, 16, 3, 4, 1, and 1 hour(s) for the six datasets (SemEval, Allsides, SLN, LUN, random, time-based).  (Bird et al., 2009), and the three adopted knowledge graphs Speer et al., 2017). We commit to make our code and data publicly available upon acceptance to facilitate reproduction and further research.