Dynamic Graph Transformer for Implicit Tag Recognition

Textual information extraction is a typical research topic in the NLP community. Several NLP tasks such as named entity recognition and relation extraction between entities have been well-studied in previous work. However, few works pay their attention to the implicit information. For example, a financial news article mentioned “Apple Inc.” may be also related to Samsung, even though Samsung is not explicitly mentioned in this article. This work presents a novel dynamic graph transformer that distills the textual information and the entity relations on the fly. Experimental results confirm the effectiveness of our approach to implicit tag recognition.


Introduction
Documents on the web deliver and spread lots of most recent information to people worldwide. In order to automatically update the real-world information, textual information extraction is a fundamental issue for NLP researchers. Many downstream tasks such as fake news detection (Wu et al., 2019) and stock movement prediction (Peng and Jiang, 2016) can benefit from the extracted information. However, most of the previous works (Hu et al., 2018;Xu and Cohen, 2018) focus on using explicit information in articles and do not consider the implicit information. For example, two companies, Sprint and T-Mobile, are explicitly mentioned companies in the news article in Figure 1. However, the stock price of a non-mentioned company, SoftBank, may be influenced since Softbank owns shares of Sprint. Although this kind of inference is intuitive for professional analysts, few previous works take such implicit information into consideration. In this paper, we aim to increase the sense of machines toward this kind of implicit information.
Transformer-based (Vaswani et al., 2017) neural networks achieve state-of-the-art performances in  many NLP tasks (Devlin et al., 2018;Malmi et al., 2019). To model the relationships between entities, graph neural network (GNN) is a well-known architecture for representing the knowledge and additional information (Fu et al., 2019;You et al., 2020). Furthermore, the models blending these two architectures show their effectiveness (Lu et al., 2020). In this paper, we propose dynamic graph transformer (DGT), a novel blend of Transformer and GNN. In previous work, the weights of the GNN are pre-determined in the training stage and not affected by the given input. Our DGT adjusts the weights depending on the input on the fly. In this way, the representation of the graph information will be more flexible and more specific to the input.
A strategy for pre-training on in-domain data is further proposed. Experimental results show our approach is effective in the task of extracting the implicit information from news articles. The contributions of this work are summarized as follows.
• We point out an important issue of information extraction for implicit entities.
• We propose a novel model that dynamically incorporates textual information and graph in-formation. Our approach outperforms recent works in implicit tag recognition.
• Our pre-training task, masked entity prediction, is helpful for predicting the implicit entities. The pre-trained model can be also applied in other information extraction tasks.

Related work
Extracting and using the information in articles is one of the focuses in the NLP community. Some works (Hu et al., 2018;Ding et al., 2019; adopt the extracted information for stock market prediction. Some of them (Baker et al., 2016;Min and Zhao, 2019) use the information in news articles to construct socio-economic indicators. Most of previous works only focus on explicit information in the articles. In this way, the implicit entities in the articles may be under looked. In this paper, we aim to extract the non-mentioned but related entities from a document.
Recently, GNN has become popular for modeling relationships among multiple entities. Kipf and Welling (2016) use the convolution neural network to learn the node representation by aggregating the features of neighboring nodes. Veličković et al. (2017) employ the attention mechanism to improve the GNN architecture. Recent studies (Berg et al., 2017;Monti et al., 2017;Ying et al., 2018) also show the effectiveness of graph neural networks in various tasks. Inspired by these works, we present a new blend of graph attention network (GAT) (Veličković et al., 2017) with Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2018) for extracting the implicitly related entities to a given article.

Task Setting
The task is formulated as follows. Given an article, a model is aimed at predicting a list of entities that are not explicitly mentioned but related to the given article. Let a corpus be D = {(d 1 , y 1 ), (d 2 , y 2 ), ..., (d |D| , y |D| )}, where d k and y k denote k-th article and the implicit entity list of k-th article, respectively. The k-th article can be represented by a word sequence, i.e., d k = (w k 1 , w k 2 , ..., w k |d k | ). Let the candidate entity list be C = {c 1 , c 2 , ..., c |C| }, where c i denotes i-th entity, and y k ∈ {0, 1} |C| , y k i = 1 if the entity c i is associated with the given article d k but not directly mentioned within the content, otherwise y k i = 0.

Graph Attentional Layer
We build a co-occurrence matrix M to represent the association graph of the entities in C. The frequency of two entities c i and c j appearing together in the training corpus can be defined as follows.
where [·] is the Iversion bracket. M can be viewed as an adjacency matrix representation of an association graph. M i,j is the value of the edge between nodes i and j, which represents the degree of association between the entities c i and c j . The graph attentional layer is a component of the graph attention network (GAT) (Veličković et al., 2017). In order to suit the architecture of Transformers, we modify part of GAT as follows. Firstly, the h-th score matrix S h ∈ R n×n is defined as follows.
where X ∈ R n×d input denotes the input matrix, and W h Q ∈ R d input ×d and W h K ∈ R d input ×d are learnable matrices.
Secondly, the h-th multi-head attention matrix A h ∈ R n×n is defined as follows.
where N [i] = {j : M i,j > 0} represents the closed neighborhood set of node i. Lastly, we concatenate all the computational results of multiattention heads. The output of the graph attentional layer is computed as follows.
where W h V ∈ R d input ×d and W O ∈ R d input ×d input are learnable matrices. H is the number of attention heads and d input = d × H.

Dynamic Graph Transformer
We first show Static Graph Transformer (SGT), which incorporates Transformer, BERT, and GAT. Then, we extend SGT to the final model, Dynamic Graph Transformer (DGT) by considering graph information dynamically.
Static Graph Transformer Figure 2a shows the architecture of SGT, which integrates graph and text information. The motivation behind SGT is to treat the Transformer encoder as a variation of GNN by replacing self-attention with a graph attentional layer. The Transformer encoder takes the entity's representation e c j as input. It is one of the row vectors in BERT's word embeddings E ∈ R |V|×d input , where V denotes the vocabulary of BERT. We consider the outputs of the Transformer encoder as the entity embeddings n 1 , n 2 , ..., n |C| . In SGT, the process of generating entity embedding n j is static because it is irrelevant to the content information from the input article. In the Transformer decoder, we take the contextual word embeddings as the inputs of the Transformer encoder. The contextual word embeddings are the last hidden state vectors of BERT, which use the word embedding e w i in the article as input. Finally, we follow the settings of Devlin et al. (2018) to use the first output embedding of the Transformer decoder for predicting the implicit entity list. Figure 2b shows the architecture of DGT. From another perspective, since BERT is a kind of Transformer encoder, we move the learning part of entity embedding n j to the Transformer decoder. We treat the last hidden state vectors of the Transformer decoder as entity embeddings n 1 , n 2 , ..., n |C| . In this way, DGT is able to update the entity embedding on the fly because the source-target attention mechanism utilizes the outputs of BERT, making the model more tailored to the input article. In our task, each entity embedding is mapped to the scalar, which indicates the probability of the entity related to the article but not mentioned in it. Devlin et al. (2018) show that pre-training with masked language modeling and next sentence prediction is effective. Chu et al. (2020) indicate that pre-training with the value process prediction task is useful for generating correct numeric values in news headlines. In this work, we propose a new pre-training task, masked entity prediction, to enrich the semantic information of entity names. Building on the original BERT vocabulary, we add a list of entity names in C and aim to learn their representations. We adopt the bert-base-chinese as the initial model and retrofit it on the training data with two sub-tasks at the same time. The first sub-task is a new masked language modeling task. Unlike the masked language model task performed for the original BERT, we not only mask the tokens using the method in Devlin et al. (2018) but also all the entity mentions in the document. The second sub-task is to label all masked entities on the position tagged as [CLS]. In this way, we obtain a new pre-trained model tailored to our target corpus.

Dataset Description
The dataset 1 consists of 27,716 news articles collected from MoneyDJ 2 , a financial newsvendor in Taiwan. Each news article is published with the labels of the related entities. These labels are annotated by professional journalists. The candidate entity list contains 735 company names, i.e., |C| = 735. We split the dataset into the training set and the test set by time. The training set contains

Baseline Models
We adopt two kinds of models as our baselines, including the models for tag recommendation and the models for classification. The models ABC (Gong and Zhang, 2016), TAB-LSTM (Li et al., 2016), and ITAG (Tang et al., 2019) are considered as the baselines for tag recommendation. For the classification task, we adopt BERT (Devlin et al., 2018) and VGCN-BERT (Lu et al., 2020) for comparison.

Experimental Results
We adopt the binary cross-entropy as the loss function and the Adam optimizer (Kingma and Ba, 2014) for training. The learning rate and the batch size are 3e-5 and 8, respectively. Table 1 shows that the proposed models are better than the baseline models in both micro-F1 score and macro-F1 score.
In Table 2, we further show the performance degradation when using the original BERT language model instead of the language model with the proposed pre-training task. The results confirm that the proposed pre-training task is useful in the implicit relation learning task. Table 3a shows that the model focuses on the companies and the products that appear in the news article. It indicates that the proposed model captures the relationship between the companies via the related companies and the mentioned products. In Table 3b, we find that even no company names have been mentioned in the news article, DGT can still (a) Inferring non-mentioned companies via mentioned companies and products.

Attention Mechanism
(b) Inferring non-mentioned companies only by mentioned products.  correctly infer the related entities by the mentioned raw material such as wheat, corn, and soybeans.

Entity Embeddings
We use the last hidden state vectors of DGT as the corresponding entity embeddings and use t-SNE (van der Maaten and Hinton, 2008) to visualize these embeddings. As shown in Figure 3, we find that although we do not directly provide the information about the related industry or product of the entity during training, the model can still capture relationships from the corpus.

Conclusion and Future Work
This paper presents a novel dynamic graph transformer model and a pre-training task for extracting the implicit entities in articles. Experimental re-sults show the usefulness of the proposed methods.
We also discuss what kinds of features our model captures.
In our previous work (Liou et al., 2021), we apply the proposed task to accelerate the working process of the journalists and show that using the extracted entities could be useful for downstream tasks such as news aggregation and stock movement prediction. In the future, we plan to apply the proposed approach to datasets with both graphical knowledge and textual content.