Combinatory Grammar Tells Underlying Relevance among Entities

,


Introduction
Given two entities in a sentence, relation extraction (RE) extracts the relation between them and thus serves as an important task in natural language processing (NLP).Recent neural approaches for RE (Zeng et al., 2014;Zhang and Wang, 2015;Xu et al., 2015;dos Santos et al., 2015;Zhang et al., 2015;Wang et al., 2016;Zhou et al., 2016;Zhang et al., 2017) with powerful encoders (e.g., Transformers) have shown outstanding performance on Figure 1: An example sentence with the CCG supertags of all words, where the supertag "(S\NP/NP)" of "produces" provides important cues to predict the relation between two given named entities "food factory" and "ice cream" (highlighted in red).benchmark datasets because the encoders are superior in capturing contextual information and thus obtain a deep understanding of the running text.
To further improve model performance, extra knowledge resources, especially the syntactic information, have been widely used for RE and demonstrated to be effective, because they provide structure information that is helpful for text understanding (Miwa and Bansal, 2016;Zhang et al., 2018;Sun et al., 2020;Chen et al., 2020).Specifically, existing approaches mainly focus on dependencies among words while paying limited attention to other types of syntactic structure, such as combinatory categorial grammar (CCG).As an important part in the a lexicalized grammatical formalism, the CCG supertags provide the lexical category of the associated words, which provides both syntactic and semantic knowledge for text understanding and thus is potentially beneficial for RE. Figure 1 shows a typical example.Herein, the supertag of "produces" (which is "(S\NP/NP)") indicates the predicate requires to nominal arguments and the supertags of the two given entities (which are highlighted in red) suggests that they could serve as good candidates.Therefore, the supertags suggest that "produces" contributes more to extracting the relation between the two entities and thus guide a model to make a correct prediction.
In this paper, we propose to leverage CCG supertags to detect the relation between entities.In doing so, we use an existing CCG supertager to annotate the supertags of the input text and then run a multi-task learning process to learn from humanannotated RE and auto-annotated supertags, where an attention mechanism is performed over all input words to distinguish the important ones for RE with the attention weights guided by the supertag decoding process.Therefore, our model is able to learn CCG information through supertag decoding rather than using the supertags as input features, which allows our approach to run efficiently in inference.Experimental results on two English benchmark datasets for RE, i.e., ACE2005EN and SemEval 2010 Task 8, demonstrate the effectiveness of our approach, where our approach outperforms strong baselines and achieve state-of-the-art performance on both datasets.

Preliminaries
RE is conventionally regarded as a text classification task with the given input sentence (which is denoted as X = x 1 , • • • , x n ) and two entities (which are denoted as E 1 and E 2 ) in it, which can be formalized as where p computes the probability of the relation label y ∈ T (T is the label set) and y is the model prediction.In doing so, special tokens, i.e., "<e1>" and "</e1>" for E 1 and "<e2>" and "</e2>" for E 2 , are firstly inserted around the entities to mark their positions.Next, the sentence (with the special entity markers) is fed into an encoder, where the obtained hidden vectors for the i-th word x i is denoted as h i .Third, the hidden vectors of the words belonging to a particular entity (i.e., E j , j = 1, 2) are extracted and fed to multi-layer perceptron (MLP) for further encoding, where the resulting vectors are passed through a max pooling layer to obtain the entity representation o j : Then, we concatenate the entity representations o = o 1 ⊕ o 2 and fed the resulting o into a softmax classifier to predict the relation y.

The Proposed Approach
To leverage the information carried by CCG supertags, one strightforward approach is to use an off-the-shelf CCG supertager to annotate the supertags of each input word and then use them as extra word-level features by concatenating them with the input words before sending them to the text encoder.However, such approach requires the CCG supertagging as a pre-processing step in in- ference, which is not efficient especially when the data to be processed is relative large.Considering multi-task learning serves as an effective approach to learn from different tasks and it does not require the label from different tasks as extra input, we propose to learn the CCG information through a multi-task learning process and then use the CCG information to guide RE though an attention mechanism over all input words.The overall architecture of our model is illustrated in Figure 2, where the backbone model for RE following the standard process illustrated on the left and the CCG supertag decoding process as well as the attention mechanism illustrated on the top right.For CCG supertag decoding, we firstly take the hidden vector h i of the word x i obtained from the encoder and pass it through a MLP: where the obtained h s i is mapped to the CCG supertag output space by a trainable matrix W s and then a softmax classifier is applied to predict the supertag y s i annotated by an existing supertagger.Simultaneously, h s i , as well as the entity representation o j obtained from the backbone model, is fed  into an attention module to enhance the RE prediction process.Specifically, we use two trainable matrix W k and W v to map h s i to the key vector k i and value vector v i , respectively.
Then, for entity E j , we compute the attention weight p j,i assigned to the value v i through Afterwards, we apply p j,i to the value vector v i and obtain the weighted sum vector a j via Finally, we concatenate a j with the entity representation o j to obtain the enhanced entity representation h E j = o j ⊕ a j .Once the enhanced representations of the two entities are computed, we concatenate them and feed the resulting vector to the softmax classifier, following the standard RE decoding process.
In training, the model is optimized on RE and CCG supertagging, which allows our model to learn CCG information and use it to enhance the entity representation through the attention mechanism with the attention weights assigned to different input words guided by the learnt CCG information.(Hendrickx et al., 2010), where we use the official training and test split (SemEval does not have an official development set).Table 1 reports the statistics of the datasets.We try two graph-based models as our baseline for comparison, namely, graph convolutional networks (GCN) (Kipf and Welling, 2016) and graph attentive networks (GAT) (Veličković et al., 2017).We use the dependency tree obtained through Stanford CoreNLP Toolkits (Manning et al., 2014) to build the word graph and use the graph as additional input to the GCN and GAT models.
We use the CCG supertager4 proposed by Tian et al. (2020b)   role to achieve good model performance in downstream NLP tasks (Song and Shi, 2018;Han et al., 2018;Devlin et al., 2019;Radford et al., 2019;Tian et al., 2020a;Lewis et al., 2020;Diao et al., 2020;Raffel et al., 2020;Diao et al., 2021;Song et al., 2021), we try the large version of BERT5 (Devlin et al., 2019) (which achieves state-of-the-art performance in many NLP tasks) with the default settings (i.e., 24 layers of multi-head attentions with 1024dimensional hidden vectors).For evaluation, we follow previous studies to use the standard micro-F1 scores6 for ACE05 and use the macro-averaged F1 scores7 for SemEval.In our experiments, we try different combinations of hyper-parameters (which are illustrated in Table 2 with the best ones highlighted in boldface), and tune them on the dev set, then evaluate on the test set by the model that achieves the highest F1 score on the dev set.

Results
Table 3 shows the average8 F1 scores of different models (including the vanilla BERT-large baseline, the GCN and GAT baseline, and our approach) on There are several observations.First, our model works well with the BERT-large pre-trained language model, where the consistent improvement is observed over the vanilla BERT baselines on both datasets, although the BERT baselines have already achieve outstanding performance.Second, it is promising to observe that our model outperforms the standard GCN and GAT that leverage dependencies on both datasets, which further confirms the effectiveness of our approach.We attribute this observation to the superior of CCG supertag that carries both syntactic and semantic information of the running text and thus is able to provide a deeper analysis of the text and use it to guide the relation prediction process.Third, it is observed that our model is able to perform more efficient compared with GCN and GAT, because the CCG information is learnt through training in our approach and no supertags is required as input in inference whereas GCN and GAT require the input to be parsed before they can predict the relation.
We further compare our approach with recent previous studies and report the results in Table 4.It is promising to observe that our approach outperforms previous studies (including the ones with powerful encoder and syntactic information) and achieves state-of-the-art performance on both datasets, which further confirms the effectiveness of our approach.

Case Study
To illustrate how CCG information guide the relation extraction process through the attention mechanism, in Figure 3, we visualize the average attention weights assigned to different words in an example sentence (the entities are highlighted in red and their gold standard relation is "location") by word background color, where higher weights correspond to deeper color.The CCG supertags of the words are shown below the attached words.It is worthnoting that the supertags are given for better illustration; they are not used as input in inference.In this case, our model is able to distinguish that "in" tend to be the head of a prepositional phrase (PP) that is attached to a predicate based on the learnt CCG supertag information9 and its argument noun phrase is exactly one of the given entities.Therefore, our model assigns the highest weight to "in", which strongly suggests a "location" relation, and thus results in the correct relation prediction.

Conclusion
In this paper, we propose a neural approach for improving RE through a CCG guided attention mechanism, where our model learns the CCG information through a multi-task learning process to predict RE and CCG supertags simultaneously and uses the learnt CCG information to compute the attention weights assigned to different words.In doing so, our approach is able to learn the CCG information through CCG supertag decoding rather than using it as additional input features, which allows our model to run efficiently in inference.Experimental results on two English benchmark datasets (i.e., ACE05 and SemEval) for RE demonstrate the effectiveness of our approach, where state-of-the-art performance is obtained on both datasets.

Figure 2 :
Figure 2: The overall architecture of the proposed approach for RE with CCG supertag guided attentions as the enhancement.The entities are highlighted in red.

Figure 3 :
Figure3: Visualizations of weights assigned to different words for an example input sentence, where the supertags associated with them are illustrated at the bottom.Darker background color refer to higher weights.the development and test set of ACE05 and Se-mEval, where the size of different models (in terms of the number of parameters) and the inference speed (in terms of the number of processed sentences per second) are also reported for reference.There are several observations.First, our model works well with the BERT-large pre-trained language model, where the consistent improvement is observed over the vanilla BERT baselines on both datasets, although the BERT baselines have already achieve outstanding performance.Second, it is promising to observe that our model outperforms the standard GCN and GAT that leverage dependencies on both datasets, which further confirms the effectiveness of our approach.We attribute this observation to the superior of CCG supertag that carries both syntactic and semantic information of the running text and thus is able to provide a deeper analysis of the text and use it to guide the relation prediction process.Third, it is observed that our model is able to perform more efficient compared with GCN and GAT, because the CCG information is learnt through training in our approach and no supertags is required as input in inference whereas GCN and GAT require the input to be parsed before they can predict the relation.We further compare our approach with recent previous studies and report the results in Table4.It is promising to observe that our approach outperforms previous studies (including the ones with powerful encoder and syntactic information) and achieves state-of-the-art performance on both datasets, which further confirms the effectiveness of our approach.

Table 1 :
The statistics of the two English benchmark datasets used in our experiments for relation extraction, where the number of sentence, tokens, and instances (i.e., entity pairs) are reported.

Table 2 :
The hyper-parameters tested in tuning our models.The best ones used in our final experiments are highlighted in boldface.
to annotate the CCG supertags for multi-task learning.For the encoder, consider a high-quality text representation plays an important

Table 4 :
The comparison of F1 scores between previous studies and our best model with BERT-large on the test sets of ACE05 and SemEval.Previous studies that leverage syntactic information (e.g., the dependency tree of the input sentence) are marked by " †".