Multi-facet Universal Schema

Universal schema (USchema) assumes that two sentence patterns that share the same entity pairs are similar to each other. This assumption is widely adopted for solving various types of relation extraction (RE) tasks. Nevertheless, each sentence pattern could contain multiple facets, and not every facet is similar to all the facets of another sentence pattern co-occurring with the same entity pair. To address the violation of the USchema assumption, we propose multi-facet universal schema that uses a neural model to represent each sentence pattern as multiple facet embeddings and encourage one of these facet embeddings to be close to that of another sentence pattern if they co-occur with the same entity pair. In our experiments, we demonstrate that multi-facet embeddings significantly outperform their single-facet embedding counterpart, compositional universal schema (CUSchema) (Verga et al., 2016), in distantly supervised relation extraction tasks. Moreover, we can also use multiple embeddings to detect the entailment relation between two sentence patterns when no manual label is available.


Introduction
Relation extraction (RE) is a crucial step in automatic knowledge base construction (AKBC). A major challenge of RE is that the frequency of relations in the real world is a long-tail distribution but collecting sufficient human annotations for every relation is infeasible (Han et al., 2020).
Distant supervision is proposed to alleviate the issue (Mintz et al., 2009). Distant supervision assumes that a sentence pattern expresses a relation if the sentence pattern co-occurs with an entity pair and the entity pair has the relation. For example, we assume the sentence pattern "$ARG1, the partner of fellow $ARG2" is likely to express the spouse *   Figure 1: Comparison between the multi-facet and compositional universal schema. In our training loss, we encourage one of the facet embeddings from a sentence pattern to be similar to its co-occurred entity pair. relation if we observe a text clip "... Angelina Jolie, the partner of fellow Brad Pitt ..." in our training corpus and a knowledge base tells us that Angelina Jolie and Brad Pitt has the spouse relation. Accordingly, we can infer that another entity pair is likely to have the spouse relation if we observe the text ", the partner of fellow" between them in a new corpus.

* indicates equal contribution
Universal schema (Riedel et al., 2013) extends this assumption by treating every sentence pattern as a relation, which means we assume that sentence patterns or relations in a knowledge base are similar if they co-occur with the same entity pair. For example, we assume "$ARG1, the partner of fellow $ARG2" and "$ARG1, the wife of fellow $ARG2" are similar if they both co-occur with (Kristen Bell, Dax Shepard). Consequently, we can infer that "$ARG1, the wife of fellow $ARG2" also implies spouse relation as "$ARG1, the partner of fellow $ARG2" even if the knowledge base does not record the spouse relation between Kristen Bell and Dax Shepard.
Compositional universal schema  realizes the idea by using a LSTM (Hochreiter and Schmidhuber, 1997) to encode each sentence pattern into an embedding and encouraging the embedding to be similar to the embedding of the co-occurred entity pair. As in the lower part of Figure 1, the model makes the embeddings of two sentence patterns similar if they co-occur with the same entity pair. Baldini Soares et al. (2019) rely on a similar assumption and achieve state-of-theart results on supervised RE tasks by replacing the LSTM with a large pre-trained language model. The variants of universal schema have many different applications, including multilingual RE , knowledge base construction (Toutanova et al., 2015;Verga et al., 2017), question answering (Das et al., 2017), documentlevel RE (Verga et al., 2018), N-ary RE (Akimoto et al., 2019), open information extraction (Zhang et al., 2019), and unsupervised relation discovery (Percha and Altman, 2018).
Nevertheless, one sentence pattern could contain multiple facets, and each facet could imply a different relation. In Figure 1, "$ARG1, the partner of fellow $ARG2" could imply the entity pair has the spouse relation, the co-worker relation, or both. "$ARG1 moved in with $ARG2" could imply the spouse relation, the parent relation, ..., etc. If we squeeze the facets of a sentence pattern into a single embedding, the embedding is more likely to be affected by the irrelevant facets from other patterns co-occurred with the same entity pair (e.g., "$ARG1 moved in with $ARG2" might incorrectly imply the co-worker relation).
Another limitation is that single embedding representation can only provide symmetric similarity measurement between two sentence patterns. Thus, an open research challenge is to predict the entailment direction of two sentence patterns only based on their co-occurring entity pair information.
To overcome the challenges, we propose multifacet universal schema, where we assume that two sentence patterns share a similar facet if they cooccur with the same entity pair. As in Figure 1, we use a neural encoder and decoder to predict multiple facet embeddings of each sentence pattern and encourage one of the facet embeddings to be similar to the entity pair embedding. As a result, the facets that are irrelevant to the relation between the entity pairs are less likely to affect the embeddings of entity pairs and other related sentence patterns. For example, the parent facet of "$ARG1 moved in with $ARG2" could be excluded when updating the embeddings of (Angelina Jolie, Brad Pitt).
In our experiments, we first compare the multifacet embeddings with the single-facet embedding in distantly supervised RE tasks. The results demonstrate that multiple facet embeddings significantly improve the similarity measurement between the sentence patterns and knowledge base relations. Besides RE, we also apply multi-facet embeddings to unsupervised entailment detection tasks. In a newly collected dataset, we show that multi-facet universal schema significantly outperforms the other unsupervised baselines.

Methods
Our method is illustrated in Figure 2. In Section 2.1, we first provide our problem setup: We are given a knowledge base (KB) and a text corpus during training. Our goal is to extract relations by measuring the similarity between KB relations and an (unseen) sentence pattern or to detect entailment between two sentence patterns. In Section 2.2, we introduce our neural model, which predicts multifacet embeddings of each sentence pattern. Next, in Section 2.3, we describe our objective function, which encourages the embeddings of co-occurred entity pairs to be close to the embeddings of their closest pattern facets. Finally, in Section 2.4, we explain that multi-facet embeddings could be viewed as the cluster centers of possibly co-occurred entity pairs, and in Section 2.5, we provide our scoring functions for distantly supervised RE and unsupervised entailment tasks.

Background and Problem Setup
Our RE problem setup is the same as compositional universal schema . First, we run named entity recognition (NER) and entity linking on a raw corpus. After identifying the entity pairs in each sentence, we prepare a co-occurrence matrix as in Figure 2. Similarly, we represent the KB relations between entity pairs as a co-occurrence matrix and merge the matrices from the KB and the training corpus. The merged matrix has y i,j = 1 if the ith sentence pair or KB relation co-occurs with the jth entity pair and y i,j = 0 otherwise.
During testing, we use NER to extract an entity pair and the sentence pattern, which might not have been seen in the training corpus. Next, we extract  Figure 2: An illustration of the proposed method. The training signal comes from the co-occurrence matrices of the KB and training text corpus on the right. On the lower left, we visualize our neural encoder, which captures the compositional meaning of tokens in the sentence pattern, and our neural decoder, which models the dependency among multiple facet embeddings. When a sentence pattern co-occurs with an entity pair, the training loss minimizes the distance between the entity pair embedding and the closest facet embedding of the sentence pattern (e.g., 0.2 between s i,2 and e 1 ). Trainable parameters in our model are highlighted using red borders. On the upper left, we visualize the embedding space to establish the connection between our method and clustering.
relations by computing the similarity between the sentence pattern embeddings and the embeddings of the applicable KB relations. Besides RE, we also detect the entailment between two sentence patterns by comparing their embeddings.

Neural Encoder and Decoder
We use a neural model to predict K facet embeddings of each sentence pattern. The goal is similar to Chang et al. (2021), which predict a fixed number of embeddings of a sentence, so we adopt their neural model as shown in Figure 2. For the ith sentence pattern S i , we append an <eos> to its end and use a 3-layer Transformer (Vaswani et al., 2017) encoder T E to model the compositional meaning of the input word sequence: where u i,l is an embedding contextualized by the encoder. In the experiment, we also replace the Transformer with a bidirectional LSTM (bi-LSTM) to show that the improvement of multi-facet embeddings is independent of the encoder choice.
The embedding u i,<eos> represents the whole sentence pattern; we use K different linear layers L d k to transform the embedding into the inputs of our decoder: The facets in a sentence pattern often have some dependency. For example, the patterns that express the partnership between two people might also express the collaboration relation between two companies. To leverage the dependency, we use another 3-layer Transformer as our decoder T D. Besides the self-attention, we allow the hidden states in the decoder to query the contextualized word embeddings u i,l from the encoder (Vaswani et al., 2017) and output the embeddings corresponding to the different facets d i,k : . Notice that we do not use autoregressive decoding as in Vaswani et al. (2017), so our decoder could also be viewed as another encoder with attention to the output of the encoder T E. Finally, to convert the hidden state size to the entity embedding size, we let the outputs of decoder go through another linear layer L o to get the facet embedding (i.e., sentence pattern embedding):

Objective Function
When measuring the distance between the jth entity pair and the ith sentence pattern, we compute the Euclidean distance between the entity pair embeddingẽ j and its closest facet embedding of the ith sentence pattern. The distance is defined as where the entity pair embedding is normalized (i.e., ||ẽ j || = 1). During testing, we ignore the magnitude of facet embeddings, so we use η k to eliminate the magnitude of facet embeddings s i,k during training. We do not allow negative η k to prevent the gradient flow from pushing s i,k toward the inverse direction ofẽ j and we ensure η k ≤ 1 to avoid the neural model from outputting s i,k with a very small magnitude. As in Figure 2, we minimize the distance D({s i,k } K k=1 ,ẽ j ) in our loss function when the ith sentence pair co-occurs with the jth entity pair (i.e., y i,j = 1). For negative samples (i.e., y i,j = 0), we maximize the distance instead. That is, the major term of our loss function is defined as and the other regularization term Ω in the loss function will be described in the appendix. R is a set that includes all positive and negative samples. Positive samples are (i, j) such that y i,j = 1 and the negative samples are constructed by pairing a randomly selected sentence pattern with the jth entity pair. To balance the influence of popular entity pairs (i.e., entity pairs that co-occur with many sentence patterns) and rare entity pairs on our model, we set the weight of each pair, r i,j ∝ 1 i y i,j and (i,j)∈R r i,j |R| = 1. We generate the embeddings for KB relations in a similar way. We use a single token to represent the relation and append an <eos> (e.g., per:spouse <eos>) to form the input of our neural model. The KB relations usually co-occur with more entity pairs, so we set the number of facet embeddings for KB relations K rel to be larger than the number of facet embeddings for sentence patterns K.

Connection to Clustering
If a sentence pattern contains multiple facets that describe different relations between the entity pairs, the pattern often co-occurs with different kinds of entity pairs. For example, "$ARG1 's partner $ARG2" in Figure 2 could express the collaboration relationship between two companies or the partnership between two people, so the sentence patterns could co-occur with two companies such as (Google, Facebook) and two people such as (Bob Bryan, Mike Bryan).
Different kinds of entity pairs often have very different embeddings, so we could discover the facets of sentence patterns by clustering the embeddings of entity pairs. Here, a facet refers to a mode of the embedding distribution of the entity pairs that could possibly co-occur with the sentence pattern. A facet could be represented by multiple facet embeddings and each facet embedding corresponds to a cluster center of the entity pair embeddings. Hence, although the number of facet embeddings K is fixed for all the sentence patterns, our model can capture the facets of the sentence patterns well when the number of facets is less than K.
In equation 1, we choose the closest facet embedding of the sentence pattern for each co-occurring entity pair embedding and minimize their distance. For example, s i,2 and the embedding of (Bob Bryan, Mike Bryan) are pulled closer in Figure 2. Minimizing equation 1 by passing the gradient through the scaled facet embedding η k s i,k is the same as minimizing a Kmeans loss, so the loss term induced by positive sample pairs encourage each s i,k to become the cluster center of its nearby co-occurring entity pair embeddings. The details of our training algorithm could be found in the appendix.
The co-occurrence matrices in RE tasks are usually extremely sparse, and most of the sentence patterns only co-occur with a few entity pairs, which makes it difficult to derive multiple high-quality embeddings by clustering the co-occurring entity pair embeddings as in multi-sense word embedding methods such as Neelakantan et al. (2014). The proposed method solves this sparsity challenge by predicting the cluster centers using a neural model. For instance, even if "$ARG1 's partner $ARG2" does not co-occur with many entity pairs, its embeddings are encouraged to be close to the embeddings of entity pairs that co-occur with other similar patterns (e.g., "$ARG1 and her partner $ARG2").

Scoring Functions
In compositional universal schema, the similarity between the ith and jth sentence patterns are measured by the symmetric cosine similaritys T i,1sj,1 , wheres i,1 = s i,1 ||s i,1 || . When using multiple embeddings to represent a sentence pattern, we can compute the asymmetric similarity as In an example of Figure 3, a red squares i,k is close to all the blue points, which leads to a high Asym({s i,k }, {s j,m }).
Between two sentence patterns with entailment relation, we empirically find that the embeddings of a premise (the more specific pattern) often have some facet embeddings that are far away from all the embeddings of its hypothesis (the more general pattern). Relying on the tendency, we could detect the direction of the entailment relation. For example, the ith sentence pattern (red squares) in Figure 3 is more likely to be premise if the ith and jth (blue circles) sentence patterns have an entailment relation.
We suspect the reason is that more specific patterns could contain more words that are similar to the words of other patterns expressing different relations. For example, "$ARG1 , the wife of fellow $ARG2" have a facet embedding for spouse relation and another facet embedding for the co-worker relation because the pattern has high word overlapping with "$ARG1 , the wife of $ARG2" and "$ARG1 and her fellow $ARG2". Another possible reason is that the articles in our corpus tend to use more specific patterns to express the relation between a pair of entities (Shwartz et al., 2017).
When performing RE, we compute the symmetric similarity between ith sentence pattern and jth

Experiments
We primarily compare our method with compositional universal schema (CUSchema)  because CUSchema is one of the state-of-theart RE methods in the small model regime (without using large pre-trained language models) (Chang et al., 2016;Chaganty et al., 2017). 1 In Section 3.1, we visualize and analyze the facet embeddings. Next, we use distant-supervised RE tasks to evaluate our symmetric similarity measurement in Section 3.2, and detect entailment between sentence patterns to evaluate our asymmetric similarity measurement in Section 3.3.

Embedding Visualization
We visualize the embeddings of sentence patterns and a KB relation from the single embedding model and multi-facet embedding model that perform the best in the RE tasks (i.e., Ours (Single-Trans) and Ours (Trans) in Table 1). We project the facet embeddings to a 2-dimensional space using multidimensional scaling (MDS) (Borg and Groenen, 2005) and visualize the embeddings of one KB relation and three related sentence patterns in Figure 4. The three sentence patterns are selected from our validation set, so the model is not aware of the entity pairs that actually co-occur with the patterns during training. For each facet embedding, we show two among five of its closest entity pairs to visualize the meaning of the embedding space. 2 1 We have not yet applied the multi-facet embeddings approach to the models that rely on a large pretrained language model (LM) (Baldini Soares et al., 2019) due to computational and evaluation considerations. Computationally speaking, training state-of-the-art models requires intensive GPU resources. Besides, a smaller model size might be desired when we need to construct a knowledge base from a large corpus in real time. Moreover, there is no existing pretrained LM in some domains (Zhang et al., 2019), and training the LM in a new domain from scratch requires even more GPU resources.
In terms of the evaluation consideration, our method is an improvement over CUSchema, so we want to compare it with CUSchema fairly. Furthermore, evaluating entailment between two full sentences is more difficult than between the sentence patterns, and we are not aware of a LM-based model that only considers the text between the entity pairs. 2 Notice that our training signal is sparse and noisy and the projection does not necessarily preserve the original distances, so the entity pairs with similar relations might be relatively far away from each other.

Single Embedding
Multi-facet Embedding In the single embedding model, the embedding of org:city of headquarter is close to the embedding of (school, location) while "$ARG1 headoffice in $ARG2" is close to (company, location) and "$ARG1 headquarter in $ARG2".
In the multi-facet embedding model, some embeddings of org:city of headquarter are closer to (school, location) and others are closer to (company, location). In addition to these entity pairs, "$ARG1 headoffice in $ARG2" and "$ARG1 headquarter in $ARG2" also co-occur with (people, location) and (people/organization, year). Using the visualization of multi-facet embedding, we can understand which facets of org:city of headquarter are similar or dissimilar to "$ARG1 headoffice in $ARG2", which cannot be done if all facets are averaged into a single embedding as in the traditional models.
The facet embeddings of "$ARG1 is now at $ARG2" are close to (people, organization) where the organization could be school, sports team, and company. Using multiple embeddings could avoid enforcing the closeness of these entity pairs with different relations. The results also indicate that our model can output reasonable cluster centers despite learning from the sparse and noisy training data. Finally, we can see that if a sentence pattern has fewer facets than K, our model learns to output some very similar facet embeddings, which makes the performance less sensitive to the setting of K.

Relation Extraction
We follow the same training data and testing protocol in compositional universal schema (CUSchema)  3 to highlight the benefit of predicting multiple facet embeddings, and the relation extraction step in TAC KBP slot-filling tasks is used to compare the different models.
Setup: The training data for our RE models are prepared by distant supervision without requiring any manually labeled data. The relations in Freebase (Bollacker et al., 2008) are mapped to TAC relations (e.g., org:city of headquarter) and the NER tagger and entity linker are run in a raw text corpus. Then, the training data is cleaned using the methods in .
During testing, we are given a query containing the head entity and a query TAC relation in the slot-filling tasks, and the goal is to extract the tail entity from the candidate sentences. The NER tagger and query expansion are used to gather the candidate sentence patterns, and we compute the similarity scores from different models between the candidate sentence patterns and query relation. Finally, we compare the extracted second entity with the ground truth using exact string matching and report the precision, recall, and F1 scores.
Following , we use TAC 2012 as our validation set to determine the threshold score for each TAC relation. Each model's hyperparameters are tuned separately using the validation set (TAC 2012) to ensure a fair comparison. We compare the following methods: Ours (Trans): Our method that measures the similarity between the sentence pattern {s i,k } and TAC relation {s j,m } using Sim({s i,k }, {s j,m }) in equation 4. Trans is an abbreviation of the Transformer encoder. We set K = 5 and K rel = 11 based on the validation set. Ours (LSTM): The same as Ours (Trans) except that we use bi-LSTM as our encoder instead. Ours (Single-*): Our methods that use single facet embedding to represent each sentence pattern or KB relation. When setting K = K rel = 1, our decoder becomes the interleaving feedforward layers and cross-attention layers attending to the output embeddings of the encoder. CUSchema (LSTM): Compositional universal schema . The method is similar to Ours (Single-LSTM) but uses a different loss function, neural architecture (no decoder), and hyperparameter search procedure. USchema: Universal schema (Riedel et al., 2013) estimates every sentence pattern embedding by factorizing the co-occurrence matrices (i.e., replacing the bi-LSTM in CUSchema with a look-up table). USchema + *:  show that taking the maximal similarity between USchema and CUSchema model improves the F1. We also apply the same merging procedure to our model.
Results: In Table 1, the proposed method Ours (Trans) significantly outperform CUSchema (LSTM) before and after combining with universal schema. As far as we know, our proposed multi-facet embedding is the first method that outperforms compositional universal schema using the same training signal in the distant-supervised RE benchmark they proposed.
Although the recall of USchema is low because it does not exploit the similarity between the patterns (e.g., "$ARG1 happily married $ARG2" is similar to "$ARG1 married $ARG2"), USchema has a high precision because it also won't be misled by the similarity (e.g., "$ARG1, and his wife $ARG2" expresses the spouse relation but "$ARG1, his wife, and $ARG2" does not) . Thus, combining USchema and Ours (Trans) leads to the best performance. Ours (Trans) and Ours (LSTM) perform similarly. Furthermore, Ours (LSTM) performs much better than Ours (Single-LSTM), which demonstrates the effectiveness of using multiple embeddings. Notice that multiple facet embeddings could improve the performance even after the training data have been cleaned. This indicates that our method is complementary to the noise removal methods in .

Entailment Detection
Entailment is a common and fundamental relation between two sentence patterns. Some examples could be seen in Table 2. Unsupervised hypernym detection (i.e., entailment at the word level) is extensively studied (Shwartz et al., 2017), but we are not aware of any previous work on unsupervised entailment detection at the sentence level, nor any existing entailment dataset between sentence patterns. Thus, we create one.
Dataset Creation: We use WordNet (Miller, 1998) to discover the entailment candidates of sen-  tence pattern pairs and manually label the candidates. For each sentence pattern in the training data of , we replace one word at a time with its hypernym based on the WordNet hierarchy. The two sentence patterns before and after replacement form an entailment candidate. We label 1,500 pairs of the most popular sentence pattern, which co-occurs with the highest number of unique entity pairs. Each candidate could be labeled as entailment, paraphrase, or other. Finally, around 20% of the candidates are randomly chosen to form the validation set, and the rest are in the test set. More details of the dataset creation process could be seen in the appendix In this dataset, only 22% and 10% of candidates are labeled as entailment and paraphrase, respectively. This suggests that entailment relation between two sentence patterns is hard to be inferred by only the hypernym relation (i.e., entailment relation at the word level) in WordNet.
Setup: We evaluate entailment detection using the typical setup and metrics in hypernym detection (Shwartz et al., 2017). Negative examples include the candidates labeled as paraphrases and others. We compare the average precision of different methods (i.e., AUC in the precision-recall curve) (Hastie et al., 2009). In addition, we predict the direction of entailment relation in each pair (i.e., which pattern is the premise) and report the accuracy. Many hypotheses have the same hypernyms such as the leader in Table 2, so we also report the macro accuracy of direction detection averaged across every hypernym in the hypotheses.
The task is challenging because all the candidates have a word-level entailment relation if their compositional meaning is ignored. Furthermore, we cannot infer the entailment direction based on the tendency that longer sentence patterns tend to be more specific because most of the candidate pairs in this dataset have the same length.
As described in Section 2.5, our models detect the direction by computing Ours Diff as  In entailment classification, we compare the results with cosine similarity from Ours (Single-Trans) and CUschema. We also test the frequency difference, which is a strong baseline in hypernym direction detection (Chang et al., 2018). Freq Diff = Freq(S j ) -Freq(S i ) where Freq(S i ) is the number of unique entity pairs co-occurred with the ith sentence pattern. The baseline predicts S i to be premise if Freq Diff > 0 because more general sentence patterns should co-occur with more entity pairs. As a reference, we also report the performance of random scores.

Results:
The quantitative and qualitative comparison are presented in Table 3 and Table 2, respectively. Our model that uses multi-facet embeddings significantly outperforms the other baselines. We hypothesize that a major reason is that the sentence patterns with an entailment relation are often similar in some but not all of the facets, and our asymmetric similarity measurement is better at capturing the facet overlapping.

Related Work
Relation extraction (RE) is widely studied. Han et al. (2020) summarize the trend of recent studies and point out one of the major challenges is the cost of collecting the labels. Distant supervision (Mintz et al., 2009) and its follow-up work enable us to collect a large amount of training data at a low cost, but the violation of its assumptions often introduces substantial noise into the supervision signal. Our goal is to alleviate the noise issue by representing every sentence pattern using multiple embeddings.
Other noise reduction methods have also been proposed . For instance, we can adopt multi-instance learning techniques (Yao et al., 2010;Surdeanu et al., 2012;Amin et al., 2020), global topic model (Alfonseca et al., 2012), or both . We can also reduce the noise by counting the number of shared entity pairs between a sentence pattern and a KB relation (Takamatsu et al., 2012;Su et al., 2018). Nevertheless, the studies focus on mitigating the noise caused by assuming similarity between the sentence patterns and KB relations that co-occur with the same entity pairs, while our method can also reduce the noise from two sentence patterns sharing the same entity pair. Besides, our method is complementary to popular noise reduction methods because our improvement is shown in the training data that have been cleaned .
Our method is conceptually related to some studies for lexical semantics. For example, word sense induction or unsupervised hypernymy detection can be addressed by clustering the co-occurring words (Neelakantan et al., 2014;Athiwaratkun and Wilson, 2017;Chang et al., 2018). However, the clustering-based methods do not apply to RE because the co-occurring matrix for RE is much sparser (see Section 2.4 for more details).
Finally, our work is inspired by Chang et al. (2021), but they focus on improving the sentence representation rather than RE. We encourage the facet embeddings to become the centers in Kmeans clustering instead of NNSC (non-negative sparse coding) clustering used in Chang et al. (2021), due to its simplicity, efficiency, and better RE performance. Moreover, we discover that an additional regularization described in the appendixis crucial to overcome the sparsity challenge in RE.

Conclusion
In this work, we address the limitation of representing each sentence pattern using only a single embedding, and our approach improves the distantlysupervised RE performances of compositional universal schema.
Relying on only a very sparse co-occurrence matrix between the sentence patterns and entity pairs, we show that it is possible to predict reasonable cluster centers of entity pair embeddings and to predict the entailment relation between two sentence patterns without any labels.