Unified Interpretation of Softmax Cross-Entropy and Negative Sampling: With Case Study for Knowledge Graph Embedding

In knowledge graph embedding, the theoretical relationship between the softmax cross-entropy and negative sampling loss functions has not been investigated. This makes it difficult to fairly compare the results of the two different loss functions. We attempted to solve this problem by using the Bregman divergence to provide a unified interpretation of the softmax cross-entropy and negative sampling loss functions. Under this interpretation, we can derive theoretical findings for fair comparison. Experimental results on the FB15k-237 and WN18RR datasets show that the theoretical findings are valid in practical settings.


Introduction
Negative Sampling (NS) (Mikolov et al., 2013) is an approximation of softmax cross-entropy (SCE). Due to its efficiency in computation cost, NS is now a fundamental loss function for various Natural Language Processing (NLP) tasks such as used in word embedding (Mikolov et al., 2013), language modeling (Melamud et al., 2017), contextualized embedding (Clark et al., 2020b,a), and knowledge graph embedding (KGE) (Trouillon et al., 2016). Specifically, recent KGE models commonly use NS for training. Considering the current usages of NS, we investigated the characteristics of NS by mainly focusing on KGE from theoretical and empirical aspects.
First, we introduce the task description of KGE. A knowledge graph is a graph that describes the relationships between entities. It is an indispensable resource for knowledge-intensive NLP applications such as dialogue (Moon et al., 2019) and questionanswering (Lukovnikov et al., 2017) systems. However, to create a knowledge graph, it is necessary to consider a large number of entity combinations and their relationships, making it difficult to construct a complete graph manually. Therefore, the prediction of links between entities is an important task.
Currently, missing relational links between entities are predicted using a scoring method based on KGE (Bordes et al., 2011). With this method, a score for each link is computed on vector space representations of embedded entities and relations. We can train these representations through various loss functions. The SCE (Kadlec et al., 2017) and NS (Trouillon et al., 2016) loss functions are commonly used for this purpose.
Several studies (Ruffinelli et al., 2020;Ali et al., 2020) have shown that link-prediction performance can be significantly improved by choosing the appropriate combination of loss functions and scoring methods. However, the relationship between the SCE and NS loss functions has not been investigated in KGE. Without a basis for understanding the relationships among different loss functions, it is difficult to make a fair comparison between the SCE and NS results.
We attempted to solve this problem by using the Bregman divergence (Bregman, 1967) to provide a unified interpretation of the SCE and NS loss functions. Under this interpretation, we can understand the relationships between SCE and NS in terms of the model's predicted distribution at the optimal solution, which we called the objective distribution. By deriving the objective distribution for a loss function, we can analyze different loss functions, the objective distributions of which are identical under certain conditions, from a unified viewpoint.
We summarize our theoretical findings not restricted to KGE as follows: • The objective distribution of NS with uniform noise (NS w/ Uni) is equivalent to that of SCE.
• NS with frequency-based noise (NS w/ Freq) in word2vec 1 has a smoothing effect on the objective distribution.
• SCE has a property wherein it more strongly fits a model to the training data than NS.
To check the validity of the theoretical findings in practical settings, we conducted experiments on the FB15k-237 (Toutanova and Chen, 2015) and WN18RR (Dettmers et al., 2018) datasets. The experimental results indicate that • The relationship between SCE and SCE w/ LS is also similar to that between NS and SANS in practical settings.
• NS is prone to underfitting because it weakly fits a model to the training data compared with SCE.
• SCE causes underfitting of KGE models when their score function has a bound.
• Both SANS and SCE w/ LS perform well as pre-training methods.
The structure of this paper is as follows: Sec. 2 introduces SCE and Bregman divergence; Sec. 3 induces the objective distributions for NS; Sec. 4 analyzes the relationships between SCE and NS loss functions; Sec. 5 summarizes and discusses our theoretical findings; Sec. 6 discusses empirically investigating the validity of the theoretical findings in practical settings; Sec. 7 explains the differences between this paper and related work; and Sec. 8 summarizes our contributions. Our code will be available at https://github.com/kamigaito/ acl2021kge 2 Softmax Cross Entropy and Bregman Divergence

SCE in KGE
We denote a link representing a relationship r k between entities e i and e j in a knowledge graph as (e i , r k , e j ). In predicting the links from given queries (e i , r k , ?) and (?, r k , e j ), the model must predict entities corresponding to each ? in the queries. We denote such a query as x and the entity to be 1 The word2vec uses unigram distribution as the frequencybased noise. predicted as y. By using the softmax function, the probability p θ (y|x) that y is predicted from x with the model parameter θ given a score function f θ (x, y) is expressed as follows: where Y is the set of all predictable entities. We further denote the pair of an input x and its label y as (x, y). Let D = {(x 1 , y 1 ), · · · , (x |D| , y |D| )} be observed data that obey a distribution p d (x, y).

Bregman Divergence
Next, we introduce the Bregman divergence. Let Ψ(z) be a differentiable function; the Bregman divergence between two distributions f and g is defined as follows: We can express various divergences by changing Ψ(z).
To take into account the divergence on the entire observed data, we consider the expectation of d To investigate the relationship between a loss function and learned distribution of a model at an optimal solution of the loss function, we need to focus on the minimization of B Ψ(z) . Gutmann and Hirayama (2011) showed that B Ψ(z) ( f , g) = 0 means that f equals g almost everywhere when Ψ(z) is a differentiable strictly convex function in its domain. Note that all Ψ(z) in this paper satisfy this condition. Accordingly, by fixing f , minimization of B Ψ(z) ( f , g) with respect to g is equivalent to minimization of We use B Ψ ( f , g) to reveal a learned distribution of a model at optimal solutions for the SCE and NS loss functions.

Derivation of SCE
For the latter explanations, we first derive the SCE loss function from Eq. (3). We denote a probability for a label y as p(y), vector for all y as y, vector of probabilities for y as p(y), and dimension size of z as len(z). In Eq. (3), by setting f as p d (y|x) and g as p θ (y|x) with Ψ(z) = ∑ len(z) i=1 z i log z i (Banerjee et al., 2005), we can derive the SCE loss function as follows: This derivation indicates that p θ (y|x) converges to the observed distribution p d (y|x) through minimizing B Ψ(z) (p d (y|x), p θ (y|x)) in the SCE loss function. We call the distribution of p θ (y|x) when B Ψ(z) equals zero an objective distribution.

Objective Distribution for Negative Sampling Loss
We begin by providing a definition of NS and its relationship to the Bregman divergence, following the induction of noise contrastive estimation (NCE) from the Bregman divergence that was established by Gutmann and Hirayama (2011). We denote p n (y|x) to be a known non-zero noise distribution for y of a given x. Given ν noise samples from p n (y|x) for each (x, y) ∈ D, NS estimates the model parameter θ for a distribution G(y|x; θ ) = exp(− f θ (x, y)). By assigning to each (x, y) a binary class label C: C = 1 if (x, y) is drawn from observed data D following a distribution p d (x, y) and C = 0 if (x, y) is drawn from a noise distribution p n (y|x), we can model the posterior probabilities for the classes as follows: , The objective function NS (θ ) of NS is defined as follows: ,y i ∼p n log(P(C = 0, y i |x; θ )) . (6) By using the Bregman divergence, we can induce the following propositions for NS (θ ).
Proposition 1. NS (θ ) can be induced from Eq. (3) by setting Ψ(z) as: Proposition 2. When NS (θ ) equals 0, the following equation is satisfied: Proposition 3. The objective distribution of P θ (y|x) for NS (θ ) is Proof. We give the proof of Props. 1, 2, and 3 in Appendix A of the supplemental material.
We can also investigate the validity of Props. 1, 2, and 3 by comparing them with the previously reported result. For this purpose, we prove the following proposition: Proposition 4. When Eq. (8) satisfies ν = 1 and p n (y|x) = p d (y), f θ (x, y) equals point-wise mutual information (PMI).
Proof. This is described in Appendix B of the supplemental material.
This observation is consistent with that by Levy and Goldberg (2014). The differences between their representation and ours are as follows. (1) Our noise distribution is general in the sense that its definition is not restricted to a unigram distribution; (2) we mainly discuss p θ (y|x) not f θ (x, y); and (3) we can compare NS-and SCE-based loss functions through the Bregman divergence.

Various Noise Distributions
Different from the objective distribution of SCE, Eq. (9) is affected by the type of noise distribution p n (y|x). To investigate the actual objective distribution for NS (θ ), we need to consider separate cases for each type of noise distribution. In this subsection, we further analyze Eq. (9) for each separate case.

NS with Uniform Noise
First, we investigated the case of a uniform distribution because it is one of the most common noise distributions for NS (θ ) in the KGE task. From Eq. (9), we can induce the following property.
Proof. This is described in Appendix C of the supplemental material. Dyer (2014) indicated that NS is equal to NCE when ν = |Y | and P n (y|x) is uniform. However, as we showed, in terms of the objective distribution, the value of ν is not related to the objective distribution because Eq. (9) is independent of ν.

NS with Frequency-based Noise
In the original setting of NS (Mikolov et al., 2013), the authors chose as p n (y|x) a unigram distribution of y, which is independent of x. Such a frequencybased distribution is calculated in terms of frequencies on a corpus and independent of the model parameter θ . Since in this case, different from the case of a uniform distribution, p n (y|x) remains on the right side of Eq. (9), p θ (y|x) decreases when p n (y|x) increases. Thus, we can interpret frequency-based noise as a type of smoothing for p d (y|x). The smoothing of NS w/ Freq decreases the importance of high-frequency labels in the training data for learning more general vector representations, which can be used for various tasks as pretrained vectors. Since we can expect pre-trained vectors to work as a prior (Erhan et al., 2010) that prevents models from overfitting, we tried to use NS w/ Freq for pre-training KGE models in our experiments.

Self-Adversarial NS
Sun et al. (2019) recently proposed SANS, which uses p θ (y|x) for generating negative samples. By replacing p n (y|x) with p θ (y|x), the objective distribution when using SANS is as follows: whereθ is a parameter set updated in the previous iteration. Because both the left and right sides of Eq. (10) include p θ (y|x), we cannot obtain an analytical solution of p θ (y|x) from this equation. However, we can consider special cases of p θ (y|x) to gain an understanding of Eq. (10). At the beginning of the training, p θ (y|x) follows a discrete uniform distribution u{1, |Y |} because θ is randomly initialized. In this situation, when we set pθ (y|x) in Eq. (10) to a discrete uniform distribution u{1, |Y |}, Next, when we set pθ (y|x) in Eq. (10) as p d (y|x), In actual mini-batch training, θ is iteratively updated for every batch of data. Because p θ (y|x) converges to u{1, |Y |} when pθ (y|x) is close to p d (y|x) and p θ (y|x) converges to p d (y|x) when pθ (y|x) is close to u{1, |Y |}, we can approximately regard the objective distribution of SANS as a mixture of p d and u{1, |Y |}. Thus, we can represent the objective distribution of p θ (y|x) as (13) where λ is a hyper-parameter to determine whether p θ (y|x) is close to p d (y|x) or u{1, |Y |}. Assuming that p θ (y|x) starts from u{1, |Y |}, λ should start from 0 and gradually increase through training. Note that λ corresponds to a temperature α for pθ (y|x) in SANS, defined as where α also adjusts pθ (y|x) to be close to p d (y|x) or u{1, |Y |}.

Corresponding SCE form to NS with Frequency-based Noise
We induce a corresponding cross entropy loss from the objective distribution for NS with frequencybased noise. We set T x,y = p n (y|x) ∑ Under these conditions, following induction from Eq.
(5), we can reformulate B Ψ(z) (q(y|x), p(y|x)) as follows: Except that T x,y is conditioned by x and not normalized for y, we can interpret this loss function as SCE with backward correction (SCE w/ BC) (Patrini et al., 2017). Taking into account that backward correction can be a smoothing method for predicting labels (Lukasik et al., 2020), this relationship supports the theoretical finding that NS can adopt a smoothing to the objective distribution.
Because the frequency-based noise is used in word2vec as unigram noise, we specifically consider the case in which p n (y|x) is set to unigram noise. In this case, we can set p n (y|x) = p d (y). Since relation tuples do not appear twice in a knowledge graph, we can assume that p d (x, y) is uniform. Accordingly, we can change T −1 x,y to is a constant value, and we can reformulate Eq. (15) as follows: where #x and #y respectively represent frequencies for x and y in the training data. We use Eq. (16) to pre-train models for SCE-based loss functions.

Corresponding SCE form to SANS
We induce a corresponding cross entropy loss from the objective distribution for SANS by setting Under these conditions, on the basis of induction from Eq. (4) to Eq. (5), we can reformulate B Ψ(z) (q(y|x), p θ (y|x)) as follows: (17) The equation in the brackets of Eq. (17) is the cross entropy loss that has a corresponding objective distribution to that of SANS. This loss function is similar in form to SCE with label smoothing (SCE w/ LS) (Szegedy et al., 2016). This relationship also accords with the theoretical finding that NS can adopt a smoothing to the objective distribution.

Understanding Loss Functions for Fair Comparisons
We summarize the theoretical findings from Sections 2, 3, and 4 in Table 1. To compare the results from the theoretical findings, we need to understand the differences in their objective distributions and divergences.

Objective Distributions
The objective distributions for NS w/ Uni and SCE are equivalent. We can also see that the objective distribution for SANS is quite similar to that for SCE w/ LS. These theoretical findings will be important for making a fair comparison between scoring methods trained with the NS and SCE loss functions. When a dataset contains low-frequency entities, SANS and SCE w/ LS can improve the link-prediction performance through their smoothing effect, even if there is no performance improvement from the scoring method itself. For comparing the SCE and NS loss functions fairly, therefore, it is necessary to use the vanilla SCE against NS w/ Uni and use SCE w/ LS against SANS.
However, we still have room to discuss the relationship between SANS and SCE w/ LS because λ in SANS increases from zero during training, whereas λ in SCE w/ LS is fixed. To introduce the behavior of λ in SANS to SCE w/ LS, we tried a simple approach in our experiments that trains KGE models via SCE w/ LS using pre-trained embeddings from SCE as initial parameters. Though this approach is not exactly equivalent to SANS, we expected it to work similarly to increasing λ from zero in training.
We also discuss the relationship between NS w/ Freq and SCE w/ BC. While NS w/ Freq is often used for learning word embeddings, neither NS w/ Freq nor SCE w/ BC has been explored in KGE. We investigated whether these loss functions are effective in pre-training KGE models 2 . Because SANS and SCE w/ LS are similar methods to NS w/ Freq and SCE w/ BC in terms of smoothing, in our experiments, we also compared NS w/ Freq with SANS and SCE w/ BC with SCE w/ LS as pre-training methods.

Divergences
Comparing Ψ(z) for NS and SCE losses is as important as focusing on their objective distributions. The Ψ(z) determines the distance between model-2 As a preliminary experiment, we also trained KGE models via NS w/ Freq and SCE w/ BC. However, these methods did not improve the link-prediction performance because frequency-based noise changes the data distribution drastically. predicted and data distributions in the loss. It has an important role in determining the behavior of the model. Figure 1 shows the distance in Eq. (3) between the probability p and probability 0.5 for each Ψ in Table 1 3 . As we can see from the example, d Ψ(z) (0.5, p) of the SCE loss has a larger distance than that of the NS loss. In fact, Painsky and Wornell (2020) proved that the upper bound of the Bregman divergence for binary labels when This means that the SCE loss imposes a larger penalty on the same predicted value than the NS loss when the value of the learning target is the same between the two losses 4 .
However, this does not guarantee that the distance of SCE is always larger than NS. This is because the values of the learning target between the two losses are not always the same. To take into account the generally satisfied property, we also focus on the convexity of the functions. In each training instance, the first-order and second-order derivatives of these loss functions indicate that SCE is convex, but NS is not in their domains 5 . Since this property is independent of the objective distribution, we can consider SCE fits the model more strongly to the training data in general. Because of these features, SCE can be prone to overfitting.
Whether the overfitting is a problem depends on how large the difference between training and test data is. To measure the difference between training and test data in a KG dataset, we calculated the Kullback-Leibler (KL) divergence for p(y|x) between the training and test data of commonly used KG datasets. To compute p(y|x), we first calculated  p(e i |r k , e j ) = p(e i |r k ) + p(e i |e j ) on the basis of frequencies in the data then calculated p(e j |r k , e i ) in the same manner. We treated both p(e i |r k , e j ) and p(e j |r k , e i ) as p(y|x). We denote p(y|x) in the training data as P and in the test data as Q. With these notations, we calculated D KL (P||Q) as the KL divergence for p(y|x) between the test and training data. Figure 2 shows the results. There is a large difference in the KL divergence between FB15k-237 and WN18RR. We investigated how this difference affects the SCE and NS loss functions for learning KGE models. In a practical setting, the loss function's divergence is not the only factor to affect the fit to the training data. Model selection also affects the fitting. However, understanding a model's behavior is difficult due to the complicated relationship between model parameters. For this reason, we experimentally investigated which combinations of models and loss functions are suitable for link prediction.

Experiments and Discussion
We conducted experiments to investigate the validity of what we explained in Section 5 through a comparison of the NS and SCE losses.

Experimental Settings
We evaluated the following models on the FB15k-237 and WN18RR datasets in terms of the Mean Reciprocal Rank (MRR), Hits@1, Hits@3, and . We used LibKGE (Broscheit et al., 2020) 6 as the implementation. For each model to be able to handle queries in both directions, we also trained a model for the reverse direction that shares the entity embeddings with the model for the forward direction.
To determine the hyperparameters of these models, for RESCAL, ComplEx, DistMult, and TransE with SCE and SCE w/ LS, we used the settings that achieved the highest performance in a previous study (Ruffinelli et al., 2020) for each loss function as well as the settings from the original papers for TuckER and RotatE. In TransE with NS and SANS, we used the settings used by Sun et al. (2019). When applying SANS, we set α to an initial value of 1.0 for LibKGE for all models except TransE and RotatE, and for TransE and Ro-tatE, where we followed the settings of the original paper since SANS was used in it. When applying SCE w/ LS, we set λ to the initial value of LibKGE, 0.3, except on TransE and RotatE. In the original setting of RotatE, because the values of SANS for TransE and RotatE were tuned, we also selected λ from {0.3, 0.1, 0.01} using the development data in TransE and RotatE for fair comparison. Appendix D in the supplemental material details the experimental settings. Table 2 shows the results for each loss and model combination. In the following subsections, we discuss investigating whether our findings work in a practical setting on the basis of the results.

Objective Distributions
In terms of the objective distribution, when SCE w/ LS improves performance, SANS also improves performance in many cases. Moreover, it accords with our finding that SCE w/ LS and SANS have similar effects. For TransE and RotatE, the relationship does not hold, but as we will see later, this is probably because TransE with SCE and RotatE with SCE did not fit the training data. If the SCE does not fit the training data, the effect of SCE w/ LS is suppressed as it has the same effect as smoothing.

Divergences
Next, let us focus on the distance of the loss functions. A comparison of the results of WN18RR and FB15k-237 shows no performance degradation of SCE compared with NS. This indicates that the difference between the training and test data in WN18RR is not so large to cause overfitting problems for SCE.
In terms of the combination of models and loss functions, the results of NS are worse than those of SCE in TuckER, RESCAL, ComplEx, and Dist-Mult. Because the four models have no constraint to prevent fitting to the training data, we consider that the lower scores are caused by underfitting. This conjecture is on the basis that the NS loss weakly fits model-predicted distributions to training-data distributions compared with the SCE loss in terms of divergence and convexity.
In contrast, the performance gap between NS  Since SCE has a normalization term, it is difficult to represent values close to 1 when the score function cannot represent negative values. This feature prevents TransE and RotatE from completely fitting to the training data. Therefore, we can assume that NS can be a useful loss function when the score function is bounded.

Effectiveness of Pre-training Methods
We also explored pre-training for learning KGE models. We selected the methods in Table 2 that achieved the best MRR for each NS-based loss and each SCE-based loss in each dataset. In accordance with the success of word2vec, we chose unigram noise for both NS w/ Freq and SCE w/ BC. Table 3 shows the results. Contrary to our expectations, SCE w/ BC does not work well as a pre-training method. Because the unigram noise for SCE w/ BC can drastically change the original data distribution, SCE w/ BC is thought to be effective when the difference between training and test data is large. However, since the difference is not so large in the KG datasets, as discussed in the previous subsection, we believe that the unigram noise may be considered unsuitable for these datasets.
Compared with SCE w/ BC, both SCE w/ LS and SANS are effective for pre-training. This is because the hyperparameters of SCE w/ LS and SANS are adjusted for KG datasets.
When using vanilla SCE as a pre-training method, there is little improvement in prediction performance, compared with other methods. This result suggests that increasing λ in training is not as important for improving task performance.
For RotatE, there is no improvement in pretraining. Because RotatE has strict constraints on its relation representation, we believe it may degrade the effectiveness of pre-training.  (Ruffinelli et al., 2020;Ali et al., 2020) have investigated the best combinations of scoring method, loss function, and their hyperparameters in KG datasets. These studies differ from ours in that they focused on empirically searching for good combinations rather than theoretical investigations.

Related Work
As a theoretical study, Levy and Goldberg (2014) showed that NS is equivalent to factorizing a matrix for PMI when a unigram distribution is selected as a noise distribution. Dyer (2014) investigated the difference between NCE (Gutmann and Hyvärinen, 2010) and NS. Gutmann and Hirayama (2011) revealed that NCE is derivable from Bregman divergence. Our derivation for NS is inspired by their work. Meister et al. (2020) proposed a framework to jointly interpret label smoothing and confidence penalty (Pereyra et al., 2017) through investigating their divergence. Yang et al. (2020) theoretically induced that a noise distribution that is close to the true distribution behind the training data is suitable for training KGE models in NS. They also proposed a variant of SANS in the basis of their investigation.
Different from these studies, we investigated the distributions at optimal solutions of SCE and NS loss functions while considering several types of noise distribution in NS.

Conclusion
We revealed the relationships between SCE and NS loss functions in KGE. Through theoretical analysis, we showed that SCE and NS w/ Uni are equivalent in objective distribution, which is the predicted distribution of a model at an optimal solution, and that SCE w/ LS and SANS have similar objective distributions. We also showed that SCE more strongly fits a model to the training data than NS due to the divergence and convexity of SCE.
The experimental results indicate that the differences in the divergence of the two losses were not large enough to affect dataset differences. The results also indicate that SCE works well with highly flexible scoring methods, which do not have any bound of the scores, while NS works well with RotatE, which cannot express minus values due to its bounded scoring. Moreover, they indicate that SCE and SANS work better in pre-training than NS w/ Uni, commonly used for learning word embeddings.