Locality Preserving Sentence Encoding

Although researches on word embeddings have made great progress in recent years, many tasks in natural language processing are on the sentence level. Thus, it is essential to learn sentence embeddings. Recently, Sentence BERT (SBERT) is proposed to learn embeddings on the sentence level, and it uses the inner product (or, cosine similarity) to compute semantic similarity between sentences. However, this measurement cannot well describe the semantic structures among sentences. The reason is that sentences may lie on a manifold in the ambient space rather than distribute in a Euclidean space. Thus, cosine similarity cannot approximate distances on the manifold. To tackle the severe problem, we propose a novel sentence embedding method called Sentence BERT with Locality Preserving (SBERT-LP), which discovers the sentence submanifold from a high-dimensional space and yields a compact sentence representation subspace by locally preserving geometric structures of sentences. We compare the SBERT-LP with several existing sentence embedding approaches from three perspectives: sentence similarity, sentence classification, and sentence clustering. Experimental results and case studies demonstrate that our method encodes sentences better in the sense of semantic structures.


Introduction
Word embeddings aim to learn semantically meaningful word representations based on distribution hypothesis (Mikolov et al., 2013). Both contextfree (Pennington et al., 2014) and contextual (Peters et al., 2018) word embeddings have made great progress in various downstream tasks: Text Classification (Aggarwal and Zhai, 2012), Dialogue System (Chen et al., 2017) and Text Clustering (Allahyari et al., 2017). However, in the real world, most Natural Language Processing tasks are on the sentence level. Hence, recent studies (Lin et al., 2017;Wang and Kuo, 2020) encode sentences into a dense vector space, which is described as the sentence space. These sentence embedding approaches generally fall into two categories: one is based on supervised learning, including: InferSent (Conneau et al., 2017), Universal Sentence Encoder (Cer et al., 2018) and SBERT (Reimers and Gurevych, 2019). While the other one is based on unsupervised learning, such as SkipThought vectors (Kiros et al., 2015), FastSent (Hill et al., 2016) and Transformer-based Sequential Denoising Auto-Encoder (TSDAE) (Wang et al., 2021). The unsupervised way overcomes the limitation of labeled data in different domains and data annotations, to some extent. Both of them represent a sentence as a point in the sentence space, where similar sentences are close.
There are two important problems in text processing: how to represent texts and how to evaluate their semantic similarity (He et al., 2004). Recently, various strategies have been taken to represent a sentence. For example, the SBERT (Reimers and Gurevych, 2019)learns semantic sentence representations with a Siamese Network on top of BERT. Additionally, some variants have been proposed such as SBERT-WK (Wang and Kuo, 2020) and BERT-flow (Li et al., 2020). The sentence space of the SBERT is associated with a Euclidean structure and the cosine similarity is employed to measure the semantic similarity. However, previous studies have demonstrated that human-generated text data are probably sampled from a submanifold of the ambient Euclidean space (Cai et al., 2005). As a result, sentence representations yielded from the SBERT may lie on a manifold, which is either linear or non-linear. The semantic similarity between sentences is the shortest distance, which may be curves, on the manifold. Hence, making use of the cosine similarity to approximate the length of a curve is inaccurate.
For obtaining correct semantic structures of the sentence space, one straight way is to calculate the geodesic distance (Varini et al., 2006), which is the length of the shortest path between two points on the possibly curvy manifold (Ghojogh et al., 2020). However, because of requiring traversing from one point to another on the manifold, the geodesic distance is hard to approximate. Therefore, we aim to find an optimal Euclidean subspace of the sentence manifold. In the subspace, cosine similarity is effective to measure sentence semantic relations. For implementing it, we borrow the idea of Locally Linear Embedding (Roweis and Saul, 2000), which is an effective way to develop low dimensional representations when data arises from sampling a probability distribution on a manifold (Cai et al., 2005). Then, we propose the Sentence BERT with Locality Preserving (SBERT-LP), which marries up the locality property and BERT. Our method highlights the local geometric structures of sentences. To be specific, the SBERT-LP firstly discovers the intrinsic manifold structure from the original sentence space. A novel Euclidean sentence subspace is then learned from the sub-manifold by preserving local geometric information of sentences. The local geometric structures are defined by each sentence and its neighbor sentences. Preserving locality avoids losing too much useful information of sentences during the projection. Finally, cosine similarity between sentences is consistent with their semantic similarity. Our contributions are summarized in three-folds: (1) We theoretically analyze from the perspective of the manifold hypothesis that why the BERTinduced sentence embeddings show poor performance when retrieving semantically similar sentences.
(2) We propose the SBERT-LP, which obtains better representations in the sense of semantic structure by using locality preserving. Sentences related to the same semantics are still close to each other in the new Euclidean subspace. Our model is unsupervised and without any fine-tuning.
(3) We conduct experiments on three tasks. Experimental results and case studies demonstrate that the SBERT-LP is superior to other existing sentence embedding methods on various tasks.

Related work
Existing sentence embedding approaches are divided into two categories: non-parametric sentence embeddings and parametric sentence embeddings (Wang and Kuo, 2020).
The non-parameterized way is to derive sentence embeddings from pre-trained word embeddings (Mikolov et al., 2013;Pennington et al., 2014) via linear aggregations. For example, SIF (Arora et al., 2017) uses smooth inverse frequency to weigh each word in a sentence and remove some special directions with PCA. Besides, uSIF (Ethayarajh, 2018) builds upon the random walk model by setting the probability of word generation inversely related to the angular distance between the word and sentence embeddings. Although the non-parametric methods have been proved to be efficient, neglecting word orders and sentence structures degrades their performance.
In order to incorporate richer sentence information, parametric sentence embeddings are proposed. For example, SkipThought (Kiros et al., 2015) borrows the idea of skip-gram in word2vec. It encodes sentences intending to predict adjacent sentences. With the success of BERT (Devlin et al., 2019) on various NLP tasks (Sun et al., 2019;Clinchant et al., 2019), some BERT-based sentence embedding methods have been proposed recently. In addition to the SBERT, the SBERT-WK encodes sentences with QR factorization, re-weighting each word in a sentence. Furthermore, BERT-flow (Li et al., 2020) leverages Normalized Flows to transform the BERT sentence space into a standard Gaussian latent space that is isotropic. It concludes that the inner product may not accurately represent semantic similarity in the sentence space because of the non-smoothing semantic structure. In contrast, the SBERT-LP analyzes and solve the cosine metric problem of the SBERT sentence space on a manifold. Our work is inspired by the investigation of local geometry in the word space (Hasan and Curry, 2017;Yonghe et al., 2019). These methods solve semantic problems in word space. Since both word and sentence embeddings share the same high-dimensional space, problems with word embeddings may exist in sentence embeddings (Li et al., 2020). To the best of our knowledge, this paper is the first to solve the semantic metric problem in the sentence space with the incorporation of locality preserving ability.

Methodology
In this section, we first give a brief introduction to SBERT. Then, we will show how to effectively preserve the locality of sentences to solve the problem Figure 1: The architecture of SBERT-LP: (a) obtaining the high-dimensional sentence space from pre-trained SBERT; (b) constructing the kNN graph on the sentence submanifold; (c) calculating the optimal reconstruction weights of each sentences on the submanifold; (d) encoding sentences to a new Euclidean subspace, which has better semantic structures.
of semantic similarity metrics.

Sentence BERT
The sentence BERT (SBERT) is an efficient way to produce semantically meaningful sentence embeddings. It integrates the Siamese network with a pre-trained BERT language model. The SBERT pre-trained sentence embedding is trained on the SNLI and Multi-Genre NLI, and it uses cosine similarity to obtain semantic similarity between sentences. More details are provided in (Reimers and Gurevych, 2019).

Sentence BERT with Locality Preserving
To solve the semantic metric problem in the sentence space, we develop the SBERT-LP to encode sentences. Specifically, our method first constructs an adjacency graph, which captures the local geometrical structure of the original sentence space. Then, a new Euclidean subspace for sentence representation is learned by leveraging Locally Linear Embedding. The new subspace allows cosine similarity to metric semantic similarity correctly.

Problem Definition
Given a set of sentences S ={s 1 , s 2 , . . . s m }, we first use SBERT to obtain high-dimensional representations of S. The representations denote as D ={d 1 , d 2 , . . . d m }, where d i ∈ R n . The problem is how to find a lower-dimensional embedding y i of d i so that y i y j can represent the correct semantic relationship between d i and d j .

Locality Preserving Embedding
Learning sentence embeddings via preserving the locality of each sample is divided into the following four steps: Step 1: Obtaining the original sentence space from pre-trained sentence embeddings Given a set of sentences S ={s 1 , s 2 , . . . ,s m }, where m is the total number of sentences. We make use of the SBERT to project sentences into a high-dimensional sentence space: where d i ∈ R n , and n is the dimensionality of the sentence space. In this paper, we use BERT-base and BERT-large pre-trained model, respectively. Therefore, the corresponding values of n are 768 and 1024 respectively.
Step 2: Constructing a k-Nearest Neighbors graph of sentences We denote sentence representations obtained by SBERT as D={d 1 , d 2 , . . . ,d m }. For all sentences on the sub-manifold, we construct a k-Nearest Neighbors graph. Specifically, we first calculate pairwise Euclidean distance between sentences. Then, we select the top-k nearest sentences as the neighbors of each sentence. Let d ij ∈ R n denote the j-th neighbor of the i-th sentence vector d i and let the matrix R n×k D i := [d i1 , . . . , d ik ] represent the k neighbors of d i .
Step 3: Reconstructing sentences via local geometric structures on the manifold The third step is to find the optimal reconstruction weights of every sentence based on the kNN graph. To optimize the linear reconstruction in the sentence space, we formulate it as: where weights of each sentence subject to m i=1 w ij = 1, ∀i ∈ {1, . . . , m}. R n×k W := [w 1 , . . . , w m ] represents the reconstruction weight matrix. R k w i := [w i1 , . . . , w ik ] denotes the reconstruction weights of the i-th sentence.
Then, the objective function can be restated as: Then, we can imply that d i = d i 1 w i from the weight constraint. The objective can be further simplified as: Eventually, we rewrite the objective function (2) as: For finding the optimal W , we first define the Lagrangian for equation (5) as L: Then, we set the derivative of Lagrangian to zero: We combine the two derivative results in Eq. (7): Making use of Eqs. (7) and (8), we then have: Finally, we obtain the optimal reconstruction weights W . Actually, each sentence and its neighbors reflect local geometric structures of the sentence manifold. The optimal weights indicate in what proportion the information should be passed from the neighbors.
Step 4: Finding the optimal Euclidean sentence subspace The SBERT-LP aims to make the locality (the optimal weights) on the sentence manifold be maintained within the Euclidean sentence sub-space. Thus, in this step, we encode sentences into the Euclidean sub-space with the locality on the sentence manifold. Then, we formulate the optimization problem of this embedding as: subjects to 1 m m i=1 y i y i = I, and m i=1 y i = 0. I is the identity matrix, while y i ∈ R p is the i-th embedded sentence, and p is the dimensionality of the Euclidean sentence embeddings. Then, we denote the set of embedded sentences as a matrix: Y =[y 1 , y 2 , . . . ,y m ] , and Y ∈ R m×p . w ij is weight between two sentences. If the j-th sentence is the neighbor of the i-th sentence, the w ij is set to w ij , which we have obtained in the third step. Otherwise, it equals to zero. Then the weight w ij can be formulated as: We then define the weight for the i-th sentence as: w i = [w i1 , w i2 , . . . , w im ] . Besides, we set a one-hot vector: 1 i = [0, . . . , 1, . . . ., 0] where i-th element is one while the others are zero. Then, the objective function can be rewritten as: The formula is then simplified into matrix form: where W =[w 1 , w 2 , . . . , w m ] , W ∈ R m×m , while . F denotes the Frobenius norm of matrix. We further simplified the Eq. (13) as: where tr( · ) is the trace of matrix and M=(I − W ) (I − W ), M ∈ R m×m . Then, the objective function in Eq. (10) is formulated as: Therefore, if we ignore the second constraint, the Lagrangian L for Eq. (15) is: where Λ ∈ R m×m is a diagonal matrix including the Lagrange multipliers. Then, we set the derivative of L to zero: Thus, the columns of Y are the eigenvectors of M, and the Y represents the target sentence embeddings.

Experiments
In this section, we perform experiments on three tasks to demonstrate the effectiveness of the SBERT-LP. We firstly introduce experimental settings for the datasets and hyper-parameters. Then we compare the SBERT-LP with several state-ofthe-art sentence encoding methods. Finally, we analyze the effect of different parameters on the SBERT-LP, and we make use of some cases from STS datasets to illustrate the effectiveness of our model on semantic metric recovery. Sentence embeddings aim to cluster semantically similar sentences. Therefore, we mainly focus on the performance of different models on the STS task and take the results of the other two tasks as references.

Experimental Settings and Datasets
To verify that SBERT-LP is able to learn better sentence representations in the sense of semantics, we set three downstream tasks: Semantic Textual Similarity, Text Classification, and Text Clustering, respectively. We obtain high-dimensional sentence embeddings from two pre-trained models without fine-tuning: SBERT-base and SBERT-large. Fifteen datasets are leveraged for three tasks: (1) For the Semantic Textual Similarity task, we use seven standard semantic textual similarity datasets: the STS tasks 2012-2016 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, the STS benchmark (Cer et al., 2017), and the SICK-Relatedness datasets (Marelli et al., 2014). The datasets are labeled between 0 and 5 on the semantic similarity of sentence pairs.
(3) For the Text Clustering task, we make use of the 20 Newsgroup dataset for evaluation.

Task Description
We evaluate the model for STS without leveraging any STS specific training data. We directly evaluate sentence embedding methods on the test data and compute the cosine similarity between sentences as the similarity score. The metric is Spearman's correlation, which is the same as (Reimers and Gurevych, 2019).

Results and Analysis
In this table, the proposed model SBERT-LP markedly outperforms the other competing methods in terms of the metric. Specifically, we can see that the SBERT-LP can improve the performance significantly compared with SBERT. This confirms that the SBERT-LP does a better job than SBERT in capturing semantic similarity between sentences by preserving local geometric structures of each sentence lying on the submanifold embedded in the ambient space. Besides, the SBERT-LP yields better results than the SBERT-flow, which is a strong baseline for sentence embedding, on five datasets. It is reasonable to say that the manifold distribution hypothesis of sentences is more efficient for sentence representations in the sense of semantic structures, compared with the Gaussian latent space.

Task Description
SBERT leverages Logistic Regression as the classifier on the text classification task. However, parameters in LR classifier may influence the experimental results. Hence, we make use of the nonparametric k-nearest neighbor (kNN) algorithm as the classifier. The distance metric of kNN is the Euclidean distance, while the k is set to 3 empirically. Accuracy is leveraged to evaluate the classification performance of models.

Results and Analysis
The Accuracy comparison results of the seven Sen-tEval datasets are depicted in table 2. Even though transfer learning is not the purpose of SBERT-LP, it outperforms other state-of-the-art sentence embeddings methods on three datasets. We can observe from these results that SBERT-LP performs better than SBERT. Therefore, we can attribute the improvement achieved by SBERT-LP over SBERT and its variants to locality preserving character, which is brought LLE. However, the result of the SBERT-LP on the TREC dataset is not satisfactory. The reason is that the USE is trained on questionanswer tasks, which are the same type with the TREC dataset (Reimers and Gurevych, 2019).

Task Description
We make use of K-means (MacQueen et al., 1967), which is based on a distance metric, for clustering. Four indicators are employed to evaluate the performance: Mutual Information (MI), Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and Purity.

Results and Analysis
As shown in table 3, it is worth mentioning that the SBERT-LP significantly outperforms SBERT. This provides empirical evidence that accounting for the better semantic relationships among sentences obtained from the SBERT-LP encodes the clustering structure even better. Similar sentences are closer in the sentence space given by SBERT-LP, while dissimilar sentences are further apart. However, we can also find that the Universal Sentence Encoder (USE) achieves the best in terms of all metrics. The reason is that the USE has more intra-class consistency compared to other sentence embedding methods.

Parameters Analysis
Having shown the superiority of the SBERT-LP, in this section, we compare the performance in different neighborhood numbers and the performance in different dimensionalities.

Selection of the number of neighbors
Our method is based on LLE, thus the selection of the number of neighbors is very important for constructing the local geometric structure on the sentence manifold. And several different algorithms, such as Residual Variance and Procrustes Statistics, have been proposed to find the optimal number of neighbors (Ghojogh et al., 2020). However, we experimentally find that the number of neighbors obtained by these methods is not optimal. Therefore, grid search is employed to get the optimal number of neighbors. Figures 2 and 3 demonstrate the relationship between the number of neighbors and the performance of different downstream tasks.

Dimensionality of the Euclidean embeddings
The dimensionality of the original sentence space is usually 768 or 1024. Although high-dimensionality sentence representations contain a wealth of semantic information, only part of the information can benefit downstream tasks. Besides, overwhelmingly complex sentence feature sets will slow the classification or regression models down and make finding global optima difficult. SBERT-LP improves this problem to a large extent. Specifically, it maps sentences into a lower-dimensional space, which reduces the number of learnable parameters for downstream tasks. We experimentally observe that there are no specific laws for the selection of dimensions. To be specific, the dimensionality of the target space often varies greatly from task to task. For example, for Sentiment Analysis, the classification result is optimal when the dimensions are in the range of 16-64. While the optimal range is 128-300 for the STS task. This may be due to the fact that universal sentence embeddings obtained by SBERT-LP contain much less information related to sentiment than semantic information. More details are reported in figure 2.

Qualitative Analysis
To verify that the SBERT-LP can make cosine similarity a valid measure of semantic similarity, we select some cases for illustration. The cases are shown in table 4. The two pairs of sentences and labels are selected from the STS13 dataset. The labels demonstrate that the semantic distance between the two sentences of sentence pair 0 should be smaller than that of sentence pair 1. However, following the result of Sim_1, we can observe the relationship between the two pairs of sentence is reversed by the SBERT. The phenomenon shows that making use of the cosine similarity to capture semantic structures of the SBERT is invalid. Then, the result of Sim_2 shows that the SBERT-LP well solves the semantic similarity problem existing in the SBERT. To be specific, the SBERT-LP takes advantage of the locality preservation property to transform the sentence manifold in the ambient space into Euclidean sentence embeddings while keeping the semantic relationships between sentences unchanged.  Table 4: sentence pairs and their similarity scores given by cosine similarity. Sim_0 is the manual label; Sim_1 is given by SBERT; Sim_2 is given by the SBERT-LP.

Conclusion
In this paper, we propose the SBERT-LP that is simple yet effective. To solve the metric problem in the sentence space, this method well exploits the idea of locality preserving to recovery the cosine similarity. It not only captures the sentence submanifold but also rebuilds a Euclidean sentence subspace. Experimental results on three tasks demonstrate that the SBERT-LP learns better sentence representations in the sense of semantic structures.