Unsupervised Sentence-embeddings by Manifold Approximation and Projection

The concept of unsupervised universal sentence encoders has gained traction recently, wherein pre-trained models generate effective task-agnostic fixed-dimensional representations for phrases, sentences and paragraphs. Such methods are of varying complexity, from simple weighted-averages of word vectors to complex language-models based on bidirectional transformers. In this work we propose a novel technique to generate sentence-embeddings in an unsupervised fashion by projecting the sentences onto a fixed-dimensional manifold with the objective of preserving local neighbourhoods in the original space. To delineate such neighbourhoods we experiment with several set-distance metrics, including the recently proposed Word Mover’s distance, while the fixed-dimensional projection is achieved by employing a scalable and efficient manifold approximation method rooted in topological data analysis. We test our approach, which we term EMAP or Embeddings by Manifold Approximation and Projection, on six publicly available text-classification datasets of varying size and complexity. Empirical results show that our method consistently performs similar to or better than several alternative state-of-the-art approaches.


On sentence-embeddings
Dense vector representation of words, or wordembeddings, form the backbone of most modern NLP applications and can be constructed using context-free (Bengio et al., 2003;Mikolov et al., 2013;Pennington et al., 2014) or contextualized methods (Peters et al., 2018;Devlin et al., 2019).
Given that practical systems often benefit from having representations for sentences and documents, in addition to word-embeddings (Palangi et al., 2016;Yan et al., 2016), a simple trick is to use the weighted average over some or all of the embeddings of words in a sentence or document. Although sentence-embeddings constructed this way often lose information because of the disregard for word-order during averaging, they have been found to be surprisingly performant (Aldarmaki and Diab, 2018).
More sophisticated methods focus on jointly learning the embeddings of sentences and words using models similar to Word2Vec ( Le and Mikolov, 2014;Chen, 2017), using encoder-decoder approaches that reconstruct the surrounding sentences of an encoded passage (Kiros et al., 2015), or training bi-directional LSTM models on large external datasets (Conneau et al., 2017). Meaningful sentence-embeddings have also been constructed by fine-tuning pre-trained bidirectional transformers (Devlin et al., 2019) using a Siamese architecture (Reimers and Gurevych, 2019).
In parallel to the approaches mentioned above, a stream of methods have emerged recently which exploit the inherent geometric properties of the structure of sentences, by treating them as sets or sequences of word-embeddings. For example, Arora et al. (2017) propose the construction of sentenceembeddings based on weighted word-embedding averages with the removal of the dominant singular vector, while Rücklé et al. (2018) produce sentenceembeddings by concatenating several power-means of word-embeddings corresponding to a sentence. Very recently, spectral decomposition techniques were used to create sentence-embeddings, which produced state-of-the-art results when used in concatenation with averaging (Kayal and Tsatsaronis, 2019;Almarwani et al., 2019).
Our work is most related to that of Wu et al. (2018) who use Random Features (Rahimi and Recht, 2008) to learn document embeddings which preserve the properties of an explicitly-defined kernel based on the Word Mover's Distance (Kusner et al., 2015). Where Wu et al. predefine the nature of the kernel, our proposed approach can learn the similarity-preserving manifold for a given setdistance metric, offering increased flexibility.

Motivation and contributions
A simple way to form sentence-embeddings is to compute the dimension-wise arithmetic mean of the embeddings of the words in a particular sentence. Even though this approach incurs information loss by disregarding the fact that sentences are sequences (or, at the very least, sets) of word vectors, it works well in practice. This already provides an indication that there is more information in the sentences to be exploited. Kusner et al. (2015) aim to use more of the information available in a sentence by representing sentences as a weighted point cloud of embedded words. Rooted in transportation theory, their Word Mover's distance (WMD) is the minimum amount of distance that the embedded words of a sentence need to travel to reach the embedded words of another sentence. The approach achieves state-of-the-art results for sentence classification when combined with a k-NN classifier (Cover and Hart, 1967). Since their work, other distance metrics have been suggested (Singh et al., 2019;Wang et al., 2019), also motivated by how transportation problems are solved.
Considering that sentences are sets of word vectors, a large variety of methods exist in literature that can be used to calculate the distance between two sets, in addition to the ones based on transport theory. Thus, as a first contribution, we compare alternative metrics to measure distances between sentences. The metrics we suggest, namely the Hausdorff distance and the Energy distance, are intuitive to explain and reasonably fast to calculate. The choice of these particular distances are motivated by their differing origins and their general usefulness in the respective application domains.
Once calculated, these distances can be used in conjunction with k-nearest neighbours for classification tasks, and k-means for clustering tasks. However, these learning algorithms are rather simplistic and the state-of-the-art machine learning algorithms require a fixed-length feature representation as input to them. Moreover, having fixedlength representations for sentences (sentenceembeddings) also provides a large degree of flexibility for downstream tasks, as compared to hav-ing only relative distances between them. With this as motivation, the second contribution of this work is to produce sentence-embeddings that approximately preserve the topological properties of the original sentence space. We propose to do so using an efficient scalable manifold-learning algorithm termed UMAP  from topological data analysis. Empirical results show that this process yields sentence-embeddings that deliver near state-of-the-art classification performance with a simple classifier.

Calculating distances
In this work, we experiment with three different distance measures to determine the distance between sentences. The first measure (Energy distance) is motivated by a useful linkage criterion from hierarchical clustering (Rokach and Maimon, 2005), while the second one (Hausdorff distance) is an important metric from algebraic topology that has been successfully used in document indexing (Tsatsaronis et al., 2012). The final metric (Word Mover's distance) is a recent extension of an existing distance measure between distributions, that is particularly suited for use with word-embeddings (Kusner et al., 2015).
Prior to defining the distances that have been used in this work, we first proceed to outline the notations that we will be using to describe them.

Notations
Let W ∈ R N ×d denote a word-embedding matrix, such that the vocabulary corresponding to it consists of N words, and each word in it, w i ∈ R d , is d-dimensional. This word-embedding matrix and its constituent words may come from pre-trained representations such as Word2Vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014), in which case d = 300.
Let S be a set of sentences and s, s be two sentences from this set. Each such sentence can be viewed as a set of word-embeddings, {w} ∈ s. Additionally, let the length of a sentence, s, be denoted as |s|, and the cardinality of the set, S , be denoted by |S |.
Let e(w i , w j ) denote the distance between two word-embeddings, w i , w j . In the context of this paper, this distance is Euclidean: Finally, D(s, s ) denotes the distance between two sentences.

Energy distance
Energy distance is a statistical distance between probability distributions, based on the inter and intra-distribution variance, that satisfies all the criteria of being a metric (Székely and Rizzo, 2013).
Using the notations defined earlier, we write it as: The original conception of the energy distance was inspired by gravitational potential energy of celestial objects. Looking closely at Equation 2, it can be quickly observed that it has two parts: the first term resembles the attraction or repulsion between two objects (or in our case, sentences), while the second and the third term indicate the self-coherence of the respective objects. As shown by Székely and Rizzo (2013), energy distance is scale equivariant, which would make it sensitive to contextual changes in sentences, and therefore make it useful in NLP applications.

Hausdorff distance
Given two subsets of a metric space, the Hausdorff distance is the maximum distance of the points in one subset to the nearest point in the other. A significant work has gone into making it fast to calculate (Atallah, 1983) so that it can be applied to real-world problems, such as shape-matching in computer vision (Dubuisson and Jain, 1994).
To calculate it, the distance between each point from one set and the closest point from the other set is determined first. Then, the Hausdorff distance is calculated as the maximal point-wise distance. Considering sentences {s, s } as subsets of wordembedding space, R d×N , the directed Hausdorff distance can be given as: such that the symmetric Hausdorff distance is:

Word Mover's distance
In addition to the representation of a sentence as a set of word-embeddings, a sentence s can also be represented as a N -dimensional normalized termfrequency vector, where n s i is the number of times word w i occurs in sentence s normalized by the total number of words in s: where, c s i is the number of times word w i appears in sentence s.
The goal of the Word Mover's distance (WMD) (Kusner et al., 2015) is to construct a sentence similarity metric based on the distances between the individual words within each sentence, given by Equation 1. In order to calculate the distance between two sentences, WMD introduces a transport matrix, T ∈ R N ×N , such that each element in it, T ij , denotes how much of n s i should be transported to n s j . Then, the WMD between two sentences is given as the solution of the following minimization problem: T ij = n s j (6) Thus, WMD between two sentences is defined as the minimum distance required to transport the words from one sentence to another. points, or those like Stochastic Neighbour Embedding (Hinton and Roweis, 2003; van der Maaten and Hinton, 2008) that preserve the conditional probabilities of points being neighbours. However, existing manifold-learning algorithms suffer from two shortcomings: they are computationally expensive and are often restricted in the number of output dimensions. In our work we use a method termed Uniform Manifold Approximation and Projection (UMAP) , which is scalable and has no computational restrictions on the output embedding dimension.
The building block of UMAP is a particular type of a simplicial complex, known as the Vietoris-Rips complex. Recalling that a k-simplex is a kdimensional polytope which is the convex hull of its k + 1 vertices, and a simplicial complex is a set of simplices of various orders, the Vietoris-Rips simplicial complex is a collection of 0 and 1-simplices. In essence, this is a means to building a simple neighbourhood graph by connecting the original data points. On the left is the original sentencespace, approximated by the nearest neighbours graph formed by the Vietoris-Rips complex. Instead of points and edges, our simplicial complex has sets of points and edges between them, formed by one of the distance metrics mentioned in Section 2.1. In this example, four sentences, denoted by S1 through S4, form two simplices, with S4 being a 0-simplex. The sentences are denoted by colored ellipses, while the high-dimensional embedding of each word in a sentence is depicted by a point having the same color as the parent sentence ellipse. The UMAP algorithm is then employed to find a similarity-preserving Euclidean embedding-space, shown on the right, by minimizing the cross-entropy between the two representations.
A key difference, in this work, to the original formulation is that an individual data sample (i.e., the vertex of a simplex) is not a d-dimensional point but a set of d-dimensional words that make up a sentence. By using any of the distance metrics defined in Section 2.1, it is possible to construct the simplicial complex that UMAP needs in order to build the topological representation of the original sentence space. An illustration can be found in Figure 1.
As per the formulation laid out for UMAP, the similarity between sentences s and s is defined as: where σ s is a normalisation factor selected based on an empirical heuristic (See Algorithm 3 in the work of ), D(s, s ) is the distance between two sentences as outlined by Equation 2, 4 or 6, and ρ s is the distance of s from its nearest neighbour. It is worth mentioning that for scalability, v s |s is calculated only for predefined set of approximate nearest neighbours, which is a userdefined input parameter to the UMAP algorithm, using the efficient nearest-neighbour descent algorithm (Dong et al., 2011). The similarity depicted in Equation 7 is asymmetric, and symmetrization is carried out by a fuzzy set union using the probabilistic t-conorm: As UMAP builds a Vietoris-Rips complex governed by Equation 7, it can take advantage of the nerve theorem (Borsuk, 1948), which makes this construction a homotope of the original topological space. In our case, this implies that we can build a simple nearest neighbours graph from a given corpus of sentences, which has certain guarantees of approximating the original topological space, as defined by the aforementioned distance metrics. The next step is to define a similar nearest neighbours graph in a fixed low-dimensional Euclidean space. Let s E , s E ∈ R d E be the corresponding d E -dimensional sentence-embeddings. Then the low dimensional similarities are given by: where, ||s E − s E || is the Euclidean distance between the d E -dimensional embeddings, and setting a, b are input-parameters, set to 1.929 and 0.791, respectively, as per the original implementation. The final step of the process is to optimize the low dimensional representation to have as close a fuzzy topological representation as possible to the original space. UMAP proceeds to do so by minimizing the cross-entropy between the two representations: 10) usually done via stochastic gradient descent.
A summary of the proposed process used to produce sentence-embeddings is provided in Algorithm 1, and pictorially presented in Figure 1.

Datasets
Six public datasets 1 have been used to empirically validate the method proposed in this paper. These datasets are of varying sizes, tasks and complexities, and have been used widely in existing liter-1 https://drive.google.com/open?id= 1sGgAo2SBoYKhQQK_kilUp8KSToCI55jl ature, thereby making comparisons and reporting possible. Information about the datasets can be found in Table 1.

Resources
Pre-trained word-embedding corpus: We use the pre-trained set of word-embeddings provided by Mikolov et al (2013) 2 . Software implementations: We use a variety of software packages and custom-written programs perform our experiments, the starting point being the calculation of sentence-wise distances. We calculate the Hausdorff distance using a directed implementation provided in the Scipy python library 3 , whereas the energy distance is calculated using dcor 4 . Lastly, the word mover's distance is calculated using implementation provided by Kusner et al. (2015) 5 . In order to produce the symmetric distance matrix for a dataset, we employ custom parallel implementation which distributes the calculations over all available logical cores in a machine.
To calculate the sentence-embeddings, the implementation of UMAP provided by McInnes et al (2018) is used 6 . Finally, the classification is done via linear kernel support vector machines from the scikit-learn library (Pedregosa et al., 2011) 7 .
All of the code and datasets have been packaged and released 8 to rerun all of the experiments. Compute infrastructure: All experiments were run on a m4.2xlarge machine on AWS-EC2 9 , which has 8 virtual CPUs and 32GB of RAM.

Competing methods
In order to check the usefulness of our proposed approach, we benchmark its performance in two different ways. The first, and most obvious, approach is to consider the performance of the k-NN  classifier as a baseline. This is motivated by the state-of-the-art k-NN based classification accuracy reported by Kusner et al. for the word mover's distance. Thus, our embeddings need to match or surpass the performance of a k-NN based approach, in order to be considered for practical use. The second approach is to compare the classification accuracies of several state-of-the-art embedding-generation algorithms on our chosen datasets. These are: dct (Almarwani et al., 2019): embeddings are generated by employing discrete cosine transform on a set of word vectors. eigensent (Kayal and Tsatsaronis, 2019): sentence representations produced via higher-order dynamic mode decomposition (Le Clainche and Vega, 2017) on a sequence of word vectors. wmovers (Wu et al., 2018): a competing method which can learn sentence representations from the word mover's distance based on kernel learning, termed in the original work as word mover's embeddings. p-means (Rücklé et al., 2018): produces sentenceembeddings by concatenating several power-means of word-embeddings corresponding to a sentence. doc2vec (Le and Mikolov, 2014): embeddings produced by jointly learning the representations of sentences, together with words, as a part of the word2vec procedure. s-bert (Reimers and Gurevych, 2019): embeddings produced by fine-tuning a pre-trained BERT model using a Siamese architecture to classify two sentences as being similar or different.
Note that the results for wmovers and doc2vec are taken from Table 3 of Wu et al.'s work (2018), while all the other algorithms are explicitly tested.

Setup
Extensive experiments are performed to provide a holistic overview of our neighbourhood-preserving embedding algorithm, for various sets of input parameters. The steps involved are as follows: Choose a dataset (one of the six mentioned in Section 3.1). For every word in every sentence in the train and test splits of the dataset, retrieve the corresponding word-embedding from the pretrained embedding corpus (as stated in Section 3.2). Calculate symmetric distance matrices corresponding to each of the chosen distance metrics, for all of the sets of word-embeddings from the train and test splits. Apply the UMAP algorithm on the distance matrices to generate embeddings for all sentences in the train and the test splits. Calculate embeddings for competing methods for the methods outlined in Section 4.1.
Embeddings are generated for various hyperparameter combinations for EMAP as well as all the compared approaches, as listed in Table 2. Train a classifier on the produced embeddings to perform the dataset-specific task. In this work, we train a simple linear-kernel support vector machine (Cortes and Vapnik, 1995) for every competing method and every dataset tested. The classifier is trained on the train-split of a dataset and evaluated on the test-split. The only parameter tuned for the SVM is the L2 regularization strength, varied between 0.001 and 100. The overall test accuracy has been been reported as a measure of performance.

Results and Discussion
The results of all our experiments are in compiled in Tables 3 and 4. All statistical tests reported are z-tests, where we compute the right-tailed p-value and call a result significantly different if p < 0.1. Performance of the distance metrics: From Table 3 it can be observed that the word mover's distance consistently performs better than the others experimented with in this paper. WMD calculates the total effort of aligning two sentences, which seems to capture more useful information compared to the hausdorff metric's worst-case effort of alignment. As for the energy distance, it calculates pairwise potentials amongst words within and between sentences, and may suffer if there are  Table 2: Hyperparameter values tested. For EMAP, n neighbours refers to the size of local neighborhood used for manifold approximation, embedding dim is the fixed dimensionality of the generated sentence-embeddings, min dist is the minimum distance apart that points are allowed to be in the low dimensional representation, spread determines the scale at which embedded points will be spread out, n iters is the number of iterations that the UMAP algorithm is allowed to run, and finally, distance is one of the metrics proposed in Section 2.1. For the spectral decomposition based algorithms, dct and eigensent, components represents the number of components to keep in the resulting decomposition, while time lag corresponds to the window-length in the dynamic mode decomposition process. For pmeans, powers represents the different powers which are used to generate the concatenated embeddings.  Table 3: Comparison versus kNN. Results shown here compare the classification accuracies of k-nearest neighbour to our proposed approach for various distance metrics. For every distance, bold indicates better accuracy, while * indicates that the winning accuracy was statistically significant with respect to the compared value (,i.e., EMAP vs kNN for a given distance metric). It can be observed that our method almost always outperforms knearest neighbour-based classification.  Table 4: Comparison versus competing methods. We compare EMAP based on word mover's distance to various state-of-the-art approaches. The best and second-best classification accuracies are highlighted in bold and italics. We perform statistical significance tests of our method (wmd-EMAP) against all other methods, for a given dataset, and denote the outcomes by ∨ when the compared method is worse and ∧ when our method is worse, while the absence of a symbol indicates insignificant differences. In terms of absolute accuracy, we observe that our method achieves state-of-the-art results in 2 out of 6 datasets.
shared commonly-occurring words in both the sentences. However, given that energy and hausdorff distances are reasonably fast to calculate and perform respectably well, they might be worth using in applications with a large number of long sentences.
Comparison versus kNN: EMAP almost always outperforms k-nearest neighbours based classification, for all the tested distance metrics. The performance boost for WMD is between a relative percentage accuracy of 0.5% to 14%. This illustrates the efficiency of the proposed manifold-learning method.
Query Sentence Best Match Sentence Cosine Sim I have spent thousands of dollar's On Meyers cookware everthing from KitchenAid Anolon Prestige Faberware & Circulan just to name a few Though Meyers does manufacture very high quality pots & pans and I would recommend them to anyone it's just sad that if you have any problem with them under warranty you have to go throught the chain of command that never gets you anywhere even if you want to speak with upper management about the rudeness of the customer service department Their customer service department employees are always very rude and snotty and they act like they are doing you a favor to even talk to you about their products When I opened the box I noticed corrosion on the lid When I contacted Rival customer service via email they told me I had to purchase a new lid I called and spoke with a customer service representative and they told me that a lid was not covered under warranty When I explained that I just opened it and it was defective they told me to just return the product that there was nothing that they were going to do After being treated this way I will NOT be purchasing any more Rival products if they don't stand behind their product VERY VERY poor customer service 0.997 This movie will bring up your racial prejudices in ways that most movies just elude to It demonstrates how connected we all are as people and how seperated we are by only one thing our viewpoints The acting is superb and you get one cameo appearance after another which is a treat Of course the soundtrack is terrific The ending is intense to witness one situation after another coming to an unfortunate finish I waited years for this movie to be released in the United States As far as I was concerned it wasn't about the acting as much as it was about the feeling the actors wanted to portray in which they profoundly accomplished I would recommend this movie to anyone who can reach that one step deeper into the minds of creativity and passion and appreciate the struggles of rising above and beyond the pain of broken dreams  Table 5: Examples of best-matching sentences. From the amazon reviews dataset using wmd-EMAP.
Comparison versus state-of-the-art methods: Consulting Table 4, it seems that wmovers, pmeans and s-bert form the strongest baselines as compared to our method, wmd-EMAP (EMAP with word mover's distance). Considering the statistical significance of the differences in performance between wmd-EMAP and the others, it can be seen that it is almost always equivalent to or better than the other state-of-the-art approaches. In terms of absolute accuracy, it wins in 3 out of 6 evaluations, where it has the highest classification accuracy, and comes out second-best for the others. Compared to it's closest competitor, the word mover's embedding algorithm, the performance of wmd-EMAP is found to be on-par (or slightly better, by 0.8% in the case of the classic dataset) to slightly worse (3% relative p.p., in case of the twitter dataset). Interestingly, both of the distance-based embedding approaches, wmd-EMAP and wmovers, are found to perform better than the siamese-BERT based approach, s-bert.
Thus, the overall conclusion from our empirical studies is that EMAP performs favourably as compared to various state-of-the-art approaches.
Examples of similar sentences with EMAP: We provide motivating examples of similar sentences from the amazon dataset, as deemed by our approach, in Table 5. As can be seen, our method performs quite well in matching complex sentences with varying topics and sentiments to their closest pairs. The first example pair has the theme of a customer who is unhappy about poor customer service in the context of cookware warranty, while the second one is about positive reviews of deeply-moving movies. The third example, about book reviews, is particularly interesting: in the first example, a reviewer is talking about how she disliked the first Stephen King work which she was exposed to, but subsequently liked all the next ones, while in the matched sentence the reviewer talks about a similar sentiment change towards the works of another author, Steve Berry. Thus in the last example, the similarity between sentences is the change of sentiment, from negative to positive, towards the works of books of particular authors.

Conclusions
In this work, we propose a novel mechanism to construct unsupervised sentence-embeddings by preserving properties of local neighbourhoods in the original space, as delineated by set-distance metrics. This method, which we term, EMAP or Embeddings by Manifold Approximation and Projection leverages a method from topological data analysis can be used as a framework with any distance metric that can discriminate between sets, three of which we test in this paper. Using both quantitative empirical studies, where we compare with state-of-the-art approaches, and qualitative probing, where we retrieve similar sentences based on our generated embeddings, we illustrate the efficiency of our proposed approach to be on-par or exceeding in-use methods. This work demonstrates the successful application of topological data analysis in sentence embedding creation, and we leave the design of better distance metrics and manifold approximation algorithms, particularly targeted towards NLP, for future research.