EmbedTextNet: Dimension Reduction with Weighted Reconstruction and Correlation Losses for Efficient Text Embedding

,


Introduction
Significant advances in computational resources along with the existence of the huge amount of data rendered large language models (LLM) ubiquitous in various fields, such as Natural Language Understanding (NLU), and vision-language multimodels Du et al. (2022).Even with their relative success in diverse applications, the practical deployment of such models remains challenging with one of the main concerns being the text embedding size (Ling et al., 2016).For example, loading word embedding matrices with 2.5 million tokens and 300 dimensions consumes 6 GB memory on a 64-bit system Raunak et al. (2019).This may breach memory or latency budget in certain highly-constrained practical settings, such as NLU on mobile devices or KNN search in a large index.
Text embeddings, in general, are generated from unlabeled text corpora and applied in several Natural Language Processing (NLP) and information retrieval applications.Embeddings are often precomputed and used in downstream applications.Many post-processing algorithms have been considered over the years to reduce the dimensionality of the resulting vector space.One approach to reduce embedding dimensionality is re-training the language models with the desired dimensionality, this, however, can be costly and reduce model performance.Another approach is to use unsupervised algorithms directly on the pre-computed text embeddings, without requiring any access to labels.Several works exist in the literature including Principal Component Analysis (PCA) Jolliffe andCadima (2016), Algo Raunak et al. (2019), and Uniform Manifold Approximation and Projection (UMAP) McInnes et al. (2018).While most unsupervised approaches have made noticeable progress, they still suffer from many disadvantages according to our findings.First, they do not perform consistently in different applications.For example, Algo works well, compared to PCA, for Dimensionality Reduction (DR) in a word embedding.However, the roles are switched in the case of a multilingual sentence (m-sentence) embedding.Second, the absolute performance of the baseline methods is significantly lower when extreme reduction is employed (e.g. 16).
In this work, we propose a framework that efficiently reduces the dimensionality of text embedding.We show that the proposed EmbedTextNet, which uses a Variational AutoEncoder (VAE) with modified loss functions, can be used as an add-on module that improves the quality of the text embedding.We also demonstrate that EmbedTextNet can efficiently reduce the size of a text embedding considerably while preserving the quality of the embedding.Our extensive experiments on three downstream applications depict that our method surpasses the original embeddings and state-of-theart (SOTA).

Related Work
Model compression and DR of embedding are receiving more attention nowadays because of their ability to speed the inference process and reduce the model storage space.To compress a model, one can use pruning approaches that aim to repeatedly eliminate the redundant parameters in a model Molchanov et al. (2016).Quantization is another common way that works by limiting the number of bits used to represent the model weights Hubara et al. (2017).Similarly, authors in Li et al. (2017) compressed the input and output embedding layers with a parameter-sharing method.Another approach for model compression is knowledge distillation, where the objective is to make a small model mimic a large pre-trained model by matching the ground-truth and soft labels from the large model Hinton et al. (2015).Recently, researchers considered a block-wise low-rank approximation Chen et al. (2018) which employs statistical features of words to form matrix approximations for embedding and softmax layers.The extension of it was shown in Lee et al. (2021) which is based on word weighting and the k-means clustering methods.Authors in Hrinchuk et al. (2019) considered parameterizing the embedding layers by tensor train decomposition to compress a model with a negligible drop in the result.All the aforementioned methods have the limitation of either requiring re-training the model to accommodate for the changes performed in the network or compromising the accuracy of the learning model.Further, some of these techniques are not studied with pretrained LLM which limits the scalability.
Unsupervised methods for DR of embedding have gained attention because of their label-free approach that transforms a high-dimensional embedding into a reduced one.The most common starting point is to utilize linear transformation approaches such as PCA or Independent Component Analysis (ICA) Müller et al. (2018); Hyvärinen (2013).These methods aim to project the original high-dimensional embedding into a lowerdimensional one by either maximizing the variance or the mutual information.Another route is to employ nonlinear DR methods including Kernel PCA (KPCA) Schölkopf et al. (1998), t-distributed Stochastic Neighbor Embedding (t-SNE) Van der Maaten andHinton (2008), andUMAP McInnes et al. (2018).The KPCA employs a kernel function (e.g.Gaussian) to map the original embeddings into a nonlinear space and performs PCA on the projected embeddings.t-SNE performs a nonlinear reduction method to capture the local structure of the high-dimensional embeddings while maintaining a global structure simultaneously.UMAP embeds data points in a nonlinear fuzzy topological representation by neighbor graphs and then, learns a low-dimensional representation while preserving maximum information of this space.
The use of pre-trained LLM is the backbone for several NLP applications such as text retrieval, question-answer, and semantic similarity between pairs.Large pre-trained models such as BERT Devlin et al. (2019), RoBERTa Liu et al. (2019) and XLNet Yang et al. (2019) have shown to provide SOTA performance.However, when we necessitate a smaller embedding size for resource-constrained devices and prompt inference in downstream tasks like text retrieval, it becomes crucial to retrain pretrained LLMs and fine-tune them to attain the same level of performance as the original model.Using DR, we can circumvent this process while also saving space that would otherwise be reserved for storing retrieval models (i.e.index).One simple approach to reduce the embedding size of these large models is to re-train the model while changing the last hidden layer size.This is inefficient since the re-training process will still, obviously, require enormous resources.A more natural and convenient mechanism is to work directly on the pre-trained large embedding which will eliminate the need for re-training Mu et al. (2017).Recently, the authors in Raunak et al. ( 2019) combined PCA with a post-processing Mu et al. (2017) to obtain effective DR on word embeddings.However, existing methods fail to work in extreme cases such as a 30-times reduction of the original size.
Therefore, in this work, we propose an algorithm that works directly on pre-computed embeddings to reduce their size even in extreme cases.This is validated on three different applications which are text similarity, language modelling and text retrieval to highlight the effectiveness of the proposed approach in different aspects such as similarity score, inference time and model size.

DR for Text Embedding
Ensuring a fast inference (either between devices or user-device interaction) is critical, especially for applications related to NLU such as voice assistants.DR has practical applications in two distinct tasks: (1) Direct application using pre-computed embeddings like GloVe and (2) retrieval task with a builtin index where we use pre-computed embeddings to generate an index for searching.These embeddings can be either contextual or non-contextual, depending on the specific application.To this end, we first introduce EmbedTextNet with modified reconstruction and additional correlation losses.We then show the three most common DR methods as the baselines.

EmbedTextNet
We propose a VAE model, named EmbedTextNet, with added correlation penalty, L Corr that quantifies the strength of the linear relation between the original embedding and the reconstructed one.The correlation penalty is added to the original losses of the VAE model to enhance the DR of pre-computed embeddings.We mainly consider the latent space of VAE for DR of embeddings.The overall model and loss functions of EmbedTextNet are shown in Figure 1 and detailed in the Appendix.
The input for our EmbedTextNet is the pretrained embedding.With that in mind, the general loss function of a VAE consists of two parts: Reconstruction and regularizer losses.The first term, L Rec , measures the distance between the original and reconstructed pre-computed embedding.The second term is the Kullback-Leibler (KL) divergence loss, L KL , which acts as a regularizer normalizing and smoothing the latent space to follow a prior distribution (i.e.Gaussian distribution).Originally, KL divergence drives the VAE as a generative model but there is also merit to using it for DR according to the ablation study in Table 7.To balance the two terms, the work in Higgins et al. (2016) proposed a β-VAE which places greater emphasis on the L KL in order to enforce a comparable latent distribution to the original distribution.This weighting value contributes to the production of high-quality synthetic data, indicating that β-VAE is primarily intended for data generation purposes.However, β-VAE is found to be less effective on applications related to DR for text embeddings where significant performance degradation is expected (seen in Table 8).Instead, we place more weight on the reconstruction loss to highlight the similarity between the reconstructed data and the original data.This new weighting value ensures that the latent space contains the critical features of the original data, ensuring that EmbedTextNet is beneficial for dimensionality reduction.
In Figure 2, we show the L Rec and L KL of the reduced word embedding while training VAE.The original VAE losses in (a) are small, especially for the KL loss which almost disappears and it causes the reconstruction term to converge.This makes the training of EmbedTextNet insufficient, resulting in a faulty reconstruction.In addition, the VAE model is not trained properly with the original reconstruction loss since it is almost invariant during the training process.However, weighted reconstruction loss in (b) indicates adequate amounts of loss with convergence at the end.This allows the model to effectively find smaller embeddings by enforcing a large penalty for incorrect decisions.From the empirical result on text similarity in Table 7, we observe the improvement after weighting the reconstruction loss.More experiments are found in the Appendix.
The reconstruction loss is the proximity of two embeddings to gauge their similarity.In the NLP area, we also utilized directional metrics (e.g.cosine similarity) to gauge the similarity between embeddings.Our conception stemmed from the desire to incorporate the additional directional information between the original and reconstructed data to devise a more effective DR approach.To do this, we additionally consider the Pearson correlation coefficient, L Corr , as an added penalty to the VAE losses to measure the strengths and directions of the linear relationship between original and reconstructed embeddings.Specifically, it enables the model to capitalize further on the train set when the reconstruction and KL losses have converged leading to better overall reconstruction performance.From empirical results in Tables 8  and 9, we observe the improvement after including the correlation loss (i.e.EmbedTextNet), compared to other measurements (e.g.cosine-similarity).
After training Encoder and Decoder: A detailed description of the EmbedTextNet DR procedure is presented in Algorithm 1.We first generate embeddings from train and test sets of language models without using any layer information from them.During training, the latent and reconstructed embeddings of the train set are used to measure the loss functions of EmbedTextNet.In the first epoch, we set the weight value (W ) for the reconstruction loss (L Rec ) from the L Rec and KL divergence (L KL ) with γ.This helps to compensate for the randomness of parameters from the initialization of the encoder and decoder.The value of γ is determined by the first L Rec value, where we provide a high value if it is small and vice versa for others.This value is crucial for making Embed-TextNet useful in all applications, as each one has a different distribution of embeddings.Equation 1shows the details of each loss in Algorithm 1.
where θ is the weight value in EmbedTextNet and L Rec objective is to find θ that maximizes the likelihood of original embedding O given latent dimension z. q(z | O) is the distribution of latent dimension given original embedding.O i , R i are each dimension of original, reconstructed embeddings respectively and Ō, R are the average values from original, reconstructed embeddings separately.

Baselines
We compared EmbedTextNet to three conventional and recent dimensionality reduction methods: PCA, Algo, and UMAP.PCA is a classic technique for reducing vector dimensionality by projecting it onto a subspace with minimal information loss Jolliffe and Cadima (2016).Algo aims to eliminate the common mean vector and a few top dominating directions from the text embeddings to make them more distinguishable Raunak et al. (2019).UMAP is a manifold learning approach based on Riemannian geometry and algebraic topology that is similar to t-SNE in visualization quality and preserves more of the global structure while improving runtime McInnes et al. (2018).

Results and Discussion
We showcase the DR usefulness of EmbedTextNet on three different applications: Text similarity for measuring the mimicry of reduced embedding with regards to the original one, language modelling for understanding the reproducibility of original data from DR and text retrieval for finding the effectiveness of DR in terms of frugality and performances.
We also conduct an ablation study for the hyperparameters and structure of EmbedTextNet.Finally, the computational cost for EmbedTextNet is detailed in Appendix A. In general, during training, EmbedTextNet takes a longer time compared to PCA and Algo, and comparable time to UMAP.Yet, at inference, it has a similar inference time with superior performance.This emphasizes the capabilities and potential of EmbedTextNet for DR.

Datasets
We evaluated the proposed DR approach on various datasets for each application.
Word Similarity: We consider diverse databases of word pairs with similarity scores rated manually by humans from Faruqui and Dyer (2014).We evaluate similarity between word embeddings using cosine-similarity and measure the correlation between the cosine-similarity scores and human ranking using Spearman's Rank Correlation (SRC, ρ x 100).Sentence Similarity: We use several datasets for similarity evaluation between sentence pairs: the Semantic Textual Similarity (STS) tasks from 2012-2016 Agirre et al. (2012Agirre et al. ( , 2013Agirre et al. ( , 2014Agirre et al. ( , 2015Agirre et al. ( , 2016))

Experimental Setting
In all downstream tasks, we used the training set to train the DR models and the testing set solely to evaluate their generalization performance.We trained each model three times with different random seeds to account for random initialization and averaged the results.For the text similarity task, we mainly used SRC (ρ x 100) to measure the strength and direction of a relationship between two ranked variables.For language modeling, we used perplexity and for text retrieval, we measured the index size, time for searching queries and recall at top-10 (Recall@10).To ensure fairness, we used the same models and datasets for all baselines and DRs.We also report the standard deviation (STD) with each performance measurement as a statistical analysis.
Text Similarity Application: To generate word embeddings, we considered the pre-trained GloVe Pennington et al. (2014) trained on Wikipedia 2014, Gigaword 5, and Twitter corpus.We employed 80% of the dictionary of GloVe for training DR to investigate the generalizability of it for unseen data (i.e.20% of dictionary).For task evaluation, we utilized the entire dictionary to extract the reduced embeddings for each word to ensure all words were included in the task.To produce sentence embeddings, we employed the pre-trained Sentence-BERT (SBERT) Reimers and Gurevych (2019) trained on Wikipedia and NLI datasets.There are two scenarios for measuring sentence similarity with DR. (1) No dedicated training set is provided in STS 2012 -2016 Agirre et al. (2012Agirre et al. ( , 2013Agirre et al. ( , 2014Agirre et al. ( , 2015Agirre et al. ( , 2016)).Here, we employed a cross-dataset evaluation protocol.For example, we used STS 2013 -2016 as train set when measuring the result on STS 2012.(2) Dedicated training and testing sets are provided in STS-B Cer et al. (2017) and SICK-R Marelli et al. (2014).In this case, we used the offered train set for training DR models and evaluated them on the provided test set.To generate m-sentence embeddings, we used the pre-trained multilingual Sentence-BERT (mSBERT) Reimers and Gurevych (2020) trained on parallel data for 50+ languages Tiedemann (2012).For train set, we used the offered monolingual pairs of AR-AR, ES-ES and the translated sentences of ES-ES into EN, DE, TR, FR, IT, NL using Google Translator since we do not have monolingual pairs for them.The provided EN-EN dataset was removed from train set since most cross-lingual datasets were generated from translating one sentence of EN-EN Reimers and Gurevych (2020).For test set, all the cross-lingual pairs are considered.
Language Modelling Application: Using the codebase in Pytorch examples,4 we employed 2layered LSTM model with 100 hidden units, 50% dropout after the encoder and used the dedicated train set in WikiText-2 for training a model.We considered the embedding of the train set from the encoder to train the DR approaches and then, implemented them on the embeddings of the test set from the encoder to get the reduced ones.Finally, we did the inverse transform for the reduced embeddings using DR and provided it to the decoder for performance evaluation.Compared to other tasks, we focus on measuring the reproducibility of original data from DR in this task.
Text Retrieval Application: Here, we employed the Hierarchical Navigable Small Worlds (HNSW) Malkov and Yashunin (2016) in Faiss library Johnson et al. (2019).The used hyperparameters in the HNSW index are 12 as the number of neighbors in the graph, 120 as the depth of exploration for building the index, and 16 as the depth of exploration of the search.To build the index, we used a train set in the NYTimes dataset and considered a test set as the queries for searching.Also, train and test sets in this dataset are used for training and testing in DR approaches.To this end, we used the reduced embeddings for building the HNSW index and searching the queries to understand the effectiveness of DR to save resources and performance.

Word Similarity with DR
For word embeddings, we simulate two different cases of DR on GloVe which are (1) GloVe-300D → 50D and (2) GloVe-50D → 10D (the left and right sides represent the original and reduced embeddings dimension (D) after DR).The overall performance for EmbedTextNet and the baseline models are found in Table 1.For (1), we noticed in general that PCA and UMAP methods perform badly compared to Algo after reducing the embedding dimension.In most cases, EmbedTextNet outperforms other reduction approaches and shows better results even compared to GloVe-50D (i.e.re-training the model to reduce dimension size).This demonstrates the ability of EmbedTextNet to effectively decrease the dimension of word embedding without the need for re-training where it is more obvious when the reduced dimension is getting smaller.For (2), the performance gap between EmbedTextNet and other DR methods is expanding in most databases.In addition, EmbedTextNet is the only model that reveals superiority compared to GloVe-25D.In this task, we need a vocabulary from pre-trained word embeddings during inference and thus, EmbedTextNet can be useful to reduce the memory and inference time with similar or better performances.This is shown in Table 2, especially when EmbedTextNet-100D outperforms GloVe-200D.Thus, we confirm the usefulness of EmbedTextNet for the word similarity task in terms of resources and performance.

Sentence Similarity with DR
We tested two cases of dimensionality reductions on sentence embeddings: (1) Reducing SBERT-1024D to 128D and (2) reducing SBERT-1024D to 16D.The overall performance can be found in Table 3. PCA mostly performed better than other methods in both cases, and EmbedTextNet had the best overall performance.This is especially evi- dent in the larger reduction (2).EmbedTextNet also performed better than using the average of word embeddings (i.e.Avg.GloVe-300D, BERT-768D) with lower dimensional embeddings.These results show that EmbedTextNet effectively generates lower dimensional word and sentence embeddings while maintaining good performance.

Multilingual Sentence Similarity with DR
We tested the proposed method in a multilingual setting by considering two use cases: (1) Reducing mSBERT-768D to 64D and (2) reducing mSBERT-768D to 16D.The results can be found in Table 4.
In the first case (1), PCA had a small decrease in performance compared to Algo and UMAP, while EmbedTextNet had the best performance overall, except for the case of AR-EN.However, in the extreme reduction case (2), EmbedTextNet had the best performance in all languages.This highlights the effectiveness of EmbedTextNet in reducing the size of embeddings, especially when a large reduction is needed.Therefore, EmbedTextNet can be useful in reducing the dimension of embeddings in Unlike other model compression works, DR cannot directly reduce the size of model parameters for language modelling since we do not access any parameter or layer information.Instead, we can reduce the dimension of embeddings but can also reconstruct them to the original dimension by inverse transform in DR approaches.In this task, we measure how close the perplexity is when starting from the reduced-dimension vector where the results are covered in Table 5.In all reduced cases, PCA always shows better perplexity than Algo and UMAP approaches while EmbedTextNet outperforms all of them.Therefore, the reconstructed data from EmbedTextNet has a high quality that follows the original data very well.

Text Retrieval with DR
For text retrieval, we need a space for saving the embeddings (i.e.index size) and a search time for queries which can be saved using DR approaches.
In Table 6, we reveal the results in text retrieval task.First of all, we can see that PCA and Em-bedTextNet in 64 dimension case enhance the recall values from the original one since they generate compact informative embeddings which help to build a distinguishable index between different keys.It is obvious that we can decrease the index size and search time for each query significantly after DR implementations where EmbedTextNet shows the best results in terms of recall@10.In real-world applications, only an encoder of Embed-TextNet is needed to reduce query dimension and it requires at most 2 MB of space, making it a beneficial approach for text retrieval tasks to achieve efficiency and similar or better performance.

Ablation Study
Since EmbedTextNet uses a modified reconstruction loss with a distance-based penalty, we investigated different distance metrics such as Huber, Hinge, Mean Absolute Error (MAE) and MSE, to find and verify the effectiveness of the selected one for text embedding applications.The investigation is shown in Table 16 in the Appendix which confirms that MSE is the most suitable selection for dimensionality reduction cases.We also tested different values of γ to find the optimal one for each application.The results are detailed in Table 13 -15 in the Appendix.We assigned larger γ for L Rec < 0.1 as it means that the reconstructed output is very similar to the original one.This also indicates that the latent space of EmbedTextNet follows the prior distribution well (i.e.small KL divergence) and a larger γ is needed to prevent convergence without sufficient learning in EmbedTextNet.The best values (i.e.γ = 10 if L Rec < 0.1 else 0.3) were used throughout the evaluation process for different applications.
In Table 7, we compare the use of KL divergence and modified reconstruction loss.As we can see, using the modified reconstruction loss with AE outperforms the naive AE and including the KL divergence (i.e.VAE) shows further improvements.In Table 8, we compare the selected EmbedTextNet with other approaches in a multilingual sentence similarity.We can see that placing more emphasis on KL divergence (i.e.β-VAE) is not suitable for DR applications.Also, giving more emphasis on reconstruction loss and using cosine-similarity (i.e.Tables 8 and 9) are not helpful, unlike the correlation loss (i.e.EmbedTextNet, N = 5).This confirms that correlation loss has its own advantage in achieving better DR.Lastly, we confirm that the correlation loss should be included after a few epochs (i.e.N = 5) to achieve better performance.More details are included in the Appendix.

Limitations
There are three predictable limitations in the developed EmbedTextNet.First, while we have performed a thorough evaluation of EmbedTextNet on various downstream tasks, it is still a generalpurpose approach and its effectiveness on specific tasks or in specific domains may vary.Thus, further research is needed to fully understand its capabilities and limitations in different contexts.Second, as mentioned, EmbedTextNet is most suitable for scenarios where the embedding is saved during inference, such as text retrieval or similarity measurement when the fixed embedding is saved with a vocabulary (e.g.GloVe).However, it may not be as effective in scenarios where the embedding needs to be decoded back to its original form, such as text generation.
Third, the effectiveness of EmbedTextNet is evident on a large embedding dimension, and it may decrease when working with a small embedding dimension even if it was still better than other SOTA in our experiments (e.g.GloVe-50D → 10D in Table 1).This limitation is due to the fact that Em-bedTextNet is based on a VAE architecture, which is known to perform better on high-dimensional data.Therefore, it is better to compare the performances of EmbedTextNet with other SOTA and choose the right one according to the researchers' usage.To train the baselines, we employed the Intel(R) Xeon(R) CPU @ 2.20 GHz and to train the Em-bedTextNet, we used the Tesla P100-PCIE for text similarity tasks and Tesla T4 for language modelling and text retrieval tasks.All of them used RAM 25 GB and we trained three times with different seeds for getting the averaged results.For training EmbedTextNet in text similarity applications, it took up to 54 minutes for GloVe, 94 minutes for all cross-validations in STS12-16, 12 minutes for STS-B, 12 minutes for SICK-R, 8 minutes for 8 multilingual languages.For training Embed-TextNet in other applications, it required up to 53 minutes for language modelling and 61 minutes for text retrieval.The time consumption for measuring text similarity in all applications is very small to be negligible while building the language modelling model (i.e. 2 layers of LSTM) and Faiss index took up to 39 minutes and 9 minutes respectively.

A.2 Details of EmbedTextNet Structure
In Table 10, we explain the structure of encoder and decoder in EmbedTextNet.Furthermore, we used RMSprop optimizer (Hinton) with lr = 2e −3 while training our models.

A.3 Hyperparameters and Size Estimation in Dimensionality Reductions
In Table 11, we cover all the hyperparameters in dimensionality reductions which are based on the

B Loss Functions in EmbedTextNet
In Figure 3, we show the loss graphs of Embed-TextNet in the sentence similarity task.All the losses are converged well even if weight value is applied on reconstruction loss and correlation loss is added in VAE model.

C Databases
In Table 12, we show the summary of each dataset with its downstream task.

D Detailed Results about Ablation Study
Tables 13 -18 showcase the detail experimental results of ablation study in each database reported in the paper.

E Additional Results for Text Similarity
Tables 19 and 20 cover the additional experimental results for text similarity tasks.

Figure 2 :
Figure 2: Loss functions before (a) and after (b) weighted in Mean Square Error (MSE)-based reconstruction loss.The graphs are based on DR on word embedding (i.e.GloVe-300D → 50D).The increase of epochs does not affect the convergence of losses.
, the STS Benchmark (STS-B)Cer et al. (2017), and the SICK-Relatedness (SICK-R) datasetsMarelli et al. (2014) using SentEval. 2M-Sentence Similarity: To measure multilingual similarity, we use the recent extension of multilingual STS 2017 databaseCer et al. (2017);Reimers and Gurevych (2020).It contains sentence pairs with semantic similarity scores in different languages: AR-AR, EN-EN, and ES-ES for monolingual pairs and AR-EN, ES-EN, EN-DE, EN-TR, FR-EN, IT-EN, and NL-EN for cross-lingual pairs. 3For both sentence and m-sentence evaluations, we evaluate the SRC between the cosinesimilarity of sentence pairs and the ground truth from measured semantic similarity.Language Modelling: We use WikiText-2 datasetMerity et al. (2016) for language modeling collected from verified good and featured articles on Wikipedia.Text Retrieval: Lastly, we consider NYTimesAumüller et al. (2018); NYT, a benchmark dataset for Approximate Nearest Neighbor (ANN) that is a bag of words generated from NYTimes news articles.

Figure 3 :
Figure 3: Loss functions in EmbedTextNet.The graphs are based on dimensionality reduction on sentence embedding (i.e.SBERT-1024D → 128D).Even if we increase the epochs, the losses are converged as this figure.
Algorithm 1 EmbedTextNet Algorithm Data: Embedding from train set X train and test set X test , Decreased dimension M , Correlation loss threshold N , Max epochs K Result: Decreased embedding for train set Y train and test set Y test Parameters: Training epoch E, Reconstructed train set X re−train and test set X re−test , Loss functions of EmbedTextNet L All , Reconstruction L Rec , KL divergence L KL , Correlation L Corr , Weight for reconstruction loss W , Tuning value γ 1. Initialize Encoder and Decoder in Embed-TextNet 2. Train Encoder and Decoder: while

Table 5 :
McInnes et al. (2018)e modelling.UMAP is only considered for the smallest case since authorsMcInnes et al. (2018)did not recommend for reduced dimension more than 8 because of low performance.

Table 6 :
Retrieval task inNYTimes benchmark where searching is done using Intel(R) Xeon(R) CPU @ 2.20 GHz.Encoder size of EmbedTextNet is up to 2 MB (0.2M parameter).

Table 7 :
The effect of KL divergence (i.e.VAE) and modified reconstruction loss (i.e.including W in Algorithm 1) when GloVe-300D → 150D in ρ x 100.

Table 10 :
EmbedTextNet structure consisting of (a) Encoder and (b) Decoder.BN, LeakyReLu and Average (Avg) Pooling are applied after each operation.Here, word means the text similarity and retrieval with GloVe and Faiss respectively.

Table 12 :
Details of datasets used.Here, train and test sets are divided according to the usage in this paper.

Table 20 :
Detailed performances for sentence similarity (ρ x 100 (STD)) where embedding is decreased from SBERT-1024D → 128D and SBERT-1024D → 16D.Bold describes the best result among dimensionality reductions.The measurements of Avg.GloVe and BERT are done using SentEval.