Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets.It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model’s awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.


Introduction
Sentence embedding models are an effective instrument for encoding the semantic nuances of words, phrases, and larger textual units into a continuous vector space.They encapsulate the complexities of contexts and lexical and grammatical interrelationships within a text, facilitating downstream tasks like information retrieval, semantic similarity evaluation, and text classification.
Despite the potential of these models, questions remain about the effectiveness of different data preprocessing strategies, the optimal loss function for training sentence embedding models, and the impact on performance of increasing the number of model parameters.This paper addresses these challenges.
We have develop a novel dataset specifically to train our sentence embedding models.Furthermore, we design a dataset specifically to sensitize our models to distinguish negations of statements from confirming statements.This paper also presents JINA EMBEDDINGS, a set of high-performance sentence embedding models trained on these datasets.The JINA EMBEDDINGS set is expected to comprise five distinct models, ranging in size from 35 million to 6 billion parameters.Three of those models are already trained and published. 1he JINA EMBEDDINGS models employ contrastive training on the T5 architecture [Raffel et al., 2020].It's important to note that we opt to use the T5 model as our base due to its pre-training on a mixed set of downstream tasks.We argue that incorporating this approach can potentially enhance our ability to accurately gauge the effectiveness of our training strategy.
Our large-scale contrastive fine-tuning approach surpasses zero-shot T5 and delivers a performance level on par with other leading T5-based sentence embedding models such as Sentence-T5 [Ni et al., 2022a] and GTR [Ni et al., 2022b].Consequently, this work demonstrates that high-quality sentence embeddings can be achieved with the judicious use of resources and innovative training methodologies.

Dataset Preparation
In order to develop models that excel across a wide range of tasks, we collate a comprehensive set of both public and custom datasets.These datasets target various retrieval objectives, such as e-commerce search, duplicate detection, web retrieval, article retrieval for question-answering, and text classification.Consolidating these datasets into a unified format facilitates concurrent model training for all tasks.
Definition of Format: Given the lack of nonrelevance information in many of the datasets, we reformat each training item into pairs, designated as (q, p) ∈ D pairs .Each pair includes a query string q and an associated target string p.To leverage explicit non-relevance judgments, we create an auxiliary set of triplets (q, p, n) ∈ D triplets , which pair a query string q with a match p (positive) and a non-matching string n (negative).

Data Extraction:
The methods used to extract pairs and triplets are specific to each source dataset.For example, given a question-answer dataset, we use questions as query strings and answers as target strings.Retrieval datasets often contain queries that can serve as query strings and relevant and nonrelevant annotated documents which can operate as matching and non-matching strings.
Training Steps: Our training process is a twostep approach.Initially, we train on pairs and then fine-tune the model using the triplets, as detailed in Section 3.3.

Pairwise Data Preparation
The substantial size and inconsistent quality of many large datasets necessitates a rigorous filtering pipeline.We apply the following steps to filter training data: De-Duplication: Duplicated entries within training data can negatively impact model performance [Hernandez et al., 2022], and potentially lead to overfitting.Consequently, we remove duplicate entries from our dataset.Considering the dataset's volume, we employ hash functions to identify and eliminate text pairs that map to duplicate hash values.We normalize whitespace and capitalization before checking for duplicates.Empty pairs and pairs with identical elements are also removed.
Language Filtering: Since we design our embedding models for English, we use the fasttext-language-identification model 2 based on the fasttext text classification method [Joulin et al., 2017] to remove non-English training items from the dataset.
Consistency Filtering: Consistency filtering means excluding training pairs with low semantic similarity.Previous studies suggest that eliminating low-similarity pairs using an auxiliary, albeit less precise, model boosts performance [Dai et al., 2023, Wang et al., 2022].We employ the all-MiniLM-L6-v2 model 3 for consistency filter-2 fasttext-language-identification (https: //huggingface.co/facebook/fasttext-language-identification) 3 all-MiniLM-L6-v2 model (https://huggingface.co/ sentence-transformers/all-MiniLM-L6-v2) ing in this manner: We generate embeddings for 1M pairs (q i , p i ) i randomly sampled from D pairs .For every pair (q, p) ∈ D pairs in the dataset, we verify whether p is among the top two passages most similar to q based on the cosine similarity of their embeddings compared to all passages p i , i = 1, ..., 1M.
The application of these preprocessing steps reduces the size of the dataset from over 1.5 billion mixed-quality pairs to 385 million high-quality pairs.This reduction permits us to train our model with significantly less data than typical embedding models without sacrificing embedding quality. 4

Triplet Data Preparation
For the triplet dataset, we forego de-duplication and language filtering and we assume the quality of these datasets already meets our quality requirements.However, we validate the relevance of the "positive" item with respect to the "query" for each triplet in a manner similar to consistency filtering.Instead of contrasting the embedding cosine similarity s(q, p) against a sample set, we compare it solely with the similarity s(q, n) of the embeddings derived from the same triplet (q, p, n) ∈ D triplets .This is accomplished using a cross-encoder model, which evaluates the pair directly without generating embedding representations.More specifically, we leverage the ms-marco-MiniLM-L-6-v2 model 5 to verify whether the difference in retrieval scores determined by the model exceeds a threshold r(q, p) − r(q, n) > κ, with threshold κ = 0.2, and eliminate all other pairs.This methodology draws inspiration from the de-noising strategy proposed in [Qu et al., 2021].

Negation Data Preparation
We observe that many embedding models struggle to accurately embed negations.For instance, when embedding the three sentences: "A couple walks hand in hand down a street.","A couple is walking together.",and "A couple is not walking together.",the first two should be embedded close together, while the second and third, contradictory in 4 For instance, models like all-MiniLM-L6-v2 and all-mpnet-base-v2 are trained on nearly 1.2 billion pairs, whereas other T5-based models such as sentence-t5-base or sentence-t5-large are trained on 2.2 billion pairs. 5ms-marco-MiniLM-L-6-v2 (https://huggingface.co/ cross-encoder/ms-marco-MiniLM-L-6-v2) meaning, should be positioned further apart.6However, for instance, the all-MiniLM-L6-v2 model assigns a cosine similarity of 0.7 to the first two sentences, while attributing a similarity of 0.86 to the second and third. 7e decide to address this problem by creating our own negation dataset8 .This dataset, based on positive pairs from the SNLI dataset9 and negatives created with GPT-3.5, comprises triplets (anchor, entailment, negative) akin to the example given above, where (anchor, entailment) form a positive pair and the "negative" contradicts both the "anchor" and "entailment", while remaining syntactically very similar to "entailment".This dataset forms a subset of our aforementioned triplet dataset, with training details provided in Section 3.3.
Our model evaluation on the negation dataset, which includes a comparative analysis with other popular open-source models, is presented in Section 4.3.

Data Composition
Our dataset of text pairs, represented as This amounts to a total of 1.6 billion pairs before filtering, which is subsequently reduced to a robust 385 million high-quality pairs after rigorous filtering.
In comparison, our dataset of triplets initially comprises a total of 1.13 million entries before filtering, streamlined to 927,000 triplets after filtering.
The composition of our datasets after filtering is illustrated in Figure 1a for the text pairs, and in Figure 2 for the triplets.Together, these form the final dataset for the training of the JINA EMBEDDINGS models.

Training
Training takes place in two distinct phases.The first phase centers on training the model using the voluminous quantity of text pairs, consolidating the semantics of an entire text phrase into a single representative embedding.The second phase uses the relatively small triplet dataset, comprising an anchor, an entailment, and a hard-negative, teaching it to differentiate between similar and dissimilar text phrases.

Training on Pairwise Data
Each model within the JINA EMBEDDINGS set is based on, and trained using, the zero-shot T5 models of corresponding size, as detailed in [Raffel et al., 2020].The zero-shot T5 models are composed of encoder-decoder pairs.However, Ni et al. [2022a] has demonstrated that it is more effective to calculate text embeddings using only the encoder component of the T5 models, as opposed to deploying both encoder and decoder.Consequently, the JINA EMBEDDINGS models use only the encoders of their respective T5 models.
During tokenization, JINA EMBEDDINGS models use SentencePiece [Kudo and Richardson, 2018] to segment input text and encode them into WordPiece tokens [Kudo, 2018].Following the encoder model, a mean pooling layer is implemented to generate fixed-length representations from the token embeddings.
For the training process involving pairs, we employ InfoNCE [van den Oord et al., 2018], a contrastive loss function.This function calculates the loss for a pair (q, p) ∼ B within a batch B ∈ D k of text pairs, where the batch size is k, as follows: The loss is calculated by comparing the cosine similarity between a given question q and its target p, with the similarity to all other targets in the batch.We found that calculating the loss in both directions results in greater improvements during training.Accordingly, the loss is defined as follows: , where Intuitively, L pairs NCE matches the target string to all query strings instead.The constant τ denotes a temperature parameter which we set to τ = 0.05.This method of calculating the loss is based on a similar method in [Neelakantan et al., 2022].

Data Sampling in Pairwise Training
Rather than sequentially training on individual datasets, we opt for a parallel approach, training on all datasets concurrently.We postulate that this parallel training promotes enhanced model generalization across diverse tasks.Despite this, each training batch is exclusively composed of data from a single dataset.This ensures that loss calculations, performed across the entire batch, do not conflate data from different tasks.
Our dataloader operates by initially selecting a dataset, followed by sampling the requisite number of data points from it to constitute a batch for the worker (refer to Section 4).Prior to training, the pairs within the datasets are thoroughly shuffled.
Sampling a dataset D i follows a probability distribution ρ across all datasets D i .The probability of sampling |D j |s j and is contingent upon the dataset's size |D i | and a scaling factor s i .
Given the disparity in dataset sizes, it is critical to frequently sample from larger datasets to prevent overfitting on the smaller ones.Furthermore, we manipulate the sampling rates of datasets using scaling factors to prioritize training on high-quality datasets and achieve balance among text domains.In scenarios where datasets with higher sampling rates deplete their items before the completion of a training epoch, the dataset is reset, enabling the model to cycle through its items anew.This ensures that high-sampling-rate datasets contribute multiple times within a single training epoch.
Figure 1b displays the proportion of each dataset used based on their sampling rates.Following the creation of this adjusted distribution, the frequency of sampling from larger datasets significantly diminishes, resulting in only 180 million pairs actually being used during training.

Training on Triplet Data
Following the completion of pairwise training, the model progresses to the next phase which involves training on the triplet datasets.This phase uses a different loss function, leveraging negatives for improved model performance.
We experimented with various triplet loss functions and found that the best results are achieved through a combination of multiple commonly used triplet loss functions.Specifically, we use the extended version of the InfoNCE loss L triplets NCE + , given by (2), which employs additional negatives [Reimers, 2023], the reverse InfoNCE loss L triplets NCE from the initial training phase as given by (3), and the triplet margin loss function L triplets 3 as presented in (4) [Chechik et al., 2010].
The triplet function L triplets 3 determines the cosine similarity difference between the query and target s(q, n), and the query and negative match s(q, n).Furthermore, it establishes a minimal margin ε = 0.05 between these two values.If the negative is more similar to the query or the margin is violated, L triplets 3 returns a positive value.Otherwise, it yields 0, which is achieved through the application of the ReLU activation function.For the temperature parameter, we opted for a value of τ = 0.05.

Evaluation
We conduct a comprehensive evaluation to compare our models against other state-of-the-art models (Section 4.1), investigate the impact of our filtering pipeline (Section 4.2), and evaluate the models' sensitivity to negation of statements (Section 4.3).Section 6 mentions details about the training.
To provide comprehensive results on the performance of models on various downstream tasks applicable to embeddings, we rely on the MTEB benchmark frameworks introduced by Muennighoff et al. [2023].This also compromises all the retrieval tasks included in the BEIR [Thakur et al., 2021] benchmark.We also publish the code for executing it on our models on the Hugging Face pages of our model10 .For evaluating models on the negation dataset, we use our own separate evaluation tool11 .

Performance Against State-of-the-Art Models
To gauge the performance of the JINA EMBED-DINGS set in relation to other similarly sized opensource and close-sourced models, we select representative models from five distinct size categories, as depicted in Table 1.Additionally, we include sentence-t5 and gtr-t5 xl and xxl models, which are based on T5 models with 3 billion and 11 billion parameters, respectively.This inclusion allows investigating the performance variation with models of such massive scales.Table 6 presents the scores for MTEB's sentence similarity tasks, wherein the models within the JINA EMBEDDINGS set outshine their similarly sized counterparts across numerous tasks.Notably, the jina-large-v1 model consistently delivers comparable, if not superior, results to models in the billion-parameter scale.jina-base-v1 and jina-small-v1 also exhibit competitive performances with models of analogous sizes, exceeding their peers on the BIOSSES12 task.This highlights the benefits of training with highly diverse data sources.
jina-base-v1 consistently demonstrates performances similar to or better than gtr-t5-base, which was trained specifically for retrieval tasks [Ni et al., 2022b].However, it seldom matches the scores of sentence-t5-base, which was trained on sentence similarity tasks [Ni et al., 2022a].
The evaluation of model performances on retrieval tasks, presented in Table 8, reflects a similar relationship among gtr-t5, sentence-t5, and JINA EMBEDDINGS.Here, gtr-t5 models, which have been specially trained on retrieval tasks, consistently score the highest for their respective sizes.JINA EMBEDDINGS models follow closely behind, whereas sentence-t5 models trail significantly.The JINA EMBEDDINGS set's capability to maintain competitive scores across these tasks underscores the advantage of multi-task training.
As illustrated in Table 7, jina-large-v1 also achieves exceedingly high scores on reranking tasks, often outperforming larger models.Similarly, jina-base-v1 surpasses gtr-t5-large and sentence-t5-large on several reranking tasks, which could once again be attributed to the specific training tasks of sentence-t5 and gtr-t5.

Impact of Filtering Steps
We evaluate the effectiveness of our dataset preprocessing pipeline by performing an ablation study.In this study, we fine-tune our smallest model on the Reddit dataset, where various preprocessing steps are individually applied.The corresponding results are presented in Table 3.
The ablation study's results underscore the value of both language and consistency filtering as crucial preprocessing steps.Their combined application results in the highest performance across the majority of benchmarks.
Specifically for the Reddit dataset, we observe a significant performance boost with the application of consistency filtering, while language filtering only marginally enhances the performance.We can account for this disparity by noting that the language filter removes only 17.4% of the Reddit data, while consistency filtering screens out 84.313 .Reddit samples are primarily in English, but many are positive pairs with very low similarity, making consistency filtering more effective than language filtering.
The effectiveness of these preprocessing steps, however, does exhibit variability across different datasets.

Effectiveness of Negation Data
To determine the effectiveness of our models on negation data, we evaluate them against the test split of our negation dataset, comparing the results with other open source models.We measure performance with respect to two metrics: one measures the percentage of samples where the model positions the anchor and entailment closer than the anchor and negative (which is an easy task, as the anchor and negative are syntactically dissimilar), the other measures the percentage of samples where the model positions the anchor and entailment closer than the entailment and negative (which is a hard task, as the entailment and negative are syntactically more similar than the anchor and entailment).The former is denoted by EasyNegation, the latter by HardNegation.The outcomes of these evaluations are displayed in Table 4.We assess our models both before and after fine-tuning on the triplet data, denoted as <model>pairwise and <model>all, respectively.
From the results, we observe that across all model sizes, fine-tuning on triplet data (which includes our negation training dataset) dramatically enhances performance, particularly on the Hard-Negation task.formance, while achieving this with only a fraction of the training data required by their counterparts.

Related Work
The field of embedding models has seen significant advanced over the years, with the development of various models featuring diverse architectures and training pipelines.For instance, Sentence-BERT [Reimers and Gurevych, 2019] uses BERT to generate sentence embeddings.Similarly, Sentence-T5 [Ni et al., 2022a], based on the encoder architecture of T5, demonstrates superior performance over Sentence-BERT on numerous benchmarks.The study underscores the effectiveness of encoders for sentence embeddings, contrasting with another approach that explores the use of decoders [Muennighoff, 2022].
Knowledge distillation [Hinton et al., 2015] offers an alternative approach to model training.In this setup, a larger, pre-trained model acts as a mentor, instructing a smaller model during training.This methodology can be seamlessly integrated with a contrastive loss function, presenting an avenue for future investigation.
Embedding models can also be characterized based on their functionality.For instance, while some models are designed to solely embed queries, others are trained to embed queries along with specific instructions, generating task-dependent embeddings [Su et al., 2023].An example of this using a T5-based model is the large dual encoder [Ni et al., 2022b], which is fine-tuned for retrieval tasks and computes a retrieval score directly.
Recent studies [Neelakantan et al., 2022, Wang et al., 2022] emphasize the benefits of contrastive pre-training coupled with fine-tuning on hard negatives.Both approaches have achieve state-of-theart results on multiple benchmarks, with [Wang et al., 2022] also employing consistency filtering as part of their preprocessing pipeline.

Training Details
For training, we employ A100 GPUs and leverage the DeepSpeed stage 2 distributed training strategy [Rajbhandari et al., 2020] for effective multi-device management.For training our models we use the AdamW optimizer, coupled with a learning rate scheduler that adjusts the learning rate during the initial stages of training.The hyperparameters used across all three models throughout the training process are listed in Table 5.

Conclusion
This paper introduces the JINA EMBEDDINGS set of embedding models, demonstrating that competitive performance on various tasks can be achieved while substantially reducing the amount of training data, when compared to other models with comparable backbones.Through an extensive evaluation on the MTEB benchmark, we show that employing judicious data filtering techniques can lead to enhanced performance in comparison to training with a larger, yet lower-quality dataset.These findings significantly shift the paradigm, indicating that training large language models for embedding tasks can be conducted with less data than previously as- However, we acknowledge the limitations of the current methodologies and the performance of the JINA EMBEDDINGS set.During the training on pairs, the sampling rate selection was based on a heuristic approach.Given the vast size of the search space for these sampling rates, we leaned on our intuition and dataset familiarity to prioritize higher-value datasets over their lower-value counterparts.This subjective approach, however, points to the need for more objective methods for future advancements.
Additionally, the JINA EMBEDDINGS set fell short on some tasks.For instance, calculating sentence similarity on our negation dataset (as described in Section 4.3) didn't meet our expectations (see Table 4) nor achieves competitive scores for classification and clustering tasks on the MTEB benchmark.These performance shortcomings suggest a possible deficit in the representation of these types of tasks in our training data, necessitating further investigation.
Looking ahead, we aim to refine our training processes to deliver models with improved performance and greater sequence length.Our future endeavors also include generating bilingual training data and training an embedding model capable of understanding and translating between two languages, thereby expanding the utility and versatility of the JINA EMBEDDINGS set.

Figure 1 :
Figure 1: The composition of 385 million pairwise data

Table 1 :
Model sizes and output dimensions

Table 3 :
Our models are on par with other state-of-the-art open-source models in terms of per-Evaluation of Data-Preparation Effectiveness on the Reddit Dataset.Retrieval evaluated on nDCG@10, Sentence Similarity on Spearman.

Table 4 :
Evaluating a Range of Models on the Negation Dataset: A Benchmark Analysis of JINA EMBEDDINGS Trained on Both Pairwise-Only and Combined Pairwise and Triplet Data.The negation dataset is available at https://huggingface.co/datasets/jinaai/negation-dataset

Table 5 :
Hyperparameterssumed, leading to potential savings in training time and resources.