C-STS: Conditional Semantic Textual Similarity

Semantic textual similarity (STS) has been a cornerstone task in NLP that measures the degree of similarity between a pair of sentences, with applications in information retrieval, question answering, and embedding methods. However, it is an inherently ambiguous task, with the sentence similarity depending on the specific aspect of interest. We resolve this ambiguity by proposing a novel task called conditional STS (C-STS) which measures similarity conditioned on an aspect elucidated in natural language (hereon, condition ). As an example, the similarity between the sentences “The NBA player shoots a three-pointer.” and “A man throws a tennis ball into the air to serve.” is higher for the condition “The motion of the ball.” (both upward) and lower for “The size of the ball.” (one large and one small). C-STS’s advantages are two-fold: (1) it reduces the sub-jectivity and ambiguity of STS, and (2) enables fine-grained similarity evaluation using diverse conditions. C-STS contains almost 20,000 instances from diverse domains and we evaluate several state-of-the-art models to demonstrate that even the most performant fine-tuning and in-context learning models (GPT-4, Flan, Sim-CSE) find it challenging, with Spearman correlation scores of < 50 . We encourage the community to evaluate their models on C-STS to provide a more holistic view of semantic similarity and natural language understanding. 1


Introduction
Over the years, natural language processing (NLP) has progressed through the co-evolution of model design (e.g.architectures, training methods) and evaluation methods for language tasks (Wang et al., 2018(Wang et al., , 2019;;Hendrycks et al., 2021).A common task used to evaluate NLP models has been Semantic Textual Similarity (STS) (Agirre et al., 2012), which evaluates the models' ability to predict the Similar A windsurfer skims the water with his outstretched hand.
The surfer is riding a wave with a mountain in the background.

Dissimilar
The way the object is propelled: Figure 1: C-STS: Two sentences are judged by their similarities based on free-form natural language conditions.The two sentences are more similar when judged by the condition 'The base of the object' (highlighted in yellow) as both windsurfing and surfing use a similar board but are dissimilar when judged by the condition 'The way the object is propelled' (highlighted in blue) because one is propelled by waves and the other by wind.Providing conditions reduces ambiguity of the sentence similarity task, and allows evaluation of a grounded and multi-faceted notion of sentence similarity.semantic similarity between two sentences.Several diverse STS datasets are popularly used, with prior work expanding the STS task to multiple domains and languages (Agirre et al., 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016;;Cer et al., 2017;Abdalla et al., 2021).STS tasks have been a component of the popular GLUE natural language understanding benchmark (Wang et al., 2018) and are a key evaluation tool for sentence-representation learning specifically (Conneau et al., 2017;Cer et al., 2018;Reimers and Gurevych, 2019;Gao et al., 2021, inter alia).
Despite its popularity, STS may be inherently ill-defined.The general semantic similarity of two sentences can be highly subjective and vary wildly depending on which aspects one decides to focus on.As observed in several studies, ambiguity in similarity judgements of word or sentence pairs can be reduced with the help of context for both hu-

Condition Annotation & Verification
Text < l a t e x i t s h a 1 _ b a s e 6 4 = " C 2 0 9 M H A 4 5 w 5 z 5 / M T z p R 2 3 W 9 r a X l l d W 2 9 s F H c 3 N r e 2 b X 3 9 h s q T i X F O o 1 5 L F s + U c i Z w L p m m m M r k U g i n 2 P T H 1 5 P + u Y d S s V i c a t H C X Y j 0 h c s Z J R o E / X s w 0 4 q A p S + J B S z h 1 z j n l 1 y y + 5 U z q L x Z q Y E M 1 V 7 9 l c n i G k a o d C U E 6 X a n p v o b k a k Z p T j u N h J F S a E D k k f 2 8 Y K E q H q Z t P t x 8 6 J S Q I n j K U 5 Q j v T 9 P e N j E R K j S L f T E Z E D 9 R 8 N w n / 6 9 q p D q + 6 G R N J q l H Q / mans (De Deyne et al., 2016a,b) and models (Veit et al., 2016;Ye et al., 2022a;Lopez-Gazpio et al., 2017;Camburu et al., 2018).
Considering the importance of STS tasks for evaluating sentence representations, we propose a new task called Conditional STS (C-STS), illustrated in Figure 1, which seeks to disambiguate the similarity of two sentences by measuring similarity within the context of a condition sentence.
C-STS, uses free-form natural language conditions, enabling us to evaluate and probe natural language understanding for myriad fine-grained aspects.Figure 1 illustrates two conditions ("The base of the object" and "The way the object is propelled") which probes language models' conception of similarity for different aspects concerning water sports and physical reasoning.Since conditions themselves are unconstrained sentences, they allow us to evaluate a precise, grounded, and multifaceted notion of sentence similarity.
To comprehensively test models on C-STS, we create the C-STS-2023 dataset which includes 18,908 instances containing sentence pairs, a condition, and a scalar similarity judgement on the Likert scale (Likert, 1932).We find that even stateof-the-art sentence encoders and large language models perform poorly on our task.Although Sim-CSE (Gao et al., 2021) and GPT-4 (OpenAI, 2023a) are among the best-performing systems, their relatively poor Spearman correlation of 47.5 and 43.6 respectively, points to significant room for improvement (SimCSE achieves a Spearman correlation of 88.09 on STS-B validation splits for comparison).
We believe that C-STS provides a testbed for potentially novel modeling settings and applications.Toward this end, we propose and evaluate a unique encoding setting (a tri-encoder) and objectives (a quadruplet contrastive loss with hard negatives) that take advantage of C-STS's threesentence inputs and paired high-and low-similarity instances.
Our qualitative analysis shows that models find C-STS challenging when tested on different aspects of the same sentence pair rather than testing an unconditional and ambiguous notion of similarity.We hope that future work evaluates on C-STS in addition to STS tasks to comprehensively benchmark semantic similarity in language models.

Methodology
The C-STS task requires sentence pairs, conditions which probe different aspects of similarity, and the similarity label for a given sentence pair and condition.This section describes the technical details involved in creating the dataset.

Background: Semantic textual similarity
Semantic textual similarity (STS) (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016;;Cer et al., 2017) is a task which requires machines to make similarity judgements between a pair of sentences ({s 1 , s 2 }).STS measures the unconditional semantic similarity between sentences because the annotator making the similarity assessment must infer which as-pect(s) of the sentences are being referred to.Formally, consider conditions (c i ∈ C) that refer to disjoint aspects of the sentences, then the similarity of the two sentences may be represented as: Here, w i is the weight assigned by the annotator to the condition c i , and sim c i (s 1 , s 2 ) is the similarity of the sentences with respect to the condition.These weights are latent to the task and each annotator has their own interpretation of them which helps marginalize similarity, thus introducing ambiguity in the task.C-STS seeks to disambiguate the STS task by measuring similarity conditioned by a single aspect specified in natural language.

Conditional semantic textual similarity
C-STS is a task comprised of quadruplets containing two sentences (a sentence pair), a natural language condition, and a similarity assessment ({s 1 , s 2 , c, y}).Crucially, we do not place any strict constraints on c, allowing it to be any relevant phrase.This allows us to probe potentially any possible aspect of similarity that may be considered between sentences.

Sentence Data Collection
The first stage of making the C-STS dataset is to acquire the sentence pairs that will later be used in eliciting conditioning statements from annotators.
We source sentence pairs {s 1 , s 2 } for our dataset from image-captioning datasets through a two-step process: (1) generate candidate text pairs through dense retrieval from the corresponding image representations and (2) filter out candidates that are irrelevant or ineffectual for our purposes.
Image Retrieval Image-captioning datasets provide a compelling data source because image pair similarity and caption (text) pair similarity encode different semantics (Parekh et al., 2021).Imagerepresentations thus serve as an informative latent variable which can represent their captions in ways that are not captured by text retrievers.
Since current sentence representation models overlook aspects of conditional similarity, we utilize both the image and text to retrieve sentence pairs which form the foundation of our dataset.
We aim to derive sentence pairs from an imagecaption dataset D to aid in creating conditioning statements.To do this, we first generate a store of image pairs, or P I .Each pair, denoted by I i , I j , is such that I j is amongst the top-k most similar images to I i , determined by the cosine distance metric of their respective image representations obtained via an image encoder E I (•).After establishing P I , we convert it into a sentence pair store (P S ) by replacing each image in a pair with its corresponding caption.When each image I i ∈ D is associated with a set of sentences {s} i we take all sentence pairs from the Cartesian product {s} i × {s} j for each image pair I i , I j ∈ P I .
Candidate Filtering After acquiring initial sentence pairs through image retrieval, we perform additional filtering to eliminate sentence pairs which are ill-suited for our task.
Specifically, we aim to include only pairs of sentences for which the unconditional similarity is somewhat ambiguous, as this incentivizes models to rely on the condition when reasoning about the conditional similarity.
To this end, we avoid high similarity sentence pairs by filtering out those with a high bag-ofwords intersection over union and avoid low similarity sentence by choosing sentences with moderate or high cosine similarity of their SimCSE embeddings (Gao et al., 2021).See Appendix A.2 for a full description of all filtering criteria used.
Dataset sources For the construction of sentence pairs candidates, we use two image-caption datasets: the train split from the 2014 MS COCO dataset (Lin et al., 2014) containing ∼ 83,000 images, and Flickr30K (Young et al., 2014) containing ∼ 31,000 images.Each dataset is processed separately and we do not intermix them during the retrieval stage.We use CLIP-ViT (Radford et al., 2021) to encode images and include the specific filtering criteria in Table 6 of Appendix A.2.

Annotation Methodology
For each sentence pair in the store (P S ), we wish to collect conditions and semantic similarity annotations for each sentence pair and condition triplet, {s 1 , s 2 , c}.As c is a free-form natural language sentence, the annotator is provided with a highlevel of control over which aspect to condition on.Human annotations are acquired through Mechanical Turk in a 3-stage process.
Stage 1: Choosing a high-quality worker pool In the first stage, we design a qualification test to select workers who excel at our task.Specifically, we test two skills: (1) The quality of conditions they write for a given sentence pair and (2) semantic similarity judgements for a triplet {s 1 , s 2 , c}.We choose a pool of 271 workers who perform well on both tasks and restrict subsequent stages to include only workers from this pool.See Appendices C.1 and C.2 for an example of these tests.
Stage 2: Condition annotation After sourcing sentence pairs {s 1 , s 2 } using the strategy discussed in the Section 2.2.1, we instruct workers to annotate each pair with one condition such that the similarity in its context is high (C-High) and one such that the similarity in its context is low (C-Low).Example: s1 : A large green ball was bouncing on the street s2 : I bought a small green avocado C-High : The color of the object C-Low : The size of the object We do not place any constraints on the conditions other than that they should be semantically unambiguous phrases and relevant to the sentence pair (Appendix C.1).
Stage 3: Condition verification and similarity assessment The output of annotations from the previous stage are triplets {s 1 , s 2 , c} with a binary similarity assessment (high or low).In this stage we ask new annotators to assign a similarity on a Likert scale (Likert, 1932) (as an integer between 1 and 5) as is common with semantic textual similarity tasks (Agirre et al., 2012).In addition to assigning a similarity, we also use this stage to verify if the conditions from the previous stage are pertinent to the sentence pairs, filtering out potentially low quality examples.At the end of this stage, we have {s 1 , s 2 , c, y} quadruplets which have passed a layer of human verification (Appendix C.2).

Dataset Analysis
Dataset statistics To ensure high-quality, faithful, and diverse annotations, we collect a total of 20,000 instances and perform quality assurance (Section 5.3) resulting in a total of 18,908 instances as part of the C-STS-2023 dataset.Following standard practice, we create train, validation, and test splits in a 60 : 15 : 25 ratio.We present the distribution of similarity scores, which are discrete numbers between [1, 5], in Figure 4. We also measure the inter-annotator agreement on a random sample of 100 examples with three independent annotations and find Fleiss' kappa score (Fleiss, 1971) to be 0.61 which implies substantial interannotator agreement.Average length of sentences and conditions is 12.6 and 5.3 words.
Qualitative analysis C-STS allows us to evaluate the generally fuzzy notion of sentence similarity with greater fidelity.We illustrate this in Table 1, where precise and discriminative conditions allow a targeted, fine-grained, and grounded definition of sentence similarity.The following is a representative instance where the conditions tease out nuanced and hidden similarities and differences between the two lexically similar sentences on surfing: Consider s 1 : "A windsurfer skims the water. . ." and s 2 : "The surfer is riding a wave. . .").While the sentences are significantly dissimilar based on the condition "the way the object is propelled" as they talk about windsurfing and surfing respectively (the former uses a sail whereas the latter depends on the wave), they are very similar in context of the condition "the base of the object" as both windsurfing and surfing use a similar board.
Our diverse set of conditions provides broad support over the distribution of conditions and enables a holistic and multi-faceted evaluation of sentence similarity.For example, the conditions for the sentences on Tennis in Table 1 test similarity both on the sport being played (which requires understanding lexical and knowledge artifacts) as well as the number of people (which requires reasoning and commonsense capabilities).

Baselines
We evaluate our dataset on several baselines which can be categorized into (1) Fine-tuning baselines, which are pre-trained models finetuned on the C-STS training split, and (2) Large language models (LLMs) baselines, which are evaluated using instructions and in-context examples.

Fine-tuning baselines
We evaluate three sentence encoder models RoBERTa (Liu et al., 2019), supervised Sim-CSE (Gao et al., 2021) and unsupervised Dif-fCSE (Chuang et al., 2022).SimCSE and DiffCSE represent state-of-the-art sentence encoder models which are particularly strong on STS tasks.For both SimCSE and DiffCSE, we use the RoBERTa pre-trained varieties.
Encoding configurations Encoder-only Transformer models, such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), initially performed An older man holding a glass of wine while standing between two beautiful ladies.
A group of people gather around a table with bottles and glasses of wine.
The people's demeanor: 5 The number of bottles: 1 Various items are spread out on the floor, like a bag has been emptied.
A woman with a bag and its contents placed out before her on a bed.

The arrangement of objects: 4
The surface the objects are on: 1 A windsurfer skims the water with his outstretched hand.
The surfer is riding a wave with a mountain in the background.
The base of the object: 5 The way the object is propelled: 1 Female tennis player jumping off the ground and swinging racket in front of an audience A young lady dressed in white playing tennis while the ball girl retrieves a tennis ball behind her.
The sport being played: 5 The number of people: 1 Table 1: Four examples from the C-STS validation set.Under different conditions, the same sentence pair can be separated into high similarity and low similarity.Scale from 1 (dissimilar) to 5 (similar).
regression finetuning for STS tasks by simply concatenating the sentences and encoding them together before generating a prediction; let us call this type of architecture a cross-encoder.Recent approaches instead opt to encode sentences separately and compare their similarity using a distance metric, such as the cosine distance Reimers and Gurevych ( 2019); which we will call a bi-encoder.
While DiffCSE and SimCSE were designed with the bi-encoder setting in mind, we observe that they work well in the cross-encoder setting as well.
For our baselines, we evaluate each model in both settings.For the cross-encoder configuration, we encode the triplet containing the sentences and the condition ({s 1 , s 2 , c}), and the output is a scalar similarity scoref θ (s 1 ; s 2 ; c).For the bi-encoder configuration (Reimers and Gurevych, 2019), the sentences of a pair are encoded independently along with the condition using a Siamese network and their cosine similarity is computedcos(f θ (s 1 ; c), f θ (s 2 ; c)).
In addition to the bi-and cross-encoder models, we propose tri-encoder models which encode each sentence and condition separately.This conceptually resembles late-interaction contextualized retrieval approaches, such as Humeau et al. (2020) or Khattab and Zaharia (2020), but our approach is specific to C-STS.For this, we first encode all sentences of the triplet separately, with encoder f θ (•) as s i = f θ (s i ), where s i ∈ R d .We then perform an additional transformation h : R 2d → R d that operates on the condition and one each of the sentences.We finally compute the conditional similarity using the cosine similarity as cos (h (c; s 1 ) , h (c; s 2 )).We experiment with 2 functions for h, an MLP and the Hadamard product.
Objectives In addition to the standard MSE loss for regression, we use a quadruplet contrastive margin loss which we denote Quad.Since each sentence pair in C-STS comes with two conditions (one with higher similarity and one with lower similarity) we represent the conditional encoding of each sentence in the higher-similarity pair as p 1 and p 2 and represent the conditional encoding of each sentence in the lower similarity pair as n 1 and n 2 .The Quad loss is then defined as follows: where λ is a margin hyperparameter.
We train all of our tasks for regression using, alternatively, mean squared error (MSE), Quad, and a linear combination of the quadruplet loss and MSE (Quad + MSE).Since we require a separate conditional encoding fore each sentence, the Quad and (Quad + MSE) objectives apply only the the bi-encoder and tri-encoder configurations.
Hyperparameters We evaluate the baselines on the test split for C-STS.We perform a hyperparameter sweep to select the best performing configuration and test using models trained with 3 random seeds, with further details in Appendix A.3.As a comparison for our training setting, we perform a similar hyperparameter sweep for the STS-B (Cer et al., 2017) dataset, with the validation split results and best hyperparameters shown in Table 9, showing that our finetuned baselines achieve very strong performance on traditional STS tasks.
When evaluating zero-or few-shot capabilities, each model input is composed of up to three parts: instruction (task definition), k in-context examples, and query.Models are evaluated with 0, 2, or 4 examples and using three different instruction prompts: no instruction, short instruction, which provides only a high-level description of the task, and long instruction, shown in Figure 6, which resembles the annotation guidelines and is similar to the instructions used for the STS-B classification task in Wang et al. (2022).
For few-shot evaluation, we additionally always group a sentence pairs' two conditional similarity examples together, so models will always see contrasting pairs in the examples, but won't see a paired example for the query.We provide examples of the formats used for the input and output for more settings in Appendix B. As we did for the finetuned models, we also evaluate these models on the STS-B validation split, shown in Table 12, with instruction finetuned models and ChatGPT achieving strong performance.

Evaluating sentence encoders on C-STS
Zero-shot bi-encoder performance As an initial comparison, we evaluate bi-encoder models without finetuning, on both C-STS and STS-B.As shown in Table 2, we see that strong performance on STS-B does not translate to good performance on C-STS, suggesting that these models fail entirely to incorporate the provided conditioning statement.These results suggest that current approaches to training sentence encoders may be too specialized to existing tasks for evaluation, such as STS-B.Fine-tuning baselines We finetune our sentence encoder baselines on C-STS and show the test performance in Table 3. Again, the best models are SimCSE and DiffCSE in the bi-encoding setting.This is suggests that the sentence representations learned in their contrastive learning phase facilitate learning for C-STS substantially, but still struggle with all Spearman correlation below 50.
Performance on C-STS varies significantly depending on the encoding configurations, with the bi-encoder setting proving to be the most effective, especially for SimCSE and DiffCSE models.Performance of the tri-encoder model, introduced in Section 4.1 was generally poor, with all models performing well below their bi-encoding and cross-encoding counterparts.Models perform much worse than their finetuned counterparts, with GPT-4 being the only evaluated model that achieves comparable performance to some finetuned baselines.†: Fine-tuning on the full train set.

Evaluating pre-trained LLMs
We show performance of generative models evaluated on C-STS in various prompting settings in Table 4, with some additional results for smaller Flan-T5 models in Table 11 in the Appendix.Notably, the state-of-the-art language model, GPT-4, performs substantially better than all competing models and systems (UL2, Flan-T5, ChatGPT-3.5)and is competitive with a finetuned SimCSE LARGE model, the best performing sentence-encoder.For example, in most settings, GPT-4 outperforms ChatGPT-3.5 and Flan models by over 10 points.This suggests existing large language benchmarks may correlates with C-STS as GPT-4 has shown to be the most proficient in a wide variety of evaluation settings (OpenAI, 2023b).
Between suites of models of different sizes (viz.Flan-T5, Tk-Instruct), we observe a strong correlation between model scale and performance.We also find that providing instructions improves performance substantially for C-STS and that this performance is robust to different instructions lengths and the number of in-context examples.

Analysis
Scaling laws for C-STS We evaluate the effect of the quantity of C-STS data on sentence-embedding methods for SimCSE LARGE (Figure 3).We notice that for all three encoding strategies, performance monotonically increases as we increase the size of the training dataset.For example, for the SimCSE bi-encoder, the Spearman correlation increases from 30 when using a train set of 1,000 examples to 45 for 7,000 examples.
There is almost a linear increase in the performance of the models, especially the bi-encoder as we increase the amount of data.This quantitatively enforces the quality of the dataset, but also retroactively makes that point that rather than relying on more data, we require better modeling strategies.
Qualitative analysis We present predictions from different models in Table 5 to illustrate systematic pitfalls.For instance, Flan-T5 makes incorrect predictions even for straightforward instances and falsely predicts that both sentences talk about the same dish, even though the sentences clearly talk about sandwiches and pizza respectively.Additionally, ChatGPT-3.5 incorrectly predicts that the two sentences are completely dissimilar when talking about the types of plants, even though both sentences mention flowering plants.Note that our annotation, unlike ChatGPT-3.5,captures the nuance that the first sentence talks about both shrubbery and flowers, while the second sentence talks only about flowers, and therefore assigns a conservative similarity score of 3. The most proficient model on C-STS, GPT-4, is much better at capturing these nuances and accurately predicts, for instance, that the height of the giraffe's head (refer to the fourth example), is high in one sentence and

Model Sentence 1
Sentence 2 Condition Output

Flan-T5-Base
A man taking a bite out of a sandwich at a table with someone else.
A man sitting with a pizza in his hand in front of pizza on the table.
Type of dish.Pred: 4.5 Label: 1.0 GPT-3.5 A wooden bench surrounded by shrubbery and flowers on the side of a house.
A scene displays a vast array of flower pots in front of a decorated building.
The type of plants.

GPT-4
Football player jumping to catch the ball with an empty stand behind him.
A football player preparing a football for a field goal kick, while his teammates can coach watch him.
The game being played.Pred: 3.0 Label: 5.0 GPT-4 A giraffe reaches up his head on a ledge high up on a rock.
A giraffe in a zoo bending over the fence towards where impalas are grazing.
The height of the giraffe's head.
Pred: 1.0 Label: 1.0 low in another.GPT-4 is far from perfect though, and we outline a negative prediction (refer to the third example), where the model does not predict that the two sentences talk about the same game, even though they are very clearly about "Football".More broadly, C-STS provides a lens into a model's ability to understand and reason over specific parts of each sentence and is well-suited to revealing systematic modeling issues.

Related Work
Historical perspectives of semantic similarities Measuring semantic similarities is a long-standing problem spanning cognitive science (Miller and Charles, 1991) to psychology (Tversky, 1977) where early attempts are made to quantify the subjective similarity judgements with information theoretical concepts.More recently, interest in semantic similarity has gained popularity in the context of machine learning, with works in computer vision recognizing that the notion of similarity between images varies with conditions (Veit et al., 2016) and can therefore be ambiguous (Ye et al., 2022b).
Textual similarity tasks Capturing textual similarity is also considered a fundamental problem in natural language processing.Works such as Agirre et al. (2012Agirre et al. ( , 2016) ) define the textual semantic similarity tasks (STS), which is widely used in common benchmarks such as GLUE (Wang et al., 2018).Extensions to the STS setting have been proposed such as making the task broader with multilinguality (Cer et al., 2017) or incorporating relatedness (Abdalla et al., 2021).However, the loose definition of similarity has not been acknowledged as an issue explicitly.In contrast, our work tackles the ambiguity problem by collecting conditions and hence reduce subjectivity.To alleviate ambiguity, explanations play an important role in identifying the differences between the two sentences either in their syntactical structure (Lopez-Gazpio et al., 2017) or in natural language (Camburu et al., 2018), but the post-hoc nature of explanations prevents it from being used prior to the similarity judgement, rendering it a supplemental component as opposed to a paradigm change in the task setup.Beyond STS, works that leverage conditioning to enhance sentence representations obtain improved performance for retrieval (Asai et al., 2023) and embedding qualities (He et al., 2015;Su et al., 2023;Jiang et al., 2022), which corroborates the observation that conditioning as a form of disambiguation benefits similarity measures.

Conclusion
In this work, we propose conditional semantic textual similarity (C-STS), a novel semantic similarity assessment task that resolves the inherent ambiguity in STS.Given the importance of STS and its importance in sentence representation evaluation we believe that C-STS is a timely and necessary addition to the language model evaluation landscape.Rather than testing unconditional semantic similarity, the diversity of conditions in our dataset allows fine-grained evaluation.The same sentence pairs can be tested on a variety of different aspects represented by conditions, with similarities often varying significantly.C-STS poses a challenging hurdle to both encoder-only and state-of-the-art generative language models which struggle to capture the high-dimensional manifold of similarity.
We believe that a combination of improved modeling and fine-tuning strategies are required to push the boundaries on C-STS and we hope that C-STS can enable innovative future work in language understanding and representation learning.

Limitations
We propose the novel task of conditional semantic textual similarity (C-STS).Given that this is a new task, we collect a dataset of over 19,000 instances, but one limitation that this size can be increased to ensure sentence embedding style models have additional data for fine-tuning.Further, we use two different sources to collect our sentence pairs, and future studies, motivated by STS follow-ups, can collect data from other sources.natural language inference data.In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670-680, Copenhagen, Denmark.Association for Computational Linguistics.
Simon De Deyne, Daniel J Navarro, Amy Perfors, and Gert Storms.2016a.Structure at every scale: A semantic network account of the similarities between unrelated concepts.Journal of Experimental Psychology: General, 145(9):1228.
Simon De Deyne, Amy Perfors, and Daniel J Navarro.2016b.Predicting human similarity judgments with distributional models: The value of word associations.
In Proceedings of coling 2016, the 26th international conference on computational linguistics: Technical papers, pages 1861-1870.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.2019.BERT: Pre-training of deep bidirectional transformers for language understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota.Association for Computational Linguistics.

A Appendix
A  The distribution of similarities is equitably spread out over the Likert scale, as depicted in Figure 4.

A.2 Sentence Pair Generation Details
Here we include some further details about sourcing sentence pairs from image-caption datasets.
As discussed in Section 2, we use a variety of metrics to quantitatively characterize the sentence pairs, and then to filter with the goal of removing pairs with excessively high or low unconditional similarity.The general criteria we consider are defined as follows: • IOU -This is computed by taking the intersection over union of the bag of words for each sentence, after stopword removal.It represents the lexical similarity and overlap of a sentence pair.
• d text -The cosine distance of the pair's Sim-CSE embeddings.We chose SimCSE due to its ubiquity and effectiveness.
• ratio -This is the ratio of the shorter sentence's word count to the longer sentence's word count in a given pair.
• length -This is the character length of the shortest sentence in a pair.
Using these criteria, we filter the sentence pairs based upon thresholds (exact values shown in Table 6) where sentences are rejected if they violate any of these criteria.These thresholds were selected based primarily manual inspection of samples on their margins.Criteria such as ratio and length are used primarily to facilitate comparison.Sentences with very different lengths are more difficult to compare, as are sentences that are very short or contain few details.

A.3 Evaluation Details
Implementation Details All models, with the exception of the ChatGPT systems, are trained and or evaluated in PyTorch using the Huggingface Transformers library (Wolf et al., 2019) and pre-trained weights repository.We use the STS-B dataset as distributed on https://huggingface.co/docs/ datasets as part of the GLUE (Wang et al., 2018) evaluation benchmark.
Finetuned Baselines For evaluation of the finetuned baselines on C-STS, we perform a hyperparameter sweep to select the best training settings for each model and encoding method before evaluating on the test split of C-STS.We show the hyperparameter values used in the sweep in Table 7, and the final hyperparameter values chosen in Table 8.We evaluate 3 random seeds using the best validation configuration to evaluate on the test data, with final results reported in Table 3.We additionally perform an extensive evaluation of our models on STS-B.We perform a comparable validation sweep as shown in Table 7, reporting the best performing hyperparameters and their performance in Table 9.
Lastly, we perform a data ablation training a RoBERTa BASE model alternatively on only the condition and only the sentence pair.The model trained to predict similarity based on the condition statement alone recovers non-trivial performance, but falls well behind the full-input baseline.

Generative Baselines
We report more details of results of the generative baselines for the validation sets of C-STS and STS-B.
For comparison to validation performance of other models, we include the validation performance for C-STS in Table 11, which largely mirrors performance on the test set.We notice, expectedly, that models frequently output non-numerical responses in settings where there are no instructions to do so, or no in-context examples to follow.
On STS-B validation performance, models generally perform much better than on C-STS, with some models performing comparably to finetuned models.Since STS-B is included as a task in Natural Instructions v2 (Wang et al., 2022), it is likely to be recognizable to Flan-T5 models, which counts Natural Instructions v2 in its training data.Likewise, STS-B is comprised of long-existing and popular datasets, which plausibly exist in the the corpora used to train ChatGPT models.

Processing Prompting Baseline Generations
For parsing prompting model generations, we allow for a maximum of 20 generation tokens.The output is stripped of non-numeric characters and errant punctuation before being cast to a float.For example, the response "The Answer is 2.0." is processed as 2.0 and counts as a valid prediction.If the cast fails, we mark the answer invalid and replace the predictions by a number y ∼ U [1, 5].

B Prompt Examples
All prompts for the prompting baselines may consist of instructions, examples, and a query, though we include evaluations for no instructions and no examples in our results.Figure 5 shows an prompt example for the short instructions and K = 2 and Figure 6 shows an example for long instructions and zero-shot setup.

Instructions
On a scale between 1 and 5, how similar are the following two sentences with respect to the condition provided?Respond only with a score between 1 and 5.

C.1 Condition Annotation
We provide the complete condition annotation guidelines used for Mechanical Turk data collection in Figure 7.

C.2 Condition Verification
We provide the complete verification guidelines used for Mechanical Turk data collection in Figure 8.

Instructions
Definition: Evaluate the similarity between the two sentences, with respect to the condition.Assign the pair a score between 1 and 5 as follows: 1 : The two sentences are completely dissimilar with respect to the condition.2 : The two sentences are dissimilar, but are on a similar topic with respect to the condition.
3 : The two sentences are roughly equivalent, but some important information differs or is missing with respect to the condition.4 : The two sentences are mostly equivalent, but some unimportant details differ with respect to the condition.5 : The two sentences are completely equivalent with respect to the condition.

Query
Input: Sentence 1: Elderly man sitting on a blue couch reading a paper.Sentence 2: Older man riding public transportation while reading a newspaper.Condition: The location of the man. Output: Figure 6: The full text input for the zero-shot evaluation with large language models, using 'long' instructions.Emphasis and section titles added for clarity.
b. S2: A green avocado in the basket.c.C-High: The color is green.Instead, the same condition can correctly be written as: "The color of the fruit".4. Avoid conditions which explicitly use words like "sentences".For example, instead of saying "the color in the sentence", just say "The color".5. Avoid vague conditions which do not help narrow down a specific aspect of the sentence.For example, avoid conditions which simply say "The activity", which does not help narrow down the aspect.Instead use more informative words like "the sport" or "the hobby" as much as possible.6. Whenever possible, try to write conditions which refer to abstract similarity.
Consider the following sentences: a. Two women are celebrating a goal.b.A couple is eating a tasty meal.A condition which is more abstract is preferred: c. Abstract condition: The sentiment of the people.Although a more literal condition is valid, it is less preferred: d.Literal condition: The number of people.

Examples
We provide good and bad examples of conditions for sentence pairs, along with the reasoning.

Good examples
All the following conditions are valid because they follow our guidelines.

Sentence 1 Sentence 2 Condition Similarity Explanation
The moon looked incredible!
The car was completely covered in snow.
The color of the object.

High
The color is white in both cases.This is a good condition because it references the color of the object without explicitly mentioning it.
A group of people wearing helmets and riding on bikes.
A group of bikers are gathered together and taking pictures.
The speed of the cyclists.

Low
The group of cyclists is moving in the first sentence whereas they are not in the second.Hence their speeds are dissimilar.
Three people are holding a ladder while another climbs it.
Three people are listening to music in a car.
The number of people.

Low
There are four people in the first sentence but only three in the second sentence.

Bad examples
All the following conditions are invalid because they ignore one or more of our guidelines.

Sentence 1 Sentence 2 Condition Reason for invalidity of condition
Egyptians appeased gods with offerings and prayers.
People in this era put faith in specific gods to protect their lives.
The culture involved.
The culture in the second sentence cannot be inferred and is missing information.
An adult elephant is playing in the river.
A boulder is rolling down the hill.
The size of the object is large.
It violates guideline 3. The condition should have been "The size of the object", without explicitly referring to it being "large".
A guitarist is playing on a bench.
A man in a green hat is playing the guitar on the road.
The instrument in the sentence.
It violates guideline 4. The condition would be good if "in the sentence" was removed so that it is just "The instrument".
A middle-aged man is helping construct a grass hut.
Three men work on a roof.
The activity.This condition is too vague and does not reference a specific aspect.A better condition would be: "The type of construction".
A man on top of a partially completed roof laying down more shingles.
A man in a hard hat and safety gear stands in a construction site.
The number of people.
While this condition is valid, it violates guideline 6, which says that an abstract condition should be considered wherever possible.
A better condition would have been, "The occupation of the man", which is "construction worker" in both cases.

Figure 2 :
Figure 2: Illustrating the data collection process for C-STS-2023.(Left) We first show the sentence pair collection procedure ( §2.2.1).Step A: An image-caption pair is sampled (red) from the dataset and then fed into the image encoder to get the image embedding.The image embedding is compared against all other image embeddings in the dataset (blue) to find the top-k similar images.The original caption is then paired with the corresponding captions of the top-k similar images to generate sentence pairs.Step B: The sentence pairs are filtered based on textual similarity.(Right) We illustrate the condition annotation/verification procedure ( §2.2.2).Once the sentence pairs have been collected, they are sent to qualified Mechanical Turkers to get annotations and verify conditions.

Figure 3 :
Figure 3: Model (SimCSE LARGE ) performance scaling as the dataset size increases.Across encoder types, Spearman correlation increases as the dataset scales.

Figure 4 :
Figure 4: The train split distribution of similarity judgements on a Likert scale between [1 − 5].

Figure 5 :
Figure 5: We show the full input for 2-shot setting with short instructions.

Table 2 :
Zero-shot bi-encoder models evaluation results on C-STS and STS-B validation data.These results verify that strong performance on STS tasks do not translate to C-STS, suggesting substantial room for improvement for fine-grained sentence embedding models.

Table 3 :
We report fine-tuned model test split results in Spearman and Pearson correlations for three models(RoBERTa, DiffCSE, and SimCSE)in different encoding settings.

Table 4 :
Few-shot Spearman correlation on the test split.

Table 5 :
Examples of model predictions evaluated on C-STS in the in-context setting (K = 2 with no instructions).We choose examples with different levels of accuracy, showcasing different failure cases of model behavior.

Table 6 :
The list of filters criteria and values used for each dataset.Sentence pairs that violate any criterion are discarded.

Table 11 :
Validation performance for prompting baselines on C-STS.