Imagination is All You Need! Curved Contrastive Learning for Abstract Sequence Modeling Utilized on Long Short-Term Dialogue Planning

Inspired by the curvature of space-time (Einstein, 1921), we introduce Curved Contrastive Learning (CCL), a novel representation learning technique for learning the relative turn distance between utterance pairs in multi-turn dialogues. The resulting bi-encoder models can guide transformers as a response ranking model towards a goal in a zero-shot fashion by projecting the goal utterance and the corresponding reply candidates into a latent space. Here the cosine similarity indicates the distance/reachability of a candidate utterance toward the corresponding goal. Furthermore, we explore how these forward-entailing language representations can be utilized for assessing the likelihood of sequences by the entailment strength i.e. through the cosine similarity of its individual members (encoded separately) as an emergent property in the curved space. These non-local properties allow us to imagine the likelihood of future patterns in dialogues, specifically by ordering/identifying future goal utterances that are multiple turns away, given a dialogue context. As part of our analysis, we investigate characteristics that make conversations (un)plannable and find strong evidence of planning capability over multiple turns (in 61.56% over 3 turns) in conversations from the DailyDialog (Li et al., 2017) dataset. Finally, we show how we achieve higher efficiency in sequence modeling tasks compared to previous work thanks to our relativistic approach, where only the last utterance needs to be encoded and computed during inference.


Introduction
Large Scale Transformers are becoming more and more popular in dialogue systems (Zhang et al. (2019), Peng et al. (2022)).Though these models are very effective in generating human-like responses in a given context, based on their learning * German Research Center for Artificial Intelligence † maastrichtuniversity.nl objective to minimize perplexity, they tend to have trouble generating engaging dialogues (Gao et al., 2020).Meister et al. (2022) have shown that human conversations usually do not sample from the most likelihood of words like transformers do.We argue that one reason for this is that natural conversations can be (always) considered goal-oriented (even chitchat) and motivate this claim based on literature from psychology.These have shown that "Conversation is a goal-directed process" (Myllyniemi, 1986) as humans shift conversation topics based on the social connection/audience and use it to shape social relations (Dunbar et al., 1997).
The psychological literature also elaborates on how humans are able to plan and simulate dialogues by utilizing inner speech as part of verbal working memory (Grandchamp et al., 2019).
"Key to most of such models is that inner speech is posited as part of a speech production system involving predictive simulations or "forward models" of linguistic representations" (Alderson-Day and Fernyhough, 2015) Keeping this in mind, we investigated dialogues under the aspect of "forward" entailing language representations by projecting them into a simple semantic sentence transformer (Reimers and Gurevych, 2019) latent space.We place a fixed position in the DailyDialog (Li et al., 2017) dataset as a goal utterance and measure the cosine similarity of the goal to every other utterance within the dialogue.Our own preliminary work revealed, as shown in figure 1, that the similarity of previous utterances to the goal utterance increases as they get closer to the goal utterance.However, fluctuations between the speaker at the goal turn (saying the utterance later on) and their dialogue partner can be observed.As we see on the blue & red highlighted turns, the goal turn speaker has a greater similarity to the goal utterance than the dialogue partner.We filtered all samples causing these fluctuations and find that these transitive entailing properties are essential for guiding the conversation toward the given goal.Regardless of whether the person had the intent to reach the target goal.We demonstrate in this paper how we can build upon this phenomenon to learn the relative distance between utterance pairs.In particular, by mixing the training objective of Natural Language Inference (NLI) for the semantic embedding space with a distance proportional and directional aware (through two special tokens [BEFORE] & [AFTER]) cosine similarity-based loss of utterance pairs.
(1) Short-term planning: CCL allows us to imagine the likelihood of a candidate utterance leading to a given goal utterance by projecting them together into one latent space (imaginary space).The cosine similarity indicates the distance/reachability of a candidate utterance towards the corresponding goal as illustrated in a transformer guidance example in figure 2. Thanks to the transitive property we can select the utterances at each turn greedily.
(2) Next utterance selection: The embeddings can be utilized for sequence modeling by only using the cosine similarity between the separately encoded sequence members.It is evaluated by the ranking performance of the human vs random utterances task given a dialogue context.
(3) Long-term planning: Since these embeddings do not require entire sequences for sequence modeling, we can assess the likelihood of following patterns (of multiple goal utterances that are mul-tiple turns apart) by using the entailment strength between these and the context in the curved space.We evaluate this approach based on the ordering/identifying of future goal utterances.
Furthermore, we investigate two research questions: • Do chit-chat conversations have planning capability?(RQ1) • What characteristics make dialogue planning possible?(RQ2) The paper is structured as follows: In §2 we discuss the related work.Following in §3 where we present the methodology, baselines as well as basic components for the advanced architectures.In §4 the short-term planning approaches, followed by the next utterance selection in §5 and the longterm planning approaches for ordering goals in §6.We wrap up the paper with the experiments & discussion in §7 followed by the conclusion in §8.

Related Work
Our work builds upon two major concepts, dialogue planning, and entailment.Related publications from the stated fields are discussed below.

Dialogue Planning
While previously introduced planning techniques used several abstraction approaches (Teixeira and Dragoni, 2022), none of them exploited the characteristics of curved conversation embedding latent spaces.We argue that generating a complete dialogue path is unnecessary as we can simply choose the utterance in the transformer's search space that gets us closest to the goal at every turn.Ramakrishnan et al. (2022) proposed a similar idea on word level by applying constrained decoding to the dialogue response generation to increase the likelihood of a target word not only in the current utterance but also utterances in the future.Furthermore, DialogRPT (Gao et al., 2020) has been introduced as a dialogue response ranking model for depth, width, and upvotes prediction for utterance candidates.We utilize DialogRPT as a baseline for our next utterance selection experiments based on the dialogue history.

Entailment
Entailment-based approaches have a long history in NLP and have been utilized for a lot of tasks as zero-shot classification tasks like relation extraction (Obamuyide and Vlachos, 2018) or zero-shot text classification (Yin et al., 2019).The idea of entailment graphs and making use of transitivity has been previously explored by Kotlerman et al. (2015) & (Chen et al., 2022).Textual entailment has also been applied to Dialogue Systems as an evaluation technique (Dziri et al., 2019) or for improving response quality through backward reasoning (Li et al., 2021).Contrastive learning with positional information has been previously applied to image segmentation (Zeng et al., 2021).While You et al. (2020) utilized contrastive learning with augmentations for graph neural networks (GNNs).Natural Language Inference (NLI) based transformers have been increasingly used for semantic textual similarity (STS) since the introduction of Sentence Transformers, thanks to bi-encoders (Reimers and Gurevych, 2019) that can compare sentence pairs with cosine similarity and therefore reduce computation time by a 234000 * fold.This trend has especially been supported by GPU Search (Johnson et al., 2017).These sentence transformers have successfully been applied to learn utterance representations for retrieving utterance replies in dialogue systems (Liu et al., 2021) or ConvRT (Henderson et al., 2020) that we use as a baseline.However, without utilizing the curved property of conversations which we argue, as motivated in §1, is essential for forward representations.

Methods
In this section, we formally define the research questions (problem definition), our baselines for the evaluation, and the core of Imaginary Embeddings based on which advanced architectures are built in the following sections.
* According to Reimers and Gurevych (2019) a set of 10000 Sentences would require 50 million inference computations with Bert which would, according to them, require 65 hours, while SBERT prior encoded would only take 5 seconds

Problem Definition Planning
As part of this paper, we investigate two planning problems, short-and long-term planning.Shortterm planning aims at guiding the conversation from the current position towards a given goal utterance g (which we define as a semantic utterance) over multiple turns.Long-term planning, on the other hand, targets the ordering/scheduling of a set of goals G (utterances that are multiple turns apart) within a conversation.

Long-Short Term Planning Evaluation
As part of this paper, we introduce a new evaluation technique, Long-Short Term Planning Evaluation (LSTPE).LSTPE is split into Short-as well as Long-Term planning.

Short-Term Planing Evaluation
As part of the short-term planning evaluation, we evaluate the guidance capability of imaginary embeddings towards a given goal utterance.For this purpose, we split all dialogues within a given corpus d ∈ C into subsets of d[: h l ] which represents the history of utterances (or context) with a fixed length h l , d[h l ] the "correct" following utterance and d[h l +g d ] as goal utterance with a goal distance g d .We then let a dialogue transformer generate 100 candidate utterances given the context d[: h l ] for every dialogue d ∈ C which we project together with the goal utterance into the imaginary embedding.Following, we compare the ranking score of the original utterance to the artificially generated utterances.As metrics, we report the Hits@K ratio (X%) and the average rank.

Long-Term Planning Evaluation
Similar to the Short-Term planning, we take a corpus of dialogue data d ∈ C and split it at fixed positions x into the dialogue history and three goal utterances |G| = 3.Given a dialogue history of length where g d ≥ 2 is the distance between the goals.We define the first goal in distance as x − h l in the perspective of the dialogue history.The three resulting goal utterances result in 6 possible order permutations.Since 4 of them are partially ordered, we split the evaluation into ranking the partially ordered and reverse order to the true order separately.In both cases, we present the Hits@K ratio (X%) as well as the average total rank.While this technique is simple and does not require any supervision, some samples due to the random selection are without any context indistinguishable.E.g. an utterance like "oh, okay" could be at any position.Since all models are evaluated on the same data set, this is not an issue, however, an accuracy of 100% is realistically not possible.

Next Utterance Selection Evaluation
Furthermore, we test the embedding's capability of telling potential replies from random utterances given a dialogue context by comparing it to Di-alogRPT (Gao et al., 2020), ConvRT (Henderson et al., 2020) and BM25 (Robertson and Zaragoza, 2009) on a ranking task.The data set is built up in a similar way as for short-term planning.

Imaginary Embeddings with Curved Contrastive Learning
We introduce a novel self-supervised learning technique to map sequences to a conversational space.
To generate these properties, we train a bi-encoder sentence transformer on two training objectives.
The first objective builds upon the AllNLI dataset (a combination of SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2017)) with a simple Softmax Loss.To learn the conversational space, two special tokens [BEFORE] and [AFTER] are introduced.The model is (pre-)trained with a Cosine Similarity loss on DailyDialog (Li et al., 2017), by sliding through conversational data with a fixed length l = 5.Notably, we combine consecutive utterances of the same speaker.Based on this fixed length, the training data is constructed for a given window as follows: ∀i ∈ {1, .., l} : where , u the utterances in the observed window, u ′ a set of random utterances, and s the cosine similarity score.As 5 shows, the target cosine similarity for a positive sample pair is proportional to their positional distance in the dialogue (see illustration in figure 3).This lets us learn semantic properties between Three hard negatives are introduced, the first ensures the directional property by swapping the [BEFORE] and [AFTER] token.The following two are selected from a special dataset of random utterances.Figure 3 unveils the widespread util- ity of imaginary embeddings.As shown, we can simply pick the best candidate utterance for reaching a given goal by imagining the closeness of the candidate utterance to the goal in the curved space without requiring the real representations between the utterance pairs.
Similar to an object in our universe that always moves on a straight line but is curved by space-time (Einstein, 1921), we can follow a line to our goal utterance by greedily selecting the best utterance on turn-to-turn bases.We illustrated this transitive property by the light red in-between nodes in figure 3.
Thanks to the relative time dimension between utterance pairs and their resulting non-locality, we are able to encode all sequence members (utterances) independently into one latent space and accumulate the likelihood of a sequence by comparing only with cosine similarity.In particular, by imagining the closeness between every context utterance (encoded with [B]) and the future utterance (encoded with [A])), i.e.Imagination is all you need!Not only can we assess the likelihood of sequences that we explore in the next utterance selection §5 but we can also utilize these self-organizing properties for mapping sequential representations to the conversational surface that are multiple turns apart.We explore this as the ordering of goals in long-term planning §6.

Adding Speaker Tokens
Furthermore, we can modify imaginary embeddings with additional speaker tokens.Given a multi-turn dialogue with two participants, the tokens [O] and [E] are added to the [BEFORE] utterance at the encoding step (for even and odd distances to the target utterance [AFTER]).Accordingly, the learning objective (see equation 5) for the curved property is slightly modified by adding hard negatives for false speaker matches (see appendix D).

Short Term Planning Approach (Transformer Guidance)
As described in section 3.2.1 we utilize imaginary embeddings as a re-ranking model.Respectively, we let a task-specific dialogue transformer generate 100 candidate utterances given the context d[: h l ] of a fixed length h l for every sample dialogue d ∈ C. To get a diverse distribution of utterances we choose nucleus sampling with p = 0.8 and a temperature of t = 0.8.The generated utterances from the transformer are then projected in the imaginary embedding space and the goal similarity of d[h l + g d ] is measured.Following, we check the rank of the true utterance from the test set leading to the goal utterance.The average rank and the distribution of ranks within the dialogue are evaluated with respect to different history lengths h l and different goal distances g d .

Next Utterance Selection with Curving
Motivated by the curved property, the most suitable next utterance u f ∈ U F for a dialogue sequence his should be closest to the individual utterances of the sequence on average.We can assess a relative likelihood between all future utterances by measuring the entailment strength P E (i.e.imagining the closeness) of every u f to the history of utterances based on the cosine similarity as follows: In the ranking evaluation, we sort the results of ∀u f ∈ U F : P E (u f |his) to determine the rank of the true utterance.Notably, we can observe the entailment strength (or activation) of individual utterances to a future one, which enables many other applications.During inference, while the dialogue partner is still speaking, we can precompute the entire context (apart from the new incoming utterance).Furthermore, we can utilize the curved context for greedily selecting the next goal max g∈G P E (g|his) in our long-term planning experiments.We refer to this as greedy curving.
6 Long-Term Planning Approaches In this section, we describe how Imaginary Embeddings can be used to order goals (a set of utterances) within dialogues for long-term planning.The models are evaluated with LSTPE, a given set of goals G with |G| = 3, and an equal distance between each node.Imaginary Embeddings are perfectly suited for this task as they can be concatenated into cosine similarity chains by using the ([B] before and [A] after token) as illustrated in figure 4. We mathematically define it as:

Imaginary Embedding Chains
where we choose the order of goals o ∈ O by the highest similarity score s with max o∈O (s(o)) (strongest entailment strength) of a given sequence o =< g 1 , ..., g n > of goals g i ∈ G.While this chain can be arbitrarily long and, thanks to GPU tensor computations calculated rather quickly, the complexity with O(n!) for a brute force computation remains high.

Imaginary Embedding Chains with History Curving
Finally, we combine the concepts of Imaginary Embedding Chains and Curving by generating for every order [g 1 , g 2 , g 3 ] a score (equation 4): where s(o) is the chain score of the given order based on equation 3 and P E (g i |his) is the history curving score for the corresponding goal.We motivate the addition of g 1 and the subtraction of g 3 (as well as g 2 ) based on the presumption that g 1 should be closest while g 3 should be the furthest away to the history with respect to the curved property.Note that other than the simple Imaginary Embedding Chains (IEC), IEC + curving requires some dialogue context and is therefore not suitable for dialogue planning without context.

Experiments
Our experiments are conducted on two dialogue corpora, DailyDialog (Li et al., 2017) and the Microsoft Dialogue Challenge (MDC) corpus (Li et al., 2018).We experiment with two transformer architectures BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) to generate Imaginary Embeddings.In the short-term planning (transformer guidance) setting, we let our Imaginary Embeddings guide DialoGPT (Zhang et al., 2019) for DailyDialog and GODEL (Peng et al., 2022) for the MDC corpus.For the next utterance selection, we use pre-trained checkpoints of Dialo-gRPT (Gao et al., 2020) and ConvRT (Henderson et al., 2020) as baselines.Furthermore, we add BM25 (Robertson and Zaragoza, 2009) as well as an ablation study with the two special tokens (before and after) but without the curved learning objective that we explore in the appendix G.

Evaluation Data sets
The evaluation data sets DailyDialog and MDC are constructed analogously.We construct the datasets for STP based on history length and goal in distance & LTP based on history length, goal in distance, goal distances respectively as illustrated in figure 4. Since MDC with an average number of 6.51 turns is even shorter than DialyDialog with 7.84, we are limited in the long-term planning to a shorter context as well as a goal in distance length.

Evaluation & Discussion
In the following sections, we investigate how well these embeddings perform on our introduced LSTPE ( §3.2) and on the next utterance selection task.In the main paper, we focus on our empirical findings and present the results of the experiments for space reasons in aggregated form.We provide from DialoGPT Large / GODEL Large (p=0.8, t=0.8)

Imaginary Embedding without Speaker Token Imaginary Embedding with Speaker Token
Goal in Distance Hits@5 (in %) Hits@10 (in %) Hits@25 (in %) Hits@50 (in %) Average Rank Hits@5 (in %) Hits@10 (in %) Hits@25 (in %) Hits@50 (in %) a detailed analysis in the appendix, where we explore examples as well as demonstrate the curved property of dialogues in these embeddings.This is illustrated as vector chains in figure 7 or the average similarity of different distances and directions within dialogues (appendix B).

Short-Term Planning
As shown in the short-term planning aggregated results table 1, we split the results based on odd distance length (unveiling utterances of the dialogue partner) and even distance (which would be uttered by the transformer).Both have at least 20% of the true candidate utterances in the top 5 (Hits@5) (of 100) ranks, 50% in the top 25 (Hits@25), and a max average rank of 32.56.We observe that speaker token-based imaginary embeddings on odd distances can even achieve 63% in the top 5 (Hits@5) with the highest average rank of 14.01.This can be expected as odd utterances will be uttered by our dialogue partner which we can greatly influence by our preceding utterance.Interestingly, we find that it is significantly easier to plan 3 turns ahead rather than 2 turns.This is portrayed in the detailed analysis based on the history length, goal distances, and the first goal distance (goal in distance) in table 3 (appendix).Our analysis unveils that the DailyDialog models have an advantage through their more diverse utterance distribution in selecting the true candidate utterance.Furthermore, they perform more consistently across different history lengths and goal distances.MDC, on the other hand, performs overall better but has a higher variance in its performance (with samples of different history lengths and goal distance).Concluding that the score distribution in the ranking process is either more strongly peaked (most in data sets with lots of request intents) or it more is flattened (especially on data with majorly inform intents).We explore this in detail in the appendix E. This flattened score distribution can be expected as in many cases of providing information, the actual information has little impact on future turns considering a structured task-oriented setting (e.g.replying on how many people will attend a reservation).

Next Utterance Selection based on Curved History
The sequence modeling capability is evaluated based on the normalized average rank (of the true following utterance compared to all other utterances at the same position of the corresponding corpus).We find that the DailyDialog corpus clearly outperforms MDC across all variations.As we demonstrate in figure 5, DailyDialog performs best with an average rank in the top 10% over all history lengths (the entire history projected in the curved space with speaker tokens).For sequences longer than 2 turns, it even outperforms all our baselines DialogRPT (human vs. random) by at least 2.8% and ConvRT by 0.5%.Overall, we find that DialogRPT has trouble with increasing sequence lengths as input and find that keeping the last two utterances performs best.Notably, we can reduce the computation costs of the dialogue context compared to DialogRPT and also ConvRT due to our relativistic approach which we explore in more detail in the appendix C.1.While our experiments on MDC for the next utterance selection show weak results, in summary, MDC shows the same fluctuations between primarily inform & requests intents.While the ranking approaches based on only the last utterance are most of the time superior, we observe on odd turns (where we have a lot of request intents) the entire history usually performs better relative to even distances.Conversely, we notice that approaches based on only the last utterance are especially good on turns where we see more informing intents (replying to the request).We further explore this in the appendix C.2.

Long Term Planning Evaluation
The short turn length of the two corpora becomes especially troublesome in the long-term planning evaluation.Here, we are limited to short context/history lengths as well as short goal distances and (first) goal in distances.Across all models and datasets, we observe a solid average rank of 1.87 (between 1 and 2 for all approaches) on identifying the correct order of 3 goal utterances within their 6 possible orders as table 2 unveils.Note that Greedy Curving has only to predict only the immediate next goal (1/3) while the other LTP models the entire order (1/6).While our MDC embeddings had especially trouble with utterance selection in width (selecting an utterance from the same dialog depth §7.4.2), we find that MDC shows a stronger performance on greedy goal selection (Greedy Curving (GC)) on classic embeddings thanks to the solidified sequential structure of task-oriented dialogues.This advantage lets MDC outperform DailyDialog also on all other approaches.When Speaker tokens come into play, however, MDC drops while DailyDialog improves in performance compared to classic imaginary embeddings.Imaginary Embedding Chains (IEC) and with curved context (IEC & CU) show similar performance in aggregated form.However, when the context is close (i.e. the first goal is not far away) IECs with a curved context prevail.This changes with increasing distance of goals or first goal in distance as highlighted in ta-ble 4 of the appendix.Here, IECs with no context keep an advantage.Similarly, we observe a drop in performance over longer distances for Greedy Curving.In terms of the MDC planning capability, the performance drop-off between the two most common intents, request and inform, is similar, although not as severe as in short-term planning or the next utterance selection.

Conclusion
In this paper, we introduced Curved Contrastive Learning, a novel technique for generating forwardentailing language embeddings.We demonstrated that these can be utilized on various sequence modeling tasks by only using the cosine similarity between the separately encoded sequence members in the curved space.In particular, for the next utterance selection by imagining the closeness of every context utterance to candidate utterances in the curved space (where DailyDialog's true utterances are constantly in the top 10%), outperforming our pre-trained baselines DialogRPT and ConvRT on sequences longer than 2 turns while reducing encoding costs.Furthermore, we have shown their pattern recognition ability on the ordering/identification of future representations (with an average rank of 1.87/6) even at longer distances and far apart.We also demonstrated that these embeddings can be applied to guiding dialogue transformers to approach a goal over multiple turns.In particular, by imagining the closeness of candidate utterances towards the goal through the transitive properties of the curved space.Following up on our claim, that even chit-chat can be considered goal-oriented (RQ1), we find strong evidence of planning capability in chit-chat conversations over

Model
Hits@1 (in %) Hits@2 (in %) Hits@3 (in %) Hits@4 (in %) Hits@1 (in %) Average Rank Hits@1 (in %) Hits@2 (in %) Hits@3 (in %) Hits@4 (in %) Hits@1 (in %)  multiple turns.E.g. 48.83% / 61.56% (within the top 5 / top 10 utterances in the re-ranking) on 3 turns ahead.Our RQ2 can be answered by the fact that we observe significant differences in the plannability of different intents.Our empirical analysis shows that request intents are significantly easier to plan than informing intents.While our focus in this paper was mainly on the introduction of Imaginary Embeddings and their utilization to dialogue planning, we leave much more space for further evaluation, analysis, and applications on the curved properties of our universe ‡ embeddings in future works.

Limitations
One of our limitations is that the data is split for short-term planning and long-term planning at fixed positions which on one side shows the overall planning capability on different datasets unbiasedly but on the other hand mixes the planning ability of the datasets with the overall performance of the embeddings.We have demonstrated in section E.2 that this can lead in many cases to unplannable examples.While this means that our embeddings should overall perform better than our results suggest, in the future, we should create either a human-filtered dataset where planning is always possible or either create a human benchmark as a further baseline.Furthermore, we rely in short-term planning (transformer guidance) on the generated utterance distributions by transformers where we have to balance between semantic diversity and the likelihood of utterances.We control these with temperature and nucleus sampling (top p) and found the best ‡ In tribute to our fellow researchers in the field of physics for their inspiring work on the curvature of spacetime trade-off with a temperature of 0.8 and a top p of 0.8.Nonetheless, this can still lead to utterances that might lead to the goal but that would be not considered by humans as very likely based on the given context as we explore in E.2.Furthermore, in the next utterance selection, we utilize the publicly available checkpoints which have been evaluated in the paper (Gao et al., 2020) on DailyDialog but both were seemingly not trained on an MDC-like task-oriented corpus.Since we find that the next utterance selection based on the curved property of the context in a task-oriented setting like MDC is almost always worse than just taking the last utterance, we have not expanded experiments in this domain.

Ethics
Like other language models, our model is prone to bias from training data sets (Schramowski et al., 2022) (Mehrabi et al., 2019).This is something to keep in mind when fine-tuning the model for domain adaptation.Since the models are for guidance only, we do not see any direct threats related to language generation.Still, if an individual intentionally wants to harm others and trains a language model to generate harmful utterances, our model could be employed to support this process.In contrast, however, we argue that these embeddings have great potential through their transitive properties to foresee and deflect harmful utterances from afar.Considering the risk that language models pose to humans (Weidinger et al., 2021), these embeddings could be utilized as a filter on top of generative language models, e.g.removing utterances that would increase the probability of leading to an utterance of a large set of harmful utterances.
Our proposed model has a relatively small model size and shows higher efficiency during training & inference compared to DialogRPT and ConvRT, therefore we see great potential for reducing the carbon footprint in utterance retrieval tasks, in accordance with recent efforts in NLP (Strubell et al., 2019) (Patterson et al., 2021).

A Attribution
This work stems from the mandatory master's internship of Justus-Jonas Erker at the German Research Center for Artificial Intelligence supervised by Stefan Schaffer and Gerasimos Spanakis.

B Imaginary Embedding extended analysis
We analyze the Imaginary Embeddings based on their average similarity to different distances of utterances pairs within dialogues as well as their direction as shown in figure 6.While the model's average similarity is far from the training objective, the scores show a favorable decay considering the distance for positive examples as well as a relatively low similarity for false direction utterance pairs.Furthermore, we have illustrated the curved

C Next Utterance Selection Extended Analysis
For the next utterance selection we provide an extended description for our speed comparison as well as the MDC results.

C.1 Computation Comparison
Since the bi-encoder architectures are significantly more efficient than DialogRPT, we compare Con-vRT and Imaginary Embeddings in more detail.
Considering the encoding of utterances for some

C.2 MDC Results
We demonstrate the results of the MDC next utterance selection in figure 8 where we observe as described in the main paper the symmetry between inform and request intents either profiting from only the last utterance or the entire history.D Speaker Token Learning Objective ∀i ∈ {1, .., l} : , u the utterances in the observed window, u ′ a set of random utterances, and s the cosine similarity score.For the random utterance matching we assign an equal probability p to every possible combination.

E Extended Short-Term Planning Evaluation
As part of the extended Short Term Planning Evaluation, we investigate the extended results based on the history length, goal distances, and the first goal distance (goal in distance) in table 3 and demonstrate examples.

E.1 Detailed Short-Term Planning Evaluation
Table 3 unveils that additional speaker tokens show improvement in the MDC Test corpus across all tested categories.While classic embeddings show on MDC a similar performance across all even distances, we can observe two spikes at position (3, 1) and (5, 1) with (h l , g d ) on odd distances with 51.17% / 45.80% in the top 5 respectively.At these positions, we monitor a 33% increase in the standard deviation on average of the distribution of guidance scores i.e. that the model is much more decisive in its ranking.We analyzed the intent at these positions and find a two times increase in requests and a 38% decrease in inform intents to the data set's average.While the speaker token-based embeddings show that we can overcome this gap for odd distances, we still find that the two lowest performers on (4, 1) & (4, 3) with "only" 53.03% & 51.45% in the top 5 have all a minimum of 80% of informing intents.Since the two corpora use separate latent spaces, we do not compare them on a simple standard deviation.Instead, we take the sum of average standard deviations as a baseline and divide it by the sum of the standard deviations (for each data set) of the standard deviations (for each transformer utterance distribution) to measure the variation in performance over different testing parameters history length, goal distances, (first) goal in distance.With a 35% higher score, Daily-Dialog shows less variance through different test parameters.Nonetheless, we find that DailyDialog has a 12% higher semantic variance across all utterances in the transformer-generated distributions than MDC by measuring their average semantic similarity with a simple semantic sentence transformer.

E.2 Examples of Short-Term Planning
While we provide construction of our evaluation datasets, we still want to highlight some of the strengths and weaknesses of our introduced embeddings.In the example on the left of figure 9, we can see that without knowing what the person is going to say, the model can sometimes move toward the goal too greedily.In the example on the right, we see that the model can also understand more complex relations, where the only way to get to a conversation state where someone would utter "look behind you.They are coming this way" would be in a manner of playing catch me as the model ranks it on the first position.A lot of the weaker ranking results are due to the fixed split of data as demonstrated in figure 10.We observe in the first example (left) that the model tries to unveil the utterance "You're right" by trying to get the other person into an argument (rank 1) where it hopes the person would then agree to their own opinion 3 turns later or by trying to unveil the utterance right away (rank 2).In the example in the middle, we see the drawback of purely relying on the transformer's context-aware utterance generation as the selected utterance of "pint of wine" might be closer to fruits than beer but at the same time is not a valid answer.This can be also observed in the last example (right).We present our detailed Long Term planning results in table 4 as well as examples in the following subsection.

F.1 Long-Term Planning Examples
Alike for short-term planning, we demonstrate examples to present the weaknesses and as well as strengths of the embeddings.In figure 11 we two easy examples, where we can follow the conversation well without knowing the replies of the other dialogue partner.This changes especially in figure 12 where in the left example it is also for us very difficult to order the corresponding utterances.While one could argue that emergency calls tend to start with the location of the incident, the utterance "I haven't checked yet" makes the ordering of the utterances without any further context very difficult.This can also be observed in the right example of figure 12, however, one could argue that based on the context to which both IEC+CU and GC have access, the predicted order (of these two) makes more sense than the original reply order.Nonetheless, both examples show that some of these orders are debatable.

G Ablation Study
As an ablation study, we compare two variations of a simple contrastive to our introduced curved contrastive objective.The first variation has the exact same setup as our approach with the same mixed learning objective of NLI, a dialogue window of l = 5, the same hard negatives (including ones for the directional property) but without the "curved" similarity scores between [BEFORE] and [AFTER] tokens.In other words with simple labels of 0 (not before and after each other within 5 turns) or 1 (before and after the utterance with First Goal In Distance n Hits@1 (in %) Hits@2 (in %) Hits@3 (in %) Hits@4 (in %) Hits@1 (in %) Average Rank Hits@1 (in %) Hits@2 (in %) Hits@3 (in %) Hits@4 (in %) Hits@1 (in %)  a distance between 1-5 turns).Since this does not take any distance into account we have a second ablation variant that takes only direct utterance pairs (so a window size of 2) with the corresponding two labels and otherwise the same setup.Like our embeddings, we train the two variations on BERT and RoBERTa architectures respectively.In contrast to our embeddings, we find that both ablation studies find their optimum for our three takes after already 1-2 epochs.In the following sections, we present the performance of the ablation studies to our approach, note that we refer to the ablation with a window size of l = 5 as ab5 and the one with l = 2 as ab2.

G.1 Ablation Study LTP
As shown in table 5 the ablation study with a dialogue window of l = 5 shows stronger performance in ordering utterances than its counterpart of l = 2. Thanks to the solidified structure of the task-oriented corpus the ablation comes relatively close to the performance of our imaginary embeddings.For Greedy Curving (GC) in particular, it can detect the next goal out of even slightly better than our embeddings without speaker tokens.However, when the solidified structure of dialogue disappears (on the chit-chat dataset DailyDialog) our models show much stronger performance than their ablation study.

G.2 Ablation Study STP
While the ablation study with the dialogue window of l = 5 shows solid performance in ordering utterances, it has severe trouble understanding the pathways between utterances as can be seen in 6.
Especially, on the MDC dataset for close members in their own group (observation window).Here we observe that the performances increase over longer distances which goes hand in hand with the better greedy curving performance.Overall, the ablation study with a dialogue window of l = 2 shows through its learning objective a better understanding of its close neighbors as l = 5.While once again the ablation studies do not get close to our embeddings on the DailyDialog corpus, on the MDC corpus it can outperform our embeddings on direct neighbors (distance 1) while being significantly worse on longer distances.Since it only learned the properties between two speakers it has notable trouble mapping utterances from the same speaker as can be seen by even distances on the MDC corpus.

G.3 Ablation Study Next Utterance Selection
We compare both ablation studies to our embeddings in figure 13 on DailyDialog on the same variation as Imaginary Embeddings, either the entire context or only the last utterance.Both ablation studies perform best on the variation closest to their training target, in other words, ab5 on the entire context and ab2 only on the last utterance.With the Greedy Curving evaluation (table 5), one could suggest a stronger performance to ab5 rather than ab2.However, we find the exact opposite in the next utterance selection task as we consider candidate utterances in width rather than in-depth.Compared to the other baselines, the strongest ablation study is still 1.5% worse than the pre-trained DialogRPT, 3.69% worse than ConveRT, and 4.3% worse than our best imaginary embeddings.On MDC (figure 14), we observe, as we described in §3.3, that considering only the last utterance shows the strongest results.Expectedly, the training objective to only match direct pairs of ablation l = 2 comes in handy, outperforming all other approaches.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: Entailment property of sentence transformerbased embeddings within conversations on DailyDialog

Figure 2 :
Figure 2: DialoGPT Guidance Example with Imaginary Embeddings with before [B] and after [A] token.
[B] & [B] and [A] & [A] as well as the curvature as a relative time dimension between utterance pairs in the space between [B] & [A] representations.

Figure 3 :
Figure 3: Curved property of Imaginary Embeddings.Grey/black nodes represent history utterances, orange nodes are utterance candidates, and dark orange is the best candidate as it is closest to the goal utterance (red).From the perspective of the best candidate encoded as [A], the scores towards history illustrate the training objective as they are encoded with [B] tokens.

Figure 4 :
Figure 4: Long Term planning Dataset construction variables (history length, goal distances, (first) goal in distance) demonstrated.Furthermore, the concept of Imaginary Embedding Chains (IEC) is illustrated with its puzzle-like properties with the corresponding goal utterance colors.

Figure 5 :
Figure 5: Normalized average rank of next utterance selection based on dialogue history on DailyDialog.Demonstrated are different Curving variants (only the last utterance or the entire history), classic as well as Speaker Token-based embeddings.As baselines, we utilize the pre-trained DialogRPT (human vs random utterance task), the pre-trained ConveRT as well as BM25.

Figure 6 :
Figure 6: Average Imaginary Embedding Similarity to correct and false direction utterances based on turn distance on DailyDialog Test Corpus

Figure 7 :
Figure 7: t-SNE visualization of first 4 utterances of the first 100 dialogues of the DailyDialog Test Corpus in curved Embedding Space.From Dark green to light green (u 1 → u 2 → u 3 ) nodes as well as edges encoded with the [BEFORE] token to u 4 encoded with [AFTER] token as light red.

Figure 8 :
Figure 8: Normalized average rank of next utterance selection based on dialogue history on MDC.Demonstrated are different Curving variants (only the last utterance or the entire history), classic as well as Speaker Token-based embeddings.As baselines, we utilize the pre-trained DialogRPT (human vs random utterance task), the pre-trained ConveRT as well as BM25.

Figure 9 :
Figure 9: Good Ranking Examples on DailyDialog Test Corpus with a history length of 2 and a goal distance of 3. The goal in red, the context in grey, the true utterance in green, and the transformer-generated utterance in blue

Figure 12 :
Figure 12: Bad Ranking Examples on DailyDialog Test Corpus with history length of 2, the goal distance of 2, and goal in distance of 3

Figure 13 :
Figure 13: Normalized average rank of next utterance selection based on dialogue history on DailyDialog.Demonstrated are different Curving variants (only the last utterance or the entire history), classic as well as Speaker Token-based embeddings.As baselines, we utilize the two ablation study variants with the two variations' entire context or only the last utterance.

Figure 14 :
Figure 14: Normalized average rank of next utterance selection based on dialogue history on MDC.Demonstrated are different Curving variants (only the last utterance or the entire history), classic as well as Speaker Token-based embeddings.As baselines, we utilize the two ablation study variants with the two variations' entire context or only the last utterance.

Table 1 :
Aggregated short-term planning evaluation for odd (unveiling utterances of the dialogue partner) and even distances (which would be uttered by the transformer itself).

Table 3 :
Detailed Short-Term Planning Evaluation with n (number of evaluation samples)

Table 4 :
Detailed Long-Term Planning Evaluation with n = number of evaluation samples Figure 11: Good Ranking Examples on DailyDialog Test Corpus with history length of 2, the goal distance of 2, and goal in distance of 3 C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.