OTTers: One-turn Topic Transitions for Open-Domain Dialogue

Mixed initiative in open-domain dialogue requires a system to pro-actively introduce new topics. The one-turn topic transition task explores how a system connects two topics in a cooperative and coherent manner. The goal of the task is to generate a “bridging” utterance connecting the new topic to the topic of the previous conversation turn. We are especially interested in commonsense explanations of how a new topic relates to what has been mentioned before. We first collect a new dataset of human one-turn topic transitions, which we callOTTers. We then explore different strategies used by humans when asked to complete such a task, and notice that the use of a bridging utterance to connect the two topics is the approach used the most. We finally show how existing state-of-the-art text generation models can be adapted to this task and examine the performance of these baselines on different splits of the OTTers data.


Introduction
For a conversation to be truly engaging, we typically assume that both participants take initiative, e.g. by introducing a new topic. We call this a mixed-initiative dialogue. Open-domain systems trained on vast amounts of data (Jiang et al., 2020;Zhang et al., 2020;Gao et al., 2018;Li et al., 2017Li et al., , 2016Vinyals and Le, 2015), however, are often purely responsive, make abrupt transitions, or fail to take initiative (see examples in Table 1). In this paper, we consider the case where the system pro-actively introduces a new topic in a conversation by providing a commonsense link of how this new topic relates to what was mentioned previously (see Fig.1). We call this transition strategy "bridging". Humans deploy a range of strategies 1 https://github.com/karinseve/OTTers User A Source Topic: I spend a lot of time outside.

User B
Transition: I like the outdoors as well, especially gardening. It destresses me. Target Topic: I enjoy relaxing and getting flowers. Entity Path: outsidegardenflower User A Source Topic: I like seafood a lot.

User B
Transition: Since you like seafood, is Swedish fish a candy that you might enjoy? Target Topic: I have no self control when it comes to candy. Entity Path: seafood -Swedish fishcandy User A Source Topic: I think I am getting engaged soon.
User B Transition: I have two children from a previous marriage Target Topic: My children are my life.
Entity Path: engagementmarriagechild Figure 1: Example topic transitions from OTTers. User A introduces a topic with a short sentence (main concept in bold). Then User B responds with a (optionally multi-sentence) "bridging" transition before introducing the new topic (the main concepts for the transition and target topic are denoted with underline and bold, respectively). Each example is accompanied by an entity path, comprising Knowledge Graph entities (denoted with teletype) instantiating the main concepts of the dialogue turn.
in addition to bridging, including disengagement, discourse markers or silence (Riou, 2015). We hypothesise that introducing a new topic by making a connection with the previous dialogue turn can be perceived as a less abrupt transition.
More specifically, we investigate bridging transitions between two user utterances in the form of one or more sentences that contain at least one main linking concept. These inherently can allow for better grounding to external resources such as entities in large Knowledge Graphs (KG) (e.g., Wikidata), or named entities mentioned in documents (e.g., Wikipedia, or news articles), ultimately leading to more controlled and interpretable outputs.
To this end, we crowdsource a corpus of humanwritten topic transitions focused on these "bridging" strategies, where humans introduce a "missing link" concept, given a source and target topic in the form of two short user utterances (Fig. 1). By grounding the topics on a KG using automatically recognised entities associated with each topic, we can then identify "commonsense" connections which are similar to these missing links.
By modelling such topic transitions in the form of Cause-Effect relationships in a KG, we can then perform abductive inference on commonsense knowledge for which we provide a language generation baseline. In particular, we fine-tune a multihop reasoning model (Ji et al., 2020) which was trained on a similar task called Abductive NLG (αNLG) to generate an explanatory hypothesis given two observations. We find that combining a reasoning module over a KG (ConceptNet) with a language model achieves the best performance on our "topic transition" task for both the predicted entity path as well as the generated utterance. In addition, we show that existing multi-topic dialogue datasets, such as PersonaChat (Zhang et al., 2018) and TopicalChat (Gopalakrishnan et al., 2019), cannot be easily adapted to this task, due to the different nature of the tasks they were designed for. Our contributions are as follows: • We propose a new Natural Language Generation task based on one-turn topic transitions for open-domain dialogue based on a "bridging" strategy, which promotes grounding on KG entities.
• We collect a crowdsourced dataset, OTTers, and present a rigorous analysis in terms of transition strategies, linguistic properties and entity linking to a KG.
• We show that our KG-grounded dataset can effectively leverage the reasoning component of an existing Transformer-based model (Ji et al., 2020) to generate better output compared to a vanilla GPT-2 (Radford et al., 2019) decoder, both in in-domain and out-of-domain data splits.

Related Work
Topic Transitions in the Linguistic Literature.
There is no common definition for the term topic (Goutsos, 1997;Purver et al., 2011); however, there are a number of definitions which are helpful for our purposes. Goutsos (1997) divide a "topic" into two main components: 1) what constitutes a topic (the "what") and 2) how participants perceive and manage a topic (the "how"). An early work from Brown and Yule (1983) declares that "topics should be described as the most frequently used, unexplained term in the analysis of discourse". In general, "discourse topics" can be explained as what a portion of the interaction is about, therefore the "aboutness" (Berthoud and Mondada, 1995;Porhiel, 2005). More specifically Chafe (1994) defines the notion of topic as "the totality of information that is semiactive at one time".
Prior work has shown that the introduction of a new topic usually co-occurs with cues such as wrapping things up about the current topic (Maynard, 1980), preceding silence, or the use of discourse markers (Riou, 2015). Also, backchannel signals, e.g., yeah, right, you know, indicate that both agents are involved in the interaction and show consent for the topic development (James, 1995). Beyond these overt cues, James (1995) and Geluykens (1993) describe semantic topic transitions: "each topic has a tendency to lead to the next; to provide the opening for another" (James, 1995), and topics are typically "co-constructed", requiring each speaker to contribute to the conversation for further progression and development (Geluykens, 1993). The identification of topic transition is indeed not an easy task. It is not only about linguistic cues such as discourse markers and prosodic cues, as sometimes a topic switch can be identified with the introduction of a new entity (James, 1995). Additionally, in a conversation topics are created and introduced by participants themselves in real time, making topics participant-and interaction-specific (Mondada, 2001(Mondada, , 2003. Moreover, "the entities in focus at a given point in the discourse will be that partially-ordered subset of activated entities which are likely to be continued as topics of subsequent utterances" (Gundel et al., 1993). These cooperative elements emphasise the importance of mixedinitiative topic management for open-domain dialogue systems.  Current Multi-topic Open-domain Systems.
Previous work in open-domain dialogue systems has largely avoided explicitly modelling topic transitions and instead focused on grounding system behaviour in a "persona" (a set of statements about hobbies, demographics, or preferences) (Zhang et al., 2018;Li et al., 2016) or by conditioning conversations on knowledge sources such as newspaper articles, fun facts or Wikipedia articles (Gopalakrishnan et al., 2019;Dinan et al., 2019) to generate engaging responses while avoiding generic replies, improving coherence, and raising new and interesting topics. These approaches often lead to poor topic transitions, as illustrated in Table  1. The PersonaChat example shows neither initiative nor common sense while transitioning to a new topic; it only displays passive acknowledgement from User B. Whereas the TopicalChat example presents a very abrupt topic shift by User B. Our dataset is the first corpus focused specifically on one-turn topic transitions; however, there are several human-to-human dialogue corpora wherein participants discuss assigned topics. Two prominent such corpora are TopicalChat (Gopalakrishnan et al., 2019) and PersonaChat (Zhang et al., 2018). In TopicalChat both participants used source documents from Wikipedia to discuss a shared topic. The dialogues in this corpus tend to flow less naturally than those in PersonaChat with participants generally focusing on expressing the main facts, often by copy and pasting from their source documents rather than having a natural conversation. Therefore we focus on PersonaChat as a point of comparison.
PersonaChat dialogues consist of chit-chat conversations based on a set of "persona traits" assigned to each participant. Because participants seek to express their persona to each other, the con-versations require mentioning various topics (i.e. their persona traits) in a natural way. Indeed, Zhang et al. (2018, Sec. 3.3) adjusted their design to encourage users to engage with each other's topics and not simply state their own topics as quickly as possible to end the dialogue. PersonaChat does not contain annotations for the topic of each turn and participants had the freedom to mention their topics (i.e. persona traits) in any order.
We use PersonaChat in two different ways: 1) using their persona traits as starting and goal topics for our own data collection, and 2) as a point of comparison for our dataset.
Commonsense-Aware Neural Text Generation. Large Language Models still suffer in cases where reasoning over underlying commonsense knowledge is required during generation, including dialogue generation (Zhou et al., 2018), story ending generation (Guan et al., 2019), and topic-to-essay generation (Yang et al., 2019). Recently, Guan et al. (2019); Bhagavatula et al. (2020) attempted to integrate external commonsense knowledge into generative pretrained language models, which we will also attempt in Section 4 using the Abductive NLG (αNLG) dataset (Bhagavatula et al., 2020). Our setup is similar in spirit to αNLG, which is a conditional generation task for explanations given observations in natural language. In particular, the model has to generate an explanatory hypothesis given two observations: the cause (e.g. The Smith family went on a cruise for their summer vacation) and the consequence (e.g. From then on, the Smiths went to the beach each summer instead). Here, a possible explanation might be: The Smith family got seasick on the cruise. The αNLG dataset contains 20k pairs observations and 200k explicative hypotheses, which we will later use for fine-tuning our models (see Section 4).

Task Design and Data Collection
Task Description. We assume there are topics t a and t b for utterances u a and u b (with u · = t · for this paper). The goal of the task is to generate a oneturn transition utterance u t to serve as a smooth link between t a and t b so that its concatenation with utterance u b is a sensible response to u a . A bridging transition occurs when one or more of the entities e t ∈ e t mentioned in u t lies on a path in the knowledge graph between entities e a ∈ e a and e b ∈ e b mentioned in u a and u b , respectively.
Knowledge Graph Construction. We use Per-sonaChat persona traits as the starting point for our data collection. In order to model commonsense connections, we built a knowledge graph (KG) using the entities found in each persona trait through the Yahoo Entity Linker (Blanco et al., 2015;Pappu et al., 2017). Each entity is linked to its correspondent Wikidata identifier, while a SPARQL query retrieved the entity's super-classes and sub-classes, which were added to the KG. Furthermore, the KG has been augmented by retrieving the commonsense connections for each entity from ConceptNet (Speer et al., 2017) and by parsing Wikipedia abstracts mentions.
To select which traits to use for the data collection, we first selected all pairs of entities connected with k-hops (1 < k < 20) in the KG. Then, we recovered the entities mentions in the persona traits and saved every pair (nearly 30k) as potential pairs for our data collection.
Data Collection. We crowdsourced the data collection for OTTers on Amazon Mechanical Turk (AMT). Each user was provided with two topics A, B from the PersonaChat persona traits, along with instructions explaining the task. The instructions ask the user to imagine they are having a conversation where the first topic A from the pair represents the last turn of the other person, and the second topic B contains the final topic the user wants to talk about. The user then has to write a short utterance to transition to the new topic B in the least abrupt way possible. Additionally, in order to encourage crowd-workers to ground their utterances in actual topics, we asked them to report the "topics" mentioned in their sentence (see Figure 2).
For each topic pair in the study we collected three different transition utterances to provide more insight into the different strategies users adopt when transitioning to a new topic.

Corpus Properties
Basic Statistics. Table 2 provides summary statistics describing OTTers. Our corpus consists of 4,316 utterances for 1,421 unique topic pairs, with an average utterance length of 1.3 sentences and 16.4 words. The KG path statistics for OTTers are based on all of the paths found by the Yahoo Entity Linker between the 1421 unique topic pairs in the corpus, a total of just over 12k paths.   KG coverage. We calculated the distance between each pair of topics in the knowledge graph described in Sec. 3.1 to facilitate analyses of the role of topic distance in transition strategy and transition quality. To extract entities from the utterances in our corpus, we extended the tagger built-in to the Yahoo Entity Linker with the spaCy Named Entity Recognizer to include all nouns and adjectives as potential entities. 2 Using these extracted entities we analyse the overlap between entities mentioned in the given topics A, B and those mentioned in the crowdsourced transition utterances. The Jaccard distance between these two sets is 1 for nearly a quarter of the topicpairs and utterances in our dataset, with a mean of 0.842, meaning that the overlap between entities mentioned in the utterances and entities mentioned in the topics is fairly low. This indicates that users   transition from Topic A to Topic B mentioning new unseen entities, following a "path" that can be grounded on a knowledge graph.
In contrast, the overlap between the entities in the KG path between the topics and the entities mentioned in the transition utterances is higher: both the mean and the mode Jaccard distances drop to below 0.8, suggesting that crowdworkers make similar connections to the ones we can find in our knowledge graph a substantial portion of the times. This suggests that our KG-grounded approach can find plausible entities to be mentioned to bridge between topics, similar to the commonsense connections made by humans shifting between topics.

Transition Strategies in OTTers
To examine the strategies humans applied while completing the OTTers task, we adapted the categories of Riou (2015) for a manual analysis of our data. Riou (2015) distinguishes between disjunctive and stepwise transitions between topics. Disjunctive transitions make no attempt to relate the new topic to the previous topic, switching abruptly to the new topic without acknowledging the previous topic, whereas stepwise transitions are akin to the previously described transition strategies.
We distinguish between bridging and acknowledge & continue strategies: in the former, the speaker aims to produce an utterance which connects the previous and new topics directly; in the latter, the speaker acknowledges the previous topic before introducing their own topic, without explicitly relating the two to one another. In addition to these categories, we also annotated utterances as off-task (e.g. replying to or continuing the first topic without any attention paid to the second topic) or off-topic when the utterance had nothing to do with either of the two topics (e.g. random greetings or generic questions).
Two of the authors annotated 10 utterances from 10 different users, resulting in 200 total annotations. The initial inter-annotator agreement was 71%, classified as substantial (Krippendorff's α = 0.34), after which the annotators collaborated to reach a consensus annotation for each of the examples that presented a disagreement. Table 5 contains a prototypical example for each of the annotated classes.
More than 80% of the data contains some form of transition to the second topic, with 79% containing a bridging utterance, 5% applying an acknowledge and continue strategy, and only 2% using the disjoint transition strategy. 12% of the data is connected to one or more of the topics in some way but does not serve as a transition, and 2% of the data is completely off-topic. This analysis suggests that our corpus indeed represents the kind of knowledge-based transitions we are interested in.
KG distance and discourse markers. We hypothesize that speakers are less likely to use explicit topic management strategies (e.g. topic wrap-ups, discourse markers) when topics are more closely related to each other, e.g. as measured by graph distance in a large knowledge graph. This would be in line with findings about the use of explicit discourse markers versus leaving discourse relations implicit. Torabi Demberg (2012, 2013) found that explicit markers are more likely to be omitted when the discourse relation is highly predictable based on the content of the arguments.
Based on Riou (2015) we examined the frequency of discourse markers in utterances to test our hypothesis, examining both general conversational discourse markers and those associated with specific discourse relations. For conversational discourse markers we use the Cambridge Dictionary, which provides a list of spoken and written markers, including "well", "you know", etc., while for markers signalling particular discourse relations we use the list from the Penn Discourse Treebank (Webber et al., 2019;Prasad et al., 2008, PDTB); these include markers like "because" indicating a causal relationship or "in addition" for an additive relationship. We find a small but significant correlation (≈ 0.04) between conversational discourse Acknowledge and continue A: i like to eat the same thing as ninja turtles. T: I love pizza. I eat it while I skateboard. B: i enjoy riding around on a plank with wheels.
Bridging: Missing Link A: i prefer things to be authentic. T: I think children are the truest form of authenticity because they say things unfiltered. B: i am not a fan of children.
Disjunctive A: i like american made cars. T: I like liver cooked in butter -just throwing that in! B: i avoid eating broccoli.
Off-Task A: i prefer things to be authentic. T: my bro just made some authentic thai chicken. B: i am not a fan of children.
Off-Topic A: i learnt to drive. T: I had a rough night sleeping in my new bed last night. B: i like making a salmon entree. markers and no significant correlations between the use of PDTB3 discourse markers or the turn length and KG distance. This suggests that users are somewhat more likely to use conversational discourse markers as the distance between topics in the knowledge graph increases, in line with our hypothesis.

Validating the Corpus
We evaluate whether the transition strategies in OTTers are less abrupt than those found in Per-sonaChatby constructing a comparable subset of PersonaChatand performing a human evaluation.
Comparable Corpus Construction. We first extract a subset of PersonaChat where two consecutive turns contain different topics. In other words: turns where one speaker changed the topic from what the previous speaker has just said. Since Per-sonaChat turns do not incorporate topic annotations, we use a heuristic based on BERTScore to assign a topic to each turn. Given topics t and turns u for a dialogue in PersonaChat, we calculate the BERTScore similarity between each u ∈ u and each t ∈ t. For each turn u we then assign t = argmax t (BERTScore(u, t)), if and only if where t is the topic achieving the second highest BERTScore relative to u, and d is a threshold to ensure that we only assign a topic to a turn if it is a substantially better fit than the other topics. 3 While this means that not every turn is assigned a topic, this is necessary to ensure that we do not assign topics to, e.g., greetings like 'hi, how are you?'.
This way of assigning topics yields a subset consisting of 22,010 utterances which have a different topic from the preceding utterance. Most of these topic-pairs (20,491) are only expressed through one utterance in the dataset, while 1,188 are expressed by two utterances, 248 by three, and 83 by more than 3 utterances. Moreover, there are 445 topic-pairs which also occur in our corpus.
Crowdsourced Validation. Using the comparable sub-corpus of PersonaChat, we asked crowdworkers to vote which of two potential transition utterances was "less abrupt" (Fig. 3) for 49 topicpairs occurring in both datasets. We collected 3 votes for each utterance and only counted instances where 2/3 workers agreed on the same choice.
The results confirm that OTTers has less abrupt transitions: the utterances in OTTers were judged as less abrupt in 44/49 cases, with the comparable PersonaChat utterance judged less abrupt in one case, and both utterances rated "bad" in another. Only 3 cases did not present a majority class.

Experiments
Having confirmed the quality of our corpus, we now adapt two existing text generation models as baselines for this task. We also explore different train-dev-test splits and conduct an error analysis.

Baselines
The first baseline we consider is a vanilla GPT-2 language model (Radford et al., 2019) fine-tuned on OTTers (vGPT2). Next, we test the recent Multi-Gen (Ji et al., 2020) on this task, which extends GPT-2 with multi-hop reasoning on commonsense knowledge graphs. In particular, this model combines the vocabulary distribution generated by GPT-2 with a concept distribution in order to produce knowledge grounded responses. The concept distribution is given by reasoning performed on the commonsense knowledge graph ConceptnetIO, using the context modeled through GPT-2.

Train-Dev-Test Splits
The first split is an out-of-domain split (ood), which ensures that none of the topics in the test-set are present in any of the topic-pairs in the train-set. For the second split, this restriction is relaxed to create an in-domain split (id), allowing one of the topics in each pair in the test-set to appear in the train-set, although with a different second topic.
The ood split resembles a zero-shot scenario, where the model has to generate a shift between two topics it has never been fine-tuned on. Hence, we expect results to be lower than the ones from id. The number of unique and total topic pairs for each split is illustrated in Table 6.

Evaluation
We evaluate two aspects of the transition task: 1) whether the model can find a sensible path through intermediate topics and 2) whether the model can generate a natural utterance which mentions such intermediate topics.
To evaluate the former, we assess the entities mentioned in the transition utterance to determine how well they bridge the gap between Topic A and Topic B. We use hits@k ratio as an automatic approximation, which measures the number of relevant entities correctly predicted by the model, out of the k most important entities identified in the target references. This metric shows how well the models ground the concepts introduced in the two dialogue turns and how the reasoning compares to the human standard presented in OTTers.
For (2) we adopt the same automated metrics used for evaluating MultiGen on the αNLG dataset for comparability: ROUGE-L (Lin, 2004), ME-TEOR (Banerjee and Lavie, 2005), and CIDEr (Vedantam et al., 2015). However, we report the full BLEU score (Papineni et al., 2002) 4 that accounts for the overlap across 1-4 ngrams instead of just 4-grams (BLEU-4). As word-overlap based metrics have been widely criticised due to their lack of correlation with human judgements (Novikova et al., 2017;Reiter, 2018), we also provide an example-based error analysis in Section 4.4.

Results
For each aforementioned split we evaluated three different models to compare performance: the pretrained vGPT2 fine-tuned on each split for OTTers, the MultiGen model fine-tuned only on αNLG, and the same model additionally fine-tuned on OTTers (called αNLGft).
Overview of Results. Table 7 shows the results of these experiments. vGPT2 performs poorly on the one-turn transition task, regardless of the traindev-test split, which we attribute to the small size of OTTers: with only a few thousand utterances, vGPT2 is unable to learn the task. We notice, however, that the system tends to repeat the main entity in Topic A, therefore scoring surprisingly well on the hits@k metric, despite the fact that the utterances themselves are of low quality (see Table  8).
The reasoning component added by MultiGen leads to substantial improvements in most of the evaluation metrics but not hits@k (αNLG in the  table). Therefore, the improvements in text quality metrics appear to be due primarily to the similarity between the structure of the abductive NLG task and the increased amount of data for fine-tuning (≈ 688k tokens) compared to fine-tuning vGPT2 on our ≈ 71k tokens alone.
Further fine-tuning MultiGen on OTTers leads to substantial improvements on all metrics for both indomain & out-of-domain splits. The performance improvement is considerable especially because of the relatively small size of the training set (693 unique topic pairs on in-domain, see Table 6), further justifying the compatibility between the original task MultiGen was trained on and OTTers.    Table 7 indicate there is still space for improvement. We hypothesise METEOR are higher than BLEU scores, because they also consider paraphrases.
These results confirm that our newly introduced one-turn topic transition task needs a reliable language model combined with an advanced reasoning component.

Detailed Discussion and Model Limitations.
We further analyse the results to understand model limitations. First, we observe that Multigen's hits@k ratio is quite low, especially when compared to vGPT2. This is surprising considering vGPT2's generated sentences are mostly very short and repetitive, and the predicted concepts mostly match the ones contained in the 'Topic A' sentence. One possible explanation is that Multigen's reasoning module uses a gate loss, which determines whether to select a concept from the provided knowledge graph or a word from the GPT2 dictionary. We observed that the majority of the times the model will use a word from the GPT2 dictionary rather than selecting a concept from the knowledge graph.
Moreover, we observe that only 65% of the concepts found in the target sentences are actually nodes in Multigen's subgraphs. One possible explanation is that Multigen's reasoning model has a limited input capacity of up to 100 nodes that are at most 2 hops away in order to prune the very large knowledge graph from ConceptNet. The English vocabulary from ConceptNet contains approximately 1,500,000 nodes, which makes the process of determining the concept distributions very computationally expensive and time inefficient. Therefore, the pruning strategy adapted by Ji et al. (2020) overcomes these problems but cannot be applied to the OTTers task, as the selection of the concepts is just as important as the output sentence being fluent. Contrary to our expectations, expanding the size of the knowledge graphs from 100 nodes to 200 and 300 did not improve the hits@k ratio. Most likely because the concepts added to the graphs are either not relevant or misleading for the model. This suggests that improving concept selection is a promising future direction to improve the performance of the reasoning module, leading to overall better topic transitions.
Error Analysis. In addition, we preform an example-based error analysis to further understand the strengths and weaknesses of the individual models. Table 8 shows representative system outputs for each of the models on the in-domain data split. First, we observe that vGPT2 often generates very simple sentences (e.g., 'family.', in Ex. 4), repeated non-content bearing tokens (e.g., 'I love it.', in Ex. 2), or incoherent and often not specific enough output to form a successful bridging transition (e.g., 'a lot of cooking.', in Ex. 3, is not a well-formed sentence, and only loosely connected to Topic A about 'agricultural experience'), contributing to low BLEU scores. However, this also reinforces the idea that the hits@k scores are artificially inflated simply due to vGPT2 choosing to include one of the entities from the first topic.
The outputs from MultiGen tested on OTTers show a better performance than vGPT2, given that the topic selection for the model is grounded on ConceptNet. However, since the Abductive NLG task is different than the 'Topic Transition' task addressed in OTTers, there is a discrepancy in the use of the language. The model often outputs coherent sentences that use generic commonsense facts which may not be related to Topic B (e.g., 'I decided to give birth to a baby', in Ex. 1).
The texts generated from MultiGen fine-tuned on OTTers on the other hand, introduce interesting connections between Topic A and Topic B (e.g., 'I like to make babies laugh when I'm not working.', in Ex. 1) and leverage commonsense (e.g., 'I like the look of Italian cars', in Ex. 2, where 'the look' creates a connection with 'being in good shape' from Topic B).

Discussion & Conclusion
Ethical Considerations. We recognise that any mixed-initiative dialogue system carries risks related to dual-use: in addition to helpful systems which serve to help users explore a new topic or discover more about the world, a system which can effectively change the topic of conversation could also be used to manipulate user behaviour. For example, bridging strategies for topic transitions could be used by virtual assistants to encourage users to make a purchase or to express their opinion or preference regarding sensitive subjects.

Conclusion.
We have defined a new NLG task exploring one-turn topic transitions for mixedinitiative in open-domain systems. Our OTTers corpus provides training data for modelling topic transitions based on 'missing link' topics which connect the previous conversation subject to a new topic. Baseline models based on state-of-the-art approaches to text generation illustrate possible approaches to the task and show that there is room for improvement. In particular, we show that commonsense knowledge grounding is necessary for this task, outperforming fine-tuned large language models. In future work, we will explore model architectures specifically designed for topic transitions, as well as fine-tuning strategies to deal with small datasets. We also plan to evaluate the impact of bridging transitions on user (dis)engagement in an open-domain dialogue system.

A OTTers Data Collection
In order to avoid collecting noisy or out-of-task data, we established some worker requirements for turkers participating in our data collection. Workers needed to: • be Masters (label assigned by Mechanical Turk to workers who achieve excellence across a variety of tasks), • have a number of HITs approved greater than 500, • have a HIT approval rate (%) greater than 80, • being located in an English speaking country, namely Australia, Canada, New Zealand, United Kingdom, and United States.
A Worker received a reward of $0.3 for each completed assignment. The reward was calculated based on an estimate of the time it would take a Worker to read the instructions and complete the task. The time for completing the task has been estimated at 1.5 minutes, and the reward was calculated accordingly to a $12 hourly payment. Each task had been assigned to 3 unique workers. Figure 4 shows the instructions that Workers were presented after opening the OTTers data collection task. The instructions explain that the context is a conversation with a newly-met person. After writing the sentence for transitioning the current topic to the 'final' one, workers are asked to list the topics they covered for the transition. Additionally, we provided an example: Current sentence: 'I have a love of reptiles.' Final sentence: 'I want to travel to NYC.' Topic shifting sentence: 'I know there is a cool snake species in the New York zoo. This is why I want to travel to NYC.' Covered topics: • Reptiles