Opening up Minds with Argumentative Dialogues

Recent research on argumentative dialogues has focused on persuading people to take some action, changing their stance on the topic of discussion, or winning debates. In this work, we focus on argumentative dialogues that aim to open up (rather than change) people's minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. To this end, we present a dataset of 183 argumentative dialogues about 3 controversial topics: veganism, Brexit and COVID-19 vaccination. The dialogues were collected using the Wizard of Oz approach, where wizards leverage a knowledge-base of arguments to converse with participants. Open-mindedness is measured before and after engaging in the dialogue using a questionnaire from the psychology literature, and success of the dialogue is measured as the change in the participant's stance towards those who hold opinions different to theirs. We evaluate two dialogue models: a Wikipedia-based and an argument-based model. We show that while both models perform closely in terms of opening up minds, the argument-based model is significantly better on other dialogue properties such as engagement and clarity.


Introduction
Developing dialogue agents that are able to argue about different topics has been the focus of a lot of recent research.Typically, these agents engage in conversations with people with the aim of changing their opinions on a topic or winning debates.Accordingly, success of argumentative dialogue agents has been measured by their ability to convince people to take an action such as donating to a charity (Wang et al., 2019;Shi et al., 2020), change their position on the subject of discussion (Tan et al., 2016;Prakken et al., 2020), or attract more votes by the audience listening to their debates (Zhang et al., 2016;Slonim et al., 2021).Other work has studied argumentation with P: I feel like pushing your kids to not eat meat or dairy products is a bit too much, other than that I really don't see a problem if it's someones choice to not eat meat or dairy products.W: The same goes for animal products, where they should not be forced onto children until they're old enough to decide for themselves.P: If a child chooses to eat meat in a vegan household they should absolutely be allowed to and have that privilege.W: That is fair.However, Children are often not educated on the source of their food and distracted from their natural empathy for animals.The artificial divide between animals which are okay to eat and those that are not begins with parenting.
Figure 1: A dialogue excerpt from our dataset about veganism between a participant (P) and a Wizard (W).
the aim of reaching agreement (Vecchi et al., 2021;De Kock and Vlachos, 2021).Nonetheless, none of the previous works has studied dialogues in terms of their ability to stimulate open-minded thinking and help participants learn about views that are unfamiliar or in opposition to their own and become more tolerant towards people who hold these views.
Open-minded thinking has been motivated by many psychological studies.Haran et al. (2013) showed that it correlates with information acquisition.Carpenter et al. (2018) demonstrated its importance for responsible behaviour on social media platforms.More recently, Stanley et al. (2020) suggested that individual's negative views about their ideological opponents could partly be attributed to their lack of exposure to good arguments for these views.Motivated by research on open-minded thinking, we propose to use argumentative dialogues to expose participants to different opinions about polarising topics with the aim of opening up their minds and increasing their tolerance towards views opposing their own.We collected 183 dialogues about three controversial topics (veganism, Brexit and COVID-19 vaccination), using the Wizard of Oz (WoZ) approach (Fraser and Gilbert, 1991;Bernsen et al., 2012).The wizards utilised arguments sourced from publicly available debate platforms to chat with participants.Figure 1 shows an example from the dataset.
In order to evaluate open-mindedness, we follow the approach of Stanley et al. (2020), and ask dialogue participants whether they believe people who hold views opposite to theirs have good reasons for their convictions.Stanley et al. (2020) argued that people who believe their ideological opponents have good reasons for their position are more likely to believe these opponents have good morals and intellectual capabilities.Therefore, we also ask participants about the intellectual capabilities and morality of people who hold views opposite to theirs.We refer to these questions as the opening up minds (OUM) questions and detail them in Table 1.We ask these questions before and after the dialogue and measure the change in the answers.Additionally, we ask participants to rate their experience (e.g., in terms of engagement, persuasiveness, frustration, etc.) and find no strong correlation between that and whether they have become more open-minded.These findings further highlight the distinction between dialogues aiming at opening up minds versus persuasiveness or engagement.To our knowledge, our dataset is the first dialogue corpus that aims at fostering openminded thinking. 1Finally, we evaluate two dialogue models: a Wikipedia-based and an argumentbased model, where the latter is fine-tuned on our dataset.Our results show that while both models perform closely in terms of opening up minds, the argument-based one is significantly better in other chat experience measures such as engagement and clarity.

Related Work
Several studies on argumentative dialogues have focused on persuasion.Tan et al. (2016) analysed the interactions on ChangeMyView (CMV) forums in order to understand the features that lead to persuasion.They described the original posters on CMV as "open-minded" if they changed their original view.In contrast, in our study an "open-minded" participant becomes more accepting to the opposite view, without necessarily changing theirs.Wang et al. (2019)  one participant tries to convince the other to make a donation.They studied different persuasion strategies that lead to dialogue success, which is measured by whether the participant actually made a donation.Following their work, Shi et al. (2020) investigated the effect of chatbot identities on convincing people to make donations.Other work has focused on argumentative dialogues for debating such as Oxford-style Debates (Zhang et al., 2016) and IBM's Project Debater (Slonim et al., 2021).The goal of the participants (humans or dialogue agents) in these debates is to win by convincing an audience with their arguments.
Recently, knowledge-based dialogue agents have attracted much attention in order to have more engaging dialogues and avoid knowledge hallucination, a typical issue in end-to-end chat models.Numerous knowledge-bases have been utilised such as IMDB movie reviews (Moghe et al., 2018) or Wikipedia (Zhou et al., 2018;Dinan et al., 2019).For instance, Dinan et al. (2019) used the WoZ approach to collect dialogues where wizards use sentences from Wikipedia to write their responses.These Wikipedia-based datasets have later been utilised to build knowledgeable dialogue agents (Li et al., 2019;Lian et al., 2019;Zhao et al., 2020b,a;Shuster et al., 2021a).Nonetheless, using arguments as a knowledge-base for dialogue agents has received less attention, with exception of, for example, Prakken et al. (2020), who developed a chatbot to persuade participants to accept that university fees should remain the same by selecting arguments from an argument graph using cosine similarity.

Wizard of Oz Data Collection
We collect 183 dialogues, using the WoZ approach, where a person (a wizard) plays the role of an agent and discusses a given topic with another person (a participant).Statistics of the collected dialogues are shown in Table 2.In the remainder of this section, we discuss the dialogue collection process.
The wizards We recruited 5 postgraduate students from one of the author's university student job shop (a pool of students looking for research assistant work) to act as wizards.Each wizard is instructed to discuss a given topic for 15-20 minutes with a participant to help them understand the other perspective on the topic being discussed rather than change their minds.More concretely, wizards are asked to use the most appropriate argument that best fits the conversation.To assist them, an argument base about the topic of discussion (see later in this section) is made available to them.Each argument is annotated with a pro or con stance relative to the topic.After a participant's turn, TF-IDF scores are calculated between the participant's last utterance and each argument in the argument base,2 and the 50 arguments with the highest scores are presented to the wizard to help them respond.Wizards are encouraged to edit the arguments they select to make them flow more naturally with the conversation, or write their own responses from scratch if they can't find a good argument to use or want to ask questions.In order to further facilitate their task, wizards are also given a list of hedges and acknowledgments to use in their responses to make the conversation more natural and polite (e.g., "I see what you mean, but...", "It could be argued...", etc.), which previous research has found to be conducive to better conversations (Yeomans et al., 2020;De Kock and Vlachos, 2021).
The WoZ interface also allows the wizards to use keywords to search the whole argument base of the topic, and to filter arguments by stance (pro/con).
The participants We recruited participants from Prolific3 .All participants are fluent in English and have a Prolific acceptance rate of over 95%.Participants are asked to discuss the topic freely with the wizards, writing arguments and posing questions as they wish.Before the conversation, participants indicate their stance on the topic of discussion by answering whether: they are vegans (if the topic is veganism), they took at least one shot of the vaccine (if the topic is vaccination), or they voted leave or remain (if the topic is Brexit).According to their stance, they are asked about the people who hold the opposite stance; in particular, they indicate how much they disagree/agree with the OUM questions in Table 1 on a 7-point Likert scale.They give their ratings before and after the dialogue.Furthermore, participants are asked after the conversation about their chat experience by rating their chat on a 7-point Likert scale to indicate how much it was: enjoyable, engaging, natural, clear, persuasive, confusing, frustrating, too complicated and boring (each measure is rated separately).They are also given the option to provide any other feedback about the conversation.We include the instructions given to participants in Appendix A.

Argument base
The arguments presented to the wizards are extracted from the online platform Kialo4 .Arguments in Kialo are organised as a tree where the top node represents the main claim (the topic in our case).Each argument node in the tree is annotated with a pro or con stance based on whether it is for or against its parent argument node.In our WoZ platform, the arguments are labelled with their stances (pro or con) relative to the topic.As the nodes in Kialo are annotated with stances relative to their parent claim rather than the main claim/topic, we use a heuristic approach to calculate the stances relative to the topic.Specifically, we trace the argument tree from the topic node down to each child argument node and modify the stance of each child with the following assumptions:5 • If an argument is pro the main topic, all its pro children will be pro the topic and all its con children will be con the topic.pro children will be con the topic and all its con children will be pro the topic.As vaccination had the lowest representation of arguments in Kialo, we augment the vaccination argument-base with 479 additional arguments written by participants who took part in a study examining anti-vaccination attitudes (Brand and Stafford, 2022) and 108 arguments sourced from a study examining the use of chatbots for changing vaccine attitudes (Altay et al., 2021;Brand and Stafford, 2022).
Wizard actions We find that wizards use arguments from the argument-base in ≈ 66% of their responses.In Table 3, we detail statistics of different actions taken by the wizards when they select an argument from the argument base.The table reveals that the wizards prefer to edit these arguments to fit the dialogue better (74.86% of the arguments were edited).Furthermore, they often use the search bar and the stance filter, instead of just selecting from the top arguments suggested by TF-IDF; they select an argument from the top 10 suggestions only 21.15% of the times.Finally, we notice that the wizards' use of pro and con arguments is balanced.

Dialogue Models
In this section, we describe the dialogue models for the task of opening up minds.

Wiki-bot
We evaluate the Retrieval-Augmented Generation (RAG)-Sequence model (Lewis et al., 2020b) pre-trained on the Wizard-of-Wikipedia dataset (Dinan et al., 2019).RAG-Sequence uses Wikipedia as a knowledge-base where a Dense Passage Retriever (Karpukhin et al., 2020, DPR) is utilised to retrieve Wikipedia passages that are relevant to the dialogue history, then it uses BART (Lewis et al., 2020a) to generate a dialogue response conditioned on the retrieved passages and the dialogue history.We use the pre-trained model by Shuster et al. (2021b). 6Their approach uses beam search for decoding, however, we noticed that it suffers from repetition and therefore used nucleus sampling to remedy this.

Argu-Bot
We fine-tune the previously described wiki-bot on the OUM dataset (Section 3).We split the dataset into 123 dialogues for training, 15 for validation and 45 for testing.Training is stopped when the validation perplexity doesn't improve for 5 epochs.In order to accommodate for the nature of the dataset, we applied some adaptations to retrieval, training and generation as follows.For retrieval: • Following the wizards' experiments, we use Kialo arguments, instead of Wikipedia, as the knowledge-base for the retrieval model.• We use BM25 instead of DPR for retrieval as initial experiments showed that DPR is more suited for Wikipedia but not suitable for argument retrieval.7 • We assume that the arguments used by the wizards in the training data are good arguments and accordingly increase their scores by 1 if they are retrieved by BM25.• We make use of the search terms the wizards used to find arguments (Section 3) and compile a list of "important terms".We increase the scores of retrieved arguments by 1 if they include any of these terms.tion, similar to work in abstractive summarisation (See et al., 2017).At any turn t in the dialogue, the model learns a generation probability (pgen t ∈ [0, 1]) conditioned on the participant's last utterance h t : where pgen t is optimized to be 0 if the wizard used an argument to generate the response and 1 otherwise.During inference, the probability of generating a response sequence y is calculated by: where x is the dialogue history and z is the retrieved argument.Finally, for generation, we re-rank the candidate responses generated by nucleus sampling w.r.t.their similarity to the retrieved argument and dissimilarity to the previously generated utterances (to avoid repetition).In order to achieve this, we compute the BLEU score between each candidate response and the retrieved argument and the negative of the BLEU score between each candidate and the previously generated utterances, then re-rank the candidates using the average of these two scores.

Control-bot
We use a control condition in our experiment to verify whether participants change their ratings for the OUM questions due to discussing the topic, or other reasons such as demand effect (i.e., they think they are required to change their ratings positively).To this end, we evaluate a 'chitchat' chatbot and instruct the participants to chat about their holidays/weekends.We use the same format of before and after questions as in the wizards study about the 3 topics (veganism, Brexit and vaccination).For instance, in an experiment about veganism, a vegan participant is first asked about their views about non-vegans, then they talk with the chatbot about their holidays, then after the chat they are asked again about their views about non-vegans.We use a Polyencoder model trained on the ConvAI2 dialogues (Humeau et al., 2020) and we refer to this chatbot as control-bot.

Evaluation
We evaluate the models described in Section 4 using the same setup as in Section 3 but by replacing the wizards with one of the models, and limiting the chat time to 10-15 instead of 15-20 minutes as the models are much faster than the wizards.We collect 150 dialogues for each of the argu-bot and wiki-bot (60 for veganism, 45 for Brexit and 45 for vaccination) and 50 dialogues for the control-bot (20 for veganism, 15 for Brexit and 15 for vaccination).In the remainder of this section, we present analysis of open-mindedness and chat experience for the wizards and the dialogue models.

Opening-up Minds
As discussed in Section 3, we ask the participants a set of OUM questions before and after the dialogue in order to evaluate the impact of the dialogue on changing their attitude towards those holding opinions different to theirs.If we ignore the dialogues where the participants did not respond to the questions after the dialogues, the number of dialogues with OUM question annotations becomes 120 for the wizards, 150 for argu-bot, 150 for wiki-bot and 50 for control-bot.For each dialogue, we calculate three OUM scores corresponding to the three question categories defined in Table 1.Each OUM score is calculated as the difference between the ratings before and after the dialogue.As the morality and intellectual capabilities categories contain three questions each, the score for the category is the average of the changes in its sub-questions.We note that due to the different phrasing of the OUM questions, an increase in the rating for the "good reasons" question denotes a positive change, whereas a decrease in the ratings for "intellectual capabilities" and "morality" questions denotes a positive change.We categorise the dialogues according to their OUM scores into 3 classes: zero change: where the score = 0, +oum change: where the score > 0 and -oum change: where the score < 0. We show in Table 4, the percentage of dialogues in each class and the average OUM score per class.We also report the overall score as the average of the OUM scores of all the dialogues for each OUM question category.This overall score helps report the model's success with a single number.8

Wizards
The results in Table 4 demonstrate the success of the wizards' dialogues in opening up participants' minds, particularly with the good reasons category (35.8% of the dialogues resulted in a positive OUM change).We find that despite the fact that for each question category most participants have zero change, which is expected given the relative brevity of the dialogues, the number of participants who have a positive change in their attitude (+oum) is substantially larger than those who have negative change (-oum).Even when the percentage of dialogues with negative scores is relatively high (e.g., 18.3% in the morality category), the average OUM score is smaller than in the positive dialogues (e.g., −0.73 vs 1.05 with the morality category), and all the categories have a positive overall score.Additionally, we find that the percentage of the dialogues with zero change in the control-bot is higher than the wizards in all question categories, which demonstrates the effect of conversing with wizards in comparison to the control condition.Furthermore, the wizards are consistently better than all the models in all question categories in terms of overall score, with a statistically significant difference over the control-bot in the good reasons category.In general, we notice that participants tend to become more open-minded about the good reasons people might have for their stances (with overall score = 0.35), which reflects the nature of the argumentative dialogues and the wizards success in finding good arguments that stimulate open-minded thinking.
On the other hand, the difference between the wizards and the control-bot is less obvious with the morality and intellectual capabilities questions.We investigate this and take a closer look at the OUM ratings before the dialogue.We find that 39.4% of the participants strongly agree/agree that their opponents have good reasons for their convictions, while 44.5% and 56.7% strongly disagree/disagree that their opponents have low intellectual capabilities or morality respectively.When we look at the most open-minded ratings, we find that only 14.7% of the participants strongly agree that their opponents have good reasons for their position, while 24.1% and 30.5% strongly disagree that their opponents have low intellectual capabilities or morality respectively.This shows that regarding the intellectual capabilities and morality categories, particularly the latter, participants come (before the dialogue) with a more open mind than in the good reasons category, and while they might not completely agree with the reasons their opponents have, they are less harsh in their judgement of the morality of these opponents.Therefore, the dialogues have more room to improve the rating of the reasons for the opposite view.The results of the morality and intellectual capabilities also suggest that there is room for development of novel measures which provide further insight into the mechanisms behind changes in open-mindedness.
We further investigate the correlation of features of the wizard dialogues with the success of these dialogues in opening-up minds with respect to the good reasons question.For this purpose we calculate Spearman's rank correlation coefficient (ρ) between the OUM scores for the good reasons question and the following dialogue features: • Length-related features: dialogue length com- Table 5: Average ratings for chat experiences on a 7point Likert scale.In the top rows the higher the score is the better while in the bottom rows the lower the score is the better.Statistical significance is calculated using the Welch t-test between argu-bot and wiki-bot where *** p < 0.001, ** p < 0.01 and *p < 0.05.
puted as the total number of turns in the dialogue, proportion of wizard turns, and proportion of participant turns.• Proportion of questions asked by the wizard to the total number of sentences in their turns.We use Stanford CoreNLP parser (Manning et al., 2014) for question identification.• Proportion of utterances containing arguments selected from the argument base to the total number of wizard turns.• Proportion of edited arguments w.r.t.all the arguments selected and used by the wizards.• Ratio between pro and con arguments used by the wizard.• Frequency of politeness markers (Danescu-Niculescu-Mizil et al., 2013) utilised by the wizard such as greetings, hedging and subjunctives.We use Convokit (Chang et al., 2020) to identify politeness markers and normalise each marker by the number of sentences written by the wizard.
Our analysis reveals very weak to negligible correlations between the OUM scores for good reason and any of these features. 9The features with the strongest correlations are two of the politeness features: the use of positive lexicon (ρ = 0.18) and the use of subjunctives 10 (ρ = −0.18).While using positive words fosters a positive attitude towards the participant (e.g., by acknowledging their "good" points), it is not clear why there is negative correlation between subjunctives and OUM scores. 9See Appendix B for the full table of correlations. 10Example of a subjunctive: "Would you agree that eating meat is not inherently bad".Table 6: Spearman's correlation (ρ) between OUM scores for the good reason question and ratings for chat experience in the wizards' dialogues.

Models
Table 4 shows that both the wiki-bot and the argubot have a higher overall score than control-bot in terms of the good reasons question.This further demonstrates the ability of the two models to positively change people's attitudes regarding the reasons their opponents have.Nevertheless, this change diminishes in the morality and intellectual categories, which aligns with our findings from the comparison between the wizards and controlbot.Additionally, despite the fact that the efficacy of using arguments was demonstrated by the wizards' performance that surpassed all the chat models (with good reasons overall score = 0.35), the model that leverages arguments (argu-bot) performs on par with the one that uses Wikipedia (wiki-bot), with good reasons overall score ≈ 0.22.We conjecture that the impact of argu-bot could be improved by: ( 1) training a retriever model on the OUM dataset to learn how to retrieve arguments similarly to the wizards (2) investigating different loss functions as we find that the model heavily relies on the arguments instead of also asking questions like the wizards do.In Table 7, we give an example of vaccination dialogues that had a positive OUM score for good reasons.The example shows that argu-bot is more able to engage in the conversation and talk about COVID-19 vaccines with more knowledge, whereas wiki-bot says it does not know about COVID-19 and deviates to other vaccines and diseases.Nonetheless, this wiki-bot dialogue had a positive change in the good reasons category which shows the potential of dialogue agents for opening-up minds.

Chat Experience
After the dialogue, participants are asked to rate their chat experience on a scale from 1 to 7 in Wiki-bot M: Hello, What do you think about COVID-19 vaccination?P: I think, they are good for us.M: I don't know about COVID-19, but I do know that the WHO recommends getting vaccinated against typhoid and shingles (common cold).P: yes, that's true M: THomid Typhoid vaccine is actually derived from the negative causative agents in the bacterium Streptococcus aureus.P: I have to say, that i didn't heard about it.
Argu-bot M: Hello, What do you think about COVID-19 vaccination?P: I think it's important for people to get immunized against COVID-19 M: What is your opinion of some Covid-19 vaccines will contain adjuvants.Adjuvants increase the immune response, and therefore reduce the amount of antigen that must be used in each dose of vaccine.The adjuvents therefore help to ensure that a large quantity of doses can be produced quickly.P: Sounds like they help with speeding up the vaccination process, so they must be good M: Those are good points, although some people feel the speed in which the vaccine was made has made them nervous about taking it.They don't believe that a vaccine made this quickly would not be safe and they think corners must have been cut to make the vaccines work quickly.P: Well, if they do not take the vaccine, they risk their health and their close ones' health.Even if the vaccine is not to be fully trusted, what could those people lose that's worse than their life and their family's life?
Table 7: Excerpts from dialogues with the argu-bot and the wiki-bot about COVID-19 vaccination, where 'M:' and 'P:' mark the model and the participant turns respectively.Both dialogues achieved a positive change in the good reasons category.
terms of how much it was: enjoyable, engaging, natural, confusing, frustrating, clear, persuasive, too complicated and boring.With the dialogue models we add two more metrics: consistent and knowledgeable.We present the chat experience average ratings in Table 5.

Wizards
Table 5 shows that the wizards surpass all the other models in terms of chat experience.In Figure 2, we plot the distribution of chat experience ratings.It is clear from the figure that participants mostly strongly agree/agree with the positive experiences (e.g., enjoyable) and mostly strongly disagree/disagree with the negative ones (e.g., frustrating), which is another sign of wizards' success.
We further investigate the correlation between chat experience ratings and the OUM scores of wizard dialogues for the good reason question (Section 5.1).Based on the results in Table 6, we can see that there is no strong correlation between the scores and the different experiences; there is very weak negative correlation with some of the bad experiences (e.g., ρ = −0.19 for too complicated) and very weak positive correlation with some of the good experiences (e.g., ρ = 0.16 for persuasive).These results show that participants can still enjoy the conversation and have a positive experience even if they did not change their position.The weak correlation between OUM scores and persuasiveness further demonstrates the difference between persuading someone and opening up their minds about the different opinions which motivates building dialogue systems that foster open-mindedness.Participants are also given the option to write any other feedback about the conversation.We find that all the feedback to the wizards was positive and included sentences like: "The bot is much more nice than the average human who asks these kind of questions",11 "It has opened my eyes to the possibilities of vegan lifestyle and their benefits " and "This study was very enjoyable and fun.I learnt a lot from it.".

Models
Table 5 reveals that argu-bot surpasses wiki-bot in all chat experience metrics and is significantly better on 8 of them.The high performance on chat experience is important to build real-life dialogue models that aim to open up minds.This is because, while in our experiments participants were clearly asked to stay at least 10 minutes in the chatroom and otherwise their experiment gets rejected, in real-life, this restriction does not apply and therefore participants need to find the chatbot enjoyable and engaging in order to continue chatting with it.We also find that the feedback argu-bot received is more positive than wiki-bot and included sentences like: "Interesting and made me think about my choices :)" or "I really liked this one.Chatbots are a clever way of engaging into a topic.".However, participants were more critical than with the wizards and added comments like: "One time bot answered two times exactly the same answer.It should be improved, however the overall impression of it is fine :)" and "I think the responses of the chatbot didn't answered my questions, they were missing the point.".Feedback about wiki-bot included: "The chatbot was changing the topic and not relating to my sentences". 12

Conclusion
We presented a dataset of argumentative dialogues for opening-up minds and showed its success in positively changing participants attitudes regarding the reasons people have for their opposing views.However, this impact was lower with regard to the morality and intellectual capabilities measures, which warrants further study to these measures.We evaluated two dialogue models: a Wikipedia-based and an argument-based one and showed that while they both perform closely in terms of opening up minds, the argument-based model is more successful in providing a good chat experience.

Limitations
• It would be useful to train a neural retriever model for argu-bot to learn to select arguments like the wizards (instead of using BM25), but this requires collecting more wizard dialogues.• Collecting more dialogues with wizards is an expensive process as it requires training more wizards and paying both wizards and participants.

• Our study involves measuring individuals on
how open-minded they are with respect to a position they are opposed to.While we rely on recent research in psychology for this, we acknowledge that such measurements are difficult and more research is needed in this direction.• We only studied the effects of the dialogues on the participants immediately after they were held, but did not check whether the effect was long-term or short-lived.(Chang et al., 2020).Figure 4: Ratings for chat experiences for the wikibot.The y-axis corresponds to the proportion of the dialogues, the x-axis corresponds to chat experiences and the different colors refer to the ratings on the 7-point Likert scale, where 1=strongly disagree and 7=strongly agree.

Figure 2 :
Figure2: The graph depicts the ratings of wizards' dialogues in terms of chat experience.The y-axis corresponds to the proportion of the dialogues, the x-axis corresponds to aspects of chat experiences, and the different colors refer to the ratings on the 7-point Likert scale, where 1=strongly disagree and 7=strongly agree.

Figure 3 :
Figure3: Ratings for chat experiences for the argubot.The y-axis corresponds to the proportion of the dialogues, the x-axis corresponds to chat experiences and the different colors refer to the ratings on the 7-point Likert scale, where 1=strongly disagree and 7=strongly agree.

Table 2 :
Statistics of the opening up minds (OUM) dataset.

Table 3 :
• If an argument is con the main topic, all its Percentage of the different (non-mutually exclusive) argument selection actions by the wizards.

Table 4 :
The percentage of dialogues that have zero, positive or negative OUM scores in the three OUM categories.'Overall' refers to the average of the dialogue's OUM scores for the respective category.The numbers between brackets indicate the average OUM score.* indicates significance over control-bot using Welch t-test with p < 0.05.