Detecting Bot-Generated Text by Characterizing Linguistic Accommodation in Human-Bot Interactions

Language generation models' democratization benefits many domains, from answering health-related questions to enhancing education by providing AI-driven tutoring services. However, language generation models' democratization also makes it easier to generate human-like text at-scale for nefarious activities, from spreading misinformation to targeting specific groups with hate speech. Thus, it is essential to understand how people interact with bots and develop methods to detect bot-generated text. This paper shows that bot-generated text detection methods are more robust across datasets and models if we use information about how people respond to it rather than using the bot's text directly. We also analyze linguistic alignment, providing insight into differences between human-human and human-bot conversations.


Introduction
Bots are useful in a wide variety of applications areas including business (Kaczorowska-Spychalska, 2019), education (Kerlyl et al., 2006), and health (Yadav et al., 2019;Liednikova et al., 2020). For instance, Yadav et al. (2019) studied the use of chatbots as a drop-in first-point-ofcontact for women in India seeking breastfeeding information. Similarly, researchers have studied bots to answer COVID-19-related questions (Oniani and Wang, 2020) and screen individuals for risks of contracting the virus (Martin et al., 2020). Overall, the wide availability of software packages, tools, and pre-trained models has democratized the creation of bots.
Even with the increasing interest in bots for social good (e.g., COVID-related chatbots), there is still a concern regarding their abuse to spread misinformation, be used for targeted discrimination, deceive users, and perform fraud (Daniel et al., 2019). Given the potential good and harm bots can create, it is essential to study how the bots should act and how people do interact with such bots for specific applications.
Automatically identifying bots online is well studied (Garcia-Silva et al., 2019;Herzig et al., 2019;Kosmajac and Keselj, 2019;Ippolito et al., 2020;Jawahar et al., 2020). Bot detection methods rely on two forms of information (Orabi et al., 2020): behavior and content. Behavior relates to measuring how often bots post, the time posts are created, and conversational network structures (Beskow and Carley, 2018). Content involves using the bot's text directly. Focusing on text suffers from generalization issues, making it challenging to detect bots that discuss different topics or operate in different domains. Likewise, behavioral approaches assume that bots will behave differently than people at a superficial level, such as posting more often than humans. Even in the network analysis of conversation structures (Beskow and Carley, 2018), many of the human-bot interactions on social media are not human-like interactions. For instance, it is understandable that humans will interact with a bot that converts pounds to kilograms differently than other humans. Given bots' current use-cases ranging from counselors to healthcare information providers, it is vital to understand how humans and bots interact beyond trivial applications.
Before looking at how humans interact with bots, it is essential to understand how humans interact. This paper focuses on Communication Accommodation Theory (CAT) and general language use to analyze interactions. CAT is used to study language use in various domains to understand human behavior (Giles et al., 1973;Tausczik and Pennebaker, 2010). More specifically, we study linguistic accommodation-where speakers come to talk more (or less) similarly as they interact-which has been analyzed in mental health support on Reddit, showing a positive link with informational and emotional support (Sharma and De Choudhury, 2018). The alignment of two people in terms of linguistic style has positively predicted successful outcomes of negotiations (Taylor and Thomas, 2008), and doctors are recommended to accommodate perspective to improve patient care (Kline and Ceropski, 1984;Wood, 2019). The way people use and accommodate pronouns can indicate power, where high-status individuals use "I" less and "you/we" words more than low-status individuals (Kacewicz et al., 2014).
Linguistic accommodation in human-bot interactions has been studied in system design, showcasing that there is a strong link between user experience and language style of the system (Chaves et al., 2019;Chaves, 2020;Thomas et al., 2020). Recent research has also shown a link between language style in chatbots and user engagement for e-commerce (Elsholz et al., 2019). Our focus is to understand real long open-domain human-bot interactions better. Our study can also influence how bots are detected on social media and provide a better understanding of how human-human interactions differ from human-bot interactions is essential for bot development. For instance, when should bots accommodate towards the user? Can we tell how well a system performs for a specific task based on whether the user accommodates the bot? If human-bot conversations are linguistically different than human-human interactions, research from human-human studies may not generalize to human-bot interactions.
Toward addressing the potential societal impacts of open-domain bots, this paper addresses the following research questions (RQs) in the context of bot Detection:

RQ1. How do humans and bots align in humanbot interactions?
RQ2. How does the alignment in human-bot interactions compare to the alignment in human-human interactions?
RQ3. Are differences in alignment between human-human and human-bot interactions similar across domains and language generation learning methods?
Overall, we analyze whether human users' language changes stylistically with bots as compared to human-human conversations. Furthermore, we show that small differences in language style provides robust information, compared to the bots' language patterns, to accurately detect bots.

Related Work
Bot Detection. Detecting bots in the wild is a widely studied problem (Garcia-Silva et al., 2019;Herzig et al., 2019;Kosmajac and Keselj, 2019;Ippolito et al., 2020;Jawahar et al., 2020). bot detection methods rely on two forms of information (Orabi et al., 2020): Content and Behavior. Garcia-Silva et al. (2019) studied how pretrained language models perform for the task of bot detection. Kosmajac and Keselj (2019) developed "language-independent" stylistic features that measure language diversity to detect bots. Knauth (2019) empirically explored content and behavioral features for bot detection. Beskow and Carley (2018) analyzed human-bot interactions by developing methods to detect bots using network analysis of conversation structures. Much of the prior work has focused on detecting bots "in the wild." Thus, many of the interactions between bots and humans are superficial (e.g., receiving movie quotes from a bot). This paper differs from prior work in two ways. First, rather than using content from the bot or general behavioral information (e.g., post frequency), we try to detect bots by analyzing how humans respond to them (compared to human-human interactions). Second, rather than exploring bots in the wild, we explore two types of datasets: one where researchers instruct participants to interact with bots and humans in the same way, and another dataset where researchers instruct participants to converse with a bot.
Analyzing Human-Human Conversations. Linguistic alignment is known to interact with a wide array of social factors. For instance, the level of alignment of people's linguistic style in a conversation has been claimed to be affected by their relative social power (Gnisci, 2005;Xu et al., 2018;Danescu-Niculescu-Mizil et al., 2011). Cooperative decision-making tasks are positively related to the participants' linguistic convergence (Fusaroli et al., 2012;Kacewicz et al., 2014). Recently, Sharma and De Choudhury (2018) analyzed mental health support forums on Reddit, showing that linguistic accommodation is positively linked with informational and emotional support. Similarly, Taylor and Thomas (2008) analyzed negotiation outcomes and linguistic alignment, showing that alignment positively predicted successful negotiation outcomes. There is also the support of linguistic alignment by doctors to enhance patient care by improving trust and adherence to a treatment plan (Kline and Ceropski, 1984;Wood, 2019). Alignment is a powerful tool that can be used to improve understanding, trust, and potentially patient outcomes. Therefore, understanding when and how people should align with each other is an important area of research. Furthermore, it is essential to draw a line between engaging with a culture and language style and appropriating it, thereby potentially causing harm rather than building rapport with the partners in a conversation.
With the goal of understanding humanhuman interactions, there has also been progress in linguistic alignment measurement techniques (Niederhoffer and Pennebaker, 2002;Danescu-Niculescu-Mizil et al., 2011;Jones et al., 2014;Wang et al., 2014;Doyle and Frank, 2016;Shin and Doyle, 2018). Danescu-Niculescu-Mizil et al. (2011) presented an easy-to-compute expression that measures the increase in the conditional probability given that a conversational partner has used it. An issue with the method proposed by Danescu-Niculescu-Mizil et al. (2011) is that it assumes messages between two people have similar length. To overcome this limitation, Doyle and Frank (2016) introduced the Word-Based Hierarchical Alignment Model (WHAM), a hierarchical graphical model where the parameters are learned using Bayesian inference. Another method called the Simplified Word-Based Alignment Model (SWAM) was recently proposed by Shin and Doyle (2018). SWAM attempts to compare alignment between different groups when the alignment scores are assumed to differ substantially. Unfortunately, SWAM only estimates group-level alignment, not conversation-level between two specific users/bots. Linguistic alignment has also been shown to be predictive of specific tasks. In this paper, we use the method proposed by Danescu-Niculescu-Mizil et al. (2011) to estimate linguistic alignment. For instance, Niven and Kao (2019) use alignment features to predict discourse acts.

Analyzing Human-Bot Conversations.
Research studying the interaction between bots and humans has been explored from a wide array of  perspectives. For example, systems that use emotionally expressive interjections ("wow", "ahem") in their text to speech responses can significantly improve the user experience (Cohn et al., 2019). Given the popularity of bots in application areas from business (Kaczorowska-Spychalska, 2019) to healthcare (Pieraccini et al., 2009), it is also important to understand how language generation style and alignment impacts their intended use. There has been a recent interest in analyzing accommodation and similar concepts in human-bot interactions. For instance, Ahn et al. (2020) show that humans will match code-switching patterns introduced by a chat system. Moreover, users tend to have positive reactions towards systems that code-switch. Ma and Lalor (2020) measure lexical entrainment between a specific Reddit bot and users. They find that sentiment of bot has a positive effect on the sentiment of the humans response. Furthermore, Ma and Lalor (2020) show that human responses tend to overlap with the bots original post. Compared to prior work studying accommodation-related aspects of human-bot interactions, this paper differs in two ways. First, just analyzing linguistic alignment, we ground our study of human-bot interactions in the real-world task of bot detection. Second, we analyze cross-domain generalization of accommodation patterns in human-bot interactions, where cross-domain includes types of bots (e.g., retrieval and transformer-based models), data collection procedures (e.g., sampling bias), and conversation topics.

Datasets
An overview of the task we introduce in this paper is shown in Figure 1. We formulate a bot de- tection task between two entities, either a human and bot or a human and human. We assume that one entity is always human, and the other entity is unknown-either a human or a bot. Formally, represents a sequence of utterances, where u h i represents the i-th utterance in a conversation and that it was made by a human h. Likewise, u o i represents an unknown entity's utterance (human or bot). N is the total number of utterances in the conversation. Our goal is to develop a classifier f (D) that maps to a class in the set T = {human-human, human-bot}, where the human-bot means a bot is a part of the conversation. We formulate two bot detection datasets for this task consisting of three known bot datasets: ConvAI2, WOCHAT, and Daily-Dialog. The basic statistics of each dataset are shown in Table 1. WOCHAT and DailyDialog are used to form a Control dataset. We describe each dataset below: ConvAI2. We use four datasets from the Second Conversational Intelligence Challenge (Con-vAI2). The motivation behind the ConvAI2 challenge was to develop new approaches towards open-domain chatbots (Zhang et al., 2018;Dinan et al., 2019). Overall, we use two types of ConvAI2 datasets: training datasets containing human-human interactions and evaluation phase datasets containing human-bot interactions. First, we use the PERSONA-CHAT dataset, a collection of human-human interactions where researchers instructed Mechanical Turk users to converse with one another, assuming specific personas (i.e., profile descriptions). The ConvAI2 competition used automatic (e.g., Perplexity) and human evaluation procedures. The human evaluation procedures involve either paid workers or volunteers that interact with models built using PERSONA-CHAT, after which the volunteers/workers are asked to rate their interactions. Moreover, the human evaluator is instructed to interact with the bots in the same way as the Mechanical Turk users who participated in creating the PERSONA-CHAT dataset.
We use three "evaluation-phase" datasets 1 : TOLOKERS, VOLUNTEERS, and INTERME-DIATE. The TOLOKERS used solicited workers to chat with the models, similar to the PERSONA-CHAT's Mechanical Turk setup. Specifically, TOLOKERS consists of data collected during DeepHack.Chat 2 hackathon via paid workers using the Yandex.Toloka service. The INTERME-DIATE dataset consists of more dialogues by the bots from DeepHack.Chat, but the interactions come from volunteers. Finally, the VOLUNTEER dataset was collected during the final "wild evaluation" round of the ConvAI2 competition. Humanbot interactions were collected from volunteers through the Facebook Messenger and Telegram APIs.
Control Dataset. One of the research questions we explore in this paper is related to outof-domain performance. Specifically, can we detect bots based on human response when data were collected under different settings and where the conversation topics differ? To address the out-of-domain research question, we use two additional datasets collected in the The Workshop on Chatbots and Conversational Agent Technologies (WOCHAT) (Kong-Vega et al., 2019). As part of a shared task, the workshop makes several bots available, has participants contribute new bots, and participants interact with the bots providing utterance-level feedback regarding their performance. We use two human-bot interaction datasets released by the organizers: IRIS and the TickTock dataset. 3 both IRIS (Banchs and Li, 2012) and TickTock (Yu et al., 2015) are retrievalbased bot variants that were "trained" on different datasets. Unlike the ConvAI2 task, humans are not instructed to converse with the bots as if they are another human. Thus, sometimes humans will ask things such as, "Are you a Robot?". This point provides a unique aspect that increases cross-data differences.
Unfortunately, human-human conversations were not released from the WOCHAT shared task. The focus was annotating human-bot interactions. Thus, we need to augment WOCHAT with real human-human conversations. Our study uses the DailyDialog dataset as our source of humanhuman interactions (Li et al., 2017). Unlike the ConvAI2-related data, DailyDialog consists of conversations between English learners practicing everyday English dialog in daily life.
Data Processing. We split the data into three groups: Unpaid (U), Paid (P), and Control (C). The U dataset consists of PERSONA-CHAT, INTERMEDIATE, and VOLUNTEER ConvAI2 datasets. The P dataset consists of PERSONA-CHAT and TOLOKERS-the name Paid comes from the fact that the TOLOKERS were paid as part of a crowdsourcing task. We split this group from the others in case this affects conversation behavior. The C dataset consists of IRIS, TICK-TOCK, and DailyDialog. Each dataset is divided into 70%, 10%, and 20% training, validation, and test splits, respectively. It is important to note that the same PERSONA-CHAT training, validation, and test examples are used in both the U and P datasets.

Method
To detect human-bot conversations, we explore two types of features below: Content and Stylistic features.

Content Features
We define content information as features describing "what" humans and bots say in their interactions. Specifically, we describe two sets of content features: bag-of-words and embeddings.
Bag-of-words. As a simple baseline, we use TF-IDF-weighted unigrams from a dialog to detect whether a bot is part of the conversation. We explore three settings for the TF-IDF features: human-only, unknown-only, and human-unknown.
The unknown user in the conversation can be either a bot or a human.

Stylistic Features
The stylistic features encode "how" humans and bots speak in their interactions. We use two sets of stylistic features: Linguistic Inquiry and Word Count ( LIWC. We experiment with the psychologically validated word categories (e.g., positive emotion, cognitive, and social processes) in LIWC as features. These lexicons might reveal more about a writer's thought processes, emotional states, and intentions. For LIWC features, we use the same word categories described in the Linguistic Accommodation Section below to train our LIWCbased classifier. Specifically, we use LIWC to process each utterance. Next, each utterance's LIWC scores are averaged together to form a dialoguespecific feature vector. We experiment with LIWC features variants: human-only, unknown-only, and human-unknown. Each setting averages the LIWC scores across a different set of utterances, similar to the bag-of-words and embedding features.
Linguistic Accommodation. We use the method proposed by Danescu-Niculescu-Mizil et al. (2011) to analyze linguistic alignment. It includes two primary group-level and conversation-level measures: baseline word usage and alignment. The group-level baseline word use corresponds to the rate at which a person uses a given word category when it has not been used previously in a conversation. The group-level alignment score reflects the proportionate increase-compared to the baseline score-in the likelihood of the word being used when it has been used previously in the conversation. Similarly, the conversation-level metrics measure accommodation and usage between a human and Unknown (bot or human) user in the each dialogue D i . We measure the probability of seeing a word category (e.g., LIWC categories) w c given w c appeared in the previous utterance P (w c ∈ u o i |w c ∈ u h i−1 ) 5 and the baseline probability of seeing w c in the conversation P (w c ∈ u o i ). Next, the accommodation score is calculated by taking the difference between both probabilities, acc These empirical probabilities are at the conversation level (i.e., measuring how two specific users align). To obtain the group-level estimates, we simply average the accommodation scores acc(c) over all conversations D i for each word category. We use the following 17 LIWC categories (Pennebaker et al., 2015): i, you, we, they, social, cogproc, posemo, negemo, article, prep, certain, conj, discrep, negate, pronoun, quant, and tentat. We experiment with two main settings as described for the previous methods: Human-Only, Unknown-Only. The Human-Only setting consists of just using the known human's alignment scores in each conversation and vice-versa for the Unknown-Only setting.
Model Training Details. For the content feature sets, we train a Logistic Regression classifier from the Scikit-Learn package (Pedregosa et al., 2011). Using the validation split for each dataset, we grid-search over the C-values {.0001, .001, .01, .1, 1., 10.}, the logistic regression class weight parameters {None, balanced}, and normalization procedures {standardize, unit normalize, None}. For the Stylistic features, we train a Random Forest classifier from the Scikit-Learn package. Again, using the validation split, we grid-search 5 The probabilities are from the perspective of the Unknown user in the conversation. They are also calculated from the human's perspective.

Results
In this section, we report two sets of results. First, in Subsection 5.1 we present the performance of the bot detection models we explain in Section 4. Beyond the methods described in Section 4, we also compare three baselines: Most Frequent, Most Infrequent, and Stratified. The Most Frequent baseline predicts the most frequent class for every example (i.e., human-human). The Most Infrequent baseline predicts the most infrequent class (i.e., human-bot) and the Stratified baseline makes random prediction proportional to each class's frequency. Second, in Subsection 5.2, we present a fine-grained analysis of the linguistic accommodation results on the ConvAI2 datasets. For all results, we report the Macro F1 (average F1 for the human-human and human-bot conversation classes).

Bot Detection Experiments
The source dataset results are reported in Table 2.
Overall, we find that content features are the most U → C P → C C → U C → P AVG  predictive for bot detection, when training and testing on the same train-test splits from the same dataset. Furthermore, we find that the BERT-based models are able to outperform the Bag-of-Words models on average. For instance, the Human Bagof-Words model on dataset C obtains an F1 of .939. Yet, the Human BERT model obtains an F1 of .987. Likewise, for the stylistic features, we find that simply using LIWC works better than using the accommodation features alone for dataset C. However, combining both Accommodation and LIWC features from both the Human and the Bot is better than using either feature set individually with an F1 of .729.
The cross-dataset bot detection results are presented in Table 3. Specifically, the scores are from experiments where we train on a source dataset and evaluate each model on a target dataset's test split (i.e., source → target). Overall, we make three major findings. First, for both Content and Stylistic features, we find that analyzing the known human's language in each conversation is more informative than analyzing the bot's content. For instance, the average (AVG) score for human bag-of-words is .541, while the bot bagof-words model AVG result is nearly 4% lower (.499). We have similar findings between human LIWC (.502) and bot LIWC (.478). We find that combining both bot and human LIWC improves the AVG performance of the human-only LIWC model with an F1 of 0.528. Second, while the BERT-based model performs better when applied to data from the same datasets in Table 2 (e.g., U → U) in many settings, when the test dataset changes substantially (e.g., U → C), the generalization performance of BERT drops compared to using bag-of-words. This result is potentially caused by overfitting to random source-specific characteristics. Third, we find that Accommodation features outperform all other individual feature sets with an AVG bot-detection Macro F1 of .584, with the exception of using both the human's and bot's accommodation features which has a Macro F1 of .637. Interestingly, human accommodation information (.584) is more predictive than the bot's text (.462). Intuitively, the type of responses generated by a bot can differ substantially depending on the bot's training data (e.g., ConvAI2 vs. DailyDialog) and model (e.g., transformer vs. retrieval-based models). Hence, the human's responses are more consistent with regard to alignment. The best combination overall is the combination of the human's LIWC features with the human's accommodation features achieving a Macro F1 of .660. Yet, with the bot (Unknown) feature combination, the performance drops from .768 with Unknown LIWC to .674 after combining accommodation features.
In  dation features. The most informative feature is pronoun accommodation. See Section 6 for more details. Other informative features include the human's use of social, quant (quantitative), and tentative words.

Accommodation Analysis
In this section, we analyze the group-level ( Overall, some indicators are robust across datasets, whereas others can vary. For instance, for the general pronoun category, we find that human generally accommodate less towards bots than bots accommodate towards them. The pronoun accommodation pattern for pronouns is robust across both datasets. Similarly, for both datasets, humans accomodate more often than bots for social words. Yet, for negative emotion words (negemo), we find that Bot's accommodate less than humans in the TickTock dataset, yet they accommodate more in the INTERMEDIATE dataset. In contrast, in the ConvAI2 and DailyDialog datasets, human accommodation is the same on average when two humans participate in a conversation together (not shown in the Figures).

Discussion
Why is humans' accommodation such a powerful predictor that they are in a conversation with a bot-even when they are supposedly treating the conversation the same as a human-human interaction? Overall, we found two major reasons to explain the model's high performance. First, modern bots from the transformer to retrieval-based models generate bad responses. For instance, in the conversation below from the VOLUNTEER dataset, the human and bot have already had a conversation about hair color:

Bot:
What is your hair color? Human: I said its black.... Bot: I know the feeling. Human: What?
From the example, we can see that the bot is asking about hair color, which was already discussed, then the bot responds with a nonsensical output, "I know the feeling." Thus, we see that the final human response does not accommodate with with"I" nor any first singular pronoun. The lower alignment score for humans (compared to bots) for the word category I in Figure 2 supports this finding. Furthermore, from Table 4, we know that accommodation in terms of pronoun usage is highly predictive of a human-bot conversation. Similar posts can be found in all datasets where bad outputs can explain why humans align differently concerning "I" in bot-human interactions than human-human interactions. This finding is interesting because it is similar to the findings in human-human interactions where alignment is positively correlated with understanding in cooperative decision-making tasks (Fusaroli et al., 2012;Kacewicz et al., 2014). If humans don't understand the conversation, they align less, at least with regards to pronoun usage in our case. The next example from the VOLUNTEER dataset is a better, more coherent conversation: We can see that the final human response aligns with the first person singular word "me" to the bot's use of "I". A potential use that warrants future work in the area of bots evaluation involves using accommodation information as an additional evaluation metric in human evaluation studies if certain types of accommodation correlate with bad system responses. If human ratings of bot performance are noisy, the use of accomodation information can provide insight into how well a bot performs, if a human interacts with it differently than they do in human-human conversations.
The second reason for the significant predictive performance can be seen in adversarial conversations between the human and bot. Specifically, in instances where the human is trying to trick the bot or simply trying to ask questions, bothuman alignment can differ from human-human conversations. The example below from the Tick-Tock dataset Human: Are you a robot? Bot: No. Human: Where are you from?
shows repeated "You" usage by the user, repeatedly asking multiple questions. Again, differences in pronoun usage can reflect adversarial users. Again, in terms of bot evaluation, if a human is acting adversarially, and it is possible to differentiate adversarial interactions from the human not understanding, this can provide insight into . More examples can be found in the Supplementary Material.

Conclusion
This study has is an exploration of the levels of linguistic alignment in human-bot conversations and presents evidence about how bot-human interactions are different than human-human conversations. This work's insights have implications for future bot detection methodologies. How humans and bots come together and converse manifested by their level of agreement and disagreement is a first step towards understanding human interactions with general conversational agents. Furthermore, as future work, we will explore the use of linguistic alignment to improve bot evaluation frameworks involving human-bot interactions. Moreover, we will also explore methods of controlling the accommodation level of bots themselves. Given the implicit social implications of certain types of accommodation, controlling it can help important tasks (e.g., counseling or customer service-related bots).