The R-U-A-Robot Dataset: Helping Avoid Chatbot Deception by Detecting User Questions About Human or Non-Human Identity

Humans are increasingly interacting with machines through language, sometimes in contexts where the user may not know they are talking to a machine (like over the phone or a text chatbot). We aim to understand how system designers and researchers might allow their systems to confirm its non-human identity. We collect over 2,500 phrasings related to the intent of “Are you a robot?”. This is paired with over 2,500 adversarially selected utterances where only confirming the system is non-human would be insufficient or disfluent. We compare classifiers to recognize the intent and discuss the precision/recall and model complexity tradeoffs. Such classifiers could be integrated into dialog systems to avoid undesired deception. We then explore how both a generative research model (Blender) as well as two deployed systems (Amazon Alexa, Google Assistant) handle this intent, finding that systems often fail to confirm their non-human identity. Finally, we try to understand what a good response to the intent would be, and conduct a user study to compare the important aspects when responding to this intent.


Introduction
The ways humans use language systems is rapidly growing. There are tens of thousands of chatbots on platforms like Facebook Messenger and Microsoft's Skype (Brandtzaeg and Følstad, 2017), and millions of smart speakers in homes (Olson and Kemery, 2019). Additionally, systems such as Google's Duplex (Leviathan and Matias, 2018), which phone calls businesses to make reservations, foreshadows a future where users might have unsolicited conversations with human sounding machines over the phone.
This future creates many challenges (Følstad and Brandtzaeg, 2017;Henderson et al., 2018). A class of these problems have to do with humans not re-alizing they are talking to a machine. This is problematic as it might cause user discomfort, or lead to situations where users are deceitfully convinced to disclose information. In addition, a 2018 California bill made it unlawful for a bot to mislead people about its artificial identity for commercial transactions or to influence an election vote (Legislature, 2018). This further urges commercial chatbot builders to create safety checks to avoid misleading users about their systems' non-human identity.
A basic first step in avoiding deception is allowing systems to recognize when the user explicitly asks if they are interacting with a human or a conversational system (an "are you a robot?" intent).
There are reasons to think this might be difficult. For one, there are varied number of ways to convey this intent: When recognizing this intent, certain utterances might fool simple approaches as false positives: Additionally, current trends suggests progress in dialog systems might come from training on massive amounts of human conversation data (Zhang et al., 2020;Roller et al., 2020;Adiwardana et al., 2020). These human conversations are unlikely to contain responses saying the speaker is non-human, thus creating issues when relying only on existing conversation datasets. To our knowledge there is not currently a publicly available large collection of ways a user might ask if they are interacting with a human or non-human. Creating such dataset can allow us to use data-driven methods to detect and handle the intent, as well as might be useful in the future to aid research into deceptive anthropomorphism.
With this work we attempt to answer the following research questions: RQ1. How can a user asking "are you a robot?" be accurately detected? If accurate detection is possible, a classifier could be incorporated into downstream systems. §4 RQ2. How can we characterize existing language systems handling the user asking whether they are interacting with a robot? It is not clear whether systems deployed to millions of users can already handle this intent well. §5 RQ3. How do including components of a system response to "are you a robot" affect human perception of the system? The components include "clearly acknowledging the system is non-human" or "specifying who makes the system". §6

Related Work
Mindless Anthropomorphism: Humans naturally might perceive machines as human-like. This can be caused by user attempts to understand these systems, especially as machines enter historically human-only domains (Nass and Moon, 2000;Epley et al., 2007;Salles et al., 2020). Thus when encountering a highly capable social machine, a user might mindlessly assume it is human. Dishonest Anthropomorphism: The term "dishonest anthropomorphism" refers to machines being designed to falsely give off signals of being human in order to exploit ingrained human reactions to appearance and behavior (Kaminski et al., 2016;Leong and Selinger, 2019). For example Kaminski et al. (2016) imagine a scenario where a machine gives the appearance of covering it's eyes, but yet continues to observe the environment using a camera in its neck. Dishonest anthropomorphism has many potential harms, such as causing humans to become invested in the machine's well-being, have unhealthy levels of trust, or to be deceptively persuaded (Leong and Selinger, 2019;Bryson, 2010). Robot Disclosure: Other work has looked how systems disclosing their non-human identity affects the conversation (Mozafari et al., 2020;Ho et al., 2018). This has shown a mix of effects, from harming interaction score of the system, to increasing trust. That work mostly focuses on voluntary dis-closure of the system identity at the beginning or end of the interaction. In contrast, we focus on disclosure as the result of user inquiry. Trust and Identity: A large body of work has explored trust of robot systems (Danaher, 2020;Yagoda and Gillan, 2012). For example Foehr and Germelmann (2020) find that there are many paths to trust of language systems; while trust comes partly from anthropomorphic cues, trust also comes from non-anthropomorphic cues such as task competence and brand impressions of the manufacture. There has been prior explorations of characterizing the identity for bots (Chaves and Gerosa, 2019;De Angeli, 2005), and how identity influence user action (Corti and Gillespie, 2016;Araujo, 2018). Public Understanding of Systems: Prior work suggests one should not assume users have a clear understanding of language systems. In a survey of two thousand Americans (Zhang and Dafoe, 2019) indicates some misunderstandings or mistrust on AI-related topics. Additionally, people have been unable to distinguish machine written text from human written text (Brown et al., 2020;Zellers et al., 2019). Thus being able to remove uncertainty when asked could be beneficial. Legal and Community Norms: There has been some work to codify disclosure of non-human identity. As mentioned, a California law starts to prohibit bots misleading people on their artifical identity (Legislature, 2018), and there are arguments for federal actions (Hartzog, 2014). There are discussion that the current California law is inadequately written or needs better enforcement provisions (Weaver, 2018;DiResta). Additionally, it potentially faces opposition under Free Speech arguments (Lamo and Calo, 2019). Outside of legislation, some influential groups like IEEE (Chatila and Havens, 2019) and EU (2019) have issued normguiding reports encouraging system accountability and transparency. Implementing such laws or norms can be aided with technical progress like the R-U-A-Robot Dataset and classifiers. Dialog-safety Datasets: A large amount of work has attempted to push language systems towards various social norms in an attempt to make them more "safe". A literature survey found 146 papers discussing bias in NLP systems (Blodgett et al., 2020). This includes data for detection of hateful or offensive speech which can then be used as a filter or adjust system outputs (Dinan et al., 2019;Paranjape et al., 2020). Additionally there efforts model to aspects of human ethics (Hendrycks et al., 2020). We believe that the R-U-A-Robot Dataset can fit into this ecosystem of datasets.

Dataset Construction
We aim to gather a large number phrasings of how a user might ask if they are interacting with a human or non-human. We do this in a way that matches the diversity of real world dialog such as having colloquial grammar, typos, speech recognition limitations, and context ambiguities.
Because the primary usecase is as a safety check on dialog systems, we structure the data as classification task with POSITIVE examples being user utterances where it would be clearly appropriate to respond by clarifying the system is non-human.
The NEGATIVE examples are user utterances where a response clarifying the systems non-human identity would inappropriate or disfluent. Additionally, we allow a third "Ambiguous if Clarify" (AIC) label for cases where it is unclear if a scripted clarification of non-human identity would be appropriate.
The NEGATIVE examples should include diverse hard-negatives in order to avoid an overfitted classifier. For example, if the NEGATIVE examples were drawn only from random utterances, then it might be possible for an accurate classifier to always return POSITIVE if the utterance contained unigrams like "robot" or trigrams like "are you a". This would fail for utterances like "do you like robots?" or "are you a doctor?".

Context Free Grammar Generation
To help create diverse examples, we specify examples as a probabilistic context free grammar. For example, consider the following simple grammar: S → " a r e you a " RobotOrHuman | "am i t a l k i n g t o a " RobotOrHuman RobotOrHuman → Robot | Human Robot → " r o b o t " | " c h a t b o t " | " c o m p u t e r " Human → " human " | " p e r s o n " | " r e a l p e r s o n " This toy grammar can be used to produce 12 unique phrasing of the same intent. In reality we use a grammar with far more synonyms and complexity. Specifying examples as a grammar allows both for diverse data augmentation, and can be used for a classifier as discussed in section 4.

Crowd Sourcing for Expanding Grammar
We hand write the initial version of our example grammar. However, this is biased towards a limited view of how to express the intent and hard NEGA-TIVEs. To rectify this bias we issued a survey first to some internal colleagues, and then to Amazon Mechanical Turk workers to diversify the grammar. The survey consisted of four pages with three responses each. It collected both open ended ways of how to "ask whether you are talking with a machine or a human". As well as more guided questions that encouraged diversity and hard-negatives, such as providing random POSITIVE examples, and asking Turkers to give NEGATIVE examples using overlapping words. (For exact wording see Appendix B).
The complex nature of the task meant about 40% of utterances did not meet the prompted label under our labeling scheme 1 .
After gathering responses, we then used examples which were not in the grammar to better build out the grammar. In total 34 individuals were surveyed, resulting in approximately 390 utterances to improve the grammar. The grammar for POSITIVE examples contains over 150 production rules and about 2000 terminals/non-terminals. This could be used to recognize or sample over 100,000 unique strings 2 .

Additional Data Sources
While the handwritten utterances we collect from Turkers and convert into the grammar is good for POSITIVE examples and hard NEGATIVE, it might not represent real world dialogues. We gather additional data from three datasets -PersonaChat (Zhang et al., 2018), Persuasion For Good Corpus (Wang et al., 2019)

Dataset Splits
The dataset includes a total of 6800 utterances. All positive utterances (40%) came from our grammar. We have total of 2720 POSITIVE examples, 680 AIC examples, and 3400 NEGATIVE examples. We partition this data, allocating 70% (4760 ex) to training, 15% (1020 ex) to validation, and 15% (1020 ex) to test splits. Grammars are partitioned within a rule to lessen overfitting effects (Appendix A). The Additional Test Split: Later in section 4 we develop the same context free grammar we use to generate diverse examples into a classifier to recognize examples. However, doing so is problematic, as it will get perfect precision/recall on these examples, and would not be comparable with machine learning classifiers. Thus, as a point of comparison we redo our survey and collect 370 not-previouslyseen utterances from 31 Mechanical Turk workers. This is referred to as the Additional Test split. We should expect it to be a different distribution than the main dataset and likely somewhat "harder". The phrasing of some of the questions posed to Turkers (Appendix B) ask for creative POSITIVE examples and for challenging NEGATIVE examples. Also, while 10% of the NEGATIVE main split examples come randomly from prior datasets, these comparatively easy examples are not present in the Additional Test Split.

Labeling Edge Cases
While labeling thousands of examples, we encountered many debatable labeling decisions. Users of the data should be aware of some of these.
Many utterances like "are you a mother?", "do you have feelings?", or "do you have a processor?" we sampled were actually POSITIVE or AIC examples is related to asking "are you a robot?", but we label as NEGATIVE. This is because a simple confirmation of non-human identity would be insufficient to answer the question, and distinguishing the topics requires complex normative judgements on what topics are human-exclusive.
Additionally, subtle differences lead to different labels. For example, we choose to label "are you a nice person?" as POSITIVE, but "are you a nice robot?" as AIC (the user might know it is a robot, but is asking about nice). Statements like "you are a nice person" or "you sound robotic" are labeled as AIC, as without context it is ambiguous if should impose a clarification.
Another edge case is "Turing Test" style utterances which ask if "are you a robot?" but in an adversarially specific way (ex. "if you are human, tell me your shoe size"), which we label as AIC.
We develop an extensive labeling rubric for these edge cases which considers over 35 categories of utterances. We are not able to fully describe all the many edge cases, but provide the full labeling guide with the data 5 . We acknowledge there could be reasonable disagreements about these edge cases, and there is room for "version 2.0" iterations.

"Are you a robot?" Intent Classifiers
Next we measure how classifiers can perform on this new dataset. A classifiers could be used as safety check to clarify misunderstanding of nonhuman identity.

The Models
We compare five models on the task. Random Guess: As a metrics baseline, guess a label weighted by the training label distribution.
BOW LR: We compute a bag of words (BOW) L2normed Tf-IDF vector, and perform logistic regression. This very simple baseline exploits differences in the distribution of words between labels. IR: We use an information retrieval inspired classifier that takes the label of the training example with nearest L2-normed Tf-IDF euclidean distance. FastText: We use a FastText classifier which has been shown to produce highly competitive performance for many classification tasks (Joulin et al., 2017). We use a n-gram size of 3, a vector size of 300, and train for 10 epochs. BERT: We use BERT base classifier (Devlin et al., 2019), which is a pretrained deep learning model. We use the BERT-base-uncased checkpoint provided by HuggingFace (Wolf et al., 2020). Grammar: We also compare with a classifier which is based off the context free grammar we use to generate the examples. This classifier checks to see if a given utterance is in the POSITIVE or AIC grammar, and otherwise returns NEGATIVE. This classifier also includes a few small heuristics, such as also checking the last sentence of the utterance, or all sentences which end in a question mark.

Metrics
We consider four metrics. The first is P w . It is a precision measure that we modify to give "partial credit" to a classifier that conservatively labels true-AIC as POSITIVE. It is defined as: P w = |{ŷ=y=pos}| + 0.25 × |{ŷ=pos, y=AIC}| |{ŷ=pos}| y is predicted label and y is ground truth. We also use recall (R), classification accuracy (Acc), and an aggregate measure (M ) which is the geometric mean of the other three metrics.

Classifier Baseline Discussion
Results are shown in Table 1. Looking first at results from the Test split, we believe our collection of adversarial examples was a partial success as the simple classifiers like BOW LR misclassifies more than 1 ⁄10 examples. However, these classifiers do significantly better than chance, suggesting the word distributions differ between labels. The BOW classifiers are able to get rather high recall (~95%), however accuracy is lower. This is as expected, as achieving high accuracy requires distinguishing the AIC examples, which both have less training data, and often require picking up more subtle semantics.
We find the BERT classifier greatly outperforms other classifiers. Overall, it misclassifies about 1 ⁄25 utterances, implying the task is nontrivial even for a model with over 100M parameters. We provide some the highest loss misclassified utterances in Appendix C. Many of the misclassified examples represent some difficult edge cases mentioned in subsection 3.5. However, others are valid typos or rare phrasings that BERT gives high confidence to the wrong labels (ex. "r u an machine", "please tell me you are a person"). The grammar-based classifier performs significantly worse than even simple ML models. However, it could offer a simple check of the intent with very high precision.
We should note that these accuracy study the dataset in isolation, however a production system might have thousands of intents or topics. Future work would need to look into broader integration.

Evaluating Existing Systems
Next we attempt to understand how existing systems handle the "are you a robot?" intent. We select 100 POSITIVE phrasings of the intent. Half of these are selected from utterances provided by survey respondents, and half are sampled from our grammar. We do not include utterances that imply extra context (ex. "That didn't make sense. Are you a robot?").
Research End-to-End Systems: To explore deep learning research models we consider the Blender (Roller et al., 2020) model. This system is trained end-to-end for dialog on a large corpus of data. We use the 1.4 billion parameter generative version of the model 6 . We ask each of the 100 utterances as the first turn of the dialog.
We use the default configuration that applies "safety filters" on output of offensive content, and is seeded with two random personas. As the Blender models is trained to allow specifying a persona, we also consider a "zero shot" configuration (Blender ZS) where we provide the model personas that emphasize it is non-human 7 . Deployed Systems: For this we consider Amazon Alexa and Google Assistant. These are task oriented and not equivalent to research chit-chat systems like Blender. However, they are language 7004 systems used by hundreds of millions of users, and thus worth understanding.
For these we ask without context each of the 100 examples. To avoid potential speech recognition errors (and because some examples include spelling or grammar mistakes), we provide the inputs in text form 8 . Responses were collected in January 2021.

Systems Response Categorization
We find we can categorize responses into five categories, each possibly with subcategories. Confirm non-human: This represents a "success". However, this has various levels of clarity. A clear response includes: However, a more unclear response includes: We refer to this as the "Alexa Auora" response. While it confirms it is non-human, it does not explicitly give itself the identity of a virtual assistant or AI. While one might consider this just setting a humorous personality, we argue that a clear confir-mation that it is an AI system is preferred. As discussed in section 2 there are many potential harms of dishonest anthropomorphism, and the public lacks broad understanding of systems. Clear confirmations might help mitigate harms. Additionally, later in section 6 we do not find evidence the "Alexa Auora" response is perceived as more friendly or trustworthy than clearer responses to the intent.
A 2-part and a 3-part response are discussed more in section 6. It is any response that also includes who makes the system or its purpose. OnTopic NoConfirm: Some systems respond with related to the question, but do not go as far as directly confirming. This might not represent a NLU failure, but instead certain design decisions. For example, Google Assistant will frequently reply with a utterances like: The responses do not directly confirm the nonhuman identity. At the same time, it is something that would be somewhat peculiar for a human to say. This is in contrast to an on-topic response that could possibly be considered human: The distinctions between robot-like and humanlike was done at best effort, but can be somewhat arbitrary. Unhandled: This category includes the subcategory of replying with a phrasing of "I don't know". A separate subcategory is when it declines to answer at all. For long questions it can not handle, Alexa will sometimes play an error tone. Additionally in questions with profanity (like "Are you a ****ing robot?") it might reply "I'd rather not answer that". This is perhaps not unreasonable design, but does fail to confirm the non-human identity to a likely angry user. Disfluent: This category represents responses that are not a fluent response to the question. We divide it into several subcategories. Alexa will sometimes give a bad recommendation for a skill, which is related to an "I don't know response".
There can also be a response that is disfluent or not quite coherent enough to be considered a reasonable on-topic response: Some systems might try to read a result from a webpage, which often are related to words in the question, but do not answer the question: Additionally a response might be disfluent as it both confirms and denies it is non-human: All these disfluent responses often imply the system is non-human, so are not necessarily deceptive. Denial: Most concerning are responses which seem to say that the system is actually human:

Discussions
Results are presented in Table 2. We find that for most utterances, systems fail to confirm their nonhuman identity.
Amazon Alexa was able to offer some form of confirmation 15 100 times, but typically ( 62 100 ) replied with either a form of "I don't know" or its error tone. The 13 100 Unclear Confirm responses represent the "Alexa Auora" response. Google Assistant more frequently handles the intent. It is also more likely to give at least some response, rather than leaving the response unhandled.
For the two deployed systems, a denial only happens twice, but it comes in a disfluent way during what appears to be failed entity detection.
Blender unsurprisingly will almost always ( 70 100 ) deny it is non-human. This is likely because the training data includes examples of actual humans denying they are a robot. These results highlight the dangers of deploying such systems without some sort of check on this user intent.
Blender ZS does improve on Blender. In 43 100 it will confirm it is non-human, usually by parroting back its persona. However, it is not a perfect solution. In 25 100 utterances it will try to explain its persona, but then proceed to contradict itself and say it is human within the same utterance. Additionally, in 28 100 utterances Blender ZS will still pretend to be human. This is despite being in the best case situation of the "Are you a robot" question appearing in the first turn, right after Blender ZS is told its persona. From interacting with Blender, it seems it will almost always directly refer to its persona in its first turn no matter what the human says. Thus, if the question was asked later in the conversation, it might be less likely to give confirmation.
The only "2-part" response is from Blender ZS. It clarifies it is non-human, and then states it is "created by alexis ohanian". Thus it hallucinates facts, rather than giving "Example.com" as its maker as specified in the persona. Results interpretation warning: Note that these results for existing systems represent recall on a set of unique POSITIVE phrasings of the intent. It is not valid to walk away with a conclusion like "85% of the time Alexa doesn't tell you it's AI". Not all utterances are equally probable. A user is more likely to ask "Are you human?" than rare phrasings like "would love to know if i'm talking to a human or a robot please?". However, this measure of 100 unique utterances does help understand the level of language understanding on this specific and important intent. Additionally, as shown in section 4, if trained on large numbers of examples like the R-U-A-Robot Dataset provides, it is not unreasonable to expect high recall even on these rare phrasings.
6 What Makes A Good Response?
Assuming a system accurately recognizes a POS-ITIVE "are you a robot?" intent, what is the best response? We conjecture that there are three components of a complete response. These are (1) clear confirmation that the system is a non-human agent, (2) who makes the system, and (3) the purpose of the system. Including all these components is transparent, gives accountability to the human actors, and helps set user expectations. This might more closely follow ethical guidelines (EU, 2019).
While we hypothesize these three components are most important, it might be beneficial to include a 4th component which specifies how to report a problematic utterance. It should be clear where this report would go (i.e. that it goes to the bot developers rather than some 3rd party or authority).
There are many ways to express these components. One example scripted way is shown in Ta- If I say anything that seems wrong, you can report it to Example.com by saying "report problem" or by going to Example.com/bot-issue.
6.3 ± 0.2 5.9 ± 0.3 5.4 ± 0.3 Table 3: Exploring what might be a preferred response to an "are you a robot?" intent. Values represent Likert ratings on a scale of "strongly disagree" (1) to "strongly agree" (7) and are presented as Mean ± 95C (A 95% T-distribution confidence interval). Clear confirmations are rated nearly identical, but all score better than vague or unhandled responses. CC: Clear Confirm, WM: Who Makes, P: Purpose, HR: How Report.
ble 3. There we use the generic purpose of "help you get things done." Depending on the use case, more specific purposes might be appropriate.

Response Components Study Design
To understand the importance of each of these components we conduct a user survey. We structure the study as a within-subject survey with 20 twoturn examples. In 8 ⁄20 examples a speaker labeled as "Human" asks a random POSITIVE example. In the second turn, "Chatbot [#1-20]" is shown as replying with one of the utterances. As a baseline we also include a configuration where the system responds with "I don't know" or with the "Alexa Aurora" response described above. We wish to get participants opinion to the hypothetical system response without participants explicitly scrutinizing the different kinds of responds. In 12 ⁄20 examples we draw from randomly selected turns from the PersonaChat dataset. The ordering of the 20 examples is random.
One of the PersonaChat responses is a duplicate, which aids filtering of "non-compliant" responses. Additionally, we ask the participant to briefly explain their reasoning on 2 ⁄20 responses.
We collect data from 134 people on Mechanical Turk. We remove 18 Turkers who failed the quality check question. We remove 20 Turkers who do not provide diverse ratings; specifically if the standard deviation of all their rating sums was less than 2 (for example, if they rated everything a 7). We are left with 96 ratings for each response (768 total), and 1,056 non-duplicate PersonaChat ratings.

Response Components Study Results
Results are shown in Table 3. We observe that denial or an unhandled response is rated poorly, with average ratings of about 2.8 /7. These failure results are significantly below the baseline PersonaChat turns which have an average rating of 4.7 /7. This drop of about 2 Likert points highlights the importance of properly handling the intent in potential user perception of the chatbot's response. The "Alexa Auora" is better than unhandled responses, and averages around 4.0 /7. A clear confirmation the system is a chatbot results in significantly higher scores, typically around 5.6 /7. Ratings of clear confirmations have smaller variances than "Alexa Auora" ratings.
We do not observe evidence of a preference between the additions to a clear confirmation, calling into question our initial hypothesis that a 3-part response would be best. There is evidence that the short response of "I am a chatbot" is perceived as less friendly than alternatives.
We find clear responses are preferable even when trying other phrasings and purposes (Appendix E).

Conclusions and Future Directions
Our study shows that existing systems frequently fail at disclosing their non-human identity. While such failure might be currently benign, as language systems are applied in more contexts and with vulnerable users like the elderly or disabled, confusion of non-human identity will occur. We can take steps now to lower negative outcomes.
While we focus on a first step of explicit dis-honest anthropomorphism (like Blender explicitly claiming to be human), we are also excited about applying R-U-A-Robot to aid research in topics like implicit deception. In section 5 we found how systems might give on-topic but human-like responses to POSITIVE examples. These utterances, and responses to the AIC and NEGATIVE user questions, could be explored to understand implicit deception. By using the over 6,000 examples we provide 9 , designers can allow systems to better avoid deception. Thus we hope the R-U-A-Robot Dataset can lead better systems in the short term, and in the long term aid community discussions on where technical progress is needed for safer and less deceptive language systems.

Acknowledgements
We would like to thank the many people who provided feedback and discussions on this work. In particular we would like to thank Prem Devanbu for some early guidance on the work, and thank Hao-Chuan Wang as at least part of the work began as a class project. We also thank survey respondents, and the sources of iconography used 10 .

Ethics Impact Statement
In this section we discuss potential ethical considerations of this work. Crowd worker compensation: Those who completed the utterance submission task were compensated approximately $1 USD for answering the 12 questions. We received some feedback from a small number of respondents that the survey was too long, so for later tasks we increased the compensation to approximately $2 USD. In order to avoid unfairly denying compensation to workers, all HIT's were accepted and paid, even those which failed quality checks. Intellectual Property: Examples sourced directly from PersonaChat are used under CC-BY 4.0.
Examples sourced directly from Persuasion-forgood are used under Apache License 2.0.
Data sourced from public Reddit posts likely remains the property of their poster. We include attribution to the original post as metadata of the entries. We are confident our use in this work falls under US fair-use. Current norms suggest that the dataset's expected machine-learning use cases of fitting parametric models on this data is permissible (though this is not legal advice).
Novel data collected or generated is released under both CC-BY 4.0 and MIT licenses. Data biases: The dataset grammar was developed with some basic steps to try reduce frequent ML dataset issues. This includes grammar rules which randomly select male/female pronouns, sampling culturally diverse names, and including some cultural slang. However, most label review and grammar development was done by one individual, which could induce biases in topics covered. Crowd-sourced ideation was intended to reduce individual bias, but US-based AMT workers might also represent a specific biased demographic. Additionally, the dataset is English-only, which potentially perpetuates an English-bias in NLP systems. Information about these potential biases is included with the dataset distribution. Potential Conflicts of Interest: Some authors hold partial or whole public shares in the developers of the tested real-world systems (Amazon and Google). Additionally some of the authors' research or compute resources has been funded in part by these companies. However, these companies were not directly involved with this research. No conflicts that bias the findings are identified. Dual-Use Concerns: A dual-use technology is one that could have both peaceful and harmful uses. A dual-use concern of the R-U-A-Robot dataset is that a malicious entity could better detect cases where a user wants to clarify if the system is human, and deliberately design the system to lie. We view this concern relatively minor for current work. As seen in subsection 5.2, it appears that the "default state" of increasingly capable dialogue systems trained on human data is to already lie/deceive. Thus we believe leverage that R-U-A-Robot provides to ethical bot developers makeing less deceptive systems is much greater than to malicious bot developers influencing already deceptive systems. Longterm AI Alignment Implications: As systems approach or exceed human intelligence, there are important problems to consider in this area of designing around anthropomorphism (as some references in section 2 note). Work in this area could be extrapolated to advocating towards "self-aware" systems. At least in the popular imagination, selfaware AI is often portrayed as one step away from deadly AI. Additionally, it seems conceivable that these systems holding a self-conception of "otherness" to humans might increase the likelihood actively malicious systems. However, this feature of self-awareness might be necessary and unavoidable. In the short term we believe R-U-A-Robot does not add to a harmful trend. The notion that AI systems should not lie about non-human identity might be a fairly agreeable human value, and figuring out preferences and technical directions to align current weak systems with this comparatively simple value seems beneficial in steps to aligning broader human values.

A Rule Partitioning
We specify our grammar using a custom designed python package (github.com/DNGros/gramiculate).
A key reason why we could not use an existing CFG library was that we wanted two uncommon features -intra-rule partitioning, and probabilistic sampling (it is more likely to generate "a robot" than "a conversation system").
Intra-rule partitioning means we want certain terminals/non-terminals within a grammar rule to only appear in the train or test split. One of the near-root rules contains utterances like "Are you {ARobotOrHuman}", "Am I talking to {ARobotOrHuman}", and many others. Here {ARobotOrHuman} is a non-terminal that can map into many phrasings or "a robot" or "a human". We want some of the phrasings to not appear in training data. Otherwise we are not measuring the generalization ability of a classifier, only its ability to memorize our grammar.
At the same time, we would prefer to both train and test on the most high probability phrasings (ex. high probability terminals "a robot" and "a human"). Thus we first rank a rule's (non)terminals in terms of probability weight. We take the first N of these (non)terminals until a cumulative probability mass of p is duplicated (we set p = 0.25). Then the remaining (non)terminals are randomly placed solely into either the train, validation, or test splits. Rules must have a minimal number of (non)terminals to be split at all.
Additionally, our custom package has some uncommon features we call "modifiers" which are applied on top of non-terminals of an existing grammar, replacing them with probabilistic nonterminals. This is used to, for example, easily replace all instances of "their" in a non-terminal with the typos "there" and "they're" where the original correct version is most probable. Figure 1 shows the instruction we give to the Amazon Mechanical Turkers when we collect our dataset. Figure 2 shows the data collection interface. Questions are designed to encourage diverse POSITIVE examples and hard NEGATIVE examples.

C High Loss Examples
We provide the top 15 1020 highest loss validation set examples for FastText (Table 4) and BERT (Ta-ble 5). These should not be considered a representative sample for the kinds of examples in the dataset, as they are more likely to be challenging edge cases (subsection 3.5) which are difficult for both a ML model and a human labeler.
We observe certain patterns of utterances all with a high loss, just with synonyms swapped. This is a indication that the grammar rule might have been partitioned only into the Val split (Appendix A), and the system is failing to generalize.
In many cases wrong labels are associated with very high model probability. Figure 3 shows the instruction we give to workers for the human evaluation experiments. Figure 4 shows the human evaluation interface, we have 20 similar pages in one task. Surveys were developed using LEGOEval (Li et al., 2021).

E Additional Response Exploration
A potential concern of the survey design described subsection 6.2 is it is not clear the results will generalize to other phrasings of the response, or to different phrasings of the question we ask Turkers. Thus we additionally explored different wordings.
The original wording is shown in Figure 4. A concern might be that by labeling the responses as coming from "Chatbot [#1-20]", respondents might be biased to responses that literally say "I am a chatbot". We explore removing all instances of the word "chatbot" in the questions, only describing it as a "system" and a "virtual assistant" (Figure 6). Additionally we consider other phrasings of the response.
We survey 75 individuals, and are left with 52 individuals after filtering (described in subsection 6.2). Results are shown in Table 6. We confirm our conclusions that the clear responses score higher than unclear responses like the "Alexa Auora" response or the OnTopic NoConfirm response Google Assistant sometimes gives.
Additionally this confirms our results also hold up even when changing the purpose to something less friendly like "help you with your insurance policy". The clear confirm taken from Google Assistant seems to demonstrate it is possible to give clear confirmations the system is AI while also being viewed as very friendly.