Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with Common Sense and World Knowledge

Cant is important for understanding advertising, comedies and dog-whistle politics. However, computational research on cant is hindered by a lack of available datasets. In this paper, we propose a large and diverse Chinese dataset for creating and understanding cant from a computational linguistics perspective. We formulate a task for cant understanding and provide both quantitative and qualitative analysis for tested word embedding similarity and pretrained language models. Experiments suggest that such a task requires deep language understanding, common sense, and world knowledge and thus can be a good testbed for pretrained language models and help models perform better on other tasks.


Introduction
A cant 2 (also known as doublespeak, cryptolect, argot, anti-language or secret language) is the jargon or language of a group, often employed to exclude or mislead people outside the group (McArthur et al., 2018). Cant is crucial for understanding advertising (Dieterich, 1974) and both ancient and modern comedy (Sommerstein, 1999;Prasetyo, 2019). Also, it is the cornerstone for infamous dogwhistle politics (López, 2015;Albertson, 2015).
Here, we summarize the key elements for cant: (1) Both a cant and its reference (i.e., hidden word) should be in the form of common natural text (not another symbol system, e.g., Morse code). (2) There is some shared information between the cant users (i.e., the insiders) that is not provided to the people outside the group.
(3) A cant should be deceptive and remain undetected to avoid being * Equal Contribution. Work done at Microsoft Research Asia. 1 The code is available at https://github.com/ JetRunner/dogwhistle. The data and leaderboard are available at https://competitions.codalab.org/ competitions/30451. 2 https://en.wikipedia.org/wiki/Cant_ (language) decrypted by people outside the group (i.e., the outsiders). These elements make the creation and understanding of cant subtle and hard to observe (Taylor, 1974). To the best of our knowledge, currently there are very few resources available for the research of cant.
In this paper, we create a dataset for studying cant, DogWhistle, centered around the aforementioned key elements (examples shown in Figure 1). We collect the data with a well-designed online game under a player-versus-player setting (see Section 3.1). The dataset includes abundant and diverse cant for a wide spectrum of hidden words. We find that cant understanding requires a deep understanding of language, common sense and world knowledge, making it a good testbed for next-generation pretrained language models. Our dataset also serves as a timely and complex language resource that can help models perform better on other tasks through Intermediate Task Transfer (Pruksachatkun et al., 2020).

Related Work
The use of cant has long been studied in linguistics research (Pei, 1973;Pulley, 1994;Albertson, 2006;Squires, 2010;McCready, 2017, 2019b,a;Bhat and Klein, 2020). However, due to a lack of language resources, there are few studies in computational linguistics research. Henderson and McCready (2020) attempted to model the dogwhistle communications with a functional, agentbased method.
As a related topic in computational linguistics, some previous studies investigate coded names in human language. Zhang et al. (2014) analyzed and generated coded names of public figures. Zhang et al. (2015) designed an automatic system to decode the coded names. Huang et al. (2017)   Our work differs from the above in the following ways: (1) Previous studies focused on coded names for public figures; the source and variety of these coded names is limited. The hidden words in our dataset are sampled from a common dictionary and are of high diversity. (2) The coded names in previous studies are used by users to bypass a censor (mostly a rule-based automatic text matching system). Conversely, our data are collected under an adversarial setting, pressuring users to mislead human adversaries. Thus, our work is ideal for evaluating recent progress on Natural Language Understanding (NLU) (Devlin et al., 2019;Lan et al., 2020;Sun et al., 2019b;Xu et al., 2020c;Zhou et al., 2020;Xu et al., 2020a).

Data Collection
Previous studies (Dergousoff and Mandryk, 2015;van Berkel et al., 2017) reveal that gamification can often improve the quality of collected data. Instead of collecting data from the wild like most datasets (Zhang et al., 2014(Zhang et al., , 2015Xu et al., 2020b), we collect the data from historical game records of Decrypto Online, a well-designed online board game. The screenshot of the user interface is shown in Figure 2.

Game Design
The game design is adapted from the board game Decrypto. 3 Four players (e.g., A, B, C and D) are divided into two teams (e.g., A and B vs. C and D), with each trying to correctly interpret the cant presented to them by their teammates while cracking the codes they intercept from the opposing team.
In more detail, each team has their own screen, and in this screen there are four words numbered 0-3. Both players on the same team can see their own words while hiding the words from the opposing team. In the first round, each team does the following: One team member receives a randomly generated message that shows three of the digits 0-3 in some order, e.g., 3-1-0. that their teammates must use to guess this message. For example, if A and B's four words are "本田" (Honda), "出租车" (taxi), "圆圈" (circle), and "戒 指" (wedding ring), then A might say "招招手-偶 像剧-3.14" ("hand waving"-"romance show"-3.14) and hope that their teammate B can correctly map those cant to 0-2-1. If B guesses incorrectly, the team would receive one "failure mark". Starting in the second round, a member of each team must again give a clue about their words to match a given three-digit message. One member from the other team (e.g., C) then attempts to guess the message. Taking Figure 1b as an example, based on the cant histories from previous rounds, C can roughly guess the code is 0-2-1. If C is correct, C and D would receive one "success mark". After every round, the real messages that both teams were trying to pass will be revealed.
The rounds continue until a team collects either its second success mark (to win the game) or its second failure mark (to lose the game).

Additional Rules and Restrictions
The participants are explicitly asked not to create a cant based on its position, length, and abbreviation. That is to say, to mimic the creation of cant, we emphasize the importance of semantics instead of the morphology. To enforce this, all input that contains the same character as in one of the four words will be automatically rejected. As emojis have been playing an important role in online communications nowadays , emojis are allowed as valid input.

Data Cleaning and Split
For data cleaning, we remove all rounds with an empty cant. We also exclude rounds where the player fails to write a cant within the given time limit (one minute). We randomly split the data into training, development and test sets with an 8:1:1 ratio, such that all rounds of a game are in the same split. We also ensure there is no overlapping combination of hidden words between splits. We show the statistics of the training, development and test sets in Table 1. In contrast to 288k cant phrases for 1.9k hidden words in our dataset, data collected by previous studies (Zhang et al., 2014(Zhang et al., , 2015Huang et al., 2017) are quite small, often containing hundreds of coded names for a small set of entities.

Task Formulation
As shown in Figure 1, we have subtasks named insider and outsider, respectively. For the insider subtask, we try to decode the cant to one of the hidden words. For the outsider subtask, the hidden words are invisible and the goal is to decrypt the messages based on the communication history. We formulate the task of decoding the cant in a similar format to multi-choice reading comprehension tasks (Lai et al., 2017;Zellers et al., 2018;Clark et al., 2018). We consider the cant context and the cant to decode as the "context" and "question" (respectively) as in multi-choice reading comprehension tasks. For the candidate answers, we use the hidden words and the set of cant histories for the insider subtask and the outsider subtask, respectively.

Baselines
Word Embedding Similarity Our task is naturally similar to the task of word similarity (Jin and Wu, 2012). We select FastText (Grave et al., 2018), SGNS  (trained with mixed large corpus), DSG (Song et al., 2018) and VCWE (Sun et al., 2019a) as word embedding baselines. For each word embedding baseline, we first check if the cant is in the vocabulary; if it is not, we try to use a word tokenizer 4 to break it into words. If there is still any out-of-vocabulary token, we then break it into characters. For the insider subtask, we take the average of the word vectors to represent the cant and select the hidden word with the smallest cosine distance in the embedding space. For the outsider subtask, we take the average of the history cant for each hidden word as the representation. Then we predict the label by selecting the smallest distance  between the representation of the cant and the history cant. Note that for word embedding baselines, the cant context is omitted and the evaluation is under a zero-shot setting (without any training).
Pretrained Language Models We use BERT (Devlin et al., 2019), RoBERTa , ALBERT (Lan et al., 2020), and Baidu ERNIE (Sun et al., 2019b) as baselines. 5 The implementation is based on Hugging Face's Transformers (Wolf et al., 2020). Specifically, for the insider subtask, we construct the input sequence for each choice by concatenating its context, cant, and candidate hidden words with a special token [SEP]. We then concatenate the input sequences for all candidate hidden words with [SEP] and feed it into a BERT-like model. Finally, we use the hidden representation of the first token [CLS] to output the final prediction with a linear layer. For the outsider subtask, we replace the hidden words with the cant history. We fine-tune the models on the training set and report the results on the development and test sets. We use Adam (Kingma and Ba, 2015) with a learning rate searched over {2e-5, 3e-5, 5e-5} and a batch size of 64 to fine-tune the models for 3 epochs. We warm-up the learning rate for the first 10% 5 The pretrained weights for BERT are from the official BERT repository: https://github.com/ google-research/bert. Pretrained weights for other models are provided by CLUE: https://github.com/ CLUEbenchmark/CLUEPretrainedModels.

Quantitative Analysis
We show the experimental results in Table 2. For word embedding similarity baselines, DSG (Song et al., 2018), which is trained with mixed characters, words and n-grams on a diverse large corpus, drastically outperforms other word embeddings. For pretrained language models, large-size models, with more computational capacity, remarkably outperform base-size models on the insider subtask. Both RoBERTa-base and ERNIE-base outperform BERT-base while ALBERT-base, which employs parameter sharing, slightly underperforms BERT on both tasks. Notably, the best-performing model still trails human performance by a large margin of 12.8 and 8.5 on the insider and outsider subtasks, respectively. It indicates that DogWhistle is a very challenging dataset, providing a new battleground for next-generation pretrained language models.

Qualitative Analysis
We list some representative samples that BERT fails to predict but that are correctly predicted by human players in Table 3. For example #1, "Dancing Pallbearers" 6 is a recent meme that went viral after the release of the models. Thus, it is likely that the pretrained models have little knowledge about the subject. For example #2, "007" refers to James Bond films 7 , in which the protagonist often cracks passwords in a mission. This kind of reasoning requires a high understanding of world knowledge instead of overfitting shallow lexical features, which has been pointed out as a major drawback in natural language inference (Poliak et al., 2018;Zhang et al., 2019). For example #3, "孩子都可 以打酱油了" (the child can buy sauce) is a Chinese slang that means a child has grown up. To successfully predict this example, the model must have extensive knowledge of the language.

Intermediate-Task Transfer
Intermediate-Task Transfer Learning (Pruksachatkun et al., 2020) exploits an intermediate task to improve the performance of a model on the target task. As we analyzed before, DogWhistle contains rich world knowledge and requires high-level reasoning. Therefore, we can strengthen the ability of a model by leveraging our dataset 6 https://en.wikipedia.org/wiki/ Dancing_Pallbearers 7 https://en.wikipedia.org/wiki/James_ Bond   as an intermediate task. Specifically, we transfer DogWhistle for a semantic similarity task. We first fine-tune the models on the insider subtask, then re-finetune the models on two real-world semantic matching datasets, Ant Financial Question Matching Corpus (AFQMC) (Xu et al., 2020d) and Large-scale Chinese Question Matching Corpus (LCQMC) . As shown in Table 4, on both datasets, DogWhistle helps models significantly obtain better performance (p < 0.05).

Conclusion and Future Work
In this paper, we propose DogWhistle, a new Chinese dataset for cant creation, understanding and decryption. We evaluate word embeddings and pretrained language models on the dataset. The gap between human performance and model results indicates that our dataset is challenging and promising for evaluating new pretrained language models. For future work, we plan to leverage this dataset to train agents to compete against each other, to better understand verbal intelligence and teach agents to reason, guess and deceive in the form of natural language to make new progress at higher levels of World Scope (Bisk et al., 2020).

Ethical Considerations
During data collection, the game has a guideline that asks the players not to use any offensive content when playing the game. However, like all user-generated language resources, there would inevitably be bias and stereotyping in the dataset. We consider this as a double-edged sword, which provides opportunities for computational social sci-ence research of bias in human language, but also requires responsible use of these data. We would also like to warn that there would inevitably be potentially toxic or offensive contents in the dataset. Likewise, this dataset could be abused to generate dog-whistle phrases and political propaganda; Being aware of the risks, we have set terms to restrict the use to be for research purposes only.