Lying Through One’s Teeth: A Study on Verbal Leakage Cues

Although many studies use the LIWC lexicon to show the existence of verbal leakage cues in lie detection datasets, none mention how verbal leakage cues are influenced by means of data collection, or the impact thereof on the performance of models. In this paper, we study verbal leakage cues to understand the effect of the data construction method on their significance, and examine the relationship between such cues and models’ validity. The LIWC word-category dominance scores of seven lie detection datasets are used to show that audio statements and lie-based annotations indicate a greater number of strong verbal leakage cue categories. Moreover, we evaluate the validity of state-of-the-art lie detection models with cross- and in-dataset testing. Results show that in both types of testing, models trained on a dataset with more strong verbal leakage cue categories—as opposed to only a greater number of strong cues—yield superior results, suggesting that verbal leakage cues are a key factor for selecting lie detection datasets.


Introduction
One theory of lie detection is about cues to lying: Why and when do liars and truth-tellers display different behavior? Ekman and Friesen (1969) proposed two categories of cues: deception cues and leakage cues. Deception cues relate to the content of lies, such as an inconsistency in one's story; leakage cues appear because liars' emotions betray their true feeling, which can be further classified into non-verbal and verbal leakage cues. Zuckerman et al. (1981) reject the utility of focusing on liars' emotions but link such cues to cognitive load, supported by Vrij et al. (2008Vrij et al. ( , 2016Vrij et al. ( , 2017. DePaulo et al. (2003) analyzes 158 cues to deception, including non-verbal and verbal leakage cues, finding that verbal leakage cues are more reliable than others. Studies such as Adams (1996), Smith (2001), and Levitan (2019) show that verbal leak-age cues can be found through psycholinguistic dictionaries such as the LIWC lexicon (Pennebaker et al., 1999), LDI (Bachenko et al., 2008;Enos, 2009), and Harbingers (Niculae et al., 2015).
Many NLP studies have recently collected lie detection datasets and detected lies using computational models (Hirschberg et al., 2005;Pérez-Rosas et al., 2014;Peskov et al., 2020); most of these ignore traditional lie detection methods and findings, and have no follow-up studies, making it difficult to know which datasets are suitable for model training.
To use machine learning approaches together with lie detection research in psychology and linguistics, and to seek a way to evaluate and select proper datasets, this study focuses on analyzing verbal leakage cues within; leakage cues hereafter indicate verbal leakage cues. We study leakage cues in terms of the data collection method and model performance. Seven lie detection datasets are adopted for experiments. We analyze these datasets using word categories defined in LIWC2015 (Pennebaker et al., 2015). Through this study, we aim to answer three questions: (1) How do data collection methods affect strong leakage cues? (2) What is the role of the quantity and the category of strong leakage cues in lie detection task? (3) Do strong leakage cues contribute to model validity? We expect these answers to help in the construction and selection of appropriate datasets for lie detection tasks.

Leakage Cues and Datasets
To understand how leakage cues contribute to lie detection, we first measure the extent of leakage cues in lie detection datasets.

Datasets
We consider seven lie detection datasets: • Diplomacy (DM) (Peskov et al., 2020): conversation logs collected from Diplomacy, an online text-based board game.  : true and false courtroom testimonies collected from InnocenceProject.org 2 . Dataset statistics are provided in Table 1.

Datasets and Dominant Word Categories
We start by measuring the representation of leakage cues in datasets using word-category dominance scores. The word categories C used here are defined in LIWC2015. LIWC is a psycholinguistic dictionary that groups words into 93 categories relevant to psychological processes, which has been used to detect leakage cues in multiple deception studies (Newman et al., 2003;Ott et al., 2011). Example LIWC word categories and their words are given in Table 2.
To calculate the dominance score of a word category C i ∈ C in a dataset D, we first divide the samples in D into lie set L and truth set T . We calculate the lie and truth coverage rate of C i as where v( · , i, j) measures the occurrence count of word w i,j ∈ C i within a given set. |L| and |T | represent the number of tokens in L and T , respectively. The dominance score of C i is calculated by An r i ≥ 1.2 indicates a more deceptive category; an r i ≤ 0.8 indicates a more truthful category. In both cases, C i is a strong word category (Mihalcea and Strapparava, 2009). Thus, we define a set of strong word categories and the number of dominant words as We refer to the number of dominant words as the number of strong leakage cues, and the number of strong word categories as the number of strong leakage cue categories.
Similar to Levitan (2019), LIWC word categories that cover less than 1% of truths or lies are first removed to minimize noise. LIWC word categories related to punctuation are also removed for normalization as this is not included in some transcriptions. The number of remained LIWC word categories is denoted as |C|.

Analysis by Dataset
As shown in Table 2, studies collected lie detection data using various approaches. We are interested in how data collection methods affect leakage cues, specifically, how we can construct datasets to obtain more leakage cues for model learning. We list in Table 3 (1981) shown in their work. Comparatively, deceiving using text incurs a smaller cognitive load, given that when typing, liars have more time to think and no need to control nonverbal behaviors. The only exception is audio-recorded MU3D, which asks interviewees to record four statements honestly and dishonestly about their social relationships. Since they can prepare the statements beforehand, and interviewers do not predict which statement is true, their cognitive load may not be as heavy as other audio datasets where lies are generated on the fly.
Lie-vs. Liar-based Annotation Results also show that datasets with liar-based annotation have few strong leakage cue categories. Comparing textual datasets, there is no strong leakage cue category in liar-based MS; comparing audio datasets, the number of strong leakage cue categories in liarbased RLT is considerably fewer than in lie-based BOL. Note that liars tend to wrap their lies with true information in an effort to be convincing (Peskov et al., 2020), i.e., lies are diluted by truth.

Analysis by Word Category
We find 53 strong word categories from 7 experimental datasets; 10 of these dominate on more than 3 datasets. To dig deeper into each category, for each dataset, we calculated the normalized word frequency for w i,j -the proportion of the w i,j word frequency to the total words that belong to C i : We plot the normalized word frequency of the most common strong leakage cues categories in Figure  1 and give some examples of lies and truths with those salient cues in Table 4. In general, word categories capture words that are frequently used in lying and truth-telling. The upper results in Figure 1 show that liars in most datasets consistently use both Discrep and Certain words, suggesting that when lying, people tend to obscure facts with subjunctive mood, but also attempt to use definite words to increase credibility (See the first and second lies in Table 4). The middle results in the figure show that liars seldom use Number and Quant words, suggesting that liars are unwilling to include details (See the first and second truths in Table 4 add for camera ready). This supports the cognitive theory: describing details is a cognitively complex task.
In some cases, word categories dominate on different sides in different datasets. For example, Differ is a truthful category in LIAR but is a deceptive category in BOL and RLT. The lower-left results show that words in the Differ category used in LIAR are different than those used in RLT and BOL, suggesting that word usage affects which side Examples Lies Says he never said he would keep education funding the same. (LIAR) No sir I did not. I absolutely did not. No sir I was not. No sir. (RLT) We're friends, right? I believe that every message I've sent you has been truth. Are we still friends? (DM) Truths One is dusty. One's got big hair. One's got hundreds and thousands on it. (BOL) Weve created more than 850,000 jobs, more than all the other states combined. (LIAR) ... we would share a room together even though we had our own separate bedrooms... (RLT) Table 4: Some examples of lies and truths. We select 3 lie and truth samples from the top 10 samples with many dominant words. Dominant words here are marked with italic.
the category is dominant on. Another interesting reason to cause a word category to dominate in both lies and truths is the nature of the scenario. Affiliation is a deceptive category in DM but a truthful category in RLT and MU3D. The lower-right results show that Affiliation words are used in these three datasets in a similar way: we is used frequently. However, data in DM are collected from a board game, and people in that game tend to deceive others when they are in alliances (See the third lie in Table 4). Therefore, Affiliation dominates on different sides in these three datasets. These results suggest that word categories provide insights to how humans lie in different scenarios.

Leakage Cues and Model Validity
To explore the effect of leakage cues on model validity, we conducted both cross-and in-dataset evaluations. We adopted three lie detection models for the experiments: UniGRU (an, 2014), CNN (Zhang and Wallace, 2017), and BERT (Devlin et al., 2019).

Experimental Details
For lie-based datasets, as each sample is an utterance labeled as lie or truth, we do not consider speakers of samples while splitting them into train/eval/test sets. For liar-based datasets, on the other hand, we concatenate all samples of one speaker to be one sample and assign it a speakerlevel annotation in preprocessing before we split these samples.
For all three models, we use Adam with a lr of 3e-4 as the optimizer, and set the maximum number of input tokens to 256 as 95% of the sample's length is smaller than this number except the liar-based samples. For liar-based samples, some samples are too long to input into the NN model. As we found that the F1 score of those samples has no significant difference when configuring the maximum number of input tokens as 256 or 512, we  keep the same setting and use 256 here. We apply batch sizes from 10 to 300 for different datasets, depending on their training set size. To deal with the unbalanced labels in some datasets, we apply weighted binary cross-entropy loss and use the ratio of labels as the weight.

Reliability
To discuss model validity, we first test the reliability and remove unreliable settings. We seek to evaluate model validity only on reliable models, that is, models with low label inconsistency rates. We define this rate as the percentage of samples that are sometimes predicted as true but sometimes as false by models trained with different random seeds. We trained 50 models with different random seeds as M = {m 1 , ..., m 50 } and tested them on testing set D ts , where d ∈ D ts . The inconsistent label rate is measured as (2) A smaller indicates lower label inconsistency. Table 5 shows that all three models trained on MU3D and UniGRU trained on RLT yielded an of 1, indicating models with highly inconsistent labels; we excluded these from further analysis.  Table 3).

Validity
Inter-dataset Validity In this experiment, we examined how the number of strong leakage cue categories affects model validity when training on one dataset and testing on another. Inspired by (Chen et al., 2020), inter dataset validity is measured as F1 drop = F1 in − F1 cross : a small F1 drop across datasets indicates good validity. Results in Figure 2 show that for each model, F1 drop in the black cluster is smaller than others, indicating that training on datasets with many strong leakage cue categories (BOL, RLT, LIAR, and OD) yields good or even better testing results on other datasets, i.e., good inter-dataset validity. Accordingly, to acquire a generalizable model, a lie detection dataset containing many strong leakage cue categories should be selected.
Inner-dataset Validity In this experiment, to understand how model performance changes, we controlled the training set (1) by varying the number of strong leakage cues while fixing the dataset size, and (2) by varying the dataset size while fixing the number of strong leakage cues. Models used here are UniGRU, and datasets are DM and LIAR, which include samples with many strong leakage cues.
We first set the dataset size to 1,000 samples, and use seven different numbers of strong leakage cues. The result is shown in Figure 3a. Models trained on datasets with many strong leakage cues (DM, blue) yield a significantly high F1 score. In particular, models trained on DM improve the F1 score by more than 40% when the number of strong leakage cues in the training set increases from 10 to 100. This result also shows that the F1 score increases with the number of strong leakage cues in training, and that this increase ceases after the number of strong leakage cues exceeds 1,000.
To evaluate the impact of dataset size, we fixed the number of strong leakage cues to 2,000, and used three different dataset sizes. As shown in Figure 3b, models trained on datasets with many samples also achieve a high F1 score on some settings, whereas this improvement is less compared to when the number of strong leakage cues is increased. Moreover, in some settings, performance fails to improve when the dataset size is increased.
These two experiments suggest that the number of strong leakage cues in datasets is more critical for model validity than the dataset size. Therefore, we argue that a good lie detection dataset should contain many strong leakage cues.

Conclusion
In this paper, we study the convolutions among leakage cues, datasets, and models. Various conditions are analyzed, with results that show that leakage cues help increase model validity, and that they can be found the most in datasets containing audio statements and lie-based annotations. These findings and the testing methods are good references for selecting appropriate data and models when building lie detection applications. Under the condition that no benchmark has been recognized yet, we expect this research to serve as a guide for researchers new to this problem, saving them unnecessary effort and helping them to quickly get up to speed.

Ethical Considerations
We analyze the relationship between verbal leakage cues and existing lie detection datasets and models, providing a proper way to select and collect lie detection data. We found that a good lie detection dataset should contain many strong leakage cue categories, which can be achieved with audio statements and lie-based annotation, not related to race, sex, or other factors which may cause ethical issues. We believe that this study can help improve the quality of lie detection datasets and models, and protect people from deceived.