“Does it Matter When I Think You Are Lying?” Improving Deception Detection by Integrating Interlocutor’s Judgements in Conversations

It is well known that human is not good at de-ception detection because of a natural inclination of truth-bias. However, during a conversation, when an interlocutor (interrogator) is being asked explicitly to assess whether his/her interacting partner (deceiver) is lying, this perceptual judgment depends highly on how the interrogator interprets the context of the conversation. While the deceptive behaviors can be difﬁcult to model due to their heterogeneous manifestation, we hypothesize that this contextual information, i.e., whether the inter-locutor trusts or distrusts what his/her partner is saying, provides an important condition in which the deceiver’s deceptive behaviors are more consistently distinct. In this work, we propose a Judgmental-Enhanced Automatic Deception Detection Network (JEADDN) that explicitly considers interrogator’s perceived truths-deceptions with three types of speech-language features (acoustic-prosodic, linguistic, and conversational temporal dynamics features) extracted during a conversation. We evaluate our framework on a large Mandarin Chinese Deception Dialog Database. The re-sults show that the method signiﬁcantly out-performs the current state-of-the-art approach without conditioning on the judgements of interrogators on this database. We further demonstrate that the behaviors of interrogators are important in detecting deception when the interrogators distrust the deceivers. Finally, with the late fusion of audio, text, and turn-taking dynamics (TTD) features, we obtain promising results of 87.27% and 94.18% accuracy under the conditions that the interrogators trust and distrust the deceivers in deception detection which improves 7.27% and 13.57% than the model without considering the judgements of interlocutor respectively.


Introduction
Deception behaviors frequently appear in human daily life, such as politics (Clementson, 2018), news (Conroy et al., 2015a;Vaccari and Chadwick, 2020), and business (Grazioli and Jarvenpaa, 2003;Triandis et al., 2001). Despite its frequent occurrences, researchers have repeatedly shown that humans are not good at detecting deceptions (it's 54% accuracy on average for both police officers and college students (Vrij and Graham, 1997)), even for highly-skilled professionals, such as teachers, social workers, and police officers (Hartwig et al., 2004;Vrij et al., 2006). Due to the difficulty in identifying deceptions by human, researchers have also developed an automatic deception detection (ADD) systems applied in various fields, such as cybercrime (Mbaziira and Jones, 2016), fake news detection (Conroy et al., 2015b), employment interviews (Levitan et al., 2018b,a), and even court decision (Venkatesh et al., 2019;Pérez-Rosas, Verónica and Abouelenien, Mohamed and Mihalcea, Rada and Burzo, Mihai, 2015). Although many works have studied approaches of automatic deception detection, few works, if any, has investigated whether judgements of human can help provide a condition that enhance ADD recognition rates.
In recent years, ADD has gained popularity and attention; however, almost all studies (if not all) on ADD pay attention to western cultures (countries), and there are very few literates that focus on eastern cultures (countries). Deception behavior often varies with different cultures (Aune and Waters, 1994), and every culture has its way to deceive others. Additionally, Rubin (2014) suggested that researchers need to study and understand more deception behaviors in the Asian area. Besides, many researchers have utilized various behavioral cues to build an ADD system, like facial expressions (Thannoon et al., 2019), internal physi-ological measures (Ambach and Gamer, 2018) and even functional brain MRI (Kozel et al., 2009a,b). While these indicators can be useful in detecting deceptions, many of them require expensive and invasive instrumentation that is not practical for real-world applications. Instead, speech and language cues carry substantial deceptive cues that can be modeled in ADD tasks for potential large-scale deployment (Zhou et al., 2003;. Hence, the proposed method modeled the speech and language cues of humans with real-world data in Mandarin Chinese. Despite these important advances in understanding and automatically identifying deceptions, there has been little work investigating whether the performance of ADD models can be significantly improved if considering the behaviors and perceptions of interrogators. Several questions remain: is there a difference in linguistic and acoustic-prosodic characteristics of an utterance from both interlocutors given trusted/distrusted judgments of interrogators? How do the judgments of interrogators help the ADD model detect deceptions? To investigate these questions, we firstly follow the previous studies  to segment a dialog into Questioning-Answering (QA) pair turns and then extract acoustic-prosodic features, linguistic features (e.g., Part-Of-Speech taggers (POS), Named Entity Recognition (NER), and Linguistic Inquiry and Word Count (LIWC)), conversational temporal dynamics (CTD) features. Then, we trained machine learning and deep learning classifiers using a large set of lexical and speech features to automatically identify deceptions and evaluated the results in the Daily Deceptive Dialogues corpus of Mandarin (DDDM). Also, to investigate the differences between interlocutor's behaviors, we perform Welch's t-test (Delacre et al., 2017) on the characteristics of utterances from both interlocutors given three different scenarios: (A) human-distrusted deceptive and truthful statements, (B) human-trusted deceptive and truthful statements, and (C) successful/unsuccessful deceptive and truthful statements.
In our further analyses, we found that (i) the judgments of human are indeed helpful to significantly improve the performance of the proposed method on detecting deceptions, (ii) the behaviors of interrogators should be considered into the model when the interrogator distrusted the deceivers, and (iii) the additional evidence indicates that human is bad at detecting deceptions -there are very few significant indicators that overlap between trusted truths-deceptions and successful-unsuccessful deceptions. We believe that these overlap-indicators could be useful for training humans to detect deceptions more successfully. Finally, we summarize our 3 main contributions as below.
• We are the first work to include the judgements of the interrogator as a condition to help improve the recognition rates of deception detection model. • We demonstrate that the features of interrogators are more effective and useful to detect deceptions than the deceivers' ones under the condition that the interrogator disbelieves the deceiver. • The proposed model has high potentials for practical deception detection applications and impact on the ADD area.

Related Work
Automatic deception detection in a dialogue Previous studies have trained a deception detector with various features in a dialog. Levitan et al. (2018a) extracted acoustic features of utterances to build the detection framework using a global-level label as the ground truth in employment interviews.  indicated that the interlocutor's vocal characteristics and conversational dynamics should be jointly modeled to better perform deception detection in dialogues. The grammatical and syntactical POS features has been widely used in the automatic deception detection (Pérez-Rosas, Verónica and Abouelenien, Mohamed and Mihalcea, Rada and Burzo, Mihai, 2015;Levitan et al., 2016;Abouelenien et al., 2017;Kao et al., 2020). In addition, Liu et al. (2012); Levitan et al. (2018b) modeled the behaviors of language use from the LIWC features. Gröndahl Dando and Bull (2011);Sandham et al. (2020) found that policies can be trained to identify criminal liars with advanced interrogation strategies (e.g, tactical use procedure) because these interview techniques maximize deceivers' cognitive load (Dando et al., 2015). In addition, Chou and Lee (2020) tried to learn from the behaviors of both interlocutors for identifying perceived deceptions, but their learning targets are from the perception of the interrogators not from the deceivers. Therefore, to our best knowledge, we are the first work to take the interrogators' behaviors for detecting deceptions automatically.
The perceptions of interrogators for detecting deceptions Levitan et al. (2018b) had studied the perception (judgment) of deception by identifying characteristics of statements that are perceived as truths or lies by interrogators, but they did not use the perceptions for detecting deceptions. Kleinberg and Verschuere (2021) used the LIWC variables and POS frequencies as input features to train a random forest classifier respectively, and then asked subjects to mark the scores ranging from 0 (certainty truthful) to 100 (certainty deceptive) on the deceptive or truthful text data. Finally, they presented the output probabilities of two trained classifiers on each data for the subjects to change the probabilities of the data. Their results showed that the perceptions of human impair the automatic deception detection models. However, we are different from Kleinberg and Verschuere (2021). The main difference is the way how judgements is being utilized; in our work, this is used to provide a condition in improving the prediction results. (4) 53 100 Figure 1: The illustration of Questioning-Answering (QA) pair turns. We only used complete QA pair turns and excluded some questioning turns if we cannot find the corresponding answering turns. To be noticed that each turn could have multiple utterances.

DDDM Database
We used conversational utterances from the Daily Deceptive Dialogues corpus of Mandarin (DDDM) . The entire DDDM contains about 27.2 hours of audio recordings from 96 unique speakers and 283 "question-level" conversational data samples. This corpus is particularly useful for our study, and all annotations in the DDDM come from "human" raters. Most deception databases lack recordings and perceptions (judgments) of the interrogators, while DDDM recorded the whole interrogator-deceiver conversations and the judgements of both interlocutors, allowing us to study deception detection given the judgements of the interrogators. With the judgements of both interlocutors, we group the data samples into four classes (shown in Table 1) as follows: (1) successful deceptions, (2) trusted truths,  Figure 1) because the interrogator tended to ask follow-up questions for judging the deceiver's statements.

The definition of deception
Deception is different from lying. Deception is human behavior that aims to make receivers believe true (or false) statements that the deceiver believes to be false (or true) with the conscious planning acts, such as sharing a mix of truthful and deceptive experiences to change the perceptions of the interrogators when being inquired to answer to questions. However, lying is just saying that something is true (or false) when in fact that something is false (or true) (Mitchell, 1986;Sarkadi, 2018). Hence, it is challenging for the interrogators to de-tect deceptions through the behaviors of deceivers. Human needs to engage in higher-order cognitive processing to detect these consciously planned deceptions (Street et al., 2019). The deceiver can act in a way to change the perceptions of that potential deception detector. This then shifts a heavier burden onto the interrogator's cognitive processing. Hence, the interrogator must necessarily engage in "higher-order" cognitive processing to detect these advanced lies because they usually cannot just detect the behavior (e.g., signs of nervousness invoice), but must interpret why this individual may be nervous, including the honest reason why (e.g., afraid of being disbelieved).

Deception detection with judgments of human
Humans rarely perform better than chance on detecting deceptions, but the interrogators make their judgements according to context information in an interrogator-deceiver conversation. People might be hard to remember the whole detailed information, but their judgements might consist of some context-general information based on their own experience, which results in a truth-bias. Therefore, we build the deception detection models based on the conditional perceptions of humans (humantrusted or human-distrusted). We use judgements of human as criteria to define the following conditions (we also include the condition that we have no judgements of human, and the most conventional studies on ADD are in this condition): (i) Truthful and deceptive statements detection: detecting deceptions without perceptions of interrogators (judgements of human) (ii) Trusted truthful and deceptive statements detection: detecting deceptions with believed judgments of interrogators (iii) Distrusted truthful and deceptive statements detection: detecting deceptions with disbelieved judgments of interrogators method, judgements of human are criterion in choosing the classifiers for certain conditions to detect deceptions (not as the features). That is, when the interrogator believes the deceiver's statements, we use the condition (ii) classifier. Instead, when the interrogator disbelieves the deceiver, we can use the condition (iii) classifier. We fuse the best feature set from each modality by late fusion with additional three dense layers. Besides, there are two main goals. One is to investigate the effectiveness and robustness of speech and language features of both interlocutors. The other is to show whether the model performance of detecting deceptions with the judgements of interrogators could be better than the model without them.
More specifically, we split four-class sample data in Table 1 into two conditions based on judgements of interrogators (human-trusted/human-distrusted). The unit of features of interrogators/deceivers incorporates all of the utterances from the complete QA pair because interrogators would like to ask questions to seek detailed information. The closest previous study is Chou and Lee (2020). They have investigated perceived deception in the condition that the deceiver is telling either truths or deceptions, but they only focus on perceived deception recognition. Our objective is to detect the deceiver's answers corresponding to each question. In contrast, the learning targets of Chou and Lee (2020) are from the interrogator's guessed answers. Therefore, our learning targets are different from them. Moreover, their work is not useful in real life since they have to know the judgements of the deceivers, and it is impractical and impossible to be applied in the real world. In this paper, we hypothesize that (i) we can get better performance if the model takes judgements of interrogators into account, and (ii) there are differences in both interlocutors' behaviors between the trusted/distrusted truthful and deceptive dialogues. In the rest of the sections, we will describe the feature extraction in detail (notice that all types of the following feature sets are normalized to each speaker using z-score normalization) and the use of a deception detection framework. Table 2 summarized 8 various feature sets, which were extracted from the acoustic and linguistic characteristics of all speakers based on questioning turns of interrogators and answering turns of deceivers within QA pairs. In this work, we use the features extracted from audio and text recordings data to build the models, and we describe each feature set one by one as below. -Utterance-duration ratio: the reciprocal ratio between the utterances length (u) and the turn duration (d), denoted as Int ud and Int du respectively. -Silence-duration ratio: the reciprocal ratio between the silence (s) duration and the turn duration, denoted as Int sd and Int ds respectively. -Silence-utterance ratio: the reciprocal ratio between the silence duration and the utterance lengths, denoted by Int su and Int us respectively. -Silence times (st): the number of times that a subject produces a pause that is more than 200ms, denoted as Int st and Dec st . • XLSR: Due to the scarcity of deception databases in Mandarin Chinese, we use the multilingual pre-trained model, XLSR-53 (Conneau et al., 2020), to extract acoustic representation. XLSR-53 is trained for acoustic speech recognition (ASR) task with more than 56,000 hours of speech data in 53 different languages including Chinese-Taiwan (Mandarin Chinese) based on wav2vec 2.0 (Baevski et al., 2020). The dimension of the feature vector is 512 per frame, and then the feature vector per frame is applied to the 15 statistics 1 to generate the final 7680dimensional feature vectors. Text Recordings • BERT: we utilize BERT-Base in the Traditional Chinese version pre-trained model (Devlin et al., 2019) to extract turn-level 768-dimensional feature vectors. BERT was trained with a large amount of plain text data publicly available on the web using unsupervised objective functions (like masked-language modeling objective (MLM)) and works at the character level. We do not have to perform word segmentation when extracting representations. • RoBERTa: we also use RoBERTa (Cui et al., 2020)  To our best knowledge, the NER feature set has never been used to train the deception detector. We are inspired by the findings of psychologist's studies on crime interrogation to use the NER feature set as input features for detecting deceptions. Vrij et al. (2021) suggest that the interrogators need to manipulate and design questions to ask the deceivers for detailed information, complications, because truth-tellers often reported more complications than lie tellers in each stage of the interview. A complication refers to details associated with personal experience or knowledge learned from any personal experience. In the DDDM, most recruited subjects are university students, and the three designed questions the researchers assigned each subject to ask are mainly about general activities or experiences of an average college student. For instance, scores of department border cups, professional knowledge about instruments, and detailed process of any events held by different clubs are regarded as personal experiences. Therefore, we extracted the NER features to capture the detailed information. • LIWC: we use LIWC 2015 toolkit to extract 82dimensional features (excluding all punctuationrelated feature dimensions and total word counts (WP)) in this work after performing word segmentation pre-processing by CKIP Tagger.

Experimental Setup
We conduct our experiments to show whether judgements and speech and language cues of interrogators are helpful to detect deceptions. The closest deception database is the Columbia X-Cultural Deception (CXD) Corpus (Levitan et al., 2015), but we have no access to the CXD corpus. To compare and show the baseline results, we compare all the models that had been used in the CXD corpus to reveal overall performance on the DDDM corpus. These baseline models include Support Vector Machines ( (Chou et al., 2021). The whole framework is implemented by Pytorch (Paszke et al., 2019). The evaluation metric is macro F1-score based on the dyadindependent 10-fold cross-validation. We use the zero-padding to ensure each data sample's timestamp is the same if the length is less than the maximum timestamp (40). Several hyper-parameters for Table 3: Results on the produced deception detection on the DDDM database in macro F1-score (%). The Who's Feature column implies that the feature comes from whom, such as the interrogator (Int.), the deceiver (Dec.), or both of interlocutors (directly concatenate the features of interlocutors in feature-level). the LSTM-DNN and BLSTM-DNN models as below are grid-searched: the number of nodes in the LSTM and BLSTM layers is ranging in [2,4,8], and the batch size is ranging in [16,32], the learning rates is ranging in [0.01, 0.005] with adjusting mechanism by multiplying 1 √ 1+epoch per epoch. Finally, the maximum epoch is 10000. These hyperparameters are chosen with early stopping criteria in all conditions to minimize cross-entropy with balanced class weights on the validation set. Table 3 presents a summary of the complete results in three different conditions. There are 283, 183, and 100 question-level data samples under conditions (i), (ii), and (iii) respectively. The more detailed information about the portion of DDDM is shown in Table 1. Besides, the human performance is 54.7% macro F1-score in the DDDM corpus. The performance of DNN (Mendels et al., 2017) is very competitive, but modeling time-series information is important for conversation setting. Hence, we only present the results with the BLSTM-DNN model in the conditions (ii) and (iii).

Experimental Results
In Table 3, the performances of the BLSTM-DNN with judgments of interrogators are consis-tently higher than the models without the judgments of interrogators, and the findings show corroborating evidence of the ALIED theory (Street, 2015;Street et al., 2019) which claimed that the perceptions of human could be potential lie detector even though the judgments of human are error-prone. We also found that the interrogators' features seem more contributing to deception detection in condition (iii). This finding demonstrates that we could consider the interrogators' features when the interrogators distrust the deceivers for building deception detection models. However, the performances of most models trained with the feature sets of the deceivers in the condition (i) and (ii) consistently surpass the ones trained with the features from the interrogators or both interlocutors.

Ablation Study
To investigate the effectiveness of audio, text, and turn-taking dynamics (TTD) modalities, we take the feature set according to the best performance in Table 3. We take Emobase, BERT, and CTD to represent the audio, text, and TTD modalities respectively. In the condition (i) and (ii), Emobase and BERT are from the deceivers. On the other hand, the counterparts are from the interrogators in the condition (iii). In the fusion method, we follow Chou et al. (2021) to firstly freeze the weights of all models trained with the above-mentioned feature sets and concatenate their final dense layers' outputs as the input of the additional three-layer feed-forward neural network to perform late fusion. Table 4 summarizes the results of the ablation study, and the text modality is the most effective modality. Finally, we get the promising results 87.27 % and 94.18 % and significant improvements 7.27% and 13.57% than the model without judgements of human in the condition (ii) and (iii) respectively.

Analyses
Having established the presence and characteristics of each speech and language cue, we were interested in exploring the differences in both of interlocutors' speech and language cues on the different judgements of the interrogators given three different scenarios: (A) human-distrusted deceptive and truthful statements, (B) human-trusted deceptive and truthful statements, and (C) successful/unsuccessful deceptive and truthful statements. We firstly performed Welch's t-test (Delacre et al., 2017) for each speaker's turn (e.g., questioning/answering turns) within QA pairs that represented a question and answer from the 3 daily questions. The QA pairs shown in Figure 1 were marked manually, and each deceivers' answer was labeled as truth or deception using the daily life questionnaire response sheet. This resulted in 2764 QA pairs. Using this data, the significant indicators after performing Welch's t-test between each feature set on the different conditions are shown in Appendix A.1 Table A.1. Then, we calculate the ratio of significant features in each feature set divided by its dimension base because every feature set has different dimensions, i.e., in the NER feature set under the scenario (A), there are 7 significant indi-cators and its dimension base is 17, so the ratio is calculated by 7 divided by 17. Additionally, while XLSR, NER, POS, BERT, and RoBERTa are all extracted by not zero-error-rate pre-trained models and LIWC is also calculated the word counts afterword segmentation by CKIP Tagger, they all have significant indicators whose p-value is smaller than 0.05 among them. For example, BERT and RoBERTa from the deceivers have a high proportion of significant indicators. However, since the meaning of XLSR, BERT, and RoBERTa representations are difficult to explain intuitively, so we focus on other feature sets to examine the following research questions.
Is there a difference in both interlocutors' behaviors between distrusted truths and deceptions (Scenario A)? According to the experimental results in Table 3, we understand that the features of interrogators are significant indicators to detect deceptions. After performing the Welch's t-test on each feature set between distrusted truthful and deceptive interlocutor's questioning/answering responses (there are 898 QA pairs in scenario A), we found that the feature sets of NER, POS, and LIWC have a higher ratio of statistically significant indicators. Moreover, we check the predictions of them in the DDDM, and we observe that the interrogators tend to ask more complex questions to inquire detailed information about the statements of deceivers. That is, the interrogators would check the numbers information about scores of games, frequency of presentation, or length of music concerts (PERCENT, QUANTITY, Neqb, and DM), things about musical instrument or events about concert presentations and ball games (EVENT, PRODUCT, and WORK OF ART), and places/locations (i.e., elementary schools and universities) (Nc). This result is very interesting because the psychologist studies had also shown that how interrogators interrogate the deceivers in details would affect the success in catching liars. Besides, there are some significant indicators in LIWC, such as the words describing the movements in the sport game (death: "殺"球 (殺球 means kill and spike)) and the words to ask the deceivers to provide more detailed information (focusfuture: "然後"你之後還有繼續打球/彈樂 器嗎? ("then", did you keep playing balls/musical instrument afterward?)). Is there a difference in both interlocutors' behaviors between trusted truths and deceptions (Scenario B)? In scenario B, the results of Welch's t-test reveal that NER consists of the highest ratio of significant indicators than others. When we go back to read the data in the DDDM (Appendix A.1)  et al., 2020). That is, the interrogator tended to judge high-intensity utterances as truths because the louder utterances might be perceived as more confident even though these utterances could be deceptive in fact. Additionally, the significance test shows that some CTD features of interrogators are important indicators indicating whether the deceiver is telling the truth or not when the interrogator trusted the deceivers. For example, in the Appendix A.2 Table A.3, we can find that the interrogator spends more time to come up with more complex questions to inquire the deceiver; however, the interrogator eventually believes the deceiver's statements, but the proposed method can successfully detect the deceptions by the interrogator's temporal TTD behaviors. This finding is the same as the previous study . Is there any common significant indicator between the one from distrusted truths and deceptions and the other from successful/unsuccessful deceptions (Scenario C)? In this analysis, we demonstrate additional evidence indicating that human is poor at detecting deceptions-there are very few indicators that overlap in all feature sets in this condition in Appendix A.1 Table A.1 (the rightmost column). However, the results repeatedly show that the ways how the interrogators ask questions about detailed information (MONEY, PRODUCT, and DM), and the meaningful information in the deceivers' answering statements (A (one of POS features) means the words to describe the noun, such as female, big, small, to name a few). Hence, the more detailed information we have, the higher chances to detect deceptions.

Conclusion and Future Work
This paper investigates whether judgements and speech and language cues of interrogators in conversation are useful and helpful to detect deceptions. We analyzed a full suite of acoustic-prosodic features, linguistic cues, conversational temporal dynamics given different conditions. Finally, with the late fusion of audio, text, and turn-taking dynamics (TTD) modality features, JEADDN obtains promising results of 87.27% and 94.18% accuracy under the conditions that the interrogators trust and distrust the deceivers in deception detection which improves 7.27% and 13.57% than the model without considering the interlocutor's judgements respectively. While there is some research in studying perceived deception detection, this is one of the first studies that have explicitly modeled acousticprosodic characteristics, linguistic cues, and conversational temporal dynamics using judgments of interrogators in conversations for detecting deceptions. Furthermore, we provide analyses on the significance of different feature sets in three different scenarios and show additional evidence indicates that human is bad at detecting deceptions. Especially, the content of questions the interrogators ask is an indicator for telling deceptions or truths when the interrogators distrust the deceivers. Verigin et al. (2019) also reveal that truthful and deceptive information interacts to influence detail richness provides insight into liars' strategic manipulation of information when statements contain a mixture of truths and lies.
In the immediate future work, we aim to extend our multimodal fusion framework to combine semantic information to enhance the model robustness and the predicting powers within multiple QA pairs. That is, we observe that some interrogators finally trusted the deceivers after many follow-up questions while the statements of the deceivers were deceptive. Kontogianni et al. (2020) also pointed out that follow-up open-ended questions prompt additional reporting. However, practitioners should be cautious to corroborate the accuracy of new reported details.  (B) human-trusted deceptive and truthful statements, and (C) successful/unsuccessful deceptive and truthful statements ("*" indicates the significance threshold, p-value, is smaller than 0.01; "**" is smaller than 0.001).  Table A.2: The Welch's t-test results on Emobase in three different scenarios ("*" indicates the significance threshold, p-value, is smaller than 0.01; "**" is smaller than 0.001).