1. Introduction
It is considered hard for humans to detect when someone is lying. Ekman [
1] highlights five reasons to explain why it is so difficult for us: (1) during most of human history, there were smaller societies in which liars would have had more chances of being caught with worse consequences than nowadays; (2) children are not taught how to detect lies since even their parents want to hide some things from them; (3) people prefer to trust in what they are told; (4) people prefer not to know the real truth; and (5) people are taught to be polite and
not steal information that is not given. However, it has been argued that it is possible for someone to learn how to detect lies in another person given sufficient feedback (e.g., that 50% of the time, that person is lying) and focusing on micro-expressions [
1,
2].
Building from the above, the detection of deceptive behavior using facial analysis has been proved feasible using macro- and, especially, micro-expressions [
3,
4,
5]. However, micro-expressions are difficult to capture at standard frame rates and, given that humans can learn how to spot them to perform lie detection, the same training might be used by liars to learn how to hide them. Thus, there has been interest in detecting facial patterns of deceptive behavior that might not be visible to the naked eye, such as the heat signature of the periorbital [
6] or perinasal region [
7] in thermal imagery, which cannot be perceived by human vision.
One of the crucial aspects to appropriately address lie-detection research is the availability of adequate datasets, which is one fundamental element of open innovation in accelerating current research, as opposed to closed or private datasets, which characterizes the opposite counterpart (closed innovation) [
8]. Regardless of current progress, however, the acquisition of training and, especially, evaluation material for lie detection is still rather a challenging task, particularly regarding the necessity to gather ground truth, namely, to know whether a person is lying or not. The main difficulty arises because such knowledge is not useful if the scenario is naively simulated (e.g., it is not sufficient to instruct a person to simply tell a lie). Research on high-stakes lies suggests that deceptive behavior can depend heavily on the potential consequences for the liar [
9]. Thus, researchers have attempted to create artificial setups that can convincingly reproduce situations where two factors converge: (1) there is a potential for truthful deceptive behavior; (2) we know when a lie takes place and when the recorded subjects are telling the truth. Most attempts so far have focused on interview scenarios in which the participants are
instructed to lie [
6,
7,
10], although it is hard to simulate a realistic setting for genuine deceptive behavior. Alternatively, some researchers have worked in collaboration with police departments, with the benefit of a scenario that, in many cases, is 100% realistic, as it is based on interviews of criminal suspects. However, the problem in this setting is the ground truth: it is not possible to rely on legal decision making [
11], and even the validity of confessions has been questioned [
12].
In contrast, in this paper, we explore an alternative scenario where participants are recorded while playing a competitive game in which convincingly lying to the opponent(s) produces an advantage. On one hand, participants are intrinsically motivated to lie convincingly. Importantly, given the knowledge of the game rules, we can accurately determine whether a given behavior is honest or deceptive. The use of card games can also benefit from the occurrence of unexpected events that produce genuine surprise situations for the potential liar, which has been highlighted as beneficial for lie detection scenarios [
9].
Thus, the goals of this paper are twofold. Firstly, we present an annotated dataset, the Game Lie Dataset (GLD), based on frontal facial recordings of 19 participants who try their best to fool their opponents in the
liar card game. Secondly, we depart from the dominating trend of lie detection based on micro-expressions and investigate whether a lie can be detected by analyzing solely the facial patterns contained on single images as input to cutting-edge machine learning [
13,
14,
15] and deep learning [
16,
17,
18,
19] facial analysis algorithms.
Using our collected dataset and several automatic lie detection models, we perform lie detection experiments under three different settings: (1) generalization test to evaluate the performance on unseen subjects; (2) person-specific test to evaluate the possibility to learn how a given participant would lie; and (3) cross-application test to evaluate how the models generalize to a different acquisition setup. Thus, the overall contributions of this work can be summarized as follows:
We present the GLD dataset, a novel dataset which contains colored facial data as well as ground truth (lie/true) annotations, captured during a competitive card game in which participants are rewarded for their ability to lie convincingly.
We also present quantitative comparisons results of several machine learning (ML) and deep learning (DL) models tested on the newly captured dataset.
We provide several experiments that outline the current limitations of facial-based lie detection when dealing with several different lie tasks.
The combination of our novel lie-detection dataset with the respective evaluations of current ML and DL methods are expected to benefit research in automatic lie detection systems, and can also be relevant for several targeted real-life tasks, where the understanding of generalized (in daily settings) lie intentions is important, such as the following:
Health practitioners/psychiatrists for general counseling: understanding whether people lie or not is important to improve their conditions, e.g., drug addictions.
Educator: to know whether students might be lying or not during a test or experiment.
Credit in the finance sector: to know if the prospective client is lying about their background and the past.
The rest of this paper is organized as follows: in
Section 2, we provide an overview of the related work, both regarding previous lie-detection methods and current lie detection task datasets. In
Section 3, we explain the characteristics of our collected dataset alongside the recording pipeline. In
Section 4, we describe several ML- and DL-based techniques used to evaluate our dataset along with the associated evaluation metrics. In
Section 5, we present our experimental results divided into generalized, person-specific and cross-task lie detection settings. Finally, in
Section 6, we provide our conclusions.
2. Related Work
Different approaches and techniques have been applied for the lie detection task, with physiological cues being widely and commonly used. The most popular one is the polygraph, commonly known as a lie detection machine. Other approaches have used brain activity in order to detect deception by utilizing different neuro-imaging methods, such as fMRI [
10,
20,
21,
22]. For example, Markowitsch [
22] compared brain scans from volunteers in a lie-detection experiment in which some participants were asked to lie and others had to tell the truth. It was found that when people were telling the truth, the brain region associated with sureness was activated, while in the case of lies, the area associated with mental imagination was activated. Similarly, the brain’s hemoglobin signals (fNIRS) or electrical activity (EEG) can be measured to define physiological features for lie detection [
23,
24,
25,
26].
The main drawback of the above techniques, however, is their invasive and expensive nature, due to the need for special instruments to allow data collections. This has led to the emergence of less obtrusive approaches involving verbal and non-verbal cues. Several studies focused on utilizing thermal imaging to perform the deception detection task since skin temperature has been shown to significantly rise when subjects are lying [
7,
27]. Furthermore, speech was also explored [
28,
29], e.g., by extracting features based on transcripts, part of speech (PoS) tags, or acoustic analysis (Mel-frequency cepstral coefficients).
The use of several modalities for lie detection was also investigated to see its impact in improving detection algorithms. In [
30,
31,
32], both verbal and non-verbal features were utilized. The verbal features were extracted from linguistic features in transcriptions, while non-verbal ones consisted of binary features containing information about facial and hands gestures. In addition, Soldner et al. [
32] introduced dialogue features, consisting of interaction cues. Other multi-modal approaches combined the previously mention verbal and non-verbal features together with micro-expressions [
3,
4,
5], thermal imaging [
33], or spatio-temporal features extracted from 3D CNNs [
34,
35].
In the last decade, there has been a growing interest in the use of facial images to perform lie detection, often based on micro-expressions [
3,
4,
5,
13,
15] or facial action units [
14], achieving the current state-of-the-art accuracy.
Table 1 below shows an overview of the major related works outlined in this section.
Existing Lie Detection Datasets
The availability of public datasets to address a certain task (i.e., in this case, lie detection) is important to stimulate and accelerate the progress of solving the respective problem. In a way, this approach is an instance of open innovation, while it has been shown to benefit wider correspondence (including universities and companies) and as such, improve its direct impact [
8] in comparison to privately developed and kept datasets (i.e., to be characterized as closed innovation). Despite there existing several works performing lie detection tasks, just a few datasets are published. In the literature, there are only two existing multi-modal, audio-visual datasets that are specifically constructed for the purpose of lie detection tasks: a multi-modal dataset based on the Box-of-Lies
® TV game [
32] and a multi-modal dataset using real-life Trial-Data [
31].
Both the Box-of-Lies and Trial-Data include 40 labels for each gesture that a participant shows and the whole transcripts for all videos. The difference between them lies in the interactions: in the Trial data, there is only a single speaker per video, and lies are judged from the information of this single speaker. In contrast, in the Box-of-Lies
® data, the lies are identified from the interaction between two people while playing a game, with emphasis on their dialogue context. Thus, the Box-of-Lies
® dataset also contains annotations on participants feedback, in addition to veracity tags for each statement made. Further details of these two datasets can be seen in
Table 2.
Even though previous datasets provided a way to analyze the respective lying characteristics, there still exist some limitations: the first one is that the interactions between participants are fairly limited, which are usually constrained to one-to-one lying settings. Furthermore, the facial areas are usually taken in extremely different settings and poses, which may hinder the model learning [
36,
37]. In this work, we present a novel dataset that involves more interactions between participants during lying, along with multiple different tasks altogether. We also record our data in a controlled environment to reduce the variability of irrelevant image characteristics, such as lighting and extreme poses, thus allowing for more precise machine learning-based modeling, capitalizing on extracted features that are inherently more relevant to achieve high level of predictions (i.e., in this case, lie detection) [
38,
39].
3. Game Lie Dataset
In order to establish an appropriate scenario to perform the lie actions, we opt to use a card game called "The Liar" due to the unique characteristics of this game that incentivize the participants to lie well in order to win the game. Furthermore, its simplicity and easy-to-learn aspect allow for more efficient data collection. The winner of this game is the first participant to run out of cards.
Specifically, the game consists of dealing all cards among three or more players. In theory, players must throw as many cards as they want as long as all of them have the same number. However, cards are turned face down, and thus, players can lie on the number in the cards. The game round starts when a player throws some cards and then the player on the right decides whether to believe the previous player or not. If the next player believes the previous player, he/she has to throw some cards, stating that they have the same number as the ones already thrown. If, on the contrary, the next player does not believe the previous player, the thrown cards are checked. Finally, if the previous player was telling the truth, the current player has to take the cards; otherwise, the previous player takes the cards back. Thus, all players are encouraged to perform the lies well in order to quickly reduce as many cards as possible.
These interactions between several players, along with the incentive to lie, enable us to observe the certain gestures that people exhibit in performing the lies. Furthermore, the interactions between players also allow us to include the dynamic as time progress. The general workflow used to record this game is shown in
Figure 1 which we explain in the following sections.
3.1. Materials
We used the following materials to perform the data collection: a deck of cards for the game scenario, an RGB color camera for face recording, a video camera for card recording and a pair of lamps to improve the light conditions. Specifically, we operated two Intel RealSense Cameras D415 For faces recording with a frame rate of 30 fps for the RGB images. For game cards recording, two video cameras Mi Action Camera 4K by Xiaomi were used. The overall table setup for the data recording can be seen in
Figure 2.
3.2. Participants
We recorded a total of 19 participants: 8 male and 11 female. The participants were mixed graduates and undergraduates from different universities and from diverse study areas (background). The age range of the participants was between 21 years and 26 years old, and they expressed themselves in Spanish and Catalan throughout the data collections and interactions. Lastly, we gained explicit consent from all participants to use and analyze the recorded facial images for research purposes.
3.3. Data Collection
We performed the data collection in a total of eight sessions, including the number of participants assigned to the different groups. These groups varied between 3 and 6 participants and several rounds of game playing were performed in every session. Furthermore, two participants were recorded at a time in each round. The scenario was set such that each camera was able to record a single face from the front and the other video cameras were located next to the recorded players’ hands in order to record their cards. This allowed us to listen to the players’ statements and determine whether they were lying, according to the cards in the recording, which was crucial during the annotation process.
3.4. Data Annotation and Pre-Processing
We began our data annotation and pre-processing task by synchronizing our recorded videos of face and corresponding cards. This was done in order to determine if the corresponding player was lying. These synchronized videos were subsequently annotated with ELAN software to create comment stamps in a selected space of time. Together with these annotations, we were able to find the statements corresponding to the proper frames. Finally, we extracted the facial area using [
40], using relevant RGB frames, and cropped them to be saved as an image in the final collected dataset, as well as a point-cloud file.
3.5. Dataset Contents
We created a structured folder (the exact structure can be seen in
Figure A1 in the
Appendix A) to ease future data loading and understanding during dissemination, with all recorded data stored to a root folder named
Game Lies Dataset. Both images and 3D objects were named, following a convention, as follows: 1_2.PNG or 3_4.PLY. The first number (1 and 3 in the example) corresponds to the number of the statement, and the second number (2 and 4) is the corresponding statement frame. In this instance, the PNG example corresponds to the second frame of the first statement made by the participant in the recording.
In the end, our collected Game Lies Dataset or GLD contains data from 26 recordings with 18 different faces and a total number of frames of 15,566 of which 6476 correspond to lies (41.6%) and 9090 to true (58.4%). These frames correspond to a total of 417 statements, 170 of which are lies (40.8%) and 247 are true (59.2%). Hence on average, each lie statement has 38 frames, and true statements consist of about 37 frames.
The examples of the recorded participants can be seen in
Figure 3. Notice that in several examples, the overall facial expressions are relatively similar, so it could be a challenging task for any visual-based lie detection algorithm. Thus, using these data, we can expect to perform an appropriate test for the effectiveness of the current machine learning-based lie detection approaches, which we detail in the next sections.
5. Experiments
We perform three different experiments for the lie detection tasks: generalization test, person-specific test, and cross lie detection test. The first experiment evaluates the generalization capacity of the trained lie detector (cf.
Section 4) to predict the lie status of the never-seen-before participant (i.e., not used for training).
The second test assesses the full potential of the lie detector when dealing with a unique participant (i.e., customized to a person). This is motivated by the recent report from [
49] suggesting that the personal lying expressions may not be universal. Furthermore, the feel and willingness to perform the lying action itself may also differ per person; while someone can feel displeased when lying, other people could enjoy it [
50]. Thus, by building and testing a specialized model for each participant, we can see the theoretical limit of our proposed lie detector.
Finally, the real-life test demonstrates the potential real-life use of the lie detector to deal with different kinds of lying conditions and with limited data. This test consists of taking the model with the best performance for both of the previous experiments, and assessing their performance with real-time lie detection (from different tasks).
5.1. Generalized Models
This section evaluates several ML and DL models on the general lie task setting using several metrics defined on the methodology section.
5.1.1. Experiment Settings
We used our recorded GLD dataset to perform the experiments by splitting the available recording following five-fold cross validations schemes. We extracted relevant features from both handcrafted and VGG features, using the corresponding split. Then, we used them to train all classifiers (SVM, LDA and FC). Finally, we tested it using the associated test split, and measured the performance using the defined metrics (cf.
Section 4.3).
5.1.2. Experiment Results of Classical Machine Learning
Table 3 shows the five-fold cross validations accuracy and F1-score from LBP descriptors combined with several classifiers (SVM, AdaBoost and LDA). We can see that the best results were obtained with the use of Adaboost, reaching 52.6% accuracy and a 52 F1-score (indicated with bold face in respective table), which is better than those of using other classifiers, such as SVM and LDA. Furthermore, in general, we notice that the use of 12 points of neighboring (i.e., P = 12), and dividing the image with 2 × 2 grid values produce the best results. This suggests that modest values of parameters are advantageous to improve the lie estimates.
We can see the results of HOG descriptor on the
Table 4, that is obtained using similar five-cross validation settings. We can see a similar pattern with the results from LBP, where using the 8 × 8 grid size with the modest value of 2 × 2 block cells to compute the histogram produces better results. Furthermore, we note that the best accuracy is achieved by AdaBoost, achieving the accuracy 53% and 52.8 F1-score, respectively.
Finally, the results obtained for SIFT descriptors can be seen in
Table 5 with the varying number of the bag of words (BoW∼K). Here, we found that in general, the use of a K value of 800 is beneficial. Furthermore, using AdaBoost classifier achieves the maximum results with an accuracy of 53% and a F1 value of 52.2.
Figure 6 shows the examples of TP, FP, TN and FN of each best performer of the classical machine learning models. Notice that the facial expressions are quite similar across the examples, with slight changes happening in the mouth area in the case of both being correctly classified as lies (TP and FN). However, on the failed recognition (FP and FN), the facial area is mostly neutral, which may thus confuse the proposed methods in their predictions.
5.1.3. Experiment Results of Deep Learning
We present the results of the use of CNN features with both classical classifiers (LDA, Adaboost, SVM) and neural network based classifier of FC on the
Table 6. We can see that results from the use of classical classifiers are quite similar to the results from previous sections, which are modest, suggesting its limitations. Furthermore, we found that using SVM leads to erroneous values (e.g., the lie values are predicted as one class, i.e., no change), thus producing the Na value. However, upon the use of the FC-based classifier, the results are improved, reaching 57.4% accuracy and a 58.3 F1 value, respectively. We need to also note that in one fold, the VGG + FC models were able to reach 62.76% accuracy and 64.34 F1 value, separately, as shown in
Table 7. This indicates the compatibility and superiority of the deep learning based model for these lie detection tasks.
We show in
Figure 7 the visual examples of the TP, FP, TN and FN cases of the deep feature-based lie detector. We can observe that in general, there is more variety in the facial expressions compared to the examples from the classical machine learning-based detector across examples. We also see that in the case of failure (FP and FN), the expressions are also more visible compared to neutral. However, there also seems to be similarity in that in the case of the correctly classified label (TP and TN), the visual changes happen in the mouth area in this example. This variety of expressions suggests the expressiveness of the VGG features, which may be helpful to more accurately classify the lie compared to the hand-crafted based descriptor.
5.1.4. Overall Comparisons
We can see the overall comparisons of the best performers for all evaluated models in
Table 8. Overall, we can see that the classical machine learning technique for lie detection yields quite modest results (close to 50% accuracy). In other hand, the deep learning based model produces more accurate estimates, achieving the best accuracy so far in this dataset of 57.4% and 58.3 F1 accuracy. Indeed our produced results are quite comparable with the other relevant works for lie detection. Such as the reports from [
32], where the classical machine learning-based approach was used (i.e., random forest) and [
31] where real humans are employed. Given the current results, we can concur that the general lie-detection task is quite challenging (even also in comparison with real human ability) and the use of the various ML/DL based models provides further insights on how much the current automatic lie detection approach can handle. In the next section, we show how our predictions can be improved by targeting each specific individual separately to learn his/her personal unique characteristics when attempting to lie.
5.2. Person Specific Models
This section evaluates the highest accuracy limit achieved by the best performing models from the previous section when dealing with specific lying characteristics of each participant.
5.2.1. Experiment Settings
In this experiment, we use the best performer model from previous comparisons (i.e., VGG + FC) for individual-based lie detection. We do this by training the model on an equal number of frames for each participant, and testing it on the other frames’ counterparts.
5.2.2. Experiment Results
Table 9 summarizes the obtained test accuracy for all participants, with column “ALL” containing the mean of the achieved results. Here, we can observe that the overall prediction accuracy is higher, with an average accuracy of 65% and F1 score of 63.12, and a maximum accuracy of 97.8% and F1 score of 65.7 in the case of participant 11. This higher accuracy may indicate the ease of the tasks that the proposed model handles given the narrow examples and specialized facial expressions that the person projects during lying. Thus, it further confirms the previously mentioned hypothesis of the unique characteristics of each person in performing the lying. Therefore, the formation of personal, specialized models tailored to each individual could be used as an additional step to improve lie detection in the application domain.
5.3. Cross Lie Detection Tasks
In this section, the best-performing models from the two previous sections are integrated into a single real-time detection algorithm and exposed to a different lie task.
5.3.1. Experiment Settings
We perform two major cross lie tasks in this experiment that consist of card number uttering and sentence filling. The first test is the simulation of the cards game, where the subject holding a deck of cards has to take one card and either utter the real number or to produce a fake number. The second one in the other hand involves the reading of some sentences with blank spaces that have to be filled by the subject with either real or fake information at the time of reading each sentence (the example sentences can be found in the
Appendix B). We perform both tests by involving a training participant and two test subjects. That is, we first train the model using the data from the training participant when performing both tasks (thus, they are quite comparable to the person-specific task in
Section 5.2, though now in a different task). Subsequently, we use the pre-trained model to detect the lies from the two test subjects when conducting similar tasks.
To collect the samples, we implement a simple application that integrates different modules: face tracking and cropping [
51], VGG-face 512-dimensional feature prediction [
52] and prediction samples as True (not lie) or False (lie). The example of the proposed program can be seen in
Figure 8. Using this program on the fly, then we can predict a statement made by the participants. That is, the statement is considered a lie if more than 30% of the frames are predicted as “lie” by the proposed program.
5.3.2. Experiment Results
Table 10 presents the results obtained from the evaluation for both tasks. As expected, we can see that the proposed model struggles to correctly predict the true lie label, both on the training and test sets judged by their low accuracy. Specifically, the best training accuracy of 52% and F1 score of 54.9 are far lower than those of the person-specific test (cf.
Section 5.2) of 65% and 63.12, respectively. Furthermore, the results of the test predictions are also considerably low, only reaching 43.59 and a F1 score of 38.1. This indicates the difficulty of this prediction task, considering the different characteristics of the lying condition itself in combinations with the personalized ways of people during lying. These findings would be relevant during applications, especially as a note that differently trained lie detector models may inherently be better if tailored to the specific cases to which they are exclusively trained/designed (i.e., to train the model to each specific lying task).
6. Conclusions
In this paper, we presented a comparison of several machine learning-based lie detection models applied to our newly collected Game Lie Dataset (GLD). We did so by first collecting the new dataset using several instrumentations and involving 19 participants during the customized card game to incite the lying conditions. Secondly, we pre-processed the data in a structured way to allow for easier loading and future dissemination. Lastly, we cropped the facial area and performed the annotation to complete the dataset productions.
Using our collected dataset, we built classical machine learning models by adopting three handcrafted based features of LBP, HOG and SIFT that were later used for lie classification using classical classifier of SVM, Adaboost and LDA. Furthermore, we included the deep learning-based feature of VGG to build a fully end-to-end system, involving fully connected layers to be compared with its semi-classical counterparts by using aforementioned classical classifiers for predictions.
To evaluate the proposed models for lie detection tasks, we performed three main experiments: generalized tests, person-specific tests, and cross lie detection tests. On the generalized tests, we found the limitation of classical methods compared to deep learning-based models based on the higher accuracy reached by the latter. Visual inspections further revealed more diverse expressions captured by deep learning-based model compared to the classical approach, suggesting its effectiveness. On the second task, we showed that a generally higher accuracy was achieved by our model, given its simpler tasks in dealing only with a specific individual, allowing for more effective learning. This also confirms the hypothesis of unique facial expressions made by each individual during lying. Then on the last task, we noticed the difficulty of the models in properly predicting the lie labels, given the inherent characteristics of the new tasks associated with unique ways of lying.
In the future, we plan to record additional physiological signals to improve the model estimations and to open more diverse analyses. Lastly, the findings of our work could be utilized for several potential applications, where the knowledge of one’s true intentions (lie detection) on a daily basis are of paramount importance, such as for health counseling, academic examinations or banking credit scenarios.