Machine Learning-based Lie Detector applied to a Novel Annotated Game Dataset

Lie detection is considered a concern for everyone in their day to day life given its impact on human interactions. Thus, people normally pay attention to both what their interlocutors are saying and also to their visual appearances, including faces, to try to find any signs that indicate whether the person is telling the truth or not. While automatic lie detection may help us to understand this lying characteristics, current systems are still fairly limited, partly due to lack of adequate datasets to evaluate their performance in realistic scenarios. In this work, we have collected an annotated dataset of facial images, comprising both 2D and 3D information of several participants during a card game that encourages players to lie. Using our collected dataset, We evaluated several types of machine learning-based lie detectors in terms of their generalization, person-specific and cross-domain experiments. Our results show that models based on deep learning achieve the best accuracy, reaching up to 57\% for the generalization task and 63\% when dealing with a single participant. Finally, we also highlight the limitation of the deep learning based lie detector when dealing with cross-domain lie detection tasks.


I. INTRODUCTION
I T is very hard for humans to detect when someone is lying.
Ekman [1]. highlights five reasons to explain why it is so difficult for us: 1) the fact that during most human history, there were smaller societies in which liars would have had more chances of being caught with worse consequences than nowadays, 2) that children are not taught on how to detect lies, since even their parents want to hide some things from them; 3) that people prefer to trust in what they are told; 4) or prefer not to know the real truth; and 5) that people are taught to be polite and not steal information that is not given to us. However, it has been argued that it is possible for someone to learn how to detect lies in another person given sufficient feedback (e.g. that 50% of the times that person is lying) and focusing on micro-expressions [1], [2] Building from the above, the detection of deceptive behavior using facial analysis has been proved feasible using macroand, especially, micro-expressions [3], [4], [5]. However, micro-expressions are difficult to capture at standard frame rates and, given that humans can learn how to spot them to perform lie detection, the same training might be used by liars to learn how to hide them. Thus, there has been interest in detecting facial patterns of deceptive behavior that might not be visible to the naked eye, such as the heat signature of Nuria Rodriguez Diaz, D. Aspandi, F. Sukno and X. Binefa are with the Department of Information and Communication Technology, Universitat Pompeu Fabra, Barcelona, Spain, 08026. E-mail: nuriarodriguezdiaz@gmail.com D. Aspandi is also with Institute for Parallel and Distributed Systems, University of Stuttgart, Stuttgart, Germany, 70569. E-mail: decky.aspandi-latif@ipvs.uni-stuttgart.de the periorbital- [6] or perinasal-region [7] in thermal imagery, which cannot be perceived by human vision.
One of the crucial aspects to appropriately address liedetection research is the availability of adequate datasets. The acquisition of training and, especially, evaluation material for lie detection is a challenging task, particularly regarding the necessity to gather ground truth, namely, to know whether a person is lying or not. The main difficulty arises because such knowledge is not useful if the scenario is naively simulated (e.g. it is not sufficient to instruct a person to simply tell a lie). Research on high-stakes lies suggests that deceptive behavior can depend heavily on the potential consequences for the liar [8]. Thus, researchers have attempted to create artificial setups that can convincingly reproduce situations where two factors converge: 1) there is a potential for truthful deceptive behavior; 2) we know when deception takes place and when the recorded subjects are telling the truth. Most attempts so far have focused on interview scenarios in which the participants are instructed to lie [7], [6], [9], although it is hard to simulate a realistic setting for genuine deceptive behavior. Alternatively, some researchers have worked in collaboration with Police Departments, with the benefit of a scenario that in many cases is 100% realistic, as it is based on interviews to criminal suspects. However, the problem in this setting is the ground truth: it is not possible to rely on legal decision-making [10] and even the validity of confessions has been questioned [11].
In contrast, in this paper we explore an alternative scenario in which participants are recorded while playing a competitive game in which convincingly lying to the opponent(s) produces an advantage. On one hand, participants are intrinsically motivated to lie convincingly. And, importantly, given the knowledge of the game rules, we can accurately determine whether a given behavior is honest or deceiving. The use of card games can also benefit from the occurrence of unexpected events that produce genuine surprise situations to the potential liar, which and has been highlighted as beneficial for lie detection scenarios [8].
Thus, the goals of this paper are two-fold. Firstly, we present an annotated dataset, the Game Lie Dataset (GLD), based on frontal facial recordings of 19 participants who try their best to fool their opponents in the liar card game. Secondly, we depart from the dominating trend of lie-detection based on microexpressions and investigate whether a lie can be detected by analyzing solely the facial patterns contained on single images as input to cutting edge machine learning [12], [13], [14] and deep learning [15], [16], [17], [18] facial analysis algorithms.
Using our collected dataset and several automatic lie detection models, we perform lie detection experiments under 3 different settings: 1) Generalization test, to evaluate the performance on unseen subjects; 2) Person specific test, to evaluate the possibility to learn how a given participant would lie, and 3) Cross-domain test, to evaluate how the models generalize to a different acquisition setup. The contributions of this work can be summarized as follows: 1) We present the GLD dataset, which contains coloured facial data as well as ground truth (lie/true) annotations, captured during a competitive card game in which participants are rewarded for their ability to lie convincingly. 2) We also present quantitative comparisons results of several machine learning models tested on the new captured dataset. 3) We present several experiments that outline the current limitations of facial-based lie detection when dealing with several different lie tasks.
II. RELATED WORK Different approaches and techniques have been applied for the lie detection task, with Physiological cues widely and commonly used. The most popular one is the polygraph, commonly known as a lie detection machine. Other approaches have used brain activity in order to detect deception by utilising different neuro-imaging methods such as fMRI [19], [20], [21], [9]. For example, Markowitsch [21] compared brain scans from volunteers in a lie-detection experiment, in which some participants were asked to lie and others had to tell the truth. It was found that when people were telling the truth, the brain region associated with sureness was activated, while in the case of lies the area associated with mental imagination was activated. Similarly, brain's hemoglobin signals (fNIRS) or electrical activity (EEG) can be measured to define physiological features for lie detection ( [22], [23], [24], [25]).
The main drawback of the above techniques, however, is their invasive and expensive nature due the need for special instruments to allow the data collections. This has led to the emergence of less invasive approaches involving verbal and non-verbal cues. Several studies focus on utilising thermal imaging to perform the deception detection task, since skin temperature has been shown to significantly rise when subjects are lying ( [26], [27]). Furthermore, the speech has also been explored ( [28], [29], e.g. by extracting features based on transcripts, part of speech (PoS) tags, or acoustic analysis (Mel-frequency cepstral coefficients).
The use of several modalities for lie detection has also been investigated to see its impact in improving the detection algorithms. In [30], [31], [32] both verbal and non-verbal features were utilised. The verbal features were extracted from linguistic features in transcriptions, while the non-verbal ones consisted in binary features containing information about facial and hands gestures. In addition, [32] introduced dialogue features, consisting in interaction cues. Other multi-modal approaches have combined the previously mention verbal and non-verbal features together with micro-expressions ( [3], [4], [5]), thermal imaging ( [33] ), or spatio-temporal features extracted from 3D-CNNs ( [34], [35]).
In the last decade, there has been a growing interest in the use of facial images to perform lie detection, often based on micro-expressions [12], [14], [3], [4], [5] or facial action units [13], achieving the current state of the art accuracy.

A. Existing Lie detection datasets
Despite the existing works to perform lie detection tasks, just a few datasets are published. In the literature, there are only two existing multi-modal, audio-visual datasets that are specifically constructed for the purpose of lie detection tasks: A multi-modal dataset based on the Box-of-Lies® TV game ( [32]) and a multi-modal dataset using real life Trial-Data ( [31]).
Both the Box-of-Lies and Trial-Data include 40 labels for each gesture a participant shows and the whole transcripts for all videos. The difference between them lies in the interactions: in the Trial data there is only a single speaker per video and lies are judged from the information of this single speaker. In contrast, in the Box-of-Lies®, the lies are identified from the interaction between two people while playing a game, with emphasis on their dialogue context. Thus, the Box of Lies® dataset also contains annotations on participants feedback, in addition to veracity tags for each statement made.
Even though previous datasets have provided a way to analyse the respective lying characteristics, there still exist some limitations: the first one is that the interactions between participants are fairly limited, which are usually constrained to one to one lying setting. Furthermore, the faces are usually taken on extremely different settings and pose which may hinder the model learning. In this work, we present a novel dataset that involves more interactions between participants during the lying. We also record our data on a controlled settings (environment) to reduce invariability of irrelevant image characteristics such as lighting and extreme poses to allow for more precise machine learning based modelling.

III. DECEPTION DATASET
In order to establish an appropriate scenario to perform the lie actions, we opt to use a card game called "The Liar" due to the unique characteristics of this game that incentivise the participants to lie well in order to win the game. Furthermore, its simplicity and easy to learn aspect allow for more efficient data collection. The winner of this game is the first participant to run out of cards.
Specifically, the game consists in dealing all cards among three or more players. In theory, players must throw as many cards as they want as long as all of them have the same number. However, cards are turned face down and thus, players can lie on the number in the cards. The game round starts when a player throws some cards and then, the player on the right decides whether to believe the previous player or not. If the next player believes the previous player, he/she has to throw some cards stating that they have the same number as the ones already thrown. If, on the contrary, the next player does not believe the previous player, the thrown cards are checked. Finally, if the previous player was telling the truth the current player has to take the cards, otherwise, the previous player will take the cards back. Thus all players are encouraged to  perform the lies well in order to quickly reduce as many cards as possible. These interactions between several players, along with the incentive to lie, enable us to observe the certain gestures that people exhibit in performing the lies. Furthermore, the interactions between players also allow us to include the dynamic as time progress. The general workflow used to record this game is shown in the Figure 1 that we will explain in the following sections.

A. Materials
We use these materials to perform the data collections: a deck of cards for the game scenario, an RGB color camera for faces recording, a video camera for cards recording and a pair of lamps to improve the light conditions. Specifically, we operate two Intel RealSense Cameras D415 For faces recording with a frame rate of 30 fps for RGB images. For game cards recording, two video cameras Mi Action Camera 4K by Xiaomi were used. The overall table setup for the data recording can be seen in Figure 2

B. Participants
We recorded a total of 19 participants, 8 male and 11 female. The participants are mixed graduate and undergraduate from different universities and from diverse study areas (background). The age range of the participants is between 21 and 26 years old, and they are expressing themselves in Spanish and Catalan throughout the data collections and interactions. Lastly, we have explicit consents from the all participants to use and analyse the recorded facial images for research purpose.

C. Data Collection
We perform the data collection in a total of eight sessions including number of participants that are assigned to the different groups. These groups varied between 3 and 6 participants and several rounds of game playing were performed in every session. Furthermore, two participants were recorded at a time in each round. The scenario is set such that each camera was able to record a single face from the front and other video cameras are located next to the recorded players' hands, on order to record their cards. This allows us to listen to the players statements and determine if they are lying according to the cards in the recording, which is crucial during the annotation process.

D. Data Annotation and Pre-processing
We begin our data annotation and pre-processing task by synchronizing our recorded videos of face and corresponding cards. This is done in order to determine if the corresponding player is lying. These synchronized videos are subsequently annotated with ELAN software to create comment stamps in a selected space of time. Together with these annotations, we are able to find the statements' that are in correspondent with proper frames. Finally, we extract the facial area using [36] using relevant RGB frames and cropped the to be saved as an image in the final collected dataset, as well as a point-cloud file.

E. Dataset Contents
We create a structured folder (as seen in Figure 3) to ease future data loading and understanding during dissemination, with all recorded data stored to a root folder named Game Lies Dataset. Both images and 3D objects are named following a convention as follow: 1 2.PNG or 3 4.PLY. Where the first number (1 and 3 in the example) correspond to the number of the statement and the second number (2 and 4) is the corresponding statement frame. In this instance, the PNG  The examples of the recorded participants can be seen in the Figure 4. Notice that in several examples, the overall facial expressions are relatively similar, so that it could be a challenging tasks for any visual based lie detection algorithm. Thus, using this data, we can expect to perform appropriate test for the effectiveness of current machine learning based lie detection approaches, that we will detail in the next sections.

IV. METHODOLOGY
We use our recorded GLD datasets to evaluate both classical machine learning approaches and deep learning techniques for this specific lie detection task. In this context, we use facial area as main modality, with the lie labels obtained following our protocol explained on our dataset collections.

A. Classical Machine Learning
We use three different handcrafted features that are extracted from RGB facial images: Local Binary Patterns (LBP),   [40] and [41] for the technical implementation. Using these handcrafted features, then we will employ three classifiers to predict the lie (all implemenation is based on Scikit-learn library ( [41]) : Support Vector Machine (SVM), AdaBoost, and Linear Discriminant Analysis (LDA). These processes are summarised on the Figure 5.

B. Deep Learning
For deep learning based approach, we perform transfer learning by means of the embedded features from VGG-Very-Deep-16 CNN [42]. Specifically, we feed the cropped facial to the pre-trained VGG model, and store the embedded features. Using these embedded features, we then trained the similar classifiers as explained in previous sections to get the baseline results. To enable fully trained deep learning model, we then use the CNN features as an input to the fully connected neural networks consisting two hidden layers with 256 and 128 units respectively and an output layer. Both hidden layers use the rectified linear unit (ReLU) as the activation function, whereas the output unit uses the sigmoid activation function that classifies True (lies) and False (not lies) samples. The model is compiled with Adam optimizer with a learning rate of 0.001 and uses the binary cross-entropy loss. We utilised Keras library [43] for concrete implementation. Finally, Figure 6 shows the overview of these processes.

C. Comparison Metrix
We used both Accuracy and F1-Score to judge the quality of the lie estimations of all evaluated approaches. These metrics are calculated as follow: where TP is True Positive, FP is False Positive, TN is True Negative, and FN is False Negative examples respectively.

V. EXPERIMENTS
We perform three different experiments for the lie detection tasks: Generalization Test, Person Specific Test, and Cross Lie Detection Test. The first experiment evaluates the generalization capacity of the trained lie detector (cf. Section IV) to predict the lie status of the never-seen-before participant (i.e not used for training).
The second test assess the full potential of the lie detector when dealing with a unique participant (i.e customised to a person). This is motivated by the recent report from ( [44]) suggesting the personals lying expressions may not be universal. Furthermore, the feel and willingness to perform lying action itself may also differ per-person, that while someone can feel displeased when lying, other people could enjoy it [45]. Thus by building and testing specialized model for each participant, we can see the theoretical limit of our proposed lie detector. Finally, the Real Life Test demonstrate the potential reallife use of the lie detector to deal with different kind of lying conditions and with limited data. This test consists of taking the model with the best performance for both of the previous experiments, and assess their performance with real-time lie detection (from different task).
A. Generalized models 1) Experiment Settings: We used our recorded GLD dataset to perform the experiments by splitting the available recording following 5-fold cross validations schemes. We extracted relevant features from both Handcrafted and VGG fatures using the corresponding split. Then used them to train all classifiers (SVM, LDA and FC). Finally, we tested it using the associated test split, and measure the performance using the defined metrics (cf. IV-C).
2) Experiment Results of Classical Machine Learning: Table II shows the 5-fold cross validations accuracy and F1 score from LBP descriptors combined with several classifiers (SVM, AdaBoost and LDA). We can see that the best results obtained with the use of Adaboost reaching 52.6% accuracy and 52 F1 score, which is better than using other classifier such as SVM and LDA. Furthermore, in general notice that the use of 12 points of neighboring (i.e P = 12), and dividing the image with 2x2 grid values produce the best results. This suggests that modest value of parameters are advantageous to improve the lie estimates.
We can see the results of HOG descriptor on the Table III, that is obtained using similar five-cross validation settings. We can see the similar pattern with the results from LBP, that using 8x8 Grid size with modest value of 2x2 block cells to compute the histogram produce the better results. Furthermore, we note that the best accuracy is achieved by AdaBoost achieving the accuracy 53% and 52.8 F1 score respectively.
Finally, the results obtained for SIFT descriptors can be seen on the Table IV with the varying number of the bag of words (BoW K). Here we found that in general, the use of K value of 800 are beneficial. Furthermore, using AdaBoost classifier achieves the maximum results with an accuracy of 53% and a F1 value of 52.2. Figure 7 shows the examples of TP, FP, TN and FN of each best performers of the classical machine learning models. Notice that the facial expressions are quite similar across the examples, with slight changes happened in the mouth area in case of both correctly classified lied (TP and FN). Whereas on the failed recogniton (FP, and FN), the facial area are mostly neutrals, thus may confuses the proposed in their predicitons.
3) Experiment Results of Deep Learning: We present the results of the use of CNN features with both classical classifiers (LDA, Adaboost, SVM) and neural network based classifier of FC on the Table V. We can see that results from the use of classical classifiers are quite similar to the results from previous sections, that are modest and suggesting its limitations. Furthermore, we found that using SVM lead to the erroneous value (e.g the lie values are predicted as one class, i.e no change) thus Na value. However, upon the use of FC based classifier, the results are improved reaching 57.4% and 58.3 F1 value respectively. We need to also note that in one fold, the VGG + FC models were able to reach the 62.76% and 64.34 F1 value separately, as shown on the Table VI We also see that in case failure (FP and FN), the expressions are also more visible compared to Neutral. However, there also seems to be similarity that on the case of the correctly classified label (TP and TN) where the visual changes happen in the mouth  area in this example. These variety of expressions suggests the expressiveness of the VGG features which may be helpful to more accurately classify the lie compared to the hand-crafted based descriptor.

4) Overall Comparisons:
We can see the overall comparisons of the best performers for all evaluated model in Table  VII. In overall we can see that the results of the classical machine learning technique for lie detection yields quite modest results (close to 50% accuracy). In other hand, the deep learning based model produces more accurate estimates achieving the best accuracy so far in this dataset of 57.4% and 58.3 F1 accuracy. Indeed our produced results are quite comparable with the other relevant works for lie detection. Such as the reports from [32] where the classical machine learning based approach were used (i.e random forest) and [31] where the real-human are employed.
B. Person specific models 1) Experiment Settings: In this experiment, we use the best performer model from previous comparisons (i.e VGG + FC) for individual-based lie detection. We do this by training the model to the equal number of frames to each participants, and tested it on the other frames counterparts.
2) Experiment Results: Table VIII summarizes the obtained test accuracy for all participants, with column "ALL" contains the mean of the achieved results. Here we can observe that the overall prediction accuracy are higher, with average accuracy of 65% and F1 score of 63.12 and maximum accuracy of C. Cross Lie Detection Tasks 1) Experiment Settings: We perform two major cross lie tasks in this experiment that consisted of Card Number Uttering and Sentence Filling. The first test is the simulation of the cards game where the subject holding a deck of cards has to take one card and either utter the real number or to produce the fake number. The second one in other hand, involve the reading of some sentences with blank spaces that have to be filled by the subject with either real or fake information at the time of reading each sentence (the example sentences can be found on the appendix). We perform both of tests by involving a training participant and two test subjects. That is, we first train the model using the data from the training participant when performing both tasks (thus are quite comparable to the person specific task on section V-B, though now in different task). Subsequently, we used the pre-trained model to detect the lies from the two test subjects when conducting similar tasks. To collect the samples, we implemented a simple application that integrates different modules: face tracking and cropping ( [46]), VGG-face 512-dimensional feature prediction ( [47] and samples prediction as True (lie) or False (not lie). The example of the proposed program can be seen on the Figure 9. Using this program on the fly, then we can predict a statement made by the participants. That is, the statement is considered a lie if more than 30% of the frames are predicted as "lie" by proposed program.
2) Experiment Results: Table IX presents the results obtained from the evaluation for both tasks. As expected, we can see that the proposed model struggles to correctly predict the true lie label both on the training and test set judged by their low accuracy. Specifically, the best training accuracy of 52% and F1 score 54.9 of are far lower than person specific test (cf. subsection V-B) of 65% and 63.12 respectively. Furthermore, the results of test predictions are also considerable low only reaching 43.59 and F1 score of 38.1. This indicates the difficulty of this prediction task, considering the different charactheristic of the lying condition itself in combinations with personalized way of people during lying.

VI. CONCLUSION
In this paper we presented a comparison of several machine learning based lie detection models applied to our newly collected Game Lie Dataset (GLD). We do so by first collecting the new dataset using several instrumentations and involving 19 participants during the customised card game to incite the lying conditions. Secondly, we pre-processed the data in the structured way to allow for easier loading and future dissemination. Lastly, we cropped the facial area and performed the annotation to complete the dataset productions.
Using our collected dataset, we build classical machine learning models by adopting three handcrafted based features of LBP, HOG and SIFT that later used for lie classification using classical classifier of SVM, Adaboost and LDA. Furthermore, we also include the deep learning based feature of VGG to build a fully end to end system involving fully connected  layers to be compared with its semi-classical counterparts by using aforementioned classical classifier for predictions.
To evaluate the proposed models for lie detection tasks, we performed three main experiments: Generalized Tests, Person Specific Tests, and Cross Lie Detection Tests. On the generalized tests, we found that the limitation of classical methods compared to deep learning based models based on the higher accuracy reached by the latter. Visual inspections also reveal more diverse expression captured by deep learning based model compared to classical approach suggesting its effectiveness. On the second task we show the higher accuracy achieved by our model given its simpler tasks allowing for more accurate learning. That also conforms the hypothesis of the unique facial expression made for each individual during lying. Finally on the last task, we notice the difficulty of the models to properly predict the lie, given the inherent characteristics of the new tasks associated with unique way of lying.