Meta Learning Based Deception Detection from Speech

: It is difﬁcult to overestimate the importance of detecting human deception, speciﬁcally by using speech cues. Indeed, several works attempt to detect deception from speech. Unfortunately, most works use the same people and environments in training and in testing. That is, they do not separate training samples from test samples according to the people who said each statement or by the environments in which each sample was recorded. This may result in less reliable detection results. In this paper, we take a meta-learning approach in which a model is trained on a variety of learning tasks to enable it to solve new learning tasks using only a few samples. In our approach, we split the data according to the persons (and recording environment), i.e., some people are used for training, and others are used for testing only, but we do assume a few labeled samples for each person in the data set. We introduce CHAML, a novel deep learning architecture that receives as input the sample in question along with two more truthful samples and non-truthful samples from the same person. We show that our method outperforms other state-of-the-art methods of deception detection based on speech and other approaches for meta-learning on our data-set. Namely, CHAML reaches an accuracy of 61.34% and an F1-Score of 0.3857, compared to an accuracy of only 55.82% and an F1-score of only 0.3444, achieved by a previous, most recent approach.


Introduction
Lying and deception are an inherent part of human nature, while some lies are considered small and may even be helpful for a smoother interaction between humans, others may be devastating and cause major damage.However, despite deception detection being essential to everyone in their daily life, it is challenging for humans to determine whether a person is being deceptive [1,2].Therefore, throughout history, many methods and devices were developed for that task [3], and more recently, machine learning methods based on text, video, and speech [4][5][6].
Typically, works for speech deception detection do not separate train and test based on the person so that an individual may have examples in both train and test [5].Nor do most works split the training and test data according to the recording environment.This results in less reliable models, as in practice, the model must learn about some population in some recording environment and then be used to produce predictions for a different population in a different recording environment.Consequently, in this work, we split the data according to the persons, i.e., some people are used for training, and others are used for testing only.
One of the main difficulties in deception detection based on speech is to learn the features of lying from some people and in some recording environments and apply this knowledge to others, despite the fact that different people tend to lie differently.Therefore, in order to achieve high performance, the model is required to learn which speech-related features are specific to a person and which are general.
In addition, we consider a meta-learning approach for deception detection based on speech.For that, we assume that we have very few labeled samples for each person (namely, two positive samples and two negative ones).This approach is inspired by the well-known polygraph test [7], in which several comparison questions are asked at the beginning of the polygraph interview in order to obtain physiological measures of the subject when telling the truth and when lying.Namely, in our approach, the model is trained on a set of training tasks; each task represents a person from the subjects in our data set and consists of a support set used for learning about the task and a query set used to evaluate the performance of this task.The support set contains four examples from the person's samples, two positive samples and two negative ones, and the query set contains the remaining samples of the person.
Our approach to meta-learning differs from the typical one.In a typical meta-learning problem, there are typically several classes in each task, which differ from task to task.In addition, training and test tasks typically have different classes.However, in our setting, we have the same two classes for all tasks, but each task represents a different person from our data, and therefore, the data features are very different between different tasks.We present an innovative method of deception detection in the meta-learning setting and show that it outperforms the existing state-of-the-art methods of deception detection based on speech and other approaches for meta-learning.In our method, we use the comparative hint approach, which gives the model hints about the new environment (in our case, a new person) by providing some true and false examples from the same person sample set together with the tested sample.Namely, our model processes the positive and negative pairs and combines the result of this process with the tested sample into a vector fed into a neural classifier.
We believe that by using our unique architecture, the model can compare the tested sample to the given hints, learn the person's way of lying, and improve its detection performance.
To summarize, our main contribution in this work is tackling the problem of deception detection based on audio signals when having different train and test environments (i.e., persons), and when the model is provided with very few true and false labeled samples for each person in the test-set.To that end, we gathered a massive amount of data and developed CHAML, a novel solution based on the meta-learning approach that uses samples for each person to learn their way of lying.This method outperforms state-of-theart methods and can be applied to other environments that include different tasks, using any neural network as its classifier.

Deception Detection
The deception detection task has been explored in many types of research using different approaches and techniques: some based on text, some on speech, some on video, and others on physiological measures.In the beginning, most research was focused on analyzing physiological measures, such as breathing rate, heart rate, blood pressure, and body temperature [8][9][10].Other studies found the connection between deception and human behaviors [11][12][13].
However, detecting the physiological measures of a human requires special instruments and may be invasive and expensive [13,14].Therefore, many studies researched the use of machine learning methods for deception detection.
For detection based on text problems, Ott et al. [4] develop a corpus of deceptive reviews and use Näive Bayes and Support Vector Machine (SVM) as the classifiers utilizing Linguistic Inquiry and Word Count (LIWC) combined with bigrams.Feng et al. [15] show that using Context Free Grammar (CFG) parse trees consistently improves detection performance.Barsever et al. [16] use the BERT (Bidirectional Encoder Representations from Transformers) network and show that compared with truthful text, deceptive text tends to be more formulaic and less varied.
There have been only a few works that attempt to detect deception based on speech cues.Hirschberg et al. [17] develop a corpus of deceptive speech using one on one interviews.Nasri et al. [18] use SVM model utilizing Mel Frequency Cepstral Coefficient (MFCC).The MFCC is a feature representation commonly used in speech processing and speech recognition tasks.MFCCs are derived from the spectral representation of a speech signal and are used to capture the spectral characteristics of the signal, such as the frequencies of the various speech sounds and the power spectrum of the signal.MFCCs are commonly used for speech analysis in various domains, such as voice recognition [19], speech recognition [20], emotion recognition [21], animal vocalizations [22], and neonatal bowel sound detection [23,24].
Graciarena et al. [25] train a classifier using combined both linguistic features and acoustic features as a combination of MFCC and prosodic features and Marcolla et al. [26] use an LSTM neural network on a set of MFCC characteristics extracted from audio speech to deception detecting based on voice stress.Xie et al. [27] extract variable length framelevel speech features from different length's speech samples and use a recurrent neural network combined with a convolution operation as their model.
Other studies focus on deception detection in videos using different feature extraction methods such as IDT (Improved Dense Trajectory) feature, and high-level features represent facial micro-expressions extracted from the videos, with machine learning techniques [28][29][30][31].Ding et al. [28] develop an automated deception detection model that consists of three main modules: face focused cross-stream network which deep joint feature learning from facial expressions and body motions for video, a meta-learning module, and an adversarial learning module that generates a 256-dimension feature vector for each synthesized video.The meta-learning module was used to deal with their data scarcity problem by using pairwise comparison.Each deceptive video sample was combined with four true samples to generate five pairs.The model outputs the probability for each pair to be from the same class.
Other researchers apply a multi-modal approach for deception detection from video data sets.Pérez-Rosas et al. [6] introduce a collected data-set consisting of videos collected from public court trials.They apply a multi-modal approach for deception detection on their data set using inputs from different modalities, i.e., video, audio, and text.Wu et al. [32] use common machine learning techniques such as SVM, Näive Bayes, Decision Trees, Random Forests, Logistic Regression, and Adaboost.They test different combinations of multi-modal features: IDT feature and high-level features represent facial micro-expressions extracted from the videos as their motion features, MFCC as the audio features and encoded the video transcripts using Glove (Global Vectors for Word Representation).Other researches [33,34] introduce a deep learning multi-modal approach on the same data-set, using Multi-Layer Perceptron (MLP) and Convolutional Neural Network (CNN) models.More specifically, Gogate et al. [33] present a deep CNN approach that utilize both early and late multi-modal fusion methods, incorporating audio, visual, and textual features.In the early fusion approach, audio, visual, and textual features are extracted using an openSMILE feature extractor, a 3D-CNN, and a CNN applied to the GloVe embedding of the video transcript, respectively.These features are then concatenated and input into an MLP classifier.In the late fusion approach, separate unimodal classifiers are trained to obtain predicted labels, concatenated, and fed into an MLP classifier to obtain the final predicted label.Krishnamurthy et al. [34] propose a similar approach using an MLP model that takes a multi-modal feature representation of a video as input.This representation includes visual features extracted using a 3D CNN, textual features extracted using a CNN applied to the Word2Vec embedding of the video transcript, audio features extracted using the openSMILE toolkit, and micro-expression features derived from ground truth annotations.The authors use two data fusion techniques for combining these features; concatenation and Hadamard product followed by concatenation.Both papers achieve an accuracy of 96% on the [6] data-set with their multi-modal model, outperforming other baseline methods that only considered visual and textual features.However, as noted by the authors, their performance may not extend to larger data-sets or out-of-domain scenarios.

Speech Emotion Recognition
The speech classification task has many fields such as emotion recognition, speaker identification, language identification, etc.The Speech Emotion Recognition (SER) task is recognizing the emotional aspects of speech and classifying them into emotion categories.This task may be considered very close to our task, as an emotional aspect may be involved in people's lying.
Many previous works on the SER problem proposed solutions based on classical ML methods such as Hidden Markov Models (HMM), SVM, and Random Forests [35][36][37].However, in the past several years, deep learning has become one of the main approaches for solving the SER problem [38].Trigeorgis et al. [39] develop an end-to-end speech emotion recognition using a combination of CNN with LSTM networks for learning a representation of the speech signal directly from the raw time representation.Han et al. [40] introduce a method that uses a neural network for extracting high-level features from audio and producing an emotion state probability distribution for each speech segment.Fayek et al. [41] propose a deep-learning framework using CNNs and a spectrogram of a speech signal as the input.Another promising approach used in various speech-related tasks is the Wav2Vec 2.0 framework [42].This framework is used for self-supervised learning of vector representation from speech audio.Recent research has shown that the Wav2Vec 2.0 framework is also a robust alternative for SER and speaker identification tasks [43][44][45].Therefore, we also use this framework for one of the extraction feature types in our task.

Few-Shot Learning
The method in our paper is based on the idea of the meta-learning framework or "learning to learn" [46], specifically in the field of few-shot learning.The meta-learning framework is based on learning from prior experience with other tasks, i.e., learning how to learn to classify given a set of training tasks such that the model can solve new learning tasks.One of the main challenges in the meta-learning framework is to train an accurate model using only a few training examples given prior experience with similar tasks.This is called few-shot learning.
Few-shot learning, is training a model for learning from very few samples and generalizing to many other new examples; most approaches for few-shot learning follow the meta-learning framework.It measures a model's ability to quickly adapt to new environments and tasks using only a few examples and training iterations.For that, the model is trained on a set of tasks in a meta-learning phase, allowing it to adapt quickly to new tasks with just a few examples.Each task consists of a support set used for learning how to solve each specific task and a query set containing further examples of each specific task, which are used to evaluate the performance on each task.A task can be utterly non-overlapping with another; the classes from one task may never appear in another.The model's performance is evaluated by the average test accuracy across the query sets of many testing tasks.As the meta-learning process proceeds, the model parameters are updated based on the training tasks.The loss function is derived from the classification performance on the query set of each of the training tasks based on the knowledge gathered from its support set.The network is given a different task at each time step, so it must learn to discriminate data classes in general rather than specific subsets.
A common way of attempting to solve a few-shot learning problem is by using prior knowledge about similarity.This is done by learning class embeddings that tend to separate classes even if they have never been seen before.One of the earlier methods for solving few-shot problems is a pairwise comparator [47,48] that was developed to classify two examples as belonging or not to the same class based on their similarity, even though the model had never seen those classes before.This method can be adapted to few-shot learning by classifying an example from the query set according to its maximum similarity to an example in the support set.A more elegant way is multi-class comparators [49,50], which learn a common representation for each class in the training set and match each new test example using cosine similarity.Snell et al. [50] propose Prototypical Networks, which average the embeddings of the examples from the same class to compute the class prototype (mean vector).Then a distance metric is used to calculate the similarity (a negative multiple of the Euclidean distance) between each query embeddings to each of the classes' prototypes for finding the most similar class.
Alternatively, the few-shot learning problem can be solved by learning parameters that generalize better to similar tasks and can be fine-tuned very quickly when applied to different tasks.An implementation of that approach is Model-Agnostic Meta-Learning (MAML), introduced by Finn et al. [51].The model is initialized with random weights, and iteratively, for each task in a meta-batch of tasks, it fine-tunes a copy learner using the weights of the primary model (meta-learner).The learner weights are updated using the loss from the query samples in the task by stochastic gradient descent.At the end of each training task, the losses and gradients from the queries are accumulated, the derivative of the mean loss concerning the primary model's weights is computed, and the weights in the primary model are updated.The primary model's weights improve during this process so that the model can fine-tune to other tasks faster.
Recently, the few-shot learning method was applied to the SER problem.Guibon et al. [52] use Prototypical networks for emotion sequence labeling, Feng and Chaspari [53] use Siamese Neural Network for emotion recognition of spontaneous speech and [54] use MAML for solving the Multilingual SER problem.Hence, as this approach seems promising for the SER task, we adapt it to be used in our problem.

Data Collection
The data presented in this work is based on the "Cheat-Game", which is a turn-taking card game.Each player is dealt 8 cards and the goal is to play all cards.The centered pile accumulates all cards played by the players, and every turn the value of the recently played card is supposed to either go up by one or down by one.On each turn, a player may place up to four cards (faced down) on the centered pile.The player then states which cards she disposed; however, the player may claim to put cards that are different from what she actually played (i.e., a false claim).Nevertheless, the player must claim to play cards that have a value of either one above or one below the recently played card(s).If a player suspects that her opponent is cheating, i.e., played cards that are different than what she has stated, the player may call out a cheat.If the opponent did in-fact cheat, she collects all the cards; otherwise, the player that called out a cheat collects the cards.
We use the implementation of Mansbach et al. [55] for the "Cheat-Game" (see Figure 1 for a screenshot).During game-play, the claims of the players are recorded and added to our data set along with the actual cards played.This information allows us to determine whether a claim is true or false.
To obtain high-quality results, we improved the game environment by adding an audio test at the beginning of the game and an option for the player to hear her claim so that she could make sure it sounds clear; if not, it could be re-recorded.Unfortunately, some players who played illegal cards (i.e., opted to cheat), rather than saying that they played legal cards, stated the cards they actually played.Such statements cannot be seen as an untruthful statement, nor can they be used as a truthful statement.Therefore, we attempted to identify such statements and remove them from the data set.To that end, we used the Google Speech-to-Text API, and provided to it relevant words from our domain.Then, we checked whether the players who played cards not according to the rules stated that they had played illegal cards, and if so, we have removed these samples from the data-set.That method resulted in 3350 samples being removed.We recruited 156 test subjects who played the game in English using Amazon's Mechanical Turk service [56].Subjects' demographic information is shown in Table 1.We collected a total of 10,788 labeled samples.7585 samples were labeled as true (70.3%), and 3203 samples were labeled as false (29.7%).Each sample lasts about 4 s.As mentioned, we divided the subjects into two groups: subjects whose data is used only for training and subjects used only for testing.In the training set, we had 111 subjects, and in the test set, we had 45 subjects.It should be noted that in our data set, different recordings are made in different environments, unlike most data sets in which all recordings are gathered from the same environment (i.e., the same microphones).This fact makes our problem more complex, but also more realistic, as, when used in practice, it is anticipated that the data is gathered from many different sources, and each person uses her own recording device.See Table 2 for a summary of the data-set division.

Comparative Hint Approach Meta-Learning (CHAML)
We present CHAML, a model that uses a Comparative Hint Approach and Meta-Learning for deception detection.The model is composed of the following three modules: the embedding module, the core process, and the classifier.

Embedding Types
In order to classify the audio samples, they must first be vectorized.Therefore, we use embeddings, which encompass the samples' audio features, and feed them to the core process.We consider two different types of embeddings: Five Sound Features and Wav2Vec 2.0.

Five Sound Features
The Five Sound Features is a vector embedding developed by Mansbach et al. [55].It contains following five features, which are extracted from the audio samples: MFCC, Mel-scale spectrogram, Spectral contrast, Short-time Fourier transform (STFT), and Tonnetz.The vector embedding size is 193.We note that although [55] used Voice Activity Detector (VAD) to trim the silence parts and background noise for all samples, we observed that this practice did not improve the performance of this embedding method.This is likely because the samples are short, and silent segments may contain clues on whether a statement is true or false.Therefore, in this paper, we do not use VAD.
4.1.2.Wav2Vec 2.0 Wav2Vec 2.0 [42] is a framework for self-supervised learning of speech representations.The model training process is divided into two phases; firstly, the model is pre-trained in a self-supervised manner on large quantities of unlabeled audio samples to achieve the best speech representation it can, and secondly, it is fine-tuned on a smaller amount of labeled data.
The architecture of the model is composed of three parts: feature encoder, context network and quantization module.The feature encoder reduces the dimensionality of the input raw waveform by converting it into a sequence of T latent audio representations vectors of 25 ms each.It consists of a 1-d convolutional neural network with 7 layers and 512 neurons at each layer.These representations are then fed into both the context network and the quantization module.The context network takes the T latent audio representations and processes them through Transformer blocks for adding information from the entire audio sequence and getting the contextualized representations.The quantization module discretizes the latent audio representation to a finite set of quantized representations via product quantization.The set of possible quantized representations is composed of a concatenation of codewords sampled from codebooks.The pretraining process uses contrastive learning in which it randomly masks parts of the audio from the latent feature encoder space, which requires the model to identify for each masked frame the correct quantized latent representations from a set of distractors.As mentioned, after the pretraining phase, the model is fine-tuned on labeled data.
In this study, during the fine-tuning phase, the quantization module is not used.Instead, the model is fine-tuned in a supervised task for speech classification by adding an average pooling layer on top of the context encoder output for calculating the averaged vector from the context representation, followed by a fully connected layer using the Tanh activation function and an output layer with two classes.The hyper-parameters used for fine-tuning the Wav2Vec 2.0 model on our data are represented in Table 3.We note that the total train batch size was set to the maximal allowed by the GPU used for training, and since fine-tuning the Wav2Vec 2.0 model required extensive training time, we could not consider many different hyper-parameters.This fine-tuning architecture is inspired by [57] that fine-tunes the Wav2Vec 2.0 model for the speech classification task.
The pre-trained model used in our study is the Wav2Vec 2.0 xlsr-53 model (https: //huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english,accessed 4 October 2022) that was fine-tuned on English using the Common Voice [58] data set which currently consists of 7335 h of transcribed speech in 60 different languages.The model outputs an embedding vector of length 1024.

CHAML-Core
In this model, we were inspired by the idea of meta-learning for few-shot learning for deception detection based on speech.We define each person in our data-set as a "metalearning task".Each task's support set contains four randomly sampled examples, two per class, and the rest are the query set.In the training phase, each sample was trained with four examples randomly sampled from the training set of the person, i.e., each sample appears as a query, and may appear as an example in the support-set of other samples.Clearly, in the evaluation phase, we do not have different pairs of labeled examples for each query sample, and we use the same four examples for all query samples of a task.As mentioned in Table 2, there are 45 testing subjects (i.e., tasks), and for each of them, we have four labeled examples, two per class.Therefore, in total, we have 90 True and 90 False samples in the support set of the testing tasks.See sample partition details in Table 4.The CHMAL model fixes the order of the provided examples by first using the true examples and then the false examples.For each class' examples, we calculated its elementwise product; this is known to have the ability to catch similarities or discrepancies between the vectors.Each pair is then concatenated with the given product and fed into three fully connected layers with the ReLU activation function and dropout.The number of neurons in each fully connected layer depends on the original embedding size, which is a different value for each of the two different embedding types.That is, for the Five Sound Features embedding mentioned in Section 4.1.1 the embedding size is 193, and for the Wav2Vec 2.0 embedding mentioned in Section 4.1.2the embedding size is 1024.The first fully connected layer is three times the length of the embedding, which matches the size of its input (the embedding of the first example, the embedding of the second example.and the embedding of the dot product).The length of the second and final fully connected layers is twice the embedding size.The intuition behind this architecture is to allow CHAML to identify the patterns present in each class and return a general representation that contains the features that characterize each class.The results from both the positive and negative pairs are combined with the tested sample into a vector that is fed into a neural classifier.Finally, CHAML returns the probability that the tested sample belongs to each of the classes.In the training phase, the weights of the preprocess layers and the neural classifier are all updated after each epoch.
We believe that the model compares the tested sample with the examples of each class in order to find which class is more similar to the tested sample, which helps it to provide a more accurate prediction.An illustration of CHAML's architecture is depicted in Figure 2.

MNA Classifier
For our model's classification task, we use the same architecture as the Five Sound Feature Model (FSFM) from [55], which achieved the highest scores for the deception detection task in a very similar environment.We term this classifier MNA (Mansbach, Neiterman, and Azaria [55]).MNA's architecture consists of three fully-connected layers, using the ReLU activation function and dropout after each.The output layer uses a softmax activation function with two classes.
The classifier illustration is provided in Figure 3.The complete CHAML training process is presented in Algorithm 1.During the test phase, instead of randomly sampling the support set for each query sample, we use the support set provided for each task for all queries of that task.

Algorithm 1: CHAML Training Process
Input : Samples divided to tasks Output : Predicted labels

Embedding Collection:
Use Five Sound Features / Wav2Vec 2.0 / other, for creating an embedding for each of the samples.

CHAML-Core:
Create an empty list model I NPUTS .
a. foreach task in train tasks do /* create support set for each sample */ foreach sample in task do T support :=Randomly sample two True samples from the current task.We consider four baseline methods, and we compare their performance to that of CHAML.Namely, we consider a method using only MNA classifier without fine-tuning, and a method using MNA classifier that is fine-tuned based on the support sets.In addition, we consider the Prototypical [50] and MAML [51] networks, which are commonly used for meta-learning.

MNA-Fine-Tuning
In this method, we trained the MNA classifier mentioned in Section 4.3 on the embedding vectors.During the evaluation phase, for each test task, we first continued training the model on the four examples provided (the support set) and then predicted the value on the query of that test task.

Prototypical Network
A prototypical network [50] is one of the well-known meta-learning methods and is based on similarity.For each task, the model learns a prototypical embedding, which is a common representation for each class in the support set and matches each query embedding to each class's prototypical embedding to find the most similar class.The prototypical network uses cosine similarity for measuring the similarity between each query embedding and each class prototypical embedding.

Model-Agnostic Meta-Learning (MAML)
Model agnostic meta-learning (MAML) [51] is a meta-learning framework that learns the model parameters that can be fine-tuned very quickly when applied to different tasks.The model is initialized with random weights, and iteratively, for each task in the training tasks, fine-tunes a copy of the primary model.Then the weights of the copy are updated using the loss from the query samples in the task by stochastic gradient descent.At the end of each training epoch, the losses and gradients from all queries are accumulated.MAML then calculates the derivative of the mean loss concerning the primary model's weights, and updates those weights in the primary model.In the evaluation phase, a copy of the primary model is fine-tuned on each support set of the test tasks and evaluated on the query set of the same task.

Results
In this section, we describe our results for deception detection.We considered training the basic MNA model using 25, 50, and 100 epochs as well as batch sizes of 32, 128, 512, and 1000.Since using 50 epochs and a batch size of 512 performed best for the basic MNA model, we used these hyper-parameters for all the models that we have tested.In addition, we use weighted categorical cross-entropy loss to deal with the imbalance of our data.The results presented in this paper are the average of 30 different executions, each with a different seed.The scores are calculated on the query sets of the test tasks and can be seen in Table 5.We note that since the False samples are the minority, and those which we are trying to identify as being deceptive, we use them as the positive class, and the True samples are used as the negative class.Therefore, False samples classified as being False are considered true positives.Similarly, True samples classified as being False are considered false positives.
As depicted by the Table, the CHAML-Wav2Vec 2.0 model outperformed all other methods with an accuracy of 61.34% and an F1-Score of 0.3857.Figure 4 provides the confusion matrix for the CHAML model on the Wav2Vec 2.0 embedding.Next, we compare CHAML to the MNA classifier.As shown in the table, CHAML outperformed the fine-tuning method and improved the results of the MNA classifier for both embedding types.We note that the FSFM original work [55] (which MNA's architecture was inspired by it) presented higher accuracy and F1-score results than in our experiments.This is due to the fact that in the original work, the entire data set was first shuffled and then split into training and test sets.Therefore, the same person also appeared in the training set and also in the test set.This allowed the model to learn from all types of people.However, in our current work, there is no overlapping between samples of the same person in both train and test sets.Consequently, the performance decreases as the model is trained on some people but tested on others.
In addition, CHAML is compare to prototypical and MAML methods, which are commonly used for meta-learning.The Prototypical network was trained using 50 epochs for each support set and used weighted categorical cross-entropy loss.For obtaining the prototypical embedding for each class, we use the FSFM architecture without the last layer used for the classification task; the prototypical embedding length is 128.MAML was trained on 50 epochs with 15 epochs for each task's fine-tuning and used the FSFM model as the meta-learner.MAML performed better when using random weighted sampling rather than weighted categorical cross-entropy loss for accounting for the imbalance in the data.Clearly, since the Prototypical network and MAML are trained on each task separately, they must use a single support set for all the query samples, unlike CHAML, which has different support sets for each query sample in the training tasks.The results show that CHAML outperforms other meta-learning methods.The Prototypical network achieved the lowest accuracy compared to all other methods and a lower F1-score compared to CHAML, while MAML achieved similar results.The averaged computation times of the three models can be seen in Table 6.As shown, CHAML is over 25 times faster than MAML while having a similar computation time to the Prototypical network, but achieves much higher performance.

Model
Avg.Time in sec.
MAML [51] 1047.71Prototypical [50] 27.22 CHAML 41.07 Moreover, we find that using the Wav2Vec 2.0 embedding Section 4.1.2for the deception detection task on our data, seems to have better results in the main models: MNA, MAML, and CHAML.Interestingly, even in the MNA classifier, which was originally specially designed for the five-sound features embedding, the Wav2Vec 2.0 embedding performs better.
In order to confirm that CHAML uses the examples and does not ignore them, we conduct another experiment in which the support set was mixed and did not have the two true examples first and the two false examples later (as required by CHAML).In this case, the F1-score has significantly decreased from 0.3857 to 0.3707 in the Wav2Vec 2.0 setting and from 0.3826 to 0.332 in the Five-Features setting.This indicates that CHAML relies on the examples for providing an accurate prediction.

Conclusions & Future Work
In this study, we have proposed CHAML, a comparative hint approach meta-learning for deception detection based on speech.The method is based on the meta-learning approach in which a model is trained on various learning tasks to enable it to solve new learning tasks using only a few samples.In our approach, we added some labeled samples (the support set) to each unlabeled sample (the query set) from the same task to predict its label.Our setting differs from the typical meta-learning problem since our data comes from different environments (and different people), whereas in the classical meta-learning framework, data from one environment is divided into different tasks.In addition, in typical meta-learning, there is no overlap between the classes in the training tasks and those in the test task.However, in our approach, classes are the same in all tasks, but the task environment differentiates each one from the other.Therefore, typical meta-learning methods do not perform well in our task, while CHAML, our proposed method, manages to gather relevant information from the support sets and improves the model's performance.
In future work we attempt to find a way to add attention to CHAML, so that it learns how to attend to the relevant parts of each example when making a prediction.We note that our method is not limited to deception detection and can also be applied to other environments, in which all tasks, both in training set and in testing set, include the same classes (or regression problems), but the samples in different tasks highly differ from each other.Examples of such environments include emotion recognition (with many different people; each person is a different task), handwriting recognition (with different people; each person is a different task), and age and sex determination of different animals (each animal is its task), and determining whether sensor data collected from different environments indicates that human intervention is required.In future work, we intend to test CHAML on some of these environments.

Figure 4 .
Figure 4. CHAML-Wav2Vec 2.0 performance on the query set of the test samples.

Table 3 .
Hyper-parameters used for fine-tuning the Wav2Vec 2.0 model.

Table 5 .
Comparison of models performance.