DDA-MSLD: A Multi-Feature Speech Lie Detection Algorithm Based on a Dual-Stream Deep Architecture

Guo, Pengfei; Huang, Shucheng; Li, Mingxing

doi:10.3390/info16050386

Open AccessArticle

DDA-MSLD: A Multi-Feature Speech Lie Detection Algorithm Based on a Dual-Stream Deep Architecture

by

Pengfei Guo

^1,†

,

Shucheng Huang

^1,*,†

and

Mingxing Li

^2,*,†

¹

School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang 212100, China

²

School of Electrical and Information Engineering, Jingjiang College, Jiangsu University, Zhenjiang 212013, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2025, 16(5), 386; https://doi.org/10.3390/info16050386

Submission received: 28 March 2025 / Revised: 25 April 2025 / Accepted: 29 April 2025 / Published: 6 May 2025

Download

Browse Figures

Versions Notes

Abstract

Speech lie detection is a technique that analyzes speech signals in detail to determine whether a speaker is lying. It has significant application value and has attracted attention from various fields. However, existing speech lie detection algorithms still have certain limitations. These algorithms fail to fully explore manually extracted features based on prior knowledge and also neglect the dynamic characteristics of speech as well as the impact of temporal context, resulting in reduced detection accuracy and generalization. To address these issues, this paper proposes a multi-feature speech lie detection algorithm based on the dual-stream deep architecture (DDA-MSLD).This algorithm employs a dual-stream structure to learn different types of features simultaneously. Firstly, it combines a gated recurrent unit (GRU) network with the attention mechanism. This combination enables the network to more comprehensively capture the context of speech signals and focus on the parts that are more critical for lie detection. It can perform in-depth sequence pattern analysis on manually extracted static prosodic features and nonlinear dynamic features, obtaining high-order dynamic features related to lies. Secondly, the encoder part of the transformer is used to simultaneously capture the macroscopic structure and microscopic details of speech signals, specifically for high-precision feature extraction of Mel spectrogram features of speech signals, obtaining deep features related to lies. This dual-stream structure processes various features of speech simultaneously, describing the subjective state of speech signals from different perspectives and thereby improving detection accuracy and generalization. Experiments were conducted on the multi-person scenario lie detection dataset CSC, and the results show that this algorithm outperformed existing state-of-the-art algorithms in detection performance. Considering the significant differences in lie speech in different lying scenarios, and to further evaluate the algorithm’s generalization performance, a single-person scenario Chinese lie speech dataset Local was constructed, and experiments were conducted on it. The results indicate that the algorithm has a strong generalization ability in different scenarios.

Keywords:

speech lie detection; multi-feature fusion; dual-stream architecture; prosodic features; dynamic features; Mel spectrogram

1. Introduction

Lies [1] exist in many scenarios of daily life. Liars often mislead others by fabricating facts to achieve their own goals, which has a certain negative impact on the security, stability, and the harmonious development of human society. Accurate lie detection can help people obtain information more authentically and prevent the occurrence of unexpected events. Speech-based lie detection methods have attracted widespread attention from researchers in psychology and computational linguistics, as well as practitioners in law enforcement, military, and intelligence agencies, due to their advantages in terms of temporal and spatial span, high concealment, and practicality [2]. Research [3] has shown that compared with normal states, lying can cause subtle changes in vocal pressure, tone, speech rate, and vocal organs, which in turn leads to changes in certain acoustic feature parameters. After a series of attempts, previous researchers have preliminarily determined that prosody, vocabulary, and paralinguistic information are effective indicators for lie detection [4]. For example, Ekman et al. [5] collected speech data of true and false statements by obtaining feedback from experimental participants who watched film clips and conducted statistical analysis on the prosodic features of speech, finding that the fundamental frequency in the prosodic features of lie speech is significantly higher than that of truthful speech. Depaulo et al. [6] analyzed the lexical features proposed in existing lie detection studies and found that when people lie, there are phenomena such as a shorter speaking duration, fewer details in expression, more repetitions, and a higher fundamental frequency. Fradkov et al. [7] extracted prosodic features and deep features from the speech data of 32 experimental participants and achieved a recognition rate of 64.4% by inputting the combined features into a classifier.

Although current speech lie detection technologies have achieved certain breakthroughs, these methods still have certain limitations. First, existing lie datasets are mostly constructed in the form of interviews in multi-person scenarios, resulting in relatively singular lying scenarios. In multi-person lying scenarios, there are relationships such as cooperation and competition, which affect the motives and methods of lying, and speech features are also easily interfered with through the interlacing of multiple voices. In single-person lying scenarios, there are fewer interpersonal factors, individual factors are more evident, and speech features are relatively independent and pure. Second, in terms of feature processing, traditional lie detection methods mostly rely on the static prosodic features (ACOs) of speech [8]. The singularity of features also makes lie detection susceptible to individual emotions and psychological states. For example, when a person is in a state of tension, anxiety, or excitement, there may be changes such as an increased speech rate, raised pitch, and voice trembling [9]. These changes may be similar to the speech features when lying, leading to misjudgments. Moreover, different people may have different vocal responses when lying. Some people may more easily exhibit obvious vocal changes, while others may be better at controlling their voices, which increases the difficulty of lie detection. The literature [10] points out that speech signals actually have chaotic characteristics (nonlinear dynamic features (NLDs)), which are affected by changes in psychological and physiological states and have a significant effect on lie detection. However, its effectiveness and feasibility in real scenarios still need in-depth research and practical verification. Therefore, to achieve more accurate lie detection, we should comprehensively analyze both the dynamic and static features of speech to improve detection accuracy and generalization. In recent years, the method of using deep learning to extract speech features for lie detection has also attracted the attention of researchers. Deep learning can learn more advanced deep features from speech, such as Liu et al. [11] using a deep belief network (DBN) to extract the sparse representation of speech for lie detection. Artificial features and features extracted using deep learning techniques have their own feature spaces, which can describe the subjective emotional state of speech from different angles and are complementary to each other. Therefore, integrating multiple speech features for a comprehensive judgment of lies can effectively reduce the risk of misjudgments caused by individual differences and environmental factors, thereby improving detection accuracy and generalization.

To address these issues, this paper first designs a comprehensive scheme for collecting lie speech data. After the collection is completed, the audio is segmented and labeled, thereby constructing a monologue Chinese lie dataset named “Local”. Based on this, a multi-feature speech lie detection algorithm based on a dual-stream architecture (DDA-MSLD) is proposed to obtain richer information from speech that is conducive to lie identification, thereby enhancing the accuracy and generalization of lie detection. The algorithm adopts a dual-stream structure; one combines a gated recurrent unit (GRU) with the attention mechanism, mainly processing acoustic prosodic features(ACOs) and nonlinear dynamic features (NLDs) manually extracted via traditional signal processing techniques. ACOs cover relatively stable prosodic elements such as pitch, intensity, and duration, while nonlinear dynamic features capture the complex changes of speech in time series, reflecting the more subtle and dynamic characteristics of speech [12]. The unique gating mechanism of a GRU can effectively handle long sequence dependencies and enhance the model’s expressive power. The attention mechanism can allocate different weights to different parts of the speech data according to their importance when processing speech data, enabling the network to more accurately identify features related to lies. The other uses the encoder part of the transformer to focus on analyzing the Mel-frequency cepstral coefficients (MFCCs) that make up the Mel spectrogram [13]. The Mel spectrogram intuitively presents the energy distribution of the speech signal in different frequencies and at different times in the form of an image. Its design simulates the nonlinear frequency response characteristics of the human ear and can suppress the impact of noise and other interference factors on speech features to a certain extent. The multi-head self-attention mechanism of the transformer can effectively capture long-term dependencies, and the processed features retain the abstract expression of spatial texture while integrating the global correlation of temporal context. By deeply analyzing the Mel spectrogram with the transformer, depth features related to lies can be extracted, which can more comprehensively reflect the subtle changes and patterns in the speech signal. Finally, the features extracted from the two directions are organically integrated together, allowing different types of features to complement each other and work together to provide a more comprehensive and representative feature set for lie detection. The algorithm has been tested on both the CSC polygraph dataset in a multi-person scenario and the Local dataset in a monologue scenario constructed in this paper. The DDA-MSLD algorithm shows better accuracy and generalization in both scenarios. The main contributions of this paper include the following:

1.: By simulating stressful situations in real environments, we constructed a single-person scenario lie speech dataset called “Local”. During the construction process, we fully considered the psychological pressure and real-world impacts faced when lying, which enhanced the ecological validity of the collected data. This dataset not only provides solid data support for the algorithm proposed in this paper but also offers high-quality training resources for future lie detection research, promoting the performance improvement and optimization of speech-based lie detection technology in practical applications.
2.: The algorithm proposed in this paper integrates three network structures—the gated recurrent unit (GRU), attention mechanism, and transformer–to process the prosodic features, nonlinear dynamic features, and Mel spectrogram features of speech signals. It not only takes into account both static and dynamic characteristics but also combines manually extracted features with deep features adaptively learned by neural networks, capturing more comprehensive lie-related information and thereby improving the accuracy and generalization of lie detection.
3.: Through experiments on the CSC dataset and the self-constructed Local dataset, the DDA-MSLD algorithm demonstrated better accuracy and generalization than existing methods, proving its potential and value in practical applications.

2. Related Work

This section will introduce the existing speech lie detection work, mainly including lie detection based on handcrafted feature extraction and deep learning.

2.1. Lie Detection Based on Handcrafted Feature Extraction

Traditional speech lie detection methods identify deception by manually extracting acoustic features from speech signals. This process involves extracting parameters such as the fundamental frequency, energy [14], and formants [15] from speech signals. By conducting statistical analyses of the mean and variance of these parameters, a feature set is formed to distinguish between truthful and deceptive speech patterns. Frank [16] found that liars often exhibit physiological phenomena such as a higher pitch and faster speech rate due to the fear associated with lying, which becomes more pronounced as the fear intensifies. Inspired by Ekman, Gopalan [17] analyzed the speech signals corresponding to truthful and deceptive statements and discovered that the pitch and amplitude of deceptive speech change, indicating that acoustic features can help differentiate between truth and lies. Mathur et al. [18] constructed the first large-scale deceptive speech database, Columbia-SRI-Colorado (CSC), and achieved a classification accuracy of 70% by extracting acoustic-prosodic and lexical features. The study particularly emphasized that prosodic features show significant utility in capturing deceptive information, which is crucial for refining lie detection. However, the extraction and analysis of prosodic features rely on high-quality audio data, which may be limited by noise and individual speaker differences in practical applications, affecting the stability of recognition. Hansen et al. [19] combined Mel-frequency cepstral coefficients (MFCCs) and their first- and second-order differences, autocorrelation, and cross-correlation functions into a feature set. They studied speech under different stress levels and further determined whether the speaker was lying based on this feature set. Levitan et al. [20] participated in the 2016 Com Par E Deception Sub-Challenge at Interspeech, using the provided Deception Speech Database (DSD) and a baseline acoustic feature set for lie detection. This feature set consisted of statistics from various functions on the contours of low-level descriptors (LLDs). By combining the baseline feature set with LIWC and DAL features, they achieved an accuracy of 69.4% on the DSD test set. Professor Zhao Heming’s team from Soochow University proposed an 18-dimensional nonlinear dynamic feature set which also showed remarkable performance in lie detection, achieving a recognition rate of 70.7% [21]. Despite these advancements, the high dependency of algorithms on feature sets poses a challenge; how to efficiently select the most discriminative feature combinations from a vast pool of features remains an urgent issue to be addressed. A research group at Purdue University [22] analyzed features such as the Teager energy, amplitude envelope, and formant frequency. Their study indicated that these features exhibit subtle changes when a speaker lies. However, the changes in fundamental frequency are less correlated with deception. These subtle feature changes demand higher sensitivity and precision from the algorithms. Moreover, most studies are conducted in laboratory settings, and further validation is needed to ensure the accuracy of detection in the dynamic real-world scenarios. Traditional manual feature extraction for speech-based lie detection relies on subjectively chosen feature sets, which struggle to comprehensively cover the diversity of deceptive behaviors. Manually selected features are often confined to low-dimensional spaces, making it difficult to capture the complex and subtle variations in deceptive speech signals. They are also more sensitive to external noise and individual differences, and their limitations are particularly evident when dealing with complex and dynamic speech data.

2.2. Lie Detection Based on Deep Learning

In recent years, with the rapid development of deep learning, researchers have continuously attempted to utilize deep learning techniques to process speech signals and extract richer acoustic features for lie detection, achieving certain results. Levitan and Maredia [23] studied 11 stress levels in liars’ speech and constructed a set of features related to Mel-frequency cepstrum coefficients (MFCCs), using a neural network as a classifier. The results indicated that slight tremors in the vocal organs led to changes in the related acoustic feature parameters. Vrij et al. [24] built a trapezoidal network and trained it in a semi-supervised manner, achieving a high lie recognition rate. Inspired by Busso, Fang et al. [25] proposed a semi-supervised speech lie detection model constructed with an autoencoder network and a bidirectional long short-term memory network, and they further proposed a feature fusion method based on the attention mechanism. The study showed that better recognition results could be achieved with a small number of labeled samples. Fu et al. [26] used a denoising autoencoder for speech lie detection, which involved dimensionality reduction of manually extracted statistical features to obtain more robust features and compress redundant information in the features, enabling the model to achieve good recognition performance with a small number of labeled samples. Mendels et al. [27] proposed a deep learning model for speech lie detection, which first applied a fully connected neural network for detailed feature extraction of speech data. Subsequently, GloVe word vector models were used to create distributed vector representations for each speech segment. Finally, bidirectional long short-term memory networks were introduced to process these distributed vectors from two temporal directions, fully capturing the temporal dependencies and contextual information in the speech sequence. The combined model achieved an F1 score of 0.64 and precision of 0.64. However, the complexity of technology integration may also lead to a decrease in the interpretability of the algorithm, affecting the trustworthiness of the results. Zhou et al. proposed a deep belief network based on the K-SVD algorithm, which could extract deep features from speech. The study showed that deep features extracted using the K-SVD algorithm were better at representing lie-related information in speech compared with basic acoustic features. Subsequently, Zhou et al. performed dimensionality reduction on speech signals to obtain sparse features and further processed these features with a deep belief network, achieving good recognition results. Dimensionality reduction of speech signals to obtain sparse features is an effective means of improving model efficiency, but this process may result in some information loss, affecting recognition accuracy. Additionally, the limitations of the datasets used in their studies may also impact the generalization ability of the models. Despite the great potential of deep learning-based lie detection technologies, they still face challenges such as the difficulty of obtaining high-quality, diverse training data, limited ability to capture subtle cues of lies, and difficulty in adapting to diverse lying strategies.

Speech lie detection technology has evolved from exploring the effectiveness of single features to integrating multiple features and then to leveraging deep learning to mine complex feature representations. However, traditional lie detection methods based on manually extracted features are susceptible to noise and individual differences [28]. While lie detection using deep learning can significantly improve recognition rates, it faces issues such as strict data quality requirements and poor model interpretability.The DDA-MSLD algorithm proposed in this paper integrates ACO and NLD features extracted by traditional manual techniques, as well as Mel spectrogram features extracted by deep learning techniques. It not only cleverly combines the strengths of manual and deep features but also fully considers both the static and dynamic characteristics of speech, effectively enhancing detection accuracy and generalization. The algorithm has demonstrated its superior performance on datasets from different lying scenarios.

3. Datasets

In the research of speech lie detection, there are significant differences between multi-person and single-person lying scenarios. In multi-person lying scenarios [29], there are factors such as cross-interference of speech and the influence of emotions between people, and the logic of lying is more structured. In contrast, in single-person lying scenarios, individuals have to bear the pressure of lying alone [30], which may lead to more obvious tension and unnatural speech features, and the logic of lying is relatively simple. Existing lie speech datasets are mostly based on multi-person scenarios, such as the CSC dataset used in this paper. To further analyze the generalization performance of our algorithm in different scenarios, we also constructed a single-person narrative scenario lie detection dataset called Local by simulating stressful situations in real environments. The following sections introduce the two datasets.

3.1. CSC Dataset

This paper employs the Columbia-SRI-Colorado Corpus (CSC) [26] for experiments in multi-person lying scenarios. The dataset comprises 32 h of audio interviews with 32 native speakers of Standard American English (16 males and 16 females), who were recruited from Columbia University students and the community. Participants were informed that they were taking part in a communication experiment aimed at identifying individuals who fit the image of top American entrepreneurs. To this end, they were tasked with performing assignments and answering questions across six domains. Subsequently, they were informed that they had scored low in some domains and did not fit the image. Thereafter, the subjects attended an interview where the interviewer asked them to persuade him that they had actually achieved high scores in all domains and did indeed fit the image. The interviewer’s task was to determine the subjects’ actual performance, and he was free to ask them any questions beyond the tasks. For each question posed by the interviewer, the subjects were required to indicate whether their answers were truthful or contained any false information by pressing one of the two pedals hidden under the table. The interviews were conducted in a double-walled soundproof room and recorded on digital tape in two channels using a Crown CM311A Differoid headset close-talk microphone, which was then downsampled to 16 kHz before processing.

The original CSC corpus language data are in the form of a single audio file and a manually transcribed text file corresponding to each of the 32 participants. In this paper, the audio data are segmented based on the start and end times of each utterance and the corresponding truth or lie labels recorded in the text files. A total of 4178 utterances were selected from the CSC dataset, including 2514 truthful utterances and 1661 deceptive utterances. The specific details are shown in Table 1.

3.2. Local Dataset

This paper designs a lie collection protocol that simulates real-life stressful situations. A total of 32 participants were involved in the recording, with 12 recruited from the general public and 20 being current students at Jiangsu University of Science and Technology. The participants’ ages ranged from 22 to 26 years old with a balanced gender ratio, and all spoke Mandarin as their native language. The formal recording was divided into three main parts.The first part was the “simulated crime”, where participants were first taken into a sealed room and informed that they could steal any item within the room. This part recorded the participants’ testimonies for clearing their suspicions after performing the simulated crime task. This simulation of a real crime under stressful conditions enhances the ecological validity of the audio data. The second part was the “deliberate lying”, where three topics were given: “my university”, “my best friend”, and “my teacher”. Participants were asked to choose a topic related to their personal experiences and then lie about it, providing a statement that contradicted their actual experiences, which was then recorded. Before starting these two parts, all participants were informed that our developed lie detection system could effectively identify lies. If they successfully deceived the system, they would receive an additional reward; otherwise, a corresponding fee would be deducted. The third part was “recounting experiences”, where participants chose one of the remaining two topics to make a truthful statement. The recorded audio data served as truthful references.

The recording process was conducted in a professional recording studio, where all recording equipment was rigorously calibrated before the formal recording to ensure efficiency and compliance with the required recording standards. The audio annotation and segmentation tasks were completed by three skilled annotators using the Praat audio processing tool. All three annotators had undergone standardized training and adhered to unified editing standards and procedures. A 0.1-s silent buffer was retained before and after each labeled speech waveform to prevent clipping [31]. The segmentation strategy employed in this paper is sentence-level segmentation. Annotators attempted to segment complete sentences, allowing for brief silence periods due to breathing or pausing within a sentence. Non-speech segments such as prolonged periods of silence, laughter, breathing, or coughing were labeled as “NOISE” and subsequently removed during the sentence segmentation process. Unlike the interpausal unit (IPU) segmentation method described in [25], we believe that segmenting sentences based on pauses could potentially disrupt the integrity of the sentence expression and lead to the loss of contextual information. Ultimately, a total of 2920 audio data samples from 32 speakers were collected, with the basic information of the dataset presented in Table 2. All data were stored in WAV format and segmented using the same standards. During the recording process, it was observed that compared with truthful statements, deceptive statements had longer durations, more irregular pauses, and less clear articulation.

3.3. Local Dataset Evaluation

We analyzed the diversity of lies, speaker variability, and potential biases that may have existed in the dataset collection process. The participants included both students and members of the general public, representing a range of backgrounds. The balanced gender ratio helped minimize biases associated with gender. Moreover, the three distinct collection scenarios cover different types of lies. The simulated crime scenario, through the “sealed room theft task”, mimics defensive lies under high-stress conditions [32], which are often accompanied by physiological stress responses such as rapid breathing and irregular pauses, closely resembling real interrogation scenarios. The active lying task requires participants to fabricate statements that contradict their personal experiences, increasing the cognitive load during the lying process. This type of lie may manifest as semantic contradictions or logical flaws. The experiential statement provides baseline vocal features for the same speaker, aiding the model in distinguishing between individual habitual speech patterns and lie-specific changes. This design enhances the diversity of the data and helps the model learn lie features under different scenarios. The data collection process was conducted in a recording environment, minimizing noise interference. The annotation process was standardized. The three annotators, who were trained and used uniform criteria, improved the consistency and quality of the annotations. The sentence-level segmentation approach maximized the integrity of the statements while retaining contextual information, which positively impacted the accuracy of the model.

4. Methods

The algorithm in this paper takes the late-fusion strategy [33] as the core and ingeniously combines a gated recurrent unit (GRU), the attention mechanism and the transformer, as shown in Figure 1. After the prosodic features and nonlinear features are manually extracted and spliced together, they are processed by the upper GRU-Attention network. Firstly, through the pooling operation, multi-level spatial features are extracted from the input data. Subsequently, the preprocessed features are input into the GRU to capture the time dependence of the sequence and enhance the model’s understanding of the sequence information. Finally, the key information output by the GRU is weighted by the attention mechanism to improve the model’s sensitivity to important features, ensuring that the model can not only accurately identify but also clearly point out the key parts of the decision-making basis. The encoder part of the transformer is used at the bottom to process the Mel spectrogram. The Mel spectrogram of speech also further compresses the data through max pooling to meet the input requirements of the transformer encoder. Subsequently, through the stacking of four layers of transformer encoders, each containing four attention heads, the long-term dependence relationships in the time sequence are deeply understood, and the ability to analyze sequence data is improved. The features extracted in the two directions are linearly combined in the linear layer through the weight matrix and bias term, thereby realizing the fusion between different features. The fused features ensure that the algorithm can comprehensively analyze features from multiple dimensions. Then, the Softmax layer converts the output of the linear layer into the predicted probability of each category, providing a clear decision-making basis for the model and thus outputting an accurate category label (true or lie). In order to better describe the function of the algorithm, we created a formal definition of the speech lie detection problem.

The lie detection problem can be defined as follows. Let

X = \{X_{a}, X_{n}, X_{m}\}

represent the multi-feature input samples. Among them,

X_{a}, X_{n}, X_{m}

represent the ACO features, NLD features, and Mel spectrogram features extracted from the speech signal, respectively. Then, the feature data sequence can be represented as

X_{l} = \{X_{l, 1}, X_{l, 2}, \dots, X_{l, N}\}, l \in \{a, n, m\}

, where N represents the feature dimension. The goal is to perform category prediction on samples according to their deceptive content, and the data labels are

y \in \{y_{0}, y_{1}\}

, where

y_{0}

represents a lie and

y_{1}

represents the truth. Finally, we obtain the embedded feature vector

F_{x}

that fuses three features, where

X \in \{a, n, m\}

, and the predicted label is obtained as follows:

\bar{y} = M (F_{a}, F_{n}, F_{m})

(1)

Among this, M is the fusion function, and

\bar{y} \in \{y_{0}, y_{1}\}

.

4.1. Feature Extraction

4.1.1. Acoustic Prosodic Features

Prosodic features reflect the static characteristics of speech signals within a short time (10–30 milliseconds) [27], and short-time analysis methods are generally used. In this paper, the OPENSMILE 3.0.2 was used to extract features from the INTERSPEECH 2013 ComParE Challenge feature set [34]. This feature set contains 6373 static features, specifically including low-level descriptors related to energy, spectrum, and phonation, such as the log harmonic-to-noise ratio (HNR), spectral harmonicity, and psycho-acoustic spectral sharpness. Statistical functions include the mean, moment, quartile, and 1% and 99% percentiles, as well as contour-related measurements, such as the (relative) rise and fall times, magnitudes, standard deviations of local maxima, and linear and quadratic regression coefficients. The specific contents are shown in Table 3.

4.1.2. Nonlinear Dynamic Features

Air flow propagating in the vocal tract does not always propagate in the form of a plane wave [35]. Instead, it sometimes separates and attaches to the walls of the vocal tract. NLD features precisely stem from such nonlinear aerodynamic phenomena within the vocal tract, such as non-laminar flow or the generation and propagation of vortices, providing supplementary analysis for lie information. For the extraction of NLD features, the amplitude modulation-frequency modulation (AM-FM) model was adopted [36]. In this model, taking the frequency of a single formant in the speech signal as the carrier frequency, frequency modulation and amplitude modulation are carried out. Afterward, by using the energy separation algorithm, the instantaneous frequency corresponding to each formant is separated from the speech signal. Using this instantaneous frequency, the NLD features of the speech signal are further obtained.

In the AM-FM model, it is assumed that a speech signal is the result of the superposition of amplitude modulation and frequency modulation of several formants. For a carrier frequency of

f_{c}

, a frequency-modulated signal of

q (t)

, and a modulation signal for controlling the amplitude

a (t)

, this can be expressed as follows:

r (t) = a (t) cos (2 π [f_{c} * t + \int_{0}^{t} q (τ) d (τ)] + θ)

(2)

The carrier frequency here corresponds to each formant frequency:

s (t) = \sum_{k = 1}^{K} r_{k} (t)

(3)

Here, K represents the total number of formants, and

r_{k} (t)

is the signal after frequency modulation and amplitude modulation with the kth formant as the carrier frequency. For the modulation signal of a single formant, an energy separation algorithm can be used to separate the amplitude-modulated amplitude envelope

|a (t)|

and the frequency-modulated instantaneous frequency

f_{t}

from the speech signal. This energy separation algorithm was developed based on the Teager energy operator. The Teager energy operator is quite helpful for signal analysis in both the continuous domain and the discrete domain, and this operator has many characteristics for cases such as time scaling, composite functions, and arithmetic operations of functions. These characteristics can be used to simplify calculations and make the expressions clearer. In this paper, 18 NLD features, including fractal features, Lyapunov exponents, and Kolmogorov entropy, are extracted. Compared with the emotional cues captured by prosodic features, NLD features focus more on cues of cognition, memory, and strategic communication.

4.1.3. Mel Spectrograms

The calculation process of the Mel spectrogram is shown in Figure 2. Considering that the human ear is less sensitive to high-frequency sounds than low-frequency sounds, it is necessary to pre-emphasize the signal first to enhance the energy of the high-frequency part and make the spectrum more uniform. Subsequently, the audio signal is divided into short frames, and a window function is applied to each frame to reduce edge effects and spectral leakage. A fast Fourier transform (FFT) is applied to each window to convert the time-domain signal into a frequency-domain signal [37], also known as the short-time Fourier transform (STFT), and the calculation formula is as follows:

X (t, f) = \int_{- \infty}^{\infty} x (τ) w (t - τ) e^{- j 2 π f τ} d τ

(4)

Here,

x (τ)

is the original signal,

w (t - τ)

is the window function, t represents time, and f represents the frequency. The STFT results are filtered through the Mel filter bank. The design of the Mel filter bank was based on the auditory characteristics of the human ear, and the human ear’s sensitivity to sounds of different frequencies is nonlinear. Therefore, performing a logarithmic operation on the filtered results helps simulate the way the human ear perceives sounds, making the subsequent processing more in line with human auditory characteristics. The final result is the Mel spectrogram representation of the speech signal. In the experiment, we used the librosa library to extract the Mel spectrogram. The sampling rate was 48 KHz, the Hamming window length was 512, and the Fourier transform window size was 1024. The training datasets were expanded by adding additive white Gaussian noise (AWGN) to enhance the diversity of the data and improve the accuracy and generalization of the algorithm. Figure 3 is a comparison of the spectrograms of truth and lies. The horizontal axis of the spectrogram is the time, the vertical axis is the frequency, and the color intensity represents the energy level. From the figures, it can be observed that compared with truthful speech, the spectrogram of lie speech has more brightly colored areas in the high-frequency regions, indicating greater energy in the high-frequency range. Additionally, the spectrogram of lie speech shows a more uneven energy distribution with more drastic color changes, reflecting the instability of speech when lying.

4.2. Network Structure

4.2.1. GRU-Attention

The GRU is a variant of the recurrent neural network (RNN) and can effectively solve the vanishing gradient problem of traditional RNNs. Compared with LSTM, the GRU has a relatively simple structure and requires less computation. It is suitable for processing sequence data and can effectively capture context information [38]. For example, when processing the sentence “My best friend is Chen Yaohui, but recently he did a very disloyal thing”, the combination of information from both directions enables the model to accurately infer that “he” refers to “Chen Yaohui” and understand that the key point of this sentence lies in pointing out the contrast between the behavior of “Chen Yaohui” and his identity (best friend). After the features are processed by the GRU network, they are weighted by the attention mechanism [34]. The model can focus more on the key features in the speech, helping the model make more accurate judgments, thereby improving the accuracy of lie detection.

We selected features with emotional representativeness from the prosodic features, including the Mel-frequency cepstral coefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), as well as the energy and amplitude. In addition, we also used the AM-FM method to extract 18 dimensional nonlinear dynamic features from speech and spliced them with the prosodic features. The combined features were input into the GRU after dimension reduction by the max pooling layer. Inside the GRU network, there are two gates, namely the update gate and the reset gate. The reset gate determines how to combine newly input information with previous memories, and the update gate defines the amount of previous memories to be saved to the current time step. Their formulas are as follows:

r_{t} = σ (W^{(z)} x_{t} + U^{(z)} h_{t - 1})

(5)

Here,

x_{t}

is the input vector at the time step t. This vector and the time step information at the

t - 1

moment,

h_{t - 1}

, undergo linear transformation; that is, they are multiplied by the weight matrices

W^{(z)}

and

U^{(z)}

. After that, the update gate adds the two parts of information and puts them into the Sigmoid function to compress the activation result to between 0 and 1. The reset gate mainly determines how much past information needs to be forgotten. Its formula is as follows:

h_{t} = t a n h (W_{x_{t}} + r_{i} ⊙ U h_{t - 1})

(6)

The Hadamard product of the reset gate

r_{t}

and

U h_{t - 1}

(i.e.,

r_{i} ⊙ U h_{t - 1}

), is calculated to determine the past information to be retained and forgotten. In the calculation process of the final memory, the update gate needs to be used, which determines the information to be collected in the current memory content

h_{t}^{'}

and the previous time step

h_{t - 1}

. This process can be represented as follows:

h_{t} = z_{t} ⊙ h_{t - 1} + (1 - z_{t}) ⊙ h_{t}^{'}

(7)

where

z_{t}

is the activation process of the update gate, which also controls the inflow of information in the form of gating, and

z_{t} ⊙ h_{t - 1}

represents the information retained from the previous time step to the final memory. This information plus the information retained from the current memory to the final memory is equal to the output content of the final gated recurrent network, and at the same time, this effectively avoids the vanishing gradient problem. Next, the attention mechanism is implemented through a linear layer. This linear layer maps the output of the GRU to a scalar value for calculating the attention weights. For each time step of the GRU output, the embedding vector of a specific time step is extracted. Then, the attention weight corresponding to each time step is calculated through a linear layer. The weight matrix of the linear layer is W, the bias is b, and the formula for the linear transformation is

y = W x + b

(8)

The degree of importance in the entire time series is determined according to the content of the embedding vector at each time step. Then, each vector

x = (x_{1}, x_{2}, \dots, x_{n})

is stacked in the last dimension, and the Softmax function is applied to obtain the normalized attention weights. The formula for this is as follows:

σ {(x)}_{i} = \frac{e^{x_{i}}}{\sum_{j = 1}^{n} e^{x_{j}}}

(9)

Suppose that the shape of the attention weight is

(B, 1, T)

and the shape of the attention embedding is

(B, T, h * 2)

, where h is the hidden layer size. According to the rules of batch matrix multiplication, the obtained attention-weighted GRU embedding is

C_{i j} = \sum_{k = 1}^{n} A_{i k} B_{i k}

(10)

The entire network structure combines the sequence-modeling ability of the GRU, the key focusing ability of the attention mechanism, and the discriminative ability of the linear classification layer. As shown in Figure 4, in this way, effective feature representations are screened from the prosodic feature sequence

X_{a}

and the NLD sequence

X_{n}

, and the sequence feature vector

F_{a n}

after the fusion of the two types of features is output.

4.2.2. Transformer

The transformer encoder is a structure that focuses on long time series dependency analysis and feature deepening expression, mainly relying on the multi-head self-attention mechanism to process input data [39]. This mechanism can deeply analyze the long-time-series dependency relationships in the input data and fully mine the global time series information in the data. In terms of feature deepening, the superposition of multi-layer transformer encoders can gradually improve the expression ability of features, enabling the model to better capture the complex patterns of input data in the time series dimension. It has its unique advantages.

The input data are the Mel spectrogram of the speech. During operation of the network, the spectrogram is first compressed in the spatial dimension through the max pooling operation to obtain the input feature vector:

X_{p o o l} \in R^{B * H^{'} * w^{'}}

(11)

After this, they are input into the transformer encoder. The transformer encoder is composed of multiple layers, and each layer contains a self-attention layer and a fully connected feedforward neural network layer. The self-attention layer allows the model to consider the context information of all other words simultaneously when processing each word and generates output by calculating the correlations among the query, key, and value vectors of each word. The calculation formula is as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(12)

Here, Q, K, and V are the query, key, and value matrices, respectively,

d_{k}

is the dimension of the key, and the Softmax function is used to calculate the attention weights. We use the multi-head self-attention mechanism to dynamically calculate the time series-dependent weights:

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{4}) W^{\circ}

(13)

The feature expression is deepened through the feedforward neural network, and the gradient propagation is optimized with layer normalization and residual connection. Finally, the transformer-embedded features are obtained by averaging along the time dimension. As shown in Figure 5, by using the long-term dependency modeling ability of the transformer encoder, effective deep feature vectors are selected from the Mel spectrogram.

4.3. Multi-Feature Fusion

After being processed by the two direction networks, we obtained two latent vectors

F_{a n}

and

F_{m}

from the ACO features, NLD features, and spectrogram features, where

F_{a n}

is the feature vector obtained by fusing

F_{a}

and

F_{n}

. These feature representations were assumed to be able to reflect the intention to lie based on various types of feature information. After that, in order to achieve better classification performance, further fusion of

F_{a n}

and

F_{m}

was still required to obtain the feature vector

F_{a n m}

for the Softmax classifier. Our fusion at the decision-making level was carried out in the form of hard voting or soft voting. The definition of the fusion technique is as follows:

\bar{y} = \{\begin{matrix} m o d e \{C_{a n} (F_{a n}), C_{m} (F_{m})\}, & if hard - voting \\ a r g m a x \frac{\sum_{j \in \{a n, m\}} s o f t m a x (F_{j})}{4}, & if soft - voting \end{matrix}

(14)

where

C_{j} = a r g m a x

s o f t m a x (F_{j})

and

j \in (a n, m)

. We combined the feature vectors output from the two parts through the late-fusion strategy to form a comprehensive feature representation. This fusion process ensured that the algorithm could obtain lie related information from both the temporal dynamics and the frequency domain structure of the speech signal simultaneously. Finally, the fused feature vector was sent to a Softmax classifier for training, which is responsible for mapping the input data onto the predefined truth and lie categories. The training goal of the entire algorithm is to optimize the cross-entropy loss function and achieve effective prediction and classification of lie speech, as shown in the following formula:

S o f t m a x p (y = k) = \frac{e x p (l o g i t s_{k})}{e x p (l o g i t s_{j})}

(15)

τ = - \sum y_{k} l o g p_{k}

(16)

In this way, our algorithm can not only make full use of the advantages of deep learning in feature learning and pattern recognition but also enhance the understanding and expression ability of complex paralinguistic content in speech by combining different types of features and network modules.

5. Experiment

5.1. Experimental Set-Up

The experiments in this paper were carried out in a computing environment with a 13th Gen Intel(R) Core(TM) i7-13700HX processor and an NVIDIA GeForce RTX 4060 GPU. PyTorch 2.1.2 was adopted as the main deep learning framework. During the data preprocessing phase, a leave-one-out cross-validation training method was employed to assess the model’s performance and generalization capability. The speech sampling rate was uniformly 48 KHz, and the duration of each speech segment was 3 s. By adding Gaussian white noise with an SNR of 15:30 dB, the sample size was doubled. A max pooling layer was used before the GRU to map the features to 1024 dimensions and meet the requirements for network input. The GRU network was set to be bidirectional, with a hidden layer size of 64 and an output dimension of 128. A dropout layer with an internal dropout rate of 0.2 and an external dropout rate of 0.4 was set up to prevent overfitting. Subsequently, the importance weights of each time step in the sequence were calculated through the attention mechanism. In the transformer part, first, there was a max pooling layer with a kernel size of [2, 4] and a stride of [2, 4] as well, aiming to reduce the spatiotemporal dimensions of the input and adapt to the input dimension standard of the transformer. Then, a transformer layer block was defined, which consisted of four identical encoder layers. Each layer had a dimension size of 64, the number of heads in the multi-head attention was 4, the hidden layer size of the feedforward neural network was 512, and the dropout rate was set to 0.4. At the same time, ReLU was used as the activation function.

The features extracted by the two parts were combined in a late fusion manner to generate the final feature vector. After dimension reduction, it was input into a linear layer with an input size of 320 and an output size of 2 (truth or lie). The cross-entropy function was used to calculate the loss, and the stochastic gradient descent optimizer (SGD) was used to optimize the model parameters to avoid the problems of vanishing gradients or exploding gradients to the greatest extent. Finally, the Softmax classifier converted the output of the linear layer into a probability distribution and output the classification result.

5.2. Ablation Experiment

The three types of features extracted in the methodology section each occupied their own feature space and could describe the characteristics of a lie from different perspectives [27]. To verify the effectiveness of each feature type, we created independent feature sets from the ACOs, NLDs, and Mel features, using the GRU as the baseline model with parameters consistent with the GRU-Attention part of our algorithm. Experiments were conducted on the Local dataset, and the results were comprehensively evaluated using the accuracy, precision, recall, and F1 score metrics. The results are shown in Table 4, where the Mel features performed exceptionally well. Their accuracy reached 65.19%, and when used alone for speech lie detection, the overall probability of correct judgment was high, with precision and recall rates of 65.43% and 66.21%, respectively, indicating that a high proportion of samples predicted as lies were actually lies, having a low likelihood of misjudgment. The ACOs showed moderate performance, contributing to lie detection but not standing out. The NLDs showed poor performance across all metrics, with a weak overall discrimination ability when used alone for speech lie detection and a limited ability to accurately determine lies, potentially requiring further exploration or use in conjunction with other features.

Next, we combined the feature sets to evaluate their contributions to lie detection. The experimental results are shown in Table 5, where the All (ACO + NLD + Mel) combination performed the best, with the highest metrics among all combinations. This indicates that the combination of these three types of features had the strongest comprehensive performance in speech lie detection, providing the best representation of lies. The NLDs did not perform well when used alone, but when combined with other features, they enhanced the experimental results. This suggests that although NLDs have weak individual discrimination ability, they contain information that is complementary to other features. This information can compensate for shortcomings when used alone with ACOs, Mel features, and other features, thereby improving the overall detection accuracy. This also implies that there is a synergistic effect between different features in speech lie detection, and the combination of multiple features helps to more comprehensively and accurately detect lies in speech.

We assessed the contribution of each module in the algorithm through ablation experiments. Specifically, we decomposed DDA-MSLD into GRU, GRU-Attention, and Transformer and input different combinations of features to observe their impact on the detection performance. On the Local dataset, we compared the results of different ablation experiments with the performance of the complete algorithm, also using the accuracy (Accuracy), precision (Precision), recall (Recall), and F1 score (F1-score) to measure the performance of each model, verifying the role and importance of each part in the speech lie detection task. The comparison results are shown in Table 6.

It can be observed that DDA-MSLD, when integrating multiple features, achieved an accuracy rate of 80.28%, which was higher than the other sub-models, demonstrating the importance of the complementary nature of multimodal features. The transformer encoder using Mel spectrograms outperformed the GRU using pure acoustic features in terms of accuracy and precision, indicating that the Mel spectrogram features had stronger discriminative power. In terms of model architecture, the GRU had the capability to model acoustic temporal features and achieved considerable detection performance with the combination of ACO + NLD. However, when relying solely on the attention mechanism, it struggled to explore multimodal relationships, resulting in weaker performance compared with the complete algorithm. The transformer, when using only Mel spectrograms, had a lower recall and F1 score compared with the complete model, indicating the need to combine it with other modalities to balance precision and recall. Finally, this paper’s DDA-MSLD algorithm enhanced the overall accuracy, recall, and F1 score by integrating features processed by different modules, validating the rationality of the modular design, with the recall rate performing the best, which could reduce the risk of missed detections.

5.3. Experimental Comparison on the CSC Datasets

Lies in multi-person scenarios tend to be more public and strategic. Such lies are often structured and intended to mislead others for a specific purpose. To verify the detection performance of our algorithm in multi-person lying scenarios, we conducted experiments on the CSC dataset. We used random forests (RFs) and a deep feedforward network (DFNN) as baseline models. RFs can handle high-dimensional speech feature data and improve detection accuracy and stability by integrating the results of multiple decision trees, and it has robustness to noise and outliers in the data. The DFNN has strong feature learning capabilities, being capable of nonlinear transformations and classification of speech features through forward propagation of multiple layers of neurons to determine whether speech is a lie, and it has a good generalization ability, adapting to various speech scenarios. Additionally, we selected three of the most recent state-of-the-art lie detection algorithms, RVM, CovBiLSTM, and HAN, to compare with the DDA-MSLD model proposed in this paper. The comparison results are shown in Table 7.

We found that our algorithm and the latest deep learning methods outperformed the baseline models RF and DFNN across all metrics. RVM showed certain improvements under different settings, such as when using hard voting and soft voting strategies with ACO + NLD features, which led to increased accuracy and other metrics. CovBiLSTM achieved a high accuracy of 74.84% with ACOs, demonstrating good performance. The HAN algorithm also performed well when combining ACOs and NLDs, reaching an accuracy of 74.71%. Our DDA-MSLD algorithm showed the best performance, with considerable metrics when using hard voting and soft voting with ACO + NLD. In particular, with soft voting and the combination of ACO + NLD + Mel, the accuracy reached 82.27%. Although there was some fluctuation in precision with hard voting under the same feature set, the overall F1 score still reached 81.88%, showing a clear advantage over other algorithms in terms of accuracy, recall, and F1 score. Soft voting strategies generally performed better than hard voting ones, indicating that integrating various features and using appropriate voting strategies can significantly improve detection performance.

5.4. Experimental Comparison on the Local Datasets

In single-person scenarios, lies often manifest as private exchanges between individuals, primarily aimed at concealing personal emotions, avoiding conflict, or self-protection. Such deceptions are usually more subtle and may include partial truths to enhance credibility, making them more personalized and contextual. Table 8 presents the experimental results of the selected algorithms on the Local dataset, which represents single-person lying scenarios.

Although DDA-MSLD achieved the best accuracy rate, its detection accuracy on the Local dataset was slightly lower than that on the CSC dataset. This may be due to the imbalance of true and false speech samples in the Local dataset, where true speech accounted for the majority at a rate of 59.45%. In such cases, algorithms often predict the most frequent category, leading to inaccuracies in predicting the minority category and causing fluctuations in metrics such as the precision. Furthermore, the Local dataset may contain unique speech characteristics or noise interference that the algorithm did not fully adapt to, thus affecting the detection accuracy. At the same time, the high proportion of true speech samples in the dataset may have led to insufficient learning of false speech samples during the training process, making it difficult to make accurate judgments when facing fewer false speech samples and thereby reducing the detection accuracy on the Local dataset.

6. Conclusions

Aiming at the problems existing in current speech lie detection algorithms, such as failing to fully utilize hand-crafted features based on prior knowledge and ignoring the influence of the dynamic characteristics and temporal context of speech, this paper proposed a lie detection algorithm, DDA-MSLD, that fuses multiple speech features. This algorithm skillfully combines GRU-Attention and a transformer, which can not only extract the emotional information contained in prosodic features and nonlinear dynamic features but also capture the high-order statistical properties and local patterns contained in a voice Mel spectrogram. By analyzing a speech signal from multiple angles and in depth, the accuracy and generalization of detection were improved. In addition, considering the differences in lies in different lying scenarios, this paper deliberately constructed the Local dataset for single-person lying scenarios and conducted comparative experiments between the proposed algorithm and several excellent speech lie detection algorithms on this dataset and the CSC dataset for multi-person lying scenarios. The results show that the DDA-MSLD algorithm proposed in this paper, when used in combination with three different types of speech features, could effectively improve the accuracy and generalization of speech lie detection.

Future research directions include further optimizing the structure and parameter settings of DDA-MSLD to further improve its detection performance. Meanwhile, we also plan to expand the application range of DDA-MSLD so that it can handle more complex and diversified speech lie detection tasks. In addition, the expansion of the lie dataset (Local) used for lie detection research will also help to further enhance the effectiveness and application value of speech lie detection.

Author Contributions

P.G. conceived the study idea and led the overall research project. He was also responsible for the initial experimental design, determining the key parameters and experimental framework. S.H. carried out the majority of the data collection. This involved conducting numerous field investigations and laboratory tests. In addition, S.H. contributed to the data analysis by performing basic statistical calculations. M.L. focused on the in-depth data analysis. Using advanced statistical and computational techniques, he interpreted the data to extract meaningful trends and relationships. P.G. wrote the first draft of the manuscript. He organized the research findings into a logical structure, covering the introduction, methods, results, and a preliminary discussion. S.H. and M.L. participated in the critical review and revision of the manuscript. They provided valuable insights and suggestions to improve the scientific rigor, clarity, and overall quality of the paper. All authors have read and agreed to the final version of the manuscript.

Funding

This research was funded by “Pedestrian Detection via Robust Object Appearance Modeling” and “Visual Tracking via Robust Object Appearance Modeling” (grant numbers 62276118 and 61772244).

Data Availability Statement

The datasets generated or analyzed during the current study are not publicly available due to the data containing the research secrets of the research group but are available from the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, Y.; Zhao, H.; Pan, X.; Shang, L. Deception detecting from speech signal using relevance vector machine and non-linear dynamics features. Neurocomputing 2015, 151, 1042–1052. [Google Scholar] [CrossRef]
Landis, C.; Wiley, L.E. Changes of blood pressure and respiration during deception. J. Comp. Psychol. 1926, 6, 1. [Google Scholar] [CrossRef]
Vrij, A.; Granhag, P.A.; Porter, S. Pitfalls and opportunities in nonverbal and verbal lie detection. Psychol. Sci. Public Interest 2010, 11, 89–121. [Google Scholar] [CrossRef]
Graciarena, M.; Shriberg, E.; Stolcke, A.; Enos, F.; Hirschberg, J.; Kajarekar, S. Combining prosodic lexical and cepstral systems for deceptive speech detection. In Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France, 14–19 May 2006. [Google Scholar]
Ekman, P.; O’Sullivan, M.; Friesen, W.V.; Scherer, K.R. Invited article: Face, voice, and body in detecting deceit. J. Nonverbal Behav. 1991, 15, 125–135. [Google Scholar] [CrossRef]
DePaulo, B.M.; Lindsay, J.J.; Malone, B.E.; Muhlenbruck, L.; Charlton, K.; Cooper, H. Cues to deception. Psychol. Bull. 2003, 129, 74. [Google Scholar] [CrossRef] [PubMed]
Fradkov, A.L.; Evans, R.J. Control of chaos: Methods and applications in engineering. Annu. Rev. Control 2005, 29, 33–56. [Google Scholar] [CrossRef]
Krajewski, J.; Kröger, B.J. Using prosodic and spectral characteristics for sleepiness detection. In Proceedings of the INTERSPEECH, Antwerp, Belgium, 27–31 August 2007. [Google Scholar]
Zhou, Y.; Zhao, H.; Pan, X. Lie detection from speech analysis based on k–svd deep belief network model. In Proceedings of the Intelligent Computing Theories and Methodologies: 11th International Conference, ICIC 2015, Fuzhou, China, 20–23 August 2015; Volume 11, pp. 189–196. [Google Scholar]
Srivastava, N.; Dubey, S. Deception detection using artificial neural network and support vector machine. In Proceedings of the 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 29–31 March 2018; pp. 1205–1208. [Google Scholar]
Liu, Z.-T.; Wu, M.; Cao, W.-H.; Mao, J.-W.; Xu, J.-P.; Tan, G.-Z. Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 2018, 273, 271–280. [Google Scholar] [CrossRef]
Levitan, S.I.; An, G.; Wang, M.; Mendels, G.; Hirschberg, J.; Levine, M.; Rosenberg, A. Cross-cultural production and detection of deception from speech. In Proceedings of the 2015 ACM on Workshop on Multimodal Deception Detection, Seattle, WA, USA, 13 November 2015. [Google Scholar]
Mannepalli, K.; Sastry, P.N.; Suman, M. Analysis of emotion recognition system for Telugu using prosodic and formant features. In Speech and Language Processing for Human-Machine Communications: Proceedings of CSI 2015; Springer: Singapore, 2018; pp. 137–144. [Google Scholar]
Dai, J.B.; Sun, L.X.; Shen, X.B. Research on speech spoofing detection based on big data and machine learning. In Proceedings of the 2021 2nd International Conference on Artificial Intelligence and Education (ICAIE), Dali, China, 18–20 June 2021; pp. 137–140. [Google Scholar]
Nahari, G. ‘Language of lies’: Urgent issues and prospects in verbal lie detection research. Legal Criminol. Psychol. 2019, 24, 1–23. [Google Scholar] [CrossRef]
Merkx, D.; Frank, S.L.; Ernestus, M. Language learning using speech to image retrieval. arXiv 2019, arXiv:1909.03795. [Google Scholar]
Gopalan, K.; Wenndt, S. Speech analysis using modulation-based features for detecting deception. In Proceedings of the 2007 15th International Conference on Digital Signal Processing, Cardiff, UK, 1–4 July 2007; pp. 619–622. [Google Scholar]
Mathur, L.; Matarić, M.J. Affect–aware deep belief network representations for multimodal unsupervised deception detection. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India, 15–18 December 2021; pp. 1–8. [Google Scholar]
Hansen, J.H.L.; Womack, B.D. Feature analysis and neural network-based classification of speech under stress. IEEE Trans. Speech Audio Process. 1996, 4, 307–313. [Google Scholar] [CrossRef]
Levitan, S.I.; Maredia, A.; Hirschberg, J. Linguistic cues to deception and perceived deception in interview dialogues. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1941–1950. [Google Scholar]
Mittal, T.; Bhattacharya, U.; Chandra, R.; Bera, A.; Manocha, D. Emotions don’t lie: An audio–visual deepfake detection method using affective cues. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2823–2832. [Google Scholar]
Chen, X.; Ita Levitan, S.; Levine, M.; Mandic, M.; Hirschberg, J. Acoustic-prosodic and lexical cues to deception and trust: Deciphering how people detect lies. Trans. Assoc. Comput. Linguist. 2020, 8, 199–214. [Google Scholar] [CrossRef]
Levitan, S.I.; Maredia, A.; Hirschberg, J. Acoustic–Prosodic Indicators of Deception and Trust in Interview Dialogues. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 416–420. [Google Scholar]
Vrij, A.; Granhag, P. Anders; Mann, S.; Leal, S. Outsmarting the liars: Toward a cognitive lie detection approach. Curr. Dir. Psychol. Sci. 2011, 20, 28–32. [Google Scholar] [CrossRef]
Fang, Y.; Fu, H.; Tao, H.; Liang, R.; Zhao, L. A novel hybrid network model based on attentional multi-feature fusion for deception detection. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2021, 104, 622–626. [Google Scholar] [CrossRef]
Fu, H.; Lei, P.; Tao, H.; Zhao, L.; Yang, J. Improved semi-supervised autoencoder for deception detection. PLoS ONE 2019, 14, e0223361. [Google Scholar] [CrossRef] [PubMed]
Mendels, G.; Levitan, S.I.; Lee, K.-Z.; Hirschberg, J. Hybrid Acoustic–Lexical Deep Learning Approach for Deception Detection. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1472–1476. [Google Scholar]
Vrij, A. Baselining as a lie detection method. Appl. Cogn. Psychol. 2016, 30, 1112–1119. [Google Scholar] [CrossRef]
Xie, Y.; Liang, R.; Tao, H.; Zhu, Y.; Zhao, L. Convolutional bidirectional long short-term memory for deception detection with acoustic features. IEEE Access 2018, 6, 76527–76534. [Google Scholar] [CrossRef]
Deng, J.; Xu, X.; Zhang, Z.; Frühholz, S.; Schuller, B. Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 26, 31–43. [Google Scholar] [CrossRef]
Parthasarathy, S.; Busso, C. Semi-supervised speech emotion recognition with ladder networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2697–2709. [Google Scholar] [CrossRef]
Ma, H.-G.; Han, C.-Z. Selection of embedding dimension and delay time in phase space reconstruction. Front. Electr. Electron. Eng. China 2006, 1, 111–114. [Google Scholar] [CrossRef]
Vrij, A.; Hartwig, M. Deception and lie detection in the courtroom: The effect of defendants wearing medical face masks. J. Appl. Res. Mem. Cogn. 2021, 10, 392–399. [Google Scholar] [CrossRef]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794v4. [Google Scholar]
Sun, M.; Gao, M.; Kang, X.; Wang, S.; Du, J.; Yao, D.; Wang, S.-J. CDSD: Chinese Dysarthria Speech Database. arXiv 2023, arXiv:2310.15930. [Google Scholar]
Nugroho, R.H.; Nasrun, M.; Setianingsih, C. Lie detector with pupil dilation and eye blinks using hough transform and frame difference method with fuzzy logic. In Proceedings of the 2017 International Conference on Control, Electronics, Renewable Energy and Communications (ICCREC), Yogyakarta, Indonesia, 26–28 September 2017; pp. 40–45. [Google Scholar]
Huang, C.-H.; Chou, H.-C.; Wu, Y.-T.; Lee, C.-C.; Liu, Y.-W. Acoustic Indicators of Deception in Mandarin Daily Conversations Recorded from an Interactive Game. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1731–1735. [Google Scholar]
Chou, H.-C. Automatic deception detection using multiple speech and language communicative descriptors in dialogs. APSIPA Trans. Signal Inf. Process. 2021, 10, e5. [Google Scholar] [CrossRef]
Levitan, S.I.; An, G.; Ma, M.; Levitan, R.; Rosenberg, A.; Hirschberg, J. Combining Acoustic–Prosodic, Lexical, and Phonotactic Features for Automatic Deception Detection. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 2006–2010. [Google Scholar]

Figure 1. DDA-MSLD algorithm architecture. Input ACO and NLD into the upper GRU-Attention network, input Mel spectrum into the lower transformer network, perform feature fusion in the linear layer, and finally, use the Softmax classifier for the final classification.

Figure 2. Mel spectrogram calculation process. After the original speech signal undergoes pre-emphasis, framing, and windowing processing, the time-domain signal is transformed into a frequency-domain signal through an STFT and then filtered by the Mel filter bank. A logarithmic operation is performed on the filtered result to obtain the Mel spectrogram.

Figure 3. Comparison of Mel spectrograms between lies and real speech: (a) lies and (b) the truth.

Figure 4. Visualization of manual feature processing. ACOs and NLDs extracted from the input audio clips are processed by the GRU and attention mechanism in sequence, and finally the fused sequence feature vector features

F_{a n}

are obtained.

Figure 4. Visualization of manual feature processing. ACOs and NLDs extracted from the input audio clips are processed by the GRU and attention mechanism in sequence, and finally the fused sequence feature vector features

F_{a n}

are obtained.

Figure 5. Visualization of deep feature processing. The Mel spectrogram is extracted from the input audio, and then the feature representations

F_{m}

related to lies through the transformer encoder are learned.

Figure 5. Visualization of deep feature processing. The Mel spectrogram is extracted from the input audio, and then the feature representations

F_{m}

related to lies through the transformer encoder are learned.

Table 1. Content information of the CSC dataset.

Speaker ID	Number of Truths/Lies	Speaker ID	Number of Truths/Lies
S1A	45/12	S17A	94/113
S2B	57/31	S18B	85/70
S3C	51/8	S19C	25/77
S4D	65/34	S20D	220/32
S5A	87/8	S21A	117/33
S6B	62/28	S22B	126/111
S7C	89/82	S23A	66/63
S8D	124/70	S24D	47/90
S9A	82/45	S25A	49/78
S10B	58/50	S26B	73/82
S11C	63/52	S27C	84/41
S12D	21/47	S28D	92/21
S13A	70/24	S29A	60/32
S14B	150/73	S30B	63/63
S15C	99/26	S31C	83/41
S16D	61/99	S32D	94/25

Table 2. Content information of the Local database.

Collecting Scene	Truth/Lie	Speaker Num	Segments Num
Crime simulation	Lie	32	392
Deliberate lying	Lie	32	792
Recounting experiences	Truth	32	1736

Table 3. Content display of the INTERSPEECH 2013 ComParE Challenge feature set.

Feature Category	Feature Meaning
5 related to energy	The basic energy and frequency distribution of sound
15 related to the frequency spectrum	Analyze and identify the characteristics of sound
5 related to sound quality	Understand and quantify various physical and perceptual attributes of sound
6 related to pitch	Gain a deeper understanding of the structure and function of sound

Table 4. Single-feature detection results. Among them, the MEL features achieved the best detection performance.

Feature	Accuracy	Precision	Recall	F1 Score
ACO	63.22	62.19	63.54	62.78
NLD	62.20	61.35	62.56	60.08
MEL	65.19	65.43	66.21	65.78

Table 5. Detection results of combined features. Among them, “All” represents ACO + NLD + Mel, and this feature set achieved the best detection effect.

Features	Accuracy	Precision	Recall	F1 Score
ACO + NLD	64.28	64.05	63.76	63.47
ACO + Mel	66.30	67.21	66.58	65.34
NLD + Mel	65.73	65.19	66.33	65.28
All	67.85	67.16	66.87	67.06

Table 6. Results of the algorithm ablation experiment. Among them, the DDA-MSLD algorithm combined with the “All” feature set achieved the best detection results.

Model	Features	Accuracy	Precision	Recall	F1-Score
GRU	ACO	74.61	66.34	67.26	69.86
GRU	ACO + NLD	75.34	70.08	72.56	74.39
GRU-Attention	ACO	76.33	67.51	68.24	68.49
GRU-Attention	ACO + NLD	74.32	72.34	71.86	72.60
Transformer	Mel	78.77	74.65	75.24	76.33
DDA-MSLD	All	80.28	80.11	81.64	81.17

Table 7. Algorithm comparison results on the CSC data set. Among them, “All” represents ACO + NLD + Mel, “*” indicates that feature fusion has not been carried out.

Algorithm	Features	Fusion Type	Accuracy	Precision	Recall	F1 Score
RF	ACO	*	60.04	62.17	60.31	61.86
DFNN	ACO	*	63.25	64.28	63.19	62.88
RVM	ACO	*	67.21	66.31	66.45	66.21
RVM	ACO + NLD	Hard Voting	70.37	71.42	72.63	70.15
RVM	ACO + NLD	Soft Voting	69.56	70.25	70.98	70.24
CovBiLSTM	ACO	*	74.84	73.29	74.66	73.26
HAN	ACO	*	70.31	72.44	71.76	70.03
HAN	ACO + NLD	Hard Voting	74.71	73.28	75.36	74.39
HAN	ACO + NLD	Soft Voting	74.38	74.22	73.68	73.49
DDA-MSLD	ACO + NLD	Hard Voting	76.59	75.67	76.59	76.71
DDA-MSLD	ACO + NLD	Soft Voting	77.50	77.62	77.38	75.81
DDA-MSLD	All	Hard Voting	81.22	80.19	79.34	81.70
DDA-MSLD	All	Soft Voting	82.27	82.69	81.42	81.88

Table 8. Algorithm comparison results on the Local data set. Among them, “All” represents ACO + NLD + Mel, “*” indicates that feature fusion has not been carried out.

Algorithm	Features	Fusion Type	Accuracy	Precision	Recall	F1-Score
RF	ACO	*	61.23	62.86	62.46	61.77
DFNN	ACO	*	65.49	64.33	64.27	64.92
RVM	ACO	*	67.52	65.26	66.33	66.91
RVM	ACO + NLD	Hard Voting	70.27	70.58	71.49	71.68
RVM	ACO + NLD	Soft Voting	70.21	69.59	71.24	70.81
CovBiLSTM	ACO	*	74.80	74.65	74.21	73.68
HAN	ACO	*	71.07	72.96	71.53	72.08
HAN	ACO + NLD	Hard Voting	73.36	73.44	73.49	75.25
HAN	ACO + NLD	Soft Voting	74.50	74.64	72.82	73.49
DDA-MSLD	ACO + NLD	Hard Voting	76.60	76.27	76.71	74.50
DDA-MSLD	ACO + NLD	Soft Voting	77.92	76.23	77.07	76.59
DDA-MSLD	All	Hard Voting	80.47	80.08	79.46	79.93
DDA-MSLD	All	Soft Voting	80.28	80.11	81.64	81.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, P.; Huang, S.; Li, M. DDA-MSLD: A Multi-Feature Speech Lie Detection Algorithm Based on a Dual-Stream Deep Architecture. Information 2025, 16, 386. https://doi.org/10.3390/info16050386

AMA Style

Guo P, Huang S, Li M. DDA-MSLD: A Multi-Feature Speech Lie Detection Algorithm Based on a Dual-Stream Deep Architecture. Information. 2025; 16(5):386. https://doi.org/10.3390/info16050386

Chicago/Turabian Style

Guo, Pengfei, Shucheng Huang, and Mingxing Li. 2025. "DDA-MSLD: A Multi-Feature Speech Lie Detection Algorithm Based on a Dual-Stream Deep Architecture" Information 16, no. 5: 386. https://doi.org/10.3390/info16050386

APA Style

Guo, P., Huang, S., & Li, M. (2025). DDA-MSLD: A Multi-Feature Speech Lie Detection Algorithm Based on a Dual-Stream Deep Architecture. Information, 16(5), 386. https://doi.org/10.3390/info16050386

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DDA-MSLD: A Multi-Feature Speech Lie Detection Algorithm Based on a Dual-Stream Deep Architecture

Abstract

1. Introduction

2. Related Work

2.1. Lie Detection Based on Handcrafted Feature Extraction

2.2. Lie Detection Based on Deep Learning

3. Datasets

3.1. CSC Dataset

3.2. Local Dataset

3.3. Local Dataset Evaluation

4. Methods

4.1. Feature Extraction

4.1.1. Acoustic Prosodic Features

4.1.2. Nonlinear Dynamic Features

4.1.3. Mel Spectrograms

4.2. Network Structure

4.2.1. GRU-Attention

4.2.2. Transformer

4.3. Multi-Feature Fusion

5. Experiment

5.1. Experimental Set-Up

5.2. Ablation Experiment

5.3. Experimental Comparison on the CSC Datasets

5.4. Experimental Comparison on the Local Datasets

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI