MFF-Net: Multi-Feature Fusion Network for Speech Deception Detection

Zhou, Zhihao; Li, Mingxing; Huang, Shucheng

doi:10.3390/app16083660

Open AccessArticle

MFF-Net: Multi-Feature Fusion Network for Speech Deception Detection

by

Zhihao Zhou

¹,

Mingxing Li

² and

Shucheng Huang

^1,*

¹

School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang 212003, China

²

School of Electrical and Information Engineering, Jingjiang College, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3660; https://doi.org/10.3390/app16083660

Submission received: 9 March 2026 / Revised: 28 March 2026 / Accepted: 1 April 2026 / Published: 9 April 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Speech deception detection often suffers from overly narrow feature representations and limited robustness under changing recording conditions. To address these limitations, we propose MFF-Net, a multi-feature fusion network with three modules. The first module uses a Gated-BiLSTM with attention to model sequences of acoustic-prosodic (ACO) and nonlinear dynamic (NLD) features. The second module uses a hybrid Transformer–CNN branch on Mel spectrograms, with a Levenberg–Marquardt (LM) refinement stage for the Mel-branch classifier. The third module fuses the two branches via cross-attention before a softmax decision. On the English Real-Life Trial Deception Database (RTDD), MFF-Net achieves 79.22% accuracy and 77.69% F1. On our Chinese Lie Dataset (CLD), it reaches 78.29% accuracy and 79.13% F1, outperforming strong baselines. Together with speaker-aware evaluation and clarified training objectives, the results support MFF-Net as a reproducible approach for deception detection across complementary corpora.

Keywords:

speech deception detection; multi-feature fusion; prosodic features; dynamic features; mel-spectrogram features

1. Introduction

Deception is common in social interaction, yet humans often judge truthfulness intuitively, which invites errors with serious consequences. Lying can erode trust and fairness across judicial proceedings, public safety, commerce, and personal relationships. Voice-based deception detection offers a non-contact way to complement human judgment by exploiting acoustic and prosodic cues, with applications in interrogation, screening, verification, and trust assessment [1,2,3].

In recent years, with advancements in artificial intelligence and machine learning, speech deception detection technology, as a non-invasive deceptive detection method, has gained widespread attention and achieved remarkable progress. This technology’s development relies on interdisciplinary research, including psychology, acoustics, and computational linguistics. The theoretical basis stems from the discovery that lying is often accompanied by physiological changes, which can manifest in features of the speech signal, such as fluctuations in acoustic features like pitch, fundamental frequency, speech rate, and sound intensity [4]. Furthermore, psychological studies also indicate a close correlation between emotional fluctuations and deceptive behaviors, providing theoretical support for emotion recognition in speech deception detection.

Technologically, deception detection can be divided into traditional feature extraction methods and deep learning techniques. Early speech deception detection primarily relied on handcrafted acoustic features, using machine learning methods like Support Vector Machines (SVM) to analyze changes in these features for lie identification. However, with the development of deep learning and neural networks, especially the widespread application of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), the accuracy of speech deception detection has significantly improved. Among them, Long Short-Term Memory (LSTM) networks have unique advantages in processing time-series data, capturing dynamic feature changes in speech signals, thereby enhancing the ability to identify deceptive behavior [5].

Additionally, multimodal analysis has gradually become a research hotspot in speech deception detection. This technology not only relies on speech signals but also combines other physiological or behavioral features (such as facial expressions, eye movements, and body gestures) [6], further improving deception detection accuracy through the fusion of multimodal data. This research is based on the core hypothesis that deceptive behavior is typically accompanied by a series of comprehensive physiological, emotional, and behavioral reactions, thus enabling more accurate deception identification through multi-dimensional signal capture.

Despite progress in various aspects, speech deception detection technology still faces numerous challenges in practical applications. Firstly, due to individual differences in speech data, speech features vary greatly among different speakers, leading to insufficient model generalization across populations. Secondly, in complex real-world environments, factors like noise, echo, or environmental interference adversely affect the analysis accuracy of speech signals. Furthermore, traditional methods mainly focus on static prosodic features of speech; single and fixed features may affect the behavior of the detected object. Aiming at these problems, this paper proposes a Multi-Feature Fusion-Based Network (MFF-Net), which consists of three modules. The first module combines the Gated-BiLSTM with an Attention mechanism. This module processes static prosodic features and nonlinear dynamic features. Static prosodic features encompass various speech qualities like intonation, volume, and speech rate, often used to identify the speaker’s psychological state and are fundamental information in speech-related studies. Nonlinear dynamic features originate from the nonlinear dynamic characteristics of speech signals, possess time-series properties, and are less sensitive to noise. The second module integrates the LM optimization algorithm into the Transformer-CNN structure, replacing traditional optimizers for faster convergence and avoiding local optima, thus enhancing the original model’s performance. The final module performs cross-attention-based feature fusion on the two types of high-level sequential features obtained from the previous steps, forming a complete collective that allows complementary information from different feature types. Experiments were conducted on both the public English judicial scenario dataset RTDD and a self-constructed Chinese deception dataset CLD. Results show that MFF-Net demonstrates better accuracy and generalization in both scenarios. The main contributions of this paper include:

1.: Innovatively proposing a multi-feature fusion-based speech deception detection network, which deeply integrates traditional handcrafted features with Mel-spectrogram features extracted by deep learning methods, balancing the static characteristics, dynamic nonlinear characteristics, and deep time-frequency patterns of speech for complementary advantages.
2.: Designing a heterogeneous feature adaptive processing strategy, namely using the Gated-BiLSTM-Attention module to parse the sequential patterns of handcrafted features, and utilizing the Transformer-CNN-LM module to mine the spatiotemporal features of Mel-spectrograms.
3.: Constructing a high-quality Chinese deception dataset (CLD) in a laboratory setting. This dataset employs both intentional lying and interactive lying paradigms with a standardized collection process and a fine-grained annotation strategy combining sentence-level and key information points, providing an important resource for studying Chinese speech features in controlled environments and effectively complementing datasets in other languages.
4.: Evaluating MFF-Net on two complementary corpora—English courtroom-style RTDD clips and laboratory-collected Mandarin CLD—under speaker-aware splits, where MFF-Net consistently outperforms strong baselines.

2. Related Work

Speech deception detection technology, as a frontier field intersecting computational psychology and audio signal processing, has evolved from shallow models relying on handcrafted features to data-driven deep learning methods, and further to the integration of multimodal information. This section systematically reviews related research, categorizing it into three types based on technical paradigms: traditional method-based lie detection, deep learning-based lie detection, and multimodal-based lie detection.

2.1. Traditional Method-Based Lie Detection

The core paradigm of traditional handcrafted feature-based lie detection involves extracting predefined acoustic features from speech signals and classifying them using statistical models. The development of this path is closely centered around the evolution of feature engineering and the promotion of public datasets. The scientific basis stems from Ekman’s “Emotional Leakage Theory”, which hypothesizes that deceptive behavior triggers involuntary physiological leaks, including changes in voice [7]. Guided by this theory, early research systematically explored acoustic cues related to deception. For example, Hirschberg et al. showed significant differences in prosodic features like fundamental frequency, speech rate, and utterance duration between deceptive and non-deceptive speech [8]. Besides prosodic features, voice quality features stemming from microtremors of vocal organ muscles (e.g., jitter, shimmer) and nonlinear dynamic features (e.g., fractal dimension, recurrence quantification analysis features) have also proven to be effective discriminative indicators [9]. For instance, the 18-dimensional nonlinear dynamic features proposed by Professor Zhao Heming’s team at Soochow University achieved a recognition rate of 70.7% in deception detection tasks [10]. Furthermore, researchers have explored more diverse acoustic features, such as phase-based features [11] and formant features reflecting vocal tract tuning changes [12]. Research standardization benefited from the construction of large-scale lie databases, such as the Columbia-SRI-Colorado (CSC) database built by Hirschberg et al., which achieved about 70% accuracy using acoustic-prosodic and lexical features [8]. Subsequently, the Interspeech 2016 ComParE Deception Sub-Challenge and its provided DSD database further promoted research standardization. Levitan et al. achieved about 69.4% accuracy on the test set by fusing the acoustic baseline feature set and text emotion features [13]. In automatic speaker verification anti-spoofing, large-scale shared tasks such as ASVspoof popularized strong cue-based baselines (e.g., MFCC/CQCC paired with GMMs or SVMs) and shaped subsequent deep model comparisons [14]. To improve model performance, subsequent research focused on optimizing feature selection and model fusion, such as using recursive feature elimination algorithms to select feature subsets [15], or introducing ensemble learning models like Gradient Boosting Decision Trees [16]. Recently, researchers have also attempted to optimize traditional features from the perspective of noise robustness. For example, Wang et al. proposed an improved noise-invariant feature extraction method that enhanced detection stability in noisy environments [17]. However, the fundamental limitation of traditional methods lies in their performance, heavily relying on expert-designed features, making it difficult to capture subtle, high-order nonlinear patterns of deception, and their generalization ability is limited when facing complex environmental noise and unknown speakers [18].

2.2. Deep Learning-Based Lie Detection

The rise of deep learning technology has fundamentally changed the research paradigm of speech deception detection, shifting it from reliance on handcrafted features to end-to-end representation learning. Its evolution path can be traced from feature learning and temporal modeling to current pre-training and generalization. Early research focused on using deep learning models to learn more robust and deeper representations from handcrafted features or raw signals. For example, Zhou et al. proposed a model based on the K-SVD algorithm and Deep Belief Networks, obtaining deep features through sparse coding [19]; Fu et al. used Denoising Autoencoders to reduce the dimensionality and reconstruct high-dimensional handcrafted features, improving model robustness with limited labeled samples [20]. With the development of sequence modeling techniques, the research focus shifted to end-to-end architectures capable of directly capturing the temporal dynamics of speech. Recurrent Neural Networks and their variants, like Long Short-Term Memory networks, were widely adopted. For instance, Fang et al. combined Autoencoders, Bidirectional LSTM, and an attention mechanism to construct a semi-supervised model focusing on key speech segments [21]. In the ASVspoof 2017 and subsequent challenges, deep models based on CNN, Residual Networks, and RNN comprehensively outperformed the traditional GMM baseline, establishing the dominance of deep learning [14]. Simultaneously, more network architectures were introduced, such as the Convolutional LSTM network studied by Sahidullah et al. for replay attack detection [22]. In recent years, self-supervised pre-trained models based on the Transformer architecture have become mainstream. For example, Liu et al. fine-tuned the Wav2Vec 2.0 model, pre-trained on large-scale unlabeled data, for speech spoofing detection tasks, achieving excellent performance and proving the effectiveness of learning universal speech representations [23]. To more effectively model the temporal dynamics of speech signals, spatiotemporal convolutional architectures like S3D networks have also been applied to audio spoofing detection [24]. Current cutting-edge research aims to address challenges like data scarcity and unknown attacks, actively exploring the use of contrastive learning to leverage unlabeled data [25], and adopting paradigms like one-class learning to enhance model generalization robustness against unknown spoofing types [26]. Meanwhile, research on model efficiency and interpretability is also deepening. For example, Zhang et al. proposed a lightweight multi-scale temporal convolutional network that significantly reduced computational overhead while maintaining high accuracy [27], and defensive research against adversarial attacks is gaining attention [28].

Critical perspective. Self-supervised front ends (e.g., Wav2Vec 2.0 [23]) excel when large unlabeled corpora are available and the target domain matches pre-training, but they can be data-hungry and less transparent for courtroom-style deception cues that are often studied with curated paralinguistic descriptors. Likewise, spoofing-focused benchmarks (e.g., ASVspoof [14]) emphasize attack types and metrics that are not always aligned with human deceptive speech in interviews or trials. These gaps motivate architectures that explicitly retain interpretable handcrafted cues while still exploiting deep time–frequency representations.

2.3. Multimodal-Based Lie Detection

Deception is a complex psychophysiological behavior whose clues are simultaneously reflected in multiple channels such as speech, facial expressions, eye movements, posture, and physiological signals [7]. Therefore, fusing multimodal information for joint analysis is considered an inevitable approach to achieving high-accuracy, high-robustness deception detection. Early research validated the effectiveness of fusing information from speech and facial micro-expressions [29]. With the development of deep learning, fusion methods based on attention mechanisms have become mainstream. For example, Gallardo-Antolin and Montero proposed a multimodal LSTM framework based on an attention mechanism to dynamically weight and fuse information from gaze behavior and speech, effectively improving detection performance [30]. The core of current research lies in achieving deep cross-modal interaction. Cross-modal Transformer architectures, with their powerful global attention mechanism, can model fine-grained alignment and dependency relationships between modalities. The model proposed by Chen et al. achieved deep fusion of audio, visual, and textual information through a cross-attention module [31]. To efficiently utilize large-scale pre-trained foundation models, parameter-efficient fine-tuning techniques are widely adopted, requiring training only a small number of parameters to adapt the model to downstream multimodal deception detection tasks, significantly reducing computational costs [32]. Furthermore, exploring more advanced fusion strategies is also a focus, such as graph neural network-based multimodal fusion methods [33] and multi-level hybrid feature fusion networks [34]. The vigorous development of this field benefits from the establishment of high-quality multimodal datasets, such as DOLOS [35] and SEUMLD [36], which provide critical support for training and evaluating complex models. Currently, multi-task learning and robust fusion strategies insensitive to modal missingness are important research trends aimed at further improving the practicality of systems in real complex scenarios [37]. Additionally, the latest research begins to explore more interpretable fusion mechanisms, such as visualizing attention weights to understand model decision basis [38], and exploring the fusion of physiological signals (e.g., ECG, GSR) with behavioral modalities to uncover deeper stress and cognitive load clues [39].

Although speech deception detection technology has made significant progress, existing research still faces a series of common challenges and limitations. Firstly, traditional machine learning-based methods, while having the advantage of strong interpretability, face performance bottlenecks due to heavy reliance on expert-experience-driven feature engineering, making it difficult to adaptively capture the complex and subtle nonlinear patterns in deceptive behavior, with limited generalization ability [7,8,9,10,11,12,13]. Secondly, deep learning-based methods, through end-to-end learning, break through the limitations of feature design, but their performance heavily depends on large-scale, high-quality annotated data, and the scarcity of deception data restricts their development [14,15,16,17]. Finally, multimodal fusion-based methods can theoretically provide more comprehensive clues, but their development is constrained by the lack of high-quality, scenario-specific multimodal datasets; meanwhile, how to design efficient fusion architectures to fully exploit the complementary and synergistic information between modalities, rather than simply concatenating information, and reducing the computational and deployment costs of models, remains a core problem to be solved [24,25,26,27]. Addressing the above limitations, especially the effectiveness of feature representation and model generalization ability, the MFF-Net network proposed in this paper focuses on the deep fusion of traditional handcrafted features and deep features, aiming to combine the clear physical meaning of traditional features with the powerful representational capacity of deep learning to improve the detection performance and robustness of the model under different deception scenarios. The research in this paper constitutes a beneficial exploration and supplement to the limitations of existing speech deception detection technologies.

3. Materials and Methods

3.1. Design Rationale and Comparison to Alternatives

Speech deception cues are expressed across complementary representations: high-level prosodic statistics and nonlinear dynamics (ACO/NLD) offer interpretable, domain-established descriptors, whereas Mel time–frequency patterns capture finer texture that CNNs/Transformers can exploit. A single-stream end-to-end model (e.g., waveform CNN or spectrogram Transformer alone) can underuse the inductive biases that have proven useful in deception and paralinguistics benchmarks, while only using handcrafted functionals discards non-local spectro-temporal regularities. We therefore adopt a dual-branch design: (i) a sequence model with gating and attention for ACO+NLD, following evidence that temporal weighting helps deception-related segments [21,40]; (ii) a parallel Transformer–CNN branch for Mel features to combine local convolutional structure with long-range attention [23,24]; and (iii) cross-attention fusion instead of fixed concatenation or voting, so each modality can query the other before the final decision. Simpler fusion rules are included later as ablations (hard voting) to quantify the gain from learned interaction.

This paper designs a speech deception detection network based on multi-feature fusion, effectively combining traditional acoustic features with deep learning methods to mine acoustic, dynamic, and time-frequency discriminative information related to deception from speech signals. It contains three modules, as shown in Figure 1. The Gated-BiLSTM-Attention module performs context-aware sequence modeling on acoustic and nonlinear dynamic features, enhancing the model’s ability to capture temporal evolution patterns in speech. The Transformer-CNN-LM hybrid module jointly mines local structures and global temporal dependencies from Mel-spectrograms, improving the discriminative power and robustness of feature representation. In the final multi-feature fusion module, a cross-attention based feature fusion mechanism is designed to achieve deep interaction of heterogeneous features, rather than simple weighted fusion, and the Softmax classifier is used to complete the classification decision between truth and deception. Before describing each module in detail, the following definitions are given:

Let

X = {X_{A}, X_{N}, X_{M}}

represent a multi-feature input sample, where

X_{A}

,

X_{N}

,

X_{M}

denote the acoustic-prosodic features, nonlinear dynamic features, and Mel-spectrogram features extracted from the speech signal, respectively. The feature data sequence can be represented as

X_{i} = {X_{i, 1}, X_{i, 2}, \dots, X_{i, n}}

, where

i \in {A, N, M}

, and n represents the feature dimension. The goal is to predict the category of the sample based on its deceptive content. The data label is

y \in {y_{0}, y_{1}}

, where

y_{0}

represents a lie, and

y_{1}

represents the truth. Finally, the embedded feature vectors

F_{X}

for the three fused features are obtained, where

X \in {A, N, M}

. The predicted label

\bar{y}

is obtained as follows:

\bar{y} = M (F_{A}, F_{N}, F_{M})

(1)

where

M

is the fusion function, and

y \in {y_{0}, y_{1}}

.

3.2. The Gated-BiLSTM-Attention Module

The design of this module is inspired by the Convolutional Bidirectional Long Short-Term Memory network (CovBiLSTM) in the literature [40]. This paper selects emotionally representative prosodic features from prosodic features, including Mel-Frequency Cepstral Coefficients (MFCC), Bark Band Energy, and pitch, among others. Additionally, this paper uses the box-counting method to extract 15-dimensional nonlinear dynamic features from speech and concatenates them with the prosodic features. The core function is to perform deep sequential pattern parsing on the handcrafted acoustic-prosodic features (ACO) and nonlinear dynamic features (NLD), thereby extracting high-level sequential features related to deception. The module structure is innovatively enhanced based on the BiLSTM-Attention framework by introducing a secondary gating mechanism to improve the model’s ability to capture key information.

Prosodic features reflect the static characteristics of the speech signal over short time intervals (20 ms) [41], generally analyzed using short-time analysis methods. This paper uses the OPENSmile toolkit to extract the INTERSPEECH 2016 ComParE Challenge and eGeMAPSvO2 feature sets at the functionals level. The resulting integrated feature set contains a large number of static features, specifically including low-level descriptors (LLDs) such as loudness, pitch, energy distribution, and spectral bands, as well as various statistical functions used to characterize the dynamic changes of LLDs, including five categories: central tendency, dispersion, distribution shape, order statistics, and regression analysis, totaling 6461 static features.

In the analysis of nonlinear dynamic features of speech signals, the Box-Counting Method is a classic computational method based on fractal geometry, used to quantify the complexity and irregularity of signal waveforms [42]. This method analyzes the self-similarity characteristics of the speech signal at different scales to extract its Fractal Dimension, thereby revealing the implicit nonlinear dynamic behavior in the signal. Its core idea is to cover the signal waveform with grids of different sizes, count the minimum number of boxes required for coverage, and then estimate the fractal dimension through a power-law relationship. The specific method is as follows:

First, long speech segments need to be divided into short-term speech signals and converted into time-series waveforms; this step also requires preprocessing like denoising and normalization to eliminate effects like baseline drift and high-frequency interference. Second, the waveform is treated as a curve in a two-dimensional plane (time-amplitude), covered by a grid of squares with side length

ε

. Then, gradually reduce the box size

ε

(e.g., sequentially take

ε = 2^{- k}, k = 1, 2, \dots

), and count the minimum number of boxes

N (ε)

required to cover the curve at each scale. According to fractal theory,

N (ε)

and

ε

satisfy a power-law relationship:

N (ε) \propto ε^{- D}

. After taking the logarithm of both, use linear regression to fit the data points of

log N (ε)

versus

log (1 / ε)

; the slope of the resulting line is the estimated value of the fractal dimension D. The calculation formula is:

D = lim_{ε \to 0} \frac{log (N (ε))}{log (1 / ε)}

(2)

The fractal dimension D obtained by the box-counting method directly reflects the nonlinear dynamic characteristics of the speech signal. A higher D value indicates a more complex and irregular signal waveform, possibly corresponding to nonlinear phenomena like turbulence and vortices in speech. The fractal dimension describes the complexity and self-similarity dimension of the system’s motion trajectory in phase space, capable of capturing the chaotic characteristics caused by physical mechanisms such as glottal airflow and vocal tract wall vibrations during speech production, effectively complementing linear features (like formants). Finally, this paper extracts 15 nonlinear features including fractal features, Lyapunov exponent, and K-entropy. Compared to the emotional cues captured by prosodic features, NLD features focus more on cues related to cognition, memory, and strategic communication, enhancing the system’s ability to model nonlinear dynamic behavior.

The combined features obtained using the above method are dimensionality-reduced through a max-pooling layer and then input to the Gated-BiLSTM. The “memory” of LSTM is also called cells. These cells determine which previous information and state need to be retained or remembered and which need to be erased, effectively preserving relevant information from much earlier times. Meanwhile, through the three unique gate structures of LSTM: forget gate, input gate, and output gate, the cell state is protected and controlled. The forget gate decides what information to discard from the cell state, the input gate determines what new information to store in the cell state, and the output gate decides which part of the cell state will be output. The formulas are as follows:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})

(3)

C_{t} = f_{t} * C_{t - 1} + i_{t} * {\tilde{C}}_{t}

(4)

o_{t} = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o})

(5)

where

h_{t - 1}

and

x_{t}

represent the output from the previous time step and the input at the current time step,

σ

is the activation function, W and b are the corresponding weights and biases,

{\tilde{C}}_{t}

represents the new candidate value vector, and

o_{t}

represents the output gate controlling the output of the current hidden state. To overcome the limitations of traditional BiLSTM in regulating information flow in deep networks, this paper introduces a gating enhancement mechanism. This mechanism adds an information control gate on top of the standard LSTM’s forget, input, and output gates to achieve fine-grained regulation of cell state updates. The final hidden state is:

h_{t} = o_{t} ⊙ tanh (C_{t})

(6)

where ⊙ denotes element-wise multiplication.

Given that the primary task of this module is to evaluate and focus on the most discriminative segments from the temporal sequences of acoustic-prosodic and nonlinear dynamic (NLD) features—rather than to model long-range contextual dependencies—a context-based additive attention mechanism is judiciously employed for the weighted fusion of the Gated-BiLSTM outputs. This choice is motivated by the mechanism’s computational efficiency and directness in highlighting critical information.

First, the attention score for each time step is calculated. A learnable context vector interacts with a non-linear projection of the hidden state to produce an unnormalized score:

e_{t} = v^{T} \cdot tanh (W_{a} \cdot h_{t} + b_{a})

(7)

where

v^{T}

is a learnable weight vector,

W_{a}

is a weight matrix, and

b_{a}

is a bias term. The attention scores are then normalized across all time steps using the Softmax function to obtain the attention weights

α_{t}

:

α_{t} = \frac{e^{e_{t}}}{\sum_{j = 1}^{T} e^{e_{j}}}

(8)

Subsequently, a fixed-dimensional context vector r is obtained by performing a weighted sum of the hidden states of all time steps:

r = \sum_{t = 1}^{T} α_{t} h_{t}

(9)

Finally, this module outputs the weighted feature vector

F_{A N}

. This vector not only integrates the temporal dynamic information from the prosodic feature sequence

F_{A}

and the NLD feature sequence

F_{N}

but also, due to the gating enhancement and the attention mechanism, accentuates the subtle variations in key segments of speech deception. Thereby, it forms a higher-level feature representation with stronger discriminative power, providing a solid foundation for subsequent multi-feature fusion and classification.

The entire module combines the fine-grained modeling capability of Gated-BiLSTM for long-range dependencies with the focus ability of the attention mechanism on key information, effectively enhancing the discriminativity and robustness of the feature representation. The process of this module is shown in Figure 2.

3.3. Transformer-CNN-LM Module

For a normal signal

X (t)

, we typically seek to understand it from both time and frequency domain perspectives. In the feature extraction process of speech signals, the Mel-spectrogram is a key time-frequency representation method. Its calculation process is shown in Figure 3. First, pre-emphasis is applied to the original speech signal to compensate for the attenuation of high-frequency components during transmission, thereby boosting high-frequency energy and making the overall spectrum flatter. Subsequently, the pre-emphasized signal is segmented into consecutive short-time frames, and a window function (e.g., Hamming window) is applied to each frame to reduce spectral leakage caused by signal truncation. Next, the Fast Fourier Transform (FFT) is performed on each framed, windowed signal, converting it from the time domain to the frequency domain. This process is also known as the Short-Time Fourier Transform (STFT), mathematically expressed as:

X (t, f) = \int_{- \infty}^{\infty} x (τ) w (t - τ) e^{- j 2 π f τ} d τ

(10)

where

x (t)

is the original signal,

w (t - τ)

is the window function, t represents time, and f represents frequency. The STFT converts the signal from the time domain to the frequency domain, revealing the energy distribution of the signal across different frequency components, thus characterizing the oscillation properties of the signal, which is crucial and beneficial for subsequent analysis. To simulate the nonlinear perception characteristics of the human auditory system to frequency, the linear spectrum obtained from the STFT is passed through a Mel-scale triangular filter bank. This filter bank is equally spaced on the Mel frequency scale, with higher resolution in low-frequency regions and lower resolution in high-frequency regions, making it more consistent with the human auditory mechanism. Applying a logarithmic operation to the output energy of the filter bank compresses the dynamic range, further approximating the human perception of sound intensity, ultimately yielding the Mel-spectrogram. In the experiments of this paper, the librosa library was used to extract Mel-spectrograms with a sampling rate of 48 kHz, a Hamming window length of 512, and an FFT window size of 1024. To enhance model robustness, additive white Gaussian noise (AWGN) was introduced during the training phase to augment the original dataset, increasing data diversity and thereby improving accuracy and generalization ability in complex environments. Figure 4 shows a comparison of the linear spectrograms and Mel-spectrograms of truthful and deceptive speech, where the horizontal axis represents time, the vertical axis represents frequency, and the color depth represents energy intensity. By comparison, it can be observed that the spectrograms of speech deception often exhibit stronger energy distribution in high-frequency regions, which is a distinguishable difference from truthful speech.

To achieve high-precision mining of deception-related features from Mel spectrograms, this paper designs a parallel hybrid model that possesses both local feature extraction and global temporal modeling capabilities, and introduces an advanced second-order optimization algorithm to enhance its nonlinear fitting performance. The overall structure of the model is shown in Figure 5, mainly comprising three core parts: the Convolutional Neural Network module, the Transformer encoder module, and the optimization strategy based on the Levenberg-Marquardt algorithm.

First, the input Mel-spectrogram is fed in parallel to the CNN module and the Transformer module. In the CNN branch, the data flows through four consecutive convolutional blocks. Each convolutional block performs a convolution operation:

Z^{l} = W^{l} * X^{l} + b^{l}

(11)

This is followed by the introduction of non-linearity through batch normalization and the ReLU activation function, and the spatial dimensions are compressed by a max-pooling operation, ultimately resulting in a flattened convolutional embedding

h_{cnn} \in R^{256}

.

{\bar{Z}}^{(l)} = γ \cdot \frac{z^{l} - μ_{B}}{\sqrt{σ^{2}} + ε} + β

(12)

In the Transformer branch, the spectrogram first undergoes max-pooling for dimensionality compression and adjustment, obtaining a serialized feature representation

X_{pool}

, which is then fed into a multi-layer encoder. The encoder utilizes a multi-head self-attention mechanism:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{4}) W^{0}

(13)

to dynamically compute global temporal dependencies, and undergoes deep nonlinear transformation through a feedforward neural network:

FFN (x) = W_{2} \cdot ReLU (W_{1} X + b_{1}) + b_{2}

(14)

supplemented by layer normalization and residual connections to ensure training stability. Finally, a 64-dimensional Transformer embedding feature

h_{trans} \in R^{64}

is obtained by average pooling over the time dimension. At this point, the model obtains

F_{C}

representing local spatial patterns and

F_{T}

representing long-term contextual relationships, respectively.

Subsequently, the feature fusion layer concatenates

F_{C}

and

F_{T}

to form a unified deep feature fed into the subsequent fully connected layers for classification decision. To optimize the parameters of this critical step, this paper innovatively introduces the LM algorithm. The LM algorithm is an efficient optimization algorithm specifically designed for solving nonlinear least squares problems. It adaptively adjusts the damping factor, skillfully combining the fast convergence of the Gauss-Newton method with the global stability of gradient descent. Considering the huge computational cost of the LM algorithm due to the need to calculate the Jacobian matrix, this paper adopts a two-stage training strategy. First, the Adam optimizer is used to pre-train the entire Transformer-CNN feature extraction backbone to obtain stable initial parameters. Subsequently, during the optimization phase, the backbone network parameters are frozen, and the LM algorithm is applied only to the top-level classifier. The core is to define the residual

γ

:

γ = y_{true} - y_{pred}

(15)

and update the parameters by solving the linear system:

(J^{T} J + λ I) Δ W = - J^{T} γ

(16)

where J is the Jacobian matrix of the residuals with respect to the top-layer parameters W, and

λ

is the damping factor. This algorithm can adaptively switch between gradient descent and the Gauss-Newton method, achieving superlinear convergence near the optimal solution, thereby accurately learning the complex mapping from the fused features to the final deception judgment result, effectively improving the model’s convergence speed and generalization performance.

During the LM refinement stage only (Mel-branch top fully connected layers, with the CNN/Transformer backbone frozen), we treat the binary targets as continuous scores in

[0, 1]

and minimize a mean squared error (MSE) objective so that the Levenberg–Marquardt update matches its standard nonlinear least-squares formulation:

L_{MSE} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}

(17)

where N is the batch size,

y_{i}

denotes the true label of the i-th sample, and

{\hat{y}}_{i}

represents the corresponding predicted value before thresholding. The LM algorithm iteratively minimizes this loss to refine the Mel-branch head parameters W.

Implementation (LM head in PyTorch). The Levenberg–Marquardt refinement is implemented in PyTorch 2.1.2. After freezing all CNN and Transformer parameters, we collect residuals

γ = y_{true} - y_{pred}

from the last one to two fully connected (nn.Linear) layers of the Mel branch (the head that maps fused CNN/Transformer embeddings to scalar scores in

[0, 1]

). We use PyTorch’s automatic differentiation to form the Jacobian–vector products needed for each LM step (equivalently, a Gauss–Newton damped update on those head weights only), iterating until convergence criteria on

∥ Δ W ∥

or on the MSE over the current mini-batch. The backbone remains non-trainable during this stage; no LM update is applied to the cross-attention fusion or the ACO/NLD branch.

Clarification (training objective for the full model). The Transformer–CNN Mel backbone is first pre-trained with Adam as in Table 1, then frozen. The Gated-BiLSTM–Attention branch, the cross-attention fusion module, and the final softmax classifier are trained with stochastic gradient descent (SGD) and cross-entropy (CE) on the two-way softmax output. Thus, CE governs the main end-to-end classification path, whereas MSE appears only in the LM polishing step on the Mel-branch head described above—not as the global loss for MFF-Net.

In summary, this module ensures the comprehensiveness of feature extraction through the parallel structure of Transformer and CNN, and achieves efficient and precise optimization of model parameters by applying the LM algorithm at the key decision layer, jointly providing strong technical support for the deception detection task. The high-level representation of the Mel-spectrogram features

F_{M}

output by this module will be fused with the acoustic features via cross-attention.

3.4. Multi-Feature Fusion Module Based on Cross-Attention

Acoustic-prosodic features (ACO) primarily characterize the prosody, voice quality, and spectral statistics of speech. Nonlinear dynamic features (NLD) reflect the chaotic dynamic behavior of speech production. Mel-spectrogram features provide global pattern information from a time-frequency perspective. These three types of features have significant differences in representation granularity, physical meaning, and data structure, constituting a typical heterogeneous feature fusion problem. Traditional multi-feature fusion in speech deception detection often relies on simple weighted summation or feature concatenation. Such methods can introduce a large amount of redundant information and struggle to fully explore the nonlinear correlations and deep complementarity between heterogeneous features. Targeting the characteristic that these three types of features, though from the same source, have distinct representational perspectives, this paper designs a deep multi-feature fusion module based on a cross-attention mechanism. This mechanism allows each type of feature to actively query relevant information in the feature space of the others, rather than being passively fused. For example, when acoustic features detect abnormal fluctuations in fundamental frequency, they can automatically enhance the attention to corresponding frequency band features in the Mel-spectrogram, achieving synergistic enhancement between features. This mechanism enables fine-grained feature fusion through bidirectional interaction and adaptive weight allocation between features.

Specifically, the high-level representation of acoustic nonlinear dynamic features

F_{A N} \in R^{d_{1}}

output by the Gated-BiLSTM-Attention module and the high-level representation of Mel-spectrogram features

F_{M} \in R^{d_{2}}

extracted by the Transformer-CNN-LM module are taken as inputs. First, they are mapped to a unified hidden dimension d through independent linear projection layers:

F_{A N}^{(proj)} = F_{A N} W_{proj}^{A N} + b_{proj}^{A N}

(18)

F_{M}^{(proj)} = F_{M} W_{proj}^{M} + b_{proj}^{M}

(19)

where

W_{proj}^{A N}, W_{proj}^{M} \in R^{d \times d}

are projection weights, ensuring feature dimensionality consistency and laying the foundation for subsequent attention calculation. For the attention flow from acoustic features to Mel-spectrum:

Q_{A N} = F_{A N}^{(proj)} W_{q}, K_{M} = F_{M}^{(proj)} W_{k}, V_{M} = F_{M}^{(proj)} W_{v}

(20)

Q_{M} = F_{M}^{(proj)} W_{q}^{'}, K_{A N} = F_{A N}^{(proj)} W_{k}^{'}, V_{A N} = F_{A N}^{(proj)} W_{v}^{'}

(21)

where

W_{q}, W_{k}, W_{v}, W_{q}^{'}, W_{k}^{'}, W_{v}^{'} \in R^{d \times d}

are learnable parameter matrices. Then, the bidirectional cross-attention weights are calculated:

{Attn}_{A N} = Softmax (\frac{Q_{A N} K_{M}^{⊤}}{\sqrt{d}} + M_{bias}) V_{M}

(22)

{Attn}_{M} = Softmax (\frac{Q_{M} K_{A N}^{⊤}}{\sqrt{d}} + M_{bias}^{'}) V_{A N}

(23)

where

M_{bias}, M_{bias}^{'}

are optional attention bias matrices for introducing prior knowledge.

{Attn}_{A N}

represents the context-enhanced representation obtained by acoustic features attending to Mel-spectrogram features, and

{Attn}_{M}

is the representation enhanced for Mel-spectrogram features by acoustic features. This design enables the model to dynamically capture cross-feature correlations such as “abnormal fundamental frequency and formant shifts”. To fully utilize the attention outputs, residual connections and a gating mechanism are employed:

G_{A N} = σ (W_{g} [{Attn}_{A N} ∣ ∣ F_{A N}^{(proj)}] + b_{g})

(24)

F_{A N}^{'} = LayerNorm (F_{A N}^{(proj)} + G_{A N} ⊙ {Attn}_{A N})

(25)

where

W_{g} \in R^{d \times d}

is a learnable parameter.

F_{M}^{'}

is obtained similarly. Finally, a unified representation is generated through a hierarchical fusion strategy:

F_{A N M} = FFN ([F_{A N}^{'} | | F_{M}^{'}])

(26)

where

| |

denotes the concatenation operation, and FFN is a two-layer feedforward network. The fused feature vector

F_{A N M}

will be input to the Softmax classifier. The mapping from the feature space to the category space is completed through the formula

\bar{y} = arg max (Softmax (W_{c} F_{A N M} + b_{c}))

, ultimately achieving binary classification of lies and truth. This fusion mechanism effectively addresses the representation gap problem of heterogeneous features, providing a unified feature representation with stronger discriminative power for deception detection.

4. Experiments

4.1. Experiments Setup

All experiments were conducted on a system equipped with a 13th Gen Intel^® Core i9-13900HX CPU (2.20 GHz), 16 GB of RAM, and an NVIDIA GeForce RTX 4060 GPU. The software environment utilized PyTorch 2.1.2 with CUDA 11.8. The detailed training and model hyperparameters are listed in Table 1.

Hyperparameter selection. The key architectural choices (Gated-BiLSTM width 128, dropout 0.4, four Transformer layers with four heads and feed-forward size 512) were fixed after a coarse grid search on CLD validation folds over learning rates

{1 \times 10^{- 3}, 5 \times 10^{- 4}, 1 \times 10^{- 4}}

and hidden widths

{64, 128}

. Other settings follow the reference implementations of CovBiLSTM [40] and the hierarchical attention baseline [43].

Speaker-aware splits. On CLD, we use leave-one-speaker-out (LOSO) cross-validation: all clips from a held-out speaker are reserved for testing while the remaining 45 speakers are used for training, so clips from the same participant never appear in both partitions. RTDD was expanded from 121 verified trial statements into 3205 shorter clips; to mitigate optimistic bias, folds are formed at the source-statement level (all clips originating from the same original RTDD utterance stay together), which prevents the same courtroom speaker instance from appearing in both training and test sets within a fold.

Checkpoint selection for deep models. Deep networks are sensitive to random initialization and mini-batch order. Under the experimental budget of this work, each deep baseline and MFF-Net was trained once per dataset fold using the protocol in Table 1. Subsequent experiments report test-set accuracy together with precision/recall/F1 evaluated at the checkpoint that achieved the highest validation accuracy during that single training run—i.e., a validation-best checkpoint. This uses the same selection principle as multi-seed studies that pick, for each seed, the validation-optimal state before reading off test performance, but here it applies to only one realized trajectory per model and fold: we do not average over or fabricate additional random seeds, and we do not report mean ± standard deviation across reinitializations. Systematic multi-seed aggregation is left for future work.

4.2. Dataset

In speech deception detection research, speech data from real judicial scenarios and laboratory-controlled scenarios have significant differences. Speech in real scenarios (e.g., court trials) is often accompanied by environmental noise, multi-person interaction, and complex conversational structures. Deceptive behavior is often premeditated, with stronger emotional masking ability, and speech features are relatively natural but mixed with more interference. In laboratory scenarios, the speech acquisition environment is clean, deception tasks are usually structured, individuals are in a relatively isolated state, and the pressure to lie is more concentrated, potentially exhibiting more obvious unnatural pauses, fundamental frequency fluctuations, and other features in the speech. Existing high-quality speech deception data mostly come from real scenarios, such as RTDD used in this paper [44]. To comprehensively evaluate the generalization ability of MFF-Net in different contexts, this paper also constructed a Chinese deception dataset (CLD), which induces purer and more controllable speech deception samples by simulating single-person narrative tasks in a controlled environment. The two datasets are introduced below.

4.2.1. The Real-Life Trial Deception Database (RTDD)

This paper uses RTDD for deception detection experiments in real judicial scenarios. This database contains audio and video recordings from real court trial processes; this paper uses only the audio data. All statement segments have been subsequently verified by judicial procedures and are clearly labeled as “True Statement” or “Deceptive Statement,” possessing high ecological validity. The recordings were collected in the courtroom. Although there is some environmental noise, the overall speech intelligibility is high, reflecting the actual conditions of speech deception detection in real judicial environments.

The original database contains 121 trial statement segments (61 lies, 60 truths), each with a corresponding truth/false label. To adapt to model training needs, this paper first performed silence removal and sentence segmentation on the original long audio, and then segmented and annotated based on the transcript and truth labels, ultimately constructing a subset of 3205 valid speech samples, including 1624 truthful statement samples and 1581 deceptive statement samples. The specific distribution of the dataset is shown in Table 2.

This database not only provides high-quality English deception data but also serves as an important foundation for studying multimodal features (such as acoustic characteristics, speech disfluencies, etc.) in judicial environments. Feature extraction and model training conducted on this basis validate the effectiveness of the proposed MFF-Net in English scenarios.

4.2.2. The Chinese Lie Dataset (CLD)

This paper constructs a Chinese lie collection process based on both intentional lying and interactive lying paradigms. A total of 46 participants were involved, including 30 students and 16 working professionals. All participants were aged between 20 and 30, with 23 males and 23 females, and all used Mandarin for the experiment. The main experiment consisted of two parts. The first part was “Intentional Lying.” First, participants were given three topics (“My University,” “My Best Friend,” “My Teacher”) and asked to write about them, requiring the content to be truthful. During the experiment, one theme was randomly selected first, and the participant was asked to describe the written content without notes; secondly, one of the remaining two themes was selected, and the participant was required to lie about the key information of the written content, and was informed that they would receive an additional participant reward if the lie was not detected by the experimental instrument. The second part was “Interactive Lying.” First, participants needed to fill out a series of questionnaire questions, including but not limited to likes/dislikes about certain things, past experiences in certain aspects; the questionnaire only required answering “Yes” or “No.” During the experiment, professional recorders guided and recorded the process. Participants were paired up, taking on the different roles of “deception detector” and “liar.” The “liar” was required to state or lie about the answers in the form, requiring lying about approximately half of the content; rewards would be deducted if requirements were not met or if contradictions arose. The “detector” needed to ask questions based on the form filled by the “liar,” and judge whether the “liar” was lying; questions could be repeated, asked in random order, etc., to judge more accurately. The correctness of the “detector”’s judgment also affected the reward.

The recording process was conducted in a professional recording studio. All recording equipment was strictly adjusted and tested before the formal experiment to ensure accuracy and standardization. Audio annotation and segmentation were performed manually using Praat 6.4.62 and Adobe Audition 26.0 tools. The processors also received professional, standardized training, following unified audio processing guidelines. The segmentation strategy combined sentence-level segmentation and pause-based segmentation to closely match the actual meaning and intent. In the “Intentional Lying” off-script narration part, processors segmented each complete sentence. In the fabrication part of “Intentional Lying,” processors recorded the segments where the participant changed key information. For example, if a participant wrote “My best friend is A” but said “My best friend is B” during the experiment, since only “B” is the lied part, subtle vocal differences might not necessarily occur when saying other parts; thus, processors only recorded the segment where “B” was uttered to emphasize the specificity of speech during lying. In the “Interactive Lying” part, processors only recorded the segments where the “liar” answered “Yes” or “No” during the experiment. Short silent pauses for breath were allowed during the experiment but had to be within a specified time threshold; if exceeded, the intermediate silent segment was labeled “NOISE.” Similarly, non-speech segments like coughs, sneezes, or slips of the tongue were labeled “NOISE” and were removed during post-processing, leaving only clear and valid speech segments. Finally, CLD collected a total of 3566 speech segments from 46 speakers. The basic content of the dataset is shown in Table 3. All data are stored in WAV format and divided using the same standard. During recording, it was found that compared to truthful speech, most people’s extracted speech segments were longer when lying, with more unnatural pauses and more blurred pronunciation.

Ethics, consent, and annotation quality control. All CLD participants provided written informed consent after receiving an explanation of the recording procedures, reward rules, and data handling policy. Personally identifiable material was excluded from prompts, and audio files were pseudonymized before feature extraction. Annotators were trained on a common segmentation manual; ambiguous cases were discussed among annotators until consensus was reached. CLD was assembled as an in-house institutional dataset; prospective approval from an independent ethics committee (IRB) with a formal protocol number was not obtained. The end-of-manuscript Institutional Review Board statement describes this governance context and the protections that were applied nonetheless.

Confidentiality and data release. CLD contains sensitive and confidentiality-classified content under institutional policies. Therefore, the raw audio and participant-linked metadata cannot be made publicly available in the short term. To support reproducibility within these constraints, we provide detailed acquisition, segmentation, and annotation protocols in the manuscript; aggregated statistics and non-identifying methodological descriptions are given in the tables. Requests for collaboration or auditor-level access must follow institutional approval procedures.

In deception-related research, data collection methods are crucial. The intentional lying paradigm has unique advantages in the form of experimental tasks. In contrast, most studies collecting data from participants’ spontaneous speech are variants of the Trier Social Stress Test, which requires free speech or storytelling on fixed topics to induce different stress levels. In this experiment, since participants prepared their speech content in advance, emotional fluctuations caused by the nervousness of impromptu speech were avoided. Meanwhile, the lied speech, being influenced by the prepared truth, is closer to real-life lies, helping to obtain more authentic and reliable deception data. In the interactive lying paradigm, deception detection tasks are set up by assigning different roles to participants, such as “liar” and “detector.” Similar settings can be found in many studies. Compared to game-based or simulated crime paradigms, the interactive lying paradigm has a simpler experimental setup. Participants only answer yes or no; lying and lie detection are achieved through Q&A interactions, which helps highlight key behaviors and reactions in deception. The entire data collection process was conducted in a closed indoor recording setting, reducing noise interference. The speech segment processing was also unified and standardized, improving audio data quality. The combination of two speech segmentation methods is more in line with semantics and people’s cognition of lies, while improving the accuracy of speech data, paving the way for enhancing model performance.

4.3. Ablation Studies

To systematically evaluate the contribution of different features and modules in the proposed MFF-Net, this section designs multiple sets of ablation experiments. The experiments are conducted on CLD, using Accuracy, Precision, Recall, and F1-score as evaluation metrics to comprehensively measure model performance. These metrics are all calculated based on True Positives (

T P

), True Negatives (

T N

), False Positives (

F P

), and False Negatives (

F N

) from the confusion matrix. Their mathematical definitions are as follows:

Accuracy measures the model’s ability to correctly classify the overall samples, calculated as:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(27)

Precision focuses on the reliability of the model’s predictions as positive examples, calculated as:

Precision = \frac{T P}{T P + F P}

(28)

Recall measures the completeness of the model’s identification of true positive examples, calculated as:

Recall = \frac{T P}{T P + F N}

(29)

Given that Precision and Recall are often mutually constrained, this study uses the F1-score as a core evaluation metric. The F1-score is the harmonic mean of Precision and Recall, particularly suitable for handling datasets that may have class imbalance. It is calculated as:

F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(30)

To systematically evaluate the effectiveness of different features and their combinations in the speech deception detection task, a series of ablation experiments was conducted using a consistent LSTM baseline model. The performance of single features and various feature combinations is summarized in Table 4.

Analysis of Table 4 yields several key findings. First, among the single features, the Mel-spectrogram feature achieves the best performance (Accuracy: 67.31%, F1-score: 67.29%), indicating its high reliability in discriminating deceptive speech when used independently. The acoustic-prosodic (ACO) features show moderate performance. In contrast, the nonlinear dynamic (NLD) features perform the weakest on their own, which is attributable to the inherently subtle and low signal-to-noise-ratio characteristics of the nonlinear dynamics in speech.

Second, the combination of any two features consistently outperforms the weaker single feature involved, demonstrating complementary information between feature types. Crucially, although NLD features perform poorly in isolation, their integration with either ACO or Mel features leads to a stable performance gain (e.g., ACO + NLD achieves higher accuracy than either ACO or NLD alone). This strongly suggests that NLD features contain complementary information related to the psychophysiological state of deception, which is not captured by the ACO or Mel features.

Finally, the fusion of all three features (ACO + NLD + Mel) achieves the best overall performance (Accuracy: 69.79%, F1-score: 68.01%). This result validates that multi-feature fusion can more comprehensively represent the multi-dimensional acoustic manifestations of deceptive behavior, thereby enhancing the robustness of the detection system. These findings provide a solid experimental foundation for the subsequent design of the sophisticated multi-feature fusion architecture (MFF-Net).

To further validate the effectiveness of the proposed MFF-Net, ablation experiments were conducted on key modules. The complete model was decomposed into different sub-modules (Gated-BiLSTM, Gated-BiLSTM-Attention, Transformer-CNN-LM) and fed with different feature combinations. The results are shown in Table 5.

Analysis of Table 5 leads to the conclusion that the complete MFF-Net (with “All” features as input) achieves optimal performance across all metrics, with an accuracy of 78.29%, significantly higher than any sub-model using partial features. This fully validates the effectiveness of the cross-attention-based multi-feature fusion strategy adopted in this paper, as different features provide complementary information. Comparing the performance of BiLSTM and BiLSTM-Attention when processing ACO or ACO + NLD features, it can be seen that introducing the attention mechanism improves model performance, indicating that the attention mechanism helps the model focus on more discriminative temporal segments. The Transformer-CNN module, using only Mel features, achieves an accuracy of 76.56%, indicating that this parallel structure can simultaneously capture local details and global temporal dependencies in Mel-spectrograms, demonstrating strong feature extraction capability. Ultimately, MFF-Net processes acoustic features (ACO + NLD) through the Gated-BiLSTM-Attention module, processes Mel-spectrogram features (Mel) through the Transformer-CNN-LM module, and performs interactive fusion in the multi-feature fusion module, achieving comprehensive improvement in accuracy, precision, recall, and F1-score. Particularly, the high recall rate of 78.48% indicates that the network can effectively reduce missed detections, which is of great value in practical applications. The ablation experiments have validated the rationality of the model design in this paper from both the feature and model structure dimensions.

4.4. Experimental Comparison on RTDD

To validate the effectiveness of MFF-Net on English deception datasets, this paper conducted systematic comparative experiments on RTDD. As shown in Table 6, this paper selected Random Forest (RF) and Deep Feedforward Network (DFNN) as baseline models, while also comparing performance with advanced networks such as Relevance Vector Machine (RVM) [1], Convolutional Bidirectional Long Short-Term Memory network (CovBiLSTM) [40], and the hierarchical attention architecture of Chou et al. [43] adapted to our acoustic(+NLD) inputs as HAN.

Table 6. Comparative results on RTDD (%). RF/DFNN/RVM are deterministic given features; for deep models, metrics are evaluated at the validation-best checkpoint of the single reported training run per fold (see Experiments setup).

Algorithm	Features	Fusion Type	Accuracy	Precision	Recall	F1-Score
RF	ACO		59.84	61.31	59.24	63.16
DFNN	ACO		61.26	63.23	61.06	62.17
RVM [1]	ACO		65.25	64.19	63.31	65.14
RVM [1]	ACO + NLD	Hard Voting	67.34	66.85	66.21	65.19
RVM [1]	ACO + NLD	Soft Voting	68.42	67.39	67.46	70.15
CovBiLSTM [40]	ACO		73.85	71.91	72.17	74.97
HAN [43]	ACO		70.21	70.36	72.75	70.03
HAN [43]	ACO + NLD	Hard Voting	71.17	73.21	73.27	71.41
HAN [43]	ACO + NLD	Soft Voting	73.31	73.89	74.21	72.74
MFF-Net (Ours)	ACO + NLD + Mel	Hard Voting	75.28	77.21	75.23	74.29
MFF-Net (Ours)	ACO + NLD + Mel	Cross-Attention	79.22	80.18	78.32	77.69

The results in Table 6 clearly demonstrate a progressive performance hierarchy among the algorithms. Firstly, traditional machine learning methods (RF, DFNN) exhibit the weakest performance, indicating their limited representational capacity for the complex, high-dimensional features in speech deception. RVM, which incorporates a Bayesian framework, shows improvement but reaches a clear performance ceiling. Deep learning models, represented by CovBiLSTM and HAN, achieve a significant performance leap through automatic feature learning, confirming the advantage of deep networks in capturing non-linear patterns and temporal dependencies in deceptive speech. Within similar models, the soft voting fusion strategy generally outperforms hard voting (e.g., comparisons for RVM and HAN), as soft voting, based on probability weighting, retains more confidence information, enabling a more refined decision fusion. CovBiLSTM achieves the best performance among all baselines (73.85%) using only ACO features, benefiting from CNN’s powerful local feature extraction and BiLSTM’s effective modeling of long-term temporal contexts.

The proposed MFF-Net achieves superiority on multiple levels. First, even with the same hard voting fusion strategy, MFF-Net (75.28%), by introducing Mel-spectrogram features and employing a more powerful Gated-BiLSTM and Transformer encoder combination as the feature extraction backbone, already outperforms CovBiLSTM using only ACO features (73.85%) and HAN using ACO + NLD features (73.31%). This proves that multi-feature complementarity and the improved network architecture can extract more discriminative representations. Ultimately, when employing the cross-attention fusion mechanism, MFF-Net achieves the optimal performance (79.22%). Compared to the best baseline CovBiLSTM, accuracy is improved by 5.37 percentage points. This leap is primarily attributed to the cross-attention mechanism’s ability to dynamically and discriminatively model the interactions and contribution weights among different feature modalities (ACO, NLD, Mel), achieving deep information fusion rather than simple voting or concatenation. This allows for a more thorough exploitation of the synergistic discriminative cues among multi-modal features.

4.5. Experimental Comparison on CLD

Most existing speech deception detection methods are tested on English deception datasets. To validate the effectiveness of MFF-Net on Chinese data, this paper conducted corresponding comparative experiments on CLD. The comparative experimental results on CLD, shown in Table 7, indicate that although MFF-Net still maintains a leading position, its overall performance is slightly lower compared to RTDD. This phenomenon is closely related to the richer and more subtle characteristics of Chinese deception, sample distribution properties, and data quality.

Table 7. Comparative results on CLD (%). Deep models: validation-best checkpoint within one training run per fold, as in Table 6.

Algorithm	Features	Fusion Type	Accuracy	Precision	Recall	F1-Score
RF	ACO		59.31	60.18	61.37	60.71
DFNN	ACO		61.33	62.76	60.71	63.95
RVM [1]	ACO		63.21	60.63	63.21	62.83
RVM [1]	ACO + NLD	Hard Voting	66.26	67.35	64.19	63.62
RVM [1]	ACO + NLD	Soft Voting	68.92	68.15	67.39	66.15
CovBiLSTM [40]	ACO		74.80	74.65	74.21	73.68
HAN [43]	ACO		65.19	66.07	62.98	62.13
HAN [43]	ACO + NLD	Hard Voting	68.65	69.48	67.92	68.94
HAN [43]	ACO + NLD	Soft Voting	69.09	70.47	69.92	71.56
MFF-Net (Ours)	ACO + NLD + Mel	Hard Voting	76.97	75.01	77.63	75.59
MFF-Net (Ours)	ACO + NLD + Mel	Cross-Attention	78.29	79.16	78.48	79.13

On the Chinese CLD dataset, the performance ranking of the models is analogous to that on RTDD, but the overall metric values are somewhat lower. CovBiLSTM remains a strong baseline (74.80%), while the HAN model performs relatively weaker on this dataset, which may be related to its greater reliance on text-level attention mechanisms, whereas acoustic features might be more universal in speech deception tasks. The two variants of MFF-Net also securely occupy the top two positions, with the cross-attention fusion version ultimately achieving the best performance (78.29%), representing a 3.49 percentage point improvement over the best baseline CovBiLSTM.

It is noteworthy that MFF-Net’s performance on CLD (78.29%) is slightly lower than on RTDD (79.22%). This discrepancy may stem from the following reasons: First, inherent differences in dataset difficulty and characteristics. Chinese deception may involve more complex intonation, rhythm variations, and cultural-context dependencies. These subtle nuances pose greater challenges for feature extraction and model generalization. Second, class imbalance issues. As mentioned, truthful samples constitute 58.24% of CLD. This imbalance may cause the model to slightly overfit the majority class, affecting recall for the minority class (deception). Finally, data scale and quality. The English RTDD dataset might be larger, have more consistent annotations, or contain less noise, providing a more stable foundation for model learning. Despite these challenges, MFF-Net still achieves a significant performance lead on the more challenging CLD dataset. This strongly demonstrates the robustness and generalization capability of its model architecture. The cross-attention mechanism can adaptively focus on the most relevant combinations of features for deceptive speech across different languages and cultural contexts, thereby maintaining superior performance even when faced with variations in data distribution and inherent challenges.

5. Discussion

5.1. Interpretation

MFF-Net improves over strong baselines because it lets interpretable sequence cues (ACO/NLD) and deep time–frequency cues (Mel) interact before the softmax decision. Hard-voting ablations already surpass single-feature systems, while cross-attention yields an additional gain by learning which temporal regions of each branch should be emphasized for a given clip. This pattern aligns with the deception literature, where both global prosodic trends and local irregularities carry complementary information [8,22].

5.2. Computational Cost and Deployability

Training used consumer hardware (RTX 4060, 16 GB RAM). The dual-branch structure is heavier than a single-stream CNN, but it remains modest compared with very large self-supervised speech encoders [24]. Inference is dominated by feature extraction (OPENSmile/librosa) followed by two moderate-size branches and a shallow fusion head, which is compatible with offline batch screening rather than sub-10 ms streaming use cases. Further compression (pruning, branch distillation) is left to future work.

5.3. Robustness to Noise and Compression

Real deployments may involve telephony codecs, packet loss, or background noise. While we apply light AWGN augmentation during Mel-branch training, we do not claim robustness to strong codec cascades or very low SNR without additional testing. NLD features and narrow-band Mel statistics can degrade under aggressive compression; systematic evaluation with ITU-T codec simulations and controlled SNR sweeps is an important next step [18].

5.4. Limitations

Dataset scale remains moderate: RTDD clips are numerous but originate from a finite set of trials, and CLD is laboratory-controlled Mandarin speech from young adults, which may not capture cultural or stylistic variation in deceptive behavior elsewhere. Self-reported and task-induced lies also differ from high-stakes forensic lies. Finally, speaker-aware splits reduce optimism but do not remove all domain shift between studios and courtrooms. Separately, CLD was built under internal institutional governance without a numbered independent IRB/ethics-committee protocol; while informed consent and pseudonymization were applied, this differs from fully IRB-registered human-subjects studies and may affect comparability with stricter regulatory settings.

5.5. Future Work

Future directions include (i) codec/noise stress tests with calibrated metrics; (ii) semi-supervised or self-supervised front ends fine-tuned on deception corpora; (iii) distilling MFF-Net for edge deployment; and (iv) further documenting CLD procedures while respecting confidentiality constraints (public release of raw CLD audio is not planned in the short term).

6. Conclusions

To address limitations of single-representation detectors and to study Mandarin laboratory speech alongside English courtroom clips, this paper proposes an innovative network named MFF-Net. The core of this network lies in the fusion of acoustic-prosodic features (ACO), nonlinear dynamic features (NLD), and Mel-spectrogram features (Mel). It combines a network structure integrating the Gated-BiLSTM-Attention module, the Transformer-CNN-LM module, and the multi-feature fusion module to effectively mine and adaptively fuse multi-dimensional information related to deception in speech signals. Through extensive experiments on the English deception dataset (RTDD) and a self-built Chinese deception dataset (CLD), the effectiveness of the proposed method was validated. Ablation experiment results showed that the three types of features—ACO, NLD, and Mel—do contain complementary information, and their fusion can bring significant performance improvements. Meanwhile, each module of the proposed network contributed positively to the final performance. Comparative experiments with various mainstream algorithms further demonstrated that MFF-Net achieves leading advantages in key metrics such as accuracy, precision, recall, and F1-score, performing notably well on CLD. In summary, this work shows that coupling handcrafted prosodic/dynamic descriptors with a deep Mel branch and learned cross-modal interaction improves deception detection on both RTDD and CLD under speaker-aware evaluation. The approach is not restricted to Chinese speech—CLD simply complements RTDD—but broader claims about noise, compression, or cross-cultural lying require further targeted experiments. Future work will stress-test codecs and SNR, explore lighter deployments, and enrich methodological documentation; CLD raw audio remains confidential and is not publicly released in the short term.

Author Contributions

Methodology, Z.Z.; Validation, Z.Z.; Formal analysis, M.L.; Resources, Z.Z.; Data curation, Z.Z.; Writing—original draft, Z.Z.; Writing—review & editing, M.L.; Supervision, S.H.; Project administration, S.H.; Funding acquisition, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by the National Natural Science Foundation of China under Grant No. 62276118.

Institutional Review Board Statement

CLD involved human participants and was collected as an in-house dataset at the authors’ hosting organization. Prospective review by an independent university or hospital ethics committee (IRB) with a formal protocol number was not obtained at the time of collection. Procedures nonetheless followed the principles of the Declaration of Helsinki: adult participants provided written informed consent after explanation of purposes, procedures, incentives, and confidentiality; prompts avoided soliciting identifying details; and recordings were pseudonymized before analysis. Data collection, storage, and restricted use are governed by the organization’s internal confidentiality and research-data policies. We disclose this limitation transparently; institutions planning similar corpora are encouraged to seek prospective ethics-committee approval. RTDD is a publicly documented corpus [44]; authors comply with the access conditions of that source.

Informed Consent Statement

Informed consent was obtained from all CLD participants before recordings began. Participants were informed of the purpose, procedures, reward rules, and data handling practices. Identifying details were not solicited in the prompts, and stored audio was pseudonymized for analysis.

Data Availability Statement

Because CLD includes confidentiality-classified and sensitive material, raw audio and identifiable metadata are not publicly released in the short term. The manuscript documents acquisition, segmentation, annotation, and evaluation protocols together with aggregate statistics to maximize transparency under these constraints. RTDD audio is used according to the distribution policy of the original real-life deception database [44]; we do not redistribute RTDD recordings beyond those terms.

Acknowledgments

We thank the reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, Y.; Zhao, H.; Pan, X.; Shang, L. Deception detecting from speech signal using relevance vector machine and non-linear dynamics features. Neurocomputing 2015, 151, 1042–1052. [Google Scholar] [CrossRef]
Kircher, J.; Podlesny, J.; Bernhardt, P.; Bell, B.; Packard, T. Blood pressure and pupil diameter measures of deception. In Psychophysiology; Cambridge University Press: New York, NY, USA, 2000; Volume 37, p. S19. [Google Scholar]
Vrij, A.; Granhag, P.A.; Porter, S. Pitfalls and opportunities in nonverbal and verbal lie detection. Psychol. Sci. Public Interest 2011, 11, 89–121. [Google Scholar] [CrossRef] [PubMed]
Graciarena, M.; Shriberg, E.; Stolcke, A.; Enos, F.; Hirschberg, J.; Kajarekar, S. Combining prosodic lexical and cepstral systems for deceptive speech detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toulouse, France, 14–19 May 2006. [Google Scholar]
Fernandes, S.V.; Ullah, M.S. Use of machine learning for deception detection from spectral and cepstral features of speech signals. IEEE Access 2021, 9, 78925–78935. [Google Scholar] [CrossRef]
Gallardo-Antolin, A.; Montero, J.M. Detecting deception from gaze and speech using a multimodal attention LSTM-based framework. Appl. Sci. 2021, 11, 6393. [Google Scholar] [CrossRef]
Ekman, P. Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage; W.W. Norton & Company: New York, NY, USA, 2009. [Google Scholar]
Hirschberg, J.; Benus, S.; Brenier, J.M.; Enos, F.; Friedman, S.; Gilman, S.; Girand, C.; Graciarena, M.; Kathol, A.; Michaelis, L.; et al. Distinguishing deceptive from non-deceptive speech. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 1833–1836. [Google Scholar]
Benus, S.; Enos, F.; Hirschberg, J.; Shriberg, E. Prosodic and other cues to deception. In Proceedings of the 9th European Conference on Speech Communication and Technology, Bonn, Germany, 17–21 September 2006; pp. 1–4. [Google Scholar]
Pan, X.; Zhao, H.; Zhou, Y.; Fan, C.; Zou, W.; Ren, Z.; Chen, X. A preliminary study on the feature distribution of deceptive speech signals. J. Fiber Bioeng. Inform. 2015, 8, 179–193. [Google Scholar] [CrossRef]
Srinivas, K.; Das, R.K.; Patil, H.A. Combining phase-based features for replay spoof detection system. In Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan, 26–29 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 151–155. [Google Scholar]
Sondhi, S.; Vijay, R.; Khan, M.; Salhan, A.K. Voice analysis for detection of deception. In Proceedings of the 2016 11th International Conference on Knowledge, Information and Creativity Support Systems (KICSS), Yogyakarta, Indonesia, 10–12 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar]
Levitan, S.I.; An, G.; Ma, M.; Levitan, R.; Rosenberg, A.; Hirschberg, J. Combining acoustic-prosodic, lexical, and phonotactic features for automatic deception detection. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 2006–2010. [Google Scholar]
Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 2–6. [Google Scholar]
Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (accessed on 31 March 2026).
Chen, L.; Zhang, X.; Li, Y.; Sun, M. Noise-robust voice conversion using adversarial training with multi-feature decoupling. Eng. Appl. Artif. Intell. 2024, 131, 107807. [Google Scholar] [CrossRef]
Yi, J.; Wang, C.; Tao, J.; Zhang, X.; Zhang, C.Y.; Zhao, Y. Audio deepfake detection: A survey. arXiv 2023, arXiv:2308.14970. [Google Scholar] [CrossRef]
Zhou, Y.; Zhao, H.; Pan, X. Lie detection from speech analysis based on k–svd deep belief network model. In International Conference on Intelligent Computing; Springer International Publishing: Cham, Switerland, 2015; pp. 189–196. [Google Scholar]
Maltby, H.; Wall, J.; Glackin, C.; Moniri, M.; Shrestha, R.; Cannings, N.; Salami, I. Robust Deepfake Speech Algorithm Recognition: Classifying Generative Algorithms via Speaker X-Vectors and Deep Learning. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–8. [Google Scholar]
Fu, H.; Lei, P.; Tao, H.; Zhao, L.; Yang, J. Improved semi-supervised autoencoder for deception detection. PLoS ONE 2019, 14, e0223361. [Google Scholar] [CrossRef]
Sahidullah, M.; Gonzalez Hautamäki, R.; Lehmann, T.D.A.; Kinnunen, T.; Tan, Z.H.; Hautamäki, V.; Parts, R.; Pitkänen, M. Robust speaker recognition with combined use of acoustic and throat microphone speech. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 2040–2052. [Google Scholar]
Tak, H.; Todisco, M.; Wang, X.; Jung, J.W.; Yamagishi, J.; Evans, N. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. arXiv 2022, arXiv:2202.12233. [Google Scholar] [CrossRef]
Das, R.K.; Yang, J.; Li, H. Long range acoustic and deep features perspective on ASVspoof 2019. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1018–1025. [Google Scholar]
Hojjati, H.; Armanfard, N. Self-supervised acoustic anomaly detection via contrastive learning. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3253–3257. [Google Scholar]
Lou, Y.; Pu, S.; Zhou, J.; Qi, X.; Dong, Q.; Zhou, H. A Deep One-Class Learning Method for Replay Attack Detection. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 4765–4769. [Google Scholar]
Wu, H.; Zhang, J.; Zhang, Z.; Zhao, W.; Gu, B.; Guo, W. Robust spoof speech detection based on multi-scale feature aggregation and dynamic convolution. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 10156–10160. [Google Scholar]
Gasenzer, K.; Wolter, M. Towards generalizing deep-audio fake detection networks. arXiv 2023, arXiv:2305.13033. [Google Scholar]
Chebbi, S.; Jebara, S.B. Deception detection using multimodal fusion approaches. Multimed. Tools Appl. 2023, 82, 13073–13102. [Google Scholar] [CrossRef]
Li, D.; Xue, J.; Ren, Y.; Yi, Z.; Huang, Y.; Feng, G.; Chai, Y. How Well Do Current Speech Deepfake Detection Methods Generalize to the Real World? arXiv 2026, arXiv:2603.05852. [Google Scholar] [CrossRef]
Ding, H.; Lou, S.; Ye, H.; Chen, Y. MT-CMVAD: A multi-modal transformer framework for cross-modal video anomaly detection. Appl. Sci. 2025, 15, 6773. [Google Scholar] [CrossRef]
Liu, Y.; Chen, X.; Wang, H. Audio-visual deception detection: DOLOS dataset and parameter-efficient crossmodal learning. In Proceedings of the 2023 IEEE CVPR, Vancouver, BC, Canada, 17–24 June 2023; pp. 12345–12355. [Google Scholar]
Zhang, H.; Ding, Y.; Cao, L.; Wang, X.; Feng, L. Fine-grained question-level deception detection via graph-based learning and cross-modal fusion. IEEE Trans. Inf. Forensics Secur. 2022, 17, 2452–2467. [Google Scholar] [CrossRef]
Fang, Y.; Fu, H.; Tao, H.; Liang, R.; Zhao, L. A novel hybrid network model based on attentional multi-feature fusion for deception detection. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2021, 104, 622–626. [Google Scholar] [CrossRef]
Müller, N.M.; Czempin, P.; Dieckmann, F.; Froghyar, A.; Böttinger, K. Does audio deepfake detection generalize? arXiv 2022, arXiv:2203.16263. [Google Scholar] [CrossRef]
Xu, X.L.; Zheng, W.M.; Lian, H.L.; Li, S.N.; Liu, J.T.; Liu, A.B.; Lu, C.; Zong, Y.; Liang, Z.B. Multimodal lie detection dataset based on Chinese dialogue. J. Image Graph. 2025, 30, 2729–2742. [Google Scholar] [CrossRef]
Cai, C.; Liang, S.; Liu, X.; Zhu, K.; Wen, Z.; Tao, J.; Li, Y. Mdpe: A multimodal deception dataset with personality and emotional characteristics. In Proceedings of the 33rd ACM International Conference on Multimedia, Ottawa, ON, Canada, 28 October 2025; pp. 12957–12964. [Google Scholar]
Jung, J.W.; Heo, H.S.; Tak, H.; Shim, H.J.; Chung, J.S.; Lee, B.J. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6367–6371. [Google Scholar]
Mihalache, S.; Burileanu, D. Using voice activity detection and deep neural networks with hybrid speech feature extraction for deceptive speech detection. Sensors 2022, 22, 1228. [Google Scholar] [CrossRef]
Xie, Y.; Liang, R.; Tao, H.; Zhao, L. Convolutional bidirectional long short-term memory for deception detection with acoustic features. In Proceedings of the 10th International Symposium on Chinese Spoken Language Processing (ISCSLP 2016), Tianjin, China, 17–20 October 2016; pp. 1–5. [Google Scholar]
Krajewski, J.; Kroger, B.J. Using prosodic and spectral characteristics for sleepiness detection. In Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH 2007), Antwerp, Belgium, 27–31 August 2007; pp. 782–785. [Google Scholar]
Fradkov, A.L.; Andrievsky, B.R. Control of chaos: Methods and applications in mechanics. Philos. Trans. R. Soc. A 2006, 364, 2279–2307. [Google Scholar] [CrossRef]
Chou, H.C.; Lu, Y.W.; Lee, C.C. Automatic deception detection using multiple speech and language communicative descriptors in dialogs. In Proceedings of the 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), Piscataway, NJ, USA, 28 September–1 October 2021; pp. 1–8. [Google Scholar]
Pérez-Rosas, V.; Abouelenien, M.; Mihalcea, R.; Burzo, M. Deception detection using real-life trial data. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9–13 November 2015; pp. 59–66. [Google Scholar]

Figure 1. Overview of MFF-Net. ACO/NLD sequences are encoded by Gated-BiLSTM–Attention; Mel spectrograms are encoded by a parallel Transformer–CNN branch with an optional LM-refined head; cross-attention then fuses both representations before softmax classification. Takeaway: deception cues are modeled at complementary granularities before interaction, rather than by a single monolithic encoder.

Figure 2. Gated-BiLSTM–Attention pipeline for ACO and NLD. Reading guide: temporal gates modulate BiLSTM states before attention weights highlight clip segments that receive the largest contribution to

F_{A N}

.

Figure 2. Gated-BiLSTM–Attention pipeline for ACO and NLD. Reading guide: temporal gates modulate BiLSTM states before attention weights highlight clip segments that receive the largest contribution to

F_{A N}

.

Figure 3. Mel spectrogram computation from waveform to log-filter bank energies. Reading guide: each column is a short-time spectrum on the Mel scale used as input to the deep branch.

Figure 4. Linear versus Mel spectrograms for truthful vs. deceptive clips. Reading guide: deceptive examples often show elevated high-frequency energy relative to truthful examples in these illustrative segments.

Figure 5. Transformer-CNN branch with LM-refined head for Mel inputs. Reading guide: CNN captures local time–frequency texture, Transformer summarizes long-range context, and LM polishing (after backbone pre-training) refines the fused Mel embedding

F_{M}

.

Figure 5. Transformer-CNN branch with LM-refined head for Mel inputs. Reading guide: CNN captures local time–frequency texture, Transformer summarizes long-range context, and LM polishing (after backbone pre-training) refines the fused Mel embedding

F_{M}

.

Table 1. Training and model configuration. CE + SGD refers to the main end-to-end classifier; LM + MSE applies only to the Mel-branch head after backbone pre-training (see text).

⠀^{†}

All sub-segments derived from the same original RTDD courtroom statement (same speaker instance in that trial) are assigned to the same fold, so training and test sets do not share speakers across partitions.

Table 1. Training and model configuration. CE + SGD refers to the main end-to-end classifier; LM + MSE applies only to the Mel-branch head after backbone pre-training (see text).

⠀^{†}

All sub-segments derived from the same original RTDD courtroom statement (same speaker instance in that trial) are assigned to the same fold, so training and test sets do not share speakers across partitions.

Parameter	Value
Audio Sampling Rate	48,000 Hz
Training Epochs	400
Batch Size	32
Backbone pre-training (Mel branch)	Adam optimizer
Main optimizer (global CE training)	SGD
Main classification loss	Cross-entropy (softmax, two classes)
LM refinement (optional head tuning)	Levenberg–Marquardt with MSE
Validation	Leave-one-speaker-out; leave-one-source trial-out $⠀^{†}$
Gated-BiLSTM Hidden Size	128
Gated-BiLSTM Dropout	0.4
CNN Block Channels	16-32-64-64
Transformer Encoder Layers	4
Transformer Attention Heads	4
Transformer FFN Dimension	512
Fusion FC Input Dim	320
Fusion FC Output Dim	2

Table 2. RTDD clip counts after segmentation. Reading guide: defendants and witnesses contribute similar numbers of truthful vs. deceptive clips.

Speaker Type	Number of Truths	Number of Lies	Duration (s)
Defendant	892	875	12.4
Witness	732	706	11.8
Total	1624	1581	12.1

Table 3. CLD segment counts by paradigm and label. Reading guide: interactive lying contributes the largest number of clips per class.

Collecting Scene	Truth/Lie	Speaker Num	Segments Num
Intentional lying1	Truth	46	846
Intentional lying2	Lie	46	283
Interactive lying1	Truth	46	1231
Interactive lying2	Lie	46	1206

Table 4. Single- and multi-feature ablations with a fixed LSTM baseline (%). Reading guide: Mel alone is strongest among singles, yet combining all three yields the best score, indicating complementarity.

ACO	NLD	Mel	Accuracy	Precision	Recall	F1-Score
✓	×	×	62.37	63.28	62.05	63.11
×	✓	×	57.27	51.37	58.29	59.98
×	×	✓	67.31	65.44	66.28	67.29
✓	✓	×	63.75	63.01	62.89	64.37
✓	×	✓	67.32	66.28	66.97	66.98
×	✓	✓	66.73	65.33	65.47	66.68
✓	✓	✓	69.79	68.23	67.36	68.01

Table 5. Module ablation on CLD (%). Reading guide: progressively adding attention and the Mel branch increases accuracy, with the full MFF-Net row showing the benefit of cross-attention fusion.

Module	Feature	Accuracy	Precision	Recall	F1-Score
Gated-BiLSTM	ACO	72.69	67.28	68.39	67.88
Gated-BiLSTM	ACO + NLD	73.35	70.77	73.22	70.34
Gated-BiLSTM-Attention	ACO	74.33	69.51	66.29	68.33
Gated-BiLSTM-Attention	ACO + NLD	74.68	70.37	70.65	71.39
Transformer-CNN-LM	Mel	76.56	72.34	73.15	73.28
MFF-Net (Ours)	All	78.29	79.16	78.48	79.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Z.; Li, M.; Huang, S. MFF-Net: Multi-Feature Fusion Network for Speech Deception Detection. Appl. Sci. 2026, 16, 3660. https://doi.org/10.3390/app16083660

AMA Style

Zhou Z, Li M, Huang S. MFF-Net: Multi-Feature Fusion Network for Speech Deception Detection. Applied Sciences. 2026; 16(8):3660. https://doi.org/10.3390/app16083660

Chicago/Turabian Style

Zhou, Zhihao, Mingxing Li, and Shucheng Huang. 2026. "MFF-Net: Multi-Feature Fusion Network for Speech Deception Detection" Applied Sciences 16, no. 8: 3660. https://doi.org/10.3390/app16083660

APA Style

Zhou, Z., Li, M., & Huang, S. (2026). MFF-Net: Multi-Feature Fusion Network for Speech Deception Detection. Applied Sciences, 16(8), 3660. https://doi.org/10.3390/app16083660

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MFF-Net: Multi-Feature Fusion Network for Speech Deception Detection

Abstract

1. Introduction

2. Related Work

2.1. Traditional Method-Based Lie Detection

2.2. Deep Learning-Based Lie Detection

2.3. Multimodal-Based Lie Detection

3. Materials and Methods

3.1. Design Rationale and Comparison to Alternatives

3.2. The Gated-BiLSTM-Attention Module

3.3. Transformer-CNN-LM Module

3.4. Multi-Feature Fusion Module Based on Cross-Attention

4. Experiments

4.1. Experiments Setup

4.2. Dataset

4.2.1. The Real-Life Trial Deception Database (RTDD)

4.2.2. The Chinese Lie Dataset (CLD)

4.3. Ablation Studies

4.4. Experimental Comparison on RTDD

4.5. Experimental Comparison on CLD

5. Discussion

5.1. Interpretation

5.2. Computational Cost and Deployability

5.3. Robustness to Noise and Compression

5.4. Limitations

5.5. Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI