3.1. Design Rationale and Comparison to Alternatives
Speech deception cues are expressed across complementary representations: high-level prosodic statistics and nonlinear dynamics (ACO/NLD) offer interpretable, domain-established descriptors, whereas Mel time–frequency patterns capture finer texture that CNNs/Transformers can exploit. A single-stream end-to-end model (e.g., waveform CNN or spectrogram Transformer alone) can underuse the inductive biases that have proven useful in deception and paralinguistics benchmarks, while only using handcrafted functionals discards non-local spectro-temporal regularities. We therefore adopt a dual-branch design: (i) a sequence model with gating and attention for ACO+NLD, following evidence that temporal weighting helps deception-related segments [
21,
40]; (ii) a parallel Transformer–CNN branch for Mel features to combine local convolutional structure with long-range attention [
23,
24]; and (iii) cross-attention fusion instead of fixed concatenation or voting, so each modality can query the other before the final decision. Simpler fusion rules are included later as ablations (hard voting) to quantify the gain from learned interaction.
This paper designs a speech deception detection network based on multi-feature fusion, effectively combining traditional acoustic features with deep learning methods to mine acoustic, dynamic, and time-frequency discriminative information related to deception from speech signals. It contains three modules, as shown in
Figure 1. The Gated-BiLSTM-Attention module performs context-aware sequence modeling on acoustic and nonlinear dynamic features, enhancing the model’s ability to capture temporal evolution patterns in speech. The Transformer-CNN-LM hybrid module jointly mines local structures and global temporal dependencies from Mel-spectrograms, improving the discriminative power and robustness of feature representation. In the final multi-feature fusion module, a cross-attention based feature fusion mechanism is designed to achieve deep interaction of heterogeneous features, rather than simple weighted fusion, and the Softmax classifier is used to complete the classification decision between truth and deception. Before describing each module in detail, the following definitions are given:
Let
represent a multi-feature input sample, where
,
,
denote the acoustic-prosodic features, nonlinear dynamic features, and Mel-spectrogram features extracted from the speech signal, respectively. The feature data sequence can be represented as
, where
, and
n represents the feature dimension. The goal is to predict the category of the sample based on its deceptive content. The data label is
, where
represents a lie, and
represents the truth. Finally, the embedded feature vectors
for the three fused features are obtained, where
. The predicted label
is obtained as follows:
where
is the fusion function, and
.
3.2. The Gated-BiLSTM-Attention Module
The design of this module is inspired by the Convolutional Bidirectional Long Short-Term Memory network (CovBiLSTM) in the literature [
40]. This paper selects emotionally representative prosodic features from prosodic features, including Mel-Frequency Cepstral Coefficients (MFCC), Bark Band Energy, and pitch, among others. Additionally, this paper uses the box-counting method to extract 15-dimensional nonlinear dynamic features from speech and concatenates them with the prosodic features. The core function is to perform deep sequential pattern parsing on the handcrafted acoustic-prosodic features (ACO) and nonlinear dynamic features (NLD), thereby extracting high-level sequential features related to deception. The module structure is innovatively enhanced based on the BiLSTM-Attention framework by introducing a secondary gating mechanism to improve the model’s ability to capture key information.
Prosodic features reflect the static characteristics of the speech signal over short time intervals (20 ms) [
41], generally analyzed using short-time analysis methods. This paper uses the OPENSmile toolkit to extract the INTERSPEECH 2016 ComParE Challenge and eGeMAPSvO2 feature sets at the functionals level. The resulting integrated feature set contains a large number of static features, specifically including low-level descriptors (LLDs) such as loudness, pitch, energy distribution, and spectral bands, as well as various statistical functions used to characterize the dynamic changes of LLDs, including five categories: central tendency, dispersion, distribution shape, order statistics, and regression analysis, totaling 6461 static features.
In the analysis of nonlinear dynamic features of speech signals, the Box-Counting Method is a classic computational method based on fractal geometry, used to quantify the complexity and irregularity of signal waveforms [
42]. This method analyzes the self-similarity characteristics of the speech signal at different scales to extract its Fractal Dimension, thereby revealing the implicit nonlinear dynamic behavior in the signal. Its core idea is to cover the signal waveform with grids of different sizes, count the minimum number of boxes required for coverage, and then estimate the fractal dimension through a power-law relationship. The specific method is as follows:
First, long speech segments need to be divided into short-term speech signals and converted into time-series waveforms; this step also requires preprocessing like denoising and normalization to eliminate effects like baseline drift and high-frequency interference. Second, the waveform is treated as a curve in a two-dimensional plane (time-amplitude), covered by a grid of squares with side length
. Then, gradually reduce the box size
(e.g., sequentially take
), and count the minimum number of boxes
required to cover the curve at each scale. According to fractal theory,
and
satisfy a power-law relationship:
. After taking the logarithm of both, use linear regression to fit the data points of
versus
; the slope of the resulting line is the estimated value of the fractal dimension
D. The calculation formula is:
The fractal dimension D obtained by the box-counting method directly reflects the nonlinear dynamic characteristics of the speech signal. A higher D value indicates a more complex and irregular signal waveform, possibly corresponding to nonlinear phenomena like turbulence and vortices in speech. The fractal dimension describes the complexity and self-similarity dimension of the system’s motion trajectory in phase space, capable of capturing the chaotic characteristics caused by physical mechanisms such as glottal airflow and vocal tract wall vibrations during speech production, effectively complementing linear features (like formants). Finally, this paper extracts 15 nonlinear features including fractal features, Lyapunov exponent, and K-entropy. Compared to the emotional cues captured by prosodic features, NLD features focus more on cues related to cognition, memory, and strategic communication, enhancing the system’s ability to model nonlinear dynamic behavior.
The combined features obtained using the above method are dimensionality-reduced through a max-pooling layer and then input to the Gated-BiLSTM. The “memory” of LSTM is also called cells. These cells determine which previous information and state need to be retained or remembered and which need to be erased, effectively preserving relevant information from much earlier times. Meanwhile, through the three unique gate structures of LSTM: forget gate, input gate, and output gate, the cell state is protected and controlled. The forget gate decides what information to discard from the cell state, the input gate determines what new information to store in the cell state, and the output gate decides which part of the cell state will be output. The formulas are as follows:
where
and
represent the output from the previous time step and the input at the current time step,
is the activation function,
W and
b are the corresponding weights and biases,
represents the new candidate value vector, and
represents the output gate controlling the output of the current hidden state. To overcome the limitations of traditional BiLSTM in regulating information flow in deep networks, this paper introduces a gating enhancement mechanism. This mechanism adds an information control gate on top of the standard LSTM’s forget, input, and output gates to achieve fine-grained regulation of cell state updates. The final hidden state is:
where ⊙ denotes element-wise multiplication.
Given that the primary task of this module is to evaluate and focus on the most discriminative segments from the temporal sequences of acoustic-prosodic and nonlinear dynamic (NLD) features—rather than to model long-range contextual dependencies—a context-based additive attention mechanism is judiciously employed for the weighted fusion of the Gated-BiLSTM outputs. This choice is motivated by the mechanism’s computational efficiency and directness in highlighting critical information.
First, the attention score for each time step is calculated. A learnable context vector interacts with a non-linear projection of the hidden state to produce an unnormalized score:
where
is a learnable weight vector,
is a weight matrix, and
is a bias term. The attention scores are then normalized across all time steps using the Softmax function to obtain the attention weights
:
Subsequently, a fixed-dimensional context vector
r is obtained by performing a weighted sum of the hidden states of all time steps:
Finally, this module outputs the weighted feature vector
. This vector not only integrates the temporal dynamic information from the prosodic feature sequence
and the NLD feature sequence
but also, due to the gating enhancement and the attention mechanism, accentuates the subtle variations in key segments of speech deception. Thereby, it forms a higher-level feature representation with stronger discriminative power, providing a solid foundation for subsequent multi-feature fusion and classification.
The entire module combines the fine-grained modeling capability of Gated-BiLSTM for long-range dependencies with the focus ability of the attention mechanism on key information, effectively enhancing the discriminativity and robustness of the feature representation. The process of this module is shown in
Figure 2.
3.3. Transformer-CNN-LM Module
For a normal signal
, we typically seek to understand it from both time and frequency domain perspectives. In the feature extraction process of speech signals, the Mel-spectrogram is a key time-frequency representation method. Its calculation process is shown in
Figure 3. First, pre-emphasis is applied to the original speech signal to compensate for the attenuation of high-frequency components during transmission, thereby boosting high-frequency energy and making the overall spectrum flatter. Subsequently, the pre-emphasized signal is segmented into consecutive short-time frames, and a window function (e.g., Hamming window) is applied to each frame to reduce spectral leakage caused by signal truncation. Next, the Fast Fourier Transform (FFT) is performed on each framed, windowed signal, converting it from the time domain to the frequency domain. This process is also known as the Short-Time Fourier Transform (STFT), mathematically expressed as:
where
is the original signal,
is the window function,
t represents time, and
f represents frequency. The STFT converts the signal from the time domain to the frequency domain, revealing the energy distribution of the signal across different frequency components, thus characterizing the oscillation properties of the signal, which is crucial and beneficial for subsequent analysis. To simulate the nonlinear perception characteristics of the human auditory system to frequency, the linear spectrum obtained from the STFT is passed through a Mel-scale triangular filter bank. This filter bank is equally spaced on the Mel frequency scale, with higher resolution in low-frequency regions and lower resolution in high-frequency regions, making it more consistent with the human auditory mechanism. Applying a logarithmic operation to the output energy of the filter bank compresses the dynamic range, further approximating the human perception of sound intensity, ultimately yielding the Mel-spectrogram. In the experiments of this paper, the librosa library was used to extract Mel-spectrograms with a sampling rate of 48 kHz, a Hamming window length of 512, and an FFT window size of 1024. To enhance model robustness, additive white Gaussian noise (AWGN) was introduced during the training phase to augment the original dataset, increasing data diversity and thereby improving accuracy and generalization ability in complex environments.
Figure 4 shows a comparison of the linear spectrograms and Mel-spectrograms of truthful and deceptive speech, where the horizontal axis represents time, the vertical axis represents frequency, and the color depth represents energy intensity. By comparison, it can be observed that the spectrograms of speech deception often exhibit stronger energy distribution in high-frequency regions, which is a distinguishable difference from truthful speech.
To achieve high-precision mining of deception-related features from Mel spectrograms, this paper designs a parallel hybrid model that possesses both local feature extraction and global temporal modeling capabilities, and introduces an advanced second-order optimization algorithm to enhance its nonlinear fitting performance. The overall structure of the model is shown in
Figure 5, mainly comprising three core parts: the Convolutional Neural Network module, the Transformer encoder module, and the optimization strategy based on the Levenberg-Marquardt algorithm.
First, the input Mel-spectrogram is fed in parallel to the CNN module and the Transformer module. In the CNN branch, the data flows through four consecutive convolutional blocks. Each convolutional block performs a convolution operation:
This is followed by the introduction of non-linearity through batch normalization and the ReLU activation function, and the spatial dimensions are compressed by a max-pooling operation, ultimately resulting in a flattened convolutional embedding
.
In the Transformer branch, the spectrogram first undergoes max-pooling for dimensionality compression and adjustment, obtaining a serialized feature representation
, which is then fed into a multi-layer encoder. The encoder utilizes a multi-head self-attention mechanism:
to dynamically compute global temporal dependencies, and undergoes deep nonlinear transformation through a feedforward neural network:
supplemented by layer normalization and residual connections to ensure training stability. Finally, a 64-dimensional Transformer embedding feature
is obtained by average pooling over the time dimension. At this point, the model obtains
representing local spatial patterns and
representing long-term contextual relationships, respectively.
Subsequently, the feature fusion layer concatenates
and
to form a unified deep feature fed into the subsequent fully connected layers for classification decision. To optimize the parameters of this critical step, this paper innovatively introduces the LM algorithm. The LM algorithm is an efficient optimization algorithm specifically designed for solving nonlinear least squares problems. It adaptively adjusts the damping factor, skillfully combining the fast convergence of the Gauss-Newton method with the global stability of gradient descent. Considering the huge computational cost of the LM algorithm due to the need to calculate the Jacobian matrix, this paper adopts a two-stage training strategy. First, the Adam optimizer is used to pre-train the entire Transformer-CNN feature extraction backbone to obtain stable initial parameters. Subsequently, during the optimization phase, the backbone network parameters are frozen, and the LM algorithm is applied only to the top-level classifier. The core is to define the residual
:
and update the parameters by solving the linear system:
where
J is the Jacobian matrix of the residuals with respect to the top-layer parameters
W, and
is the damping factor. This algorithm can adaptively switch between gradient descent and the Gauss-Newton method, achieving superlinear convergence near the optimal solution, thereby accurately learning the complex mapping from the fused features to the final deception judgment result, effectively improving the model’s convergence speed and generalization performance.
During the LM refinement stage only (Mel-branch top fully connected layers, with the CNN/Transformer backbone frozen), we treat the binary targets as continuous scores in
and minimize a mean squared error (MSE) objective so that the Levenberg–Marquardt update matches its standard nonlinear least-squares formulation:
where
N is the batch size,
denotes the true label of the
i-th sample, and
represents the corresponding predicted value before thresholding. The LM algorithm iteratively minimizes this loss to refine the Mel-branch head parameters
W.
Implementation (LM head in PyTorch). The Levenberg–Marquardt refinement is implemented in PyTorch 2.1.2. After freezing all CNN and Transformer parameters, we collect residuals from the last one to two fully connected (nn.Linear) layers of the Mel branch (the head that maps fused CNN/Transformer embeddings to scalar scores in ). We use PyTorch’s automatic differentiation to form the Jacobian–vector products needed for each LM step (equivalently, a Gauss–Newton damped update on those head weights only), iterating until convergence criteria on or on the MSE over the current mini-batch. The backbone remains non-trainable during this stage; no LM update is applied to the cross-attention fusion or the ACO/NLD branch.
Clarification (training objective for the full model). The Transformer–CNN Mel backbone is first pre-trained with Adam as in
Table 1, then frozen. The Gated-BiLSTM–Attention branch, the cross-attention fusion module, and the final softmax classifier are trained with stochastic gradient descent (SGD) and cross-entropy (CE) on the two-way softmax output. Thus, CE governs the main end-to-end classification path, whereas MSE appears only in the LM polishing step on the Mel-branch head described above—not as the global loss for MFF-Net.
In summary, this module ensures the comprehensiveness of feature extraction through the parallel structure of Transformer and CNN, and achieves efficient and precise optimization of model parameters by applying the LM algorithm at the key decision layer, jointly providing strong technical support for the deception detection task. The high-level representation of the Mel-spectrogram features output by this module will be fused with the acoustic features via cross-attention.
3.4. Multi-Feature Fusion Module Based on Cross-Attention
Acoustic-prosodic features (ACO) primarily characterize the prosody, voice quality, and spectral statistics of speech. Nonlinear dynamic features (NLD) reflect the chaotic dynamic behavior of speech production. Mel-spectrogram features provide global pattern information from a time-frequency perspective. These three types of features have significant differences in representation granularity, physical meaning, and data structure, constituting a typical heterogeneous feature fusion problem. Traditional multi-feature fusion in speech deception detection often relies on simple weighted summation or feature concatenation. Such methods can introduce a large amount of redundant information and struggle to fully explore the nonlinear correlations and deep complementarity between heterogeneous features. Targeting the characteristic that these three types of features, though from the same source, have distinct representational perspectives, this paper designs a deep multi-feature fusion module based on a cross-attention mechanism. This mechanism allows each type of feature to actively query relevant information in the feature space of the others, rather than being passively fused. For example, when acoustic features detect abnormal fluctuations in fundamental frequency, they can automatically enhance the attention to corresponding frequency band features in the Mel-spectrogram, achieving synergistic enhancement between features. This mechanism enables fine-grained feature fusion through bidirectional interaction and adaptive weight allocation between features.
Specifically, the high-level representation of acoustic nonlinear dynamic features
output by the Gated-BiLSTM-Attention module and the high-level representation of Mel-spectrogram features
extracted by the Transformer-CNN-LM module are taken as inputs. First, they are mapped to a unified hidden dimension
d through independent linear projection layers:
where
are projection weights, ensuring feature dimensionality consistency and laying the foundation for subsequent attention calculation. For the attention flow from acoustic features to Mel-spectrum:
where
are learnable parameter matrices. Then, the bidirectional cross-attention weights are calculated:
where
are optional attention bias matrices for introducing prior knowledge.
represents the context-enhanced representation obtained by acoustic features attending to Mel-spectrogram features, and
is the representation enhanced for Mel-spectrogram features by acoustic features. This design enables the model to dynamically capture cross-feature correlations such as “abnormal fundamental frequency and formant shifts”. To fully utilize the attention outputs, residual connections and a gating mechanism are employed:
where
is a learnable parameter.
is obtained similarly. Finally, a unified representation is generated through a hierarchical fusion strategy:
where
denotes the concatenation operation, and FFN is a two-layer feedforward network. The fused feature vector
will be input to the Softmax classifier. The mapping from the feature space to the category space is completed through the formula
, ultimately achieving binary classification of lies and truth. This fusion mechanism effectively addresses the representation gap problem of heterogeneous features, providing a unified feature representation with stronger discriminative power for deception detection.