3.1. Data Preprocessing
During the data preprocessing stage, we prepared audio samples from the ASVSpoof 2019 [
31] and FoR [
32] datasets. In this paper, the audio samples are first divided into multiple short frames, with each frame representing the short-term characteristics of the audio signal. To prevent information loss between frames, the audio signal is segmented into short-time frames with a frame length of 25 ms and a frame shift of 10 ms (50% overlap). The 50% frame overlap ensures signal continuity and facilitates smooth transitions during computation, thereby effectively capturing the time-varying characteristics of the audio signal. Each frame is initially weighted using a Hamming window function to mitigate spectral leakage. The expression of the Hamming window function is as follows:
where N is the length of the window function, and n represents the sampling point within the current frame.
The purpose of the Hamming window is to apply weighting to both ends of the signal, causing the amplitude at the edges to gradually attenuate. This avoids spectral leakage caused by frame truncation and ensures the accuracy of spectral calculations. Each frame of the signal, after being weighted by the Hamming window, subsequently undergoes Fast Fourier Transformation (FFT) to convert the time-domain signal into a frequency-domain representation. The formula for FFT is as follows:
where X(f) represents the frequency-domain signal, x(n) denotes the time-domain signal,
N is the frame length,
f is the frequency component, and
j is the imaginary unit.
Through FFT, the frequency components and their amplitude distribution of each frame signal can be calculated, thereby obtaining the spectral information of the frame. However, since the human ear exhibits varying sensitivity to different frequencies, directly using the spectrogram may not adequately capture the characteristics of the audio signal. Therefore, to better align with the perception of the human auditory system, this paper adopts the Mel frequency scale to non-linearly compress the spectrum. The conversion formula for the Mel frequency is as follows:
where mel is the Mel frequency, and
f is the linear frequency (Hz).
The Mel scale simulates the human ear’s varying sensitivity to low and high frequencies by non-linearly compressing the frequency axis. Mel frequency transformation helps reduce redundant information in the high-frequency range while preserving critical information in the low-frequency range, making it more aligned with human auditory characteristics. After the Mel frequency transformation, the spectrogram is filtered through a Mel filter bank. Each Mel filter outputs the energy value of a specific frequency range, resulting in the final Mel spectrogram. In this study, we generated Mel spectrograms with a size of 224 × 224 × 3 using a Hanning window with a size of 2048 and a hop length of 512. The number of Mel filters used in the filter bank was 224. The generation of Mel spectrograms is a critical step in the proposed method, as it prepares the audio data for subsequent feature extraction and analysis stages.
3.2. Feature-Enhanced Lightweight Fuzzy Branch Network (LFBN)
Traditional deepfake detection models usually assume that features are deterministic; however, in practical scenarios, interferences such as noise and compression introduce uncertainty. To address this issue, this paper proposes a feature enhancement method based on the Pythagorean Hesitant Fuzzy Set (PHFS). By dynamically modeling the credibility and uncertainty of features, this method improves the robustness and interpretability of the model.
Based on the feature representation of preprocessing-generated Mel spectrograms (with a size of 224 × 224 × 3), this method focuses on the key problem of dynamically modeling feature uncertainty and credibility in the deepfake audio identification task. It achieves fuzzy enhancement in the feature space by designing a Lightweight Fuzzy Branch Network (LFBN).
Figure 2 shows the overall architecture of this method, and its core processing flow includes the following four key stages.
First, the high-level features of ResNet generate the membership degree (μ) and non-membership degree (ν) through two fully connected branches, which are then mapped to the interval (0, 1) via the Sigmoid function. Subsequently, the PHFS projection layer normalizes μ and ν according to the constraint μ2 + ν2 ≤ 1, so as to satisfy the mathematical definition of the Pythagorean fuzzy set. Furthermore, the hesitation degree (π) is calculated to quantify the uncertainty and hesitation of features. Finally, μ, ν, and π are concatenated with the original features to form enhanced features, providing the downstream network with a hybrid feature space that integrates both semantic information and uncertainty representation.
Compared with traditional Intuitionistic Fuzzy Sets (IFS), Pythagorean Fuzzy Sets (PFS), or Hesitant Fuzzy Sets (HFS), PHFS has more flexible expressive ability in handling conflicts, hesitation, and uncertainty. It is particularly suitable for addressing degraded or ambiguous forgery traces, thereby significantly improving the robustness and interpretability of the model.
The Mel spectrogram is first processed by a pre-trained ResNet-34 network [
33] to extract high-level semantic features, yielding an initial feature representation
where B is the batch size, T is the number of time steps (corresponding to the time axis of the Mel spectrogram), and d is the feature dimension. This step aims to capture key patterns in the time-frequency domain of the audio signal. Subsequently, an unnormalized membership degree
and non-membership degree
are generated through a feature mapping layer composed of two fully connected branches. The calculation methods for the membership and non-membership degrees are as follows:
Among them is the learnable, parameter and is the sigmoid function. The dual-branch structure achieves independent modeling of feature credibility and uncertainty through parameter decoupling.
To satisfy the mathematical constraints of Pythagorean fuzzy sets, the generated unnormalized membership
and non-membership degrees
are subsequently projected into a normalized space via a PHFS constraint layer. The normalization formulas are as follows:
Then, according to the normalized membership degree and non-membership degree, the degree is calculated to quantify the uncertainty of the feature:
Finally, the membership degree, non-membership degree, and hesitancy degree are concatenated with the original features as additional channels to form the enhanced feature . This provides the downstream network with a hybrid feature space that integrates both semantic information and uncertainty representation. By explicitly encoding feature credibility and uncertainty, the model dynamically weighs the contribution of features from different regions when judging spoofed segments.
3.3. Dual Pathway Temporal and Spatial Attention Network (DPTFAN)
The Mel spectrogram of audio is essentially a two-dimensional time-frequency matrix with explicit structural properties. The temporal dimension captures the dynamic variations of speech, such as phoneme boundaries, rhythmic patterns, and energy transitions; in contrast, the frequency dimension carries static spectral features, including harmonic structures, formant distributions, and high-frequency artifacts. Extensive research has demonstrated that the forgery cues embedded in these two dimensions exhibit significant complementarity, while there exist intrinsic differences in their statistical characteristics and spatial correlations. If only 2D convolution is employed for holistic modeling, temporal and spectral cues tend to be mixed during the convolution process, which may undermine the model’s capability to independently represent the two types of fine-grained forgery patterns.
Based on this insight, the Mel spectrogram is explicitly decoupled into a Temporal Path (T-Path) and a Frequency Path (F-Path) in the proposed model design. The T-Path adopts a 1D convolution structure along the temporal axis, focusing on capturing the dynamic variations and continuity patterns of speech; the F-Path, on the other hand, retains the complete 2D structure and enhances the modeling of spectral details via 2D convolution and channel attention. These two paths are subsequently fused through the Lightweight Fuzzy Branch Network (LFBN), thereby achieving complementary enhancement of temporal and spectral features. This decoupling-fusion design not only conforms to the time-frequency structure of Mel features but also strengthens the expressiveness of forgery traces across different dimensions, thus forming the design rationale for the dual-path structure proposed in this study.
The proposed Dual-Path Time-Frequency Attention Network (DPTFAN) achieves efficient forgery detection with multi-stage fuzzy information fusion by integrating the time-frequency properties of Mel spectrograms and the feature enhancement capability of the Lightweight Fuzzy Branch Network (LFBN). Leveraging the time-frequency decoupling property of speech signals, the network processes features from the temporal and frequency dimensions separately: the Temporal Path (T-Path) focuses on temporal dynamic patterns (e.g., rhythmic anomalies, phoneme discontinuities), while the Frequency Path (F-Path) captures frequency-domain structural features (e.g., harmonic absence, formant distortion). The dual paths improve detection accuracy through a complementary learning mechanism, and the specific workflow is as follows:
Time Path: The preprocessed Mel spectrogram is reshaped into a pseudo-one-dimensional sequence (224 time steps, each containing 224 × 3 frequency-domain features), which incorporates concatenated features from all frequency channels. This step isolates the time axis for independent modeling, avoiding interference from frequency-domain information and enabling the model to focus on temporal dynamics. Subsequently, a convolution operation with a kernel size of 5, stride of 2, and padding of 2 is applied along the time axis of the pseudo-one-dimensional sequence to extract temporal features, yielding initial features along the time path. This captures short-term temporal patterns (such as phoneme boundaries and energy variations).
Next, four lightweight residual modules are stacked, each comprising a depthwise separable convolution (with 8 groups), batch normalization, and ELU activation. Each module employs a stride of 2, progressively downsampling the time steps and expanding the feature dimension from 672 to 1344. Residual connections are used to add the input of each module to its output, preventing degradation in deep networks. The purpose of the lightweight residual modules is to balance computational efficiency with audio feature representation, producing the output features along the time path.
The Lightweight Fuzzy Branch Network (LFBN) analyzes the Mel spectrogram and outputs a 14-dimensional vector, representing the importance weight (0~1) of each time step. These weights are applied to the output features of the time path (14 time steps × 1344 dimensions) in a step-wise manner. The membership matrix
generated by the LFBN adjusts the weights of key time steps (such as abnormal silent segments) to approach 1, while weights of less important segments approach 0. The calculation method is as follows:
Among them, dk denotes the dimension of the key (and query) feature vectors in the attention mechanism. To avoid excessively large dot-product similarity (caused by high vector dimensions) which would lead to overly small Softmax gradients, a scaling factor is introduced in the formula to improve the numerical stability of attention calculation. This is consistent with the standard scaled dot-product attention mechanism in the Transformer.
The architecture further incorporates four lightweight residual modules, each featuring depth separable convolution (with 8 groups), batch normalization, and ELU activation. With a stride of 2, the temporal steps are progressively downsampled, expanding the feature dimension from 672 to 1344. Residual connections are applied by summing the inputs and outputs of each module, effectively preventing network degradation in deep layers. These lightweight modules balance computational efficiency with audio feature preservation, ultimately yielding output features along the temporal path.
The Lightweight Fuzzy Branch Network (LFBN) analyzes the Mel frequency spectrum and outputs a 14-dimensional vector representing importance weights (0–1) for each time step. It then applies time-step weighting to the output features (14 time steps × 1344 dimensions). By combining the membership degree matrix generated by LFBN, key time steps (such as abnormal silent segments) are adjusted to have weights close to 1, while secondary segments are assigned weights approaching 0. The calculation method is as follows:
Frequency Path: The preprocessed Mel spectrogram is fed into the frequency path. First, a 3 × 3 convolutional kernel is applied to slide along the frequency axis, extracting correlation features between adjacent frequency points (such as harmonic continuity), resulting in feature . Subsequently, a Squeeze-and-Excitation (SE) module is employed, where the squeeze operation performs global average pooling on each channel to produce a 64-dimensional channel descriptor vector. The excitation operation then uses two fully connected layers (with an intermediate dimension of 16) to generate channel weights, enhancing the weights of key frequency bands (4–6 kHz) while suppressing those of noisy bands (such as high-frequency artifacts).
Multi-scale frequency-domain context fusion is achieved by concatenating features processed through spatial pyramid pooling with 4 × 4 max pooling and 2 × 2 average pooling, respectively. This integrates multi-scale frequency-domain contextual information, improving the model’s robustness to variations in frequency band structures. The resulting feature is denoted as .
The non-membership degree
, generated by the Lightweight Fuzzy Branch Network (LFBN), identifies noisy or ambiguous regions (values close to 1 indicate suppression is required). The calculation method is as follows:
Dual-Path Feature Fusion: First, the final features from the time path are upsampled to the dimension using bilinear interpolation to align with the features output by the frequency path. Time-frequency cross-attention is employed to fuse the features from the time and frequency paths. The design of the time-frequency cross-attention is as follows:
The query
is generated from the time path features, reflecting temporal dynamic patterns,
The value
and key
are generated from the frequency path features, encoding spectral structural information.
The deep hesitancy degree matrix
generated by the Lightweight Fuzzy Branch Network (LFBN) is element-wise multiplied with the attention weights to suppress attention responses in uncertain regions (e.g., background noise), thereby fuzzifying uncertain information:
The frequency path features are then weighted and aggregated using the adjusted attention weights, and concatenated with the time path features. The fused features are mapped to a high-dimensional space through a fully connected layer. Finally, a Softmax classifier is applied to output the probabilities of spoofed and genuine classes, completing the end-to-end detection process.