A Spoofing Speech Detection Method Combining Multi-Scale Features and Cross-Layer Information

Yuan, Hongyan; Zhang, Linjuan; Niu, Baoning; Zheng, Xianrong

doi:10.3390/info16030194

Open AccessArticle

A Spoofing Speech Detection Method Combining Multi-Scale Features and Cross-Layer Information

¹

College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Jinzhong 030600, China

²

Information Technology & Decision Sciences Department, Old Dominion University, Norfolk, VA 23529, USA

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(3), 194; https://doi.org/10.3390/info16030194

Submission received: 31 January 2025 / Revised: 21 February 2025 / Accepted: 26 February 2025 / Published: 2 March 2025

(This article belongs to the Collection Natural Language Processing and Applications: Challenges and Perspectives)

Download

Browse Figures

Versions Notes

Abstract

Pre-trained self-supervised speech models can extract general acoustic features, providing feature inputs for various speech downstream tasks. Spoofing speech detection, which is a pressing issue in the age of generative AI, requires both global information and local features of speech. The multi-layer transformer structure in pre-trained speech models can effectively capture temporal information and global context in speech, but there is still room for improvement in handling local features. To address this issue, a speech spoofing detection method that integrates multi-scale features and cross-layer information is proposed. The method introduces a multi-scale feature adapter (MSFA), which enhances the model’s ability to perceive local features through residual convolutional blocks and squeeze-and-excitation (SE) mechanisms. Additionally, cross-adaptable weights (CAWs) are used to guide the model in focusing on task-relevant shallow information, thereby enabling the effective fusion of features from different layers of the pre-trained model. Experimental results show that the proposed method achieved an equal error rate (EER) of 0.36% and 4.29% on the ASVspoof2019 logical access (LA) and ASVspoof2021 LA datasets, respectively, demonstrating excellent detection performance and generalization ability.

Keywords:

spoofing speech detection; misuse of generative AI; multi-scale feature; pre-trained models; adapters; cross adaptable weight

Graphical Abstract

1. Introduction

Speech synthesis techniques, especially artificial intelligence generated content, can produce speeches that closely resembles real human speech. These speeches are widely used in fields, such as human–computer interaction, entertainment, and education. However, they can also be misused to attack automatic speaker verification (ASV) systems [1] or deceive human listeners. Spoofing speech detection techniques that utilize the differences between bona fide and spoofed speech features to distinguish them can protect ASV systems and human users from the threat of spoofed speech.

Spoofing speech detection typically adopts a “feature extraction + backend classification” architecture, where acoustic features are first extracted from the input audio, and traditional machine learning methods or deep neural networks are used to distinguish between bona fide and spoofed speeches. Most current research on spoofing speech detection focuses on the extraction of more discriminative acoustic features. Current mainstream features include handcrafted features designed based on prior human knowledge and deep features learned by neural networks from raw speech. However, the extraction of these features relies on training with task-specific datasets, which limits their generalization ability when faced with unseen data types. In recent years, self-supervised learning-based speech pre-trained models have rapidly developed. These models are trained on large-scale unlabeled data and have shown a strong generalization ability across various downstream speech tasks. However, pre-trained models have significant computational and storage overheads; therefore, efficiently transferring pre-trained models to spoofing speech detection tasks is crucial for improving detection performance.

To address the above issues, this paper proposes a novel adapter architecture, multi-scale feature adapter (MSFA), designed to fine-tune the output of the transformer layers in pre-trained models. MSFA primarily consists of two components: a residual convolutional block and a squeeze-and-excitation (SE) mechanism. The residual convolutional block incorporates multiple receptive field sizes, enabling the capture of local features at different granularities, which complement the global information extracted by transformer layers. The SE mechanism introduces channel attention to help the model focus on fine-grained task-relevant features. Additionally, this paper introduces cross-layer adaptive weights (CAWs). CAWs assign greater weights to task-relevant layers and aggregate multi-layer outputs to enhance the representation of speech. The final speech representation is then fed into a back-end classifier for bona fide and spoofed speech detection. The main contributions of this paper are as follows:

We propose MSFA, a new adapter design tailored for the spoofing speech detection task. MSFA effectively utilizes the local detail features overlooked by pre-trained models, making it more suitable for spoofing detection.
To better leverage the information embedded in different layers of the pre-trained model, CAWs were introduced to assign different weights to each layer. This mechanism enables the model to focus on task-relevant layers, improving the ability of the speech representation to distinguish between bona fide and spoofed speech.
We conducted experiments on ASVspoof2019 logical access (LA) and ASVspoof2021 LA datasets. The results demonstrated that our method achieved an equal error rate (EER) of 0.36% on ASVspoof2019 LA and exhibited good generalization capabilities on ASVspoof2021 LA.

2. Related Work

The separability of speech features directly affects the performance of spoofing speech detection systems. Current mainstream features include handcrafted features, deep features, and features extracted from pre-trained models.

2.1. Handcrafted Feature-Based Spoofing Speech Detection

Current mainstream spoofing speech detection models primarily rely on various handcrafted features for feature extraction including short-term features, long-term features, and their combinations. Short-term features include mel-frequency cepstral coefficients (MFCCs), linear frequency cepstral coefficients (LFCCs) [2], and constant Q cepstral coefficients (CQCCs) [3], while long-term features include spectrograms. Handcrafted feature design is based on human prior knowledge, which provides high interpretability and flexibility. However, this process is labor-intensive and cannot fully adapt to rapidly evolving spoofing techniques.

2.2. Deep Feature-Based Spoofing Speech Detection

With the remarkable performance of deep neural networks (DNNs) [4] in various classification tasks, DNN-based spoofing speech detection algorithms have gained popularity. Considering that differentiating between spoofed and bona fide speech requires both local information (e.g., unnatural stress or intonation) and global information (e.g., excessive smoothing) [5], some studies have taken the raw audio as input and leveraged convolutional neural networks (CNNs) and long short-term memory networks (LSTMs) to extract both local and global information. These features are then jointly trained with back-end classifiers. For example, Yoon et al. [6] proposed a bidirectional feature segmentation method based on bidirectional LSTMs and used the SE-ResNet network for replay attack detection. However, most of these studies have focused on training spoofing speech detection systems from scratch on task-specific datasets, which often suffer from poor model generalization, high computational costs, and significant time overhead.

2.3. Spoofing Speech Detection Based on Pre-Trained Speech Models

Large-scale pre-trained speech models (hereafter referred to as pre-trained models) leverage the transformer architecture, self-supervised learning, and large amounts of unlabeled data to learn effective general speech representations. These models have demonstrated strong generalization capabilities across various downstream speech tasks. Currently, mainstream pre-trained models include WavLM [7], HuBERT [8], UniSpeech-SAT (US-SAT) [9], and wav2vec 2.0 [10]. However, these models often incur substantial computational and storage costs. Fully fine-tuning these models for specific speech tasks is not only time- and resource-intensive, but also prone to catastrophic forgetting [11].

In spoofing speech detection tasks, early research primarily used pre-trained models as feature extractors, freezing their parameters and training them jointly with task-specific backend classifiers. Lv et al. [12] utilized a variant of wav2vec 2.0 as a feature extractor, achieving the best performance in the partially fake audio detection (PF) task of the ADD 2022 competition. Wang et al. [13] conducted a series of experiments to evaluate pre-trained models and their corresponding back-end classifier architectures for spoofing speech detection tasks. The results indicate that while pre-trained models trained on a large corpus of bona fide unlabeled data can output rich semantic representations, they often lose the local detail features and shallow non-semantic information necessary for distinguishing between bona fide and spoofed speech. Therefore, simply applying these models to spoofing speech detection tasks will not work and still requires further improvement.

A common approach to address this issue is to fine-tune pre-trained models using lightweight modules such as adapters, without altering the parameters or structures of the pre-trained models. Current research in this direction has focused on two main areas: the design of adapter architectures and the selection of insertion points within the model.

A typical adapter architecture is used in the speech domain, as shown in Figure 1. To make this architecture more suitable for spoofing speech detection, Wu et al. [14] introduced a dynamic convolution-modified Res2Net module at the end of wav2vec to effectively extract multi-scale features of speech. However, this approach does not fully consider the interactions between the local features and global information. Because distinguishing between bona fide and spoofed speech requires both local details and global context, we propose a MSFA that uses multi-scale residual convolution blocks to capture the local information and an SE mechanism for adaptive channel weighting, thus achieving an effective combination of local and global contextual information.

Regarding the selection of adapter insertion points, Thomas et al. [15] inserted adapters into the top-N layers of pre-trained models to reduce the number of parameters introduced. Peng et al. [16] proposed a parallel adapter, training adapters in parallel with the feed-forward network (FFN) of the transformer layers. This approach enhances the model’s nonlinear expressive power through feature concatenation, but is limited by the parallel structure with FFN, restricting interactions to a single layer of features. However, related studies [17] suggest that speaker-related information is distributed across different layers of pre-trained models. Considering the complementarity and redundancy of features across different layers of the pre-trained model, we proposed CAWs to assign larger weights to layers that were more relevant to the task, thereby enabling the aggregation of outputs from multiple layers.

3. Methods

To better apply the speech pre-trained model to the spoofing speech detection task, based on the XLSR-LLGF model, this paper proposed a new adapter architecture, MSFA, and added CAWs for fine-tuning and fusing the outputs of each layer of the pre-trained model to extract task-relevant speech representations. The overall architecture of the model is shown in Figure 2. This section provides a detailed description of each module in the model.

3.1. XLSR-LLGF

In this study, the XLSR-LLGF model was used as the baseline system, and its overall architecture is illustrated in Figure 2. For front-end feature extraction, the frozen cross-lingual speech representation (XLSR) [18] pre-trained model was employed. The back-end classifier includes a light convolution neural network (LCNN) followed by two bi-directional recurrent layers using long short-term memory (LSTM) units, a global average pooling layer (GAP), and a fully connected (FC) output layer. It is referred to as LLGF. This back end has achieved excellent performance on the ASVspoof2019 LA dataset.

The XLSR model is a self-supervised, cross-lingual speech representation model designed for learning robust speech features. It extends wav2vec 2.0 by increasing the diversity of languages, the amount of training data, and the model size. This extensive training enables XLSR to extract more robust speech features. The architecture of XLSR is shown in Figure 3. The model was trained using 128 languages and 436,000 h of unlabeled speech samples from three datasets: Libri-Speech, Common Voice, and BABEL, with a parameter size of 317,000,000.

The input to the convolutional feature encoder layer is raw speech, which is processed through multiple CNN modules to produce latent speech representations

Z = \{z_{1}, z_{2}, \dots, z_{T}\}

. The size of

T

in the output is determined by the convolutional stride. Each CNN module consists of three components: a 1D convolution, a layer normalization, and a Gaussian error linear unit (GELU) activation function.

The contextual encoder layer takes the output

Z

of the feature encoder as the input and extracts contextual representations

C

. This layer consists of multiple transformer blocks, each containing multi-head attention, feed-forward layers, and layer normalization. Random masking is applied before passing

Z

to the first transformer layer. The input to subsequent layers comes from the output of the previous layer.

The quantization module takes

Z

as the input and discretizes the continuous speech features into a finite set of speech representation units, using a product quantization method. These units are concatenated to form the quantized representation

Q

.

The loss function for the entire network can be expressed as:

L = L_{m} + α L_{d}

(1)

Here, the contrastive loss

L_{m}

requires the model to identify the correct quantized representation vector from a set of candidate quantized vectors. The codebook diversity loss

L_{d}

is introduced to encourage the model to use the codebook entries more evenly.

α

is a tunable hyperparameter.

3.2. MSFA

Speech self-supervised pre-trained models can extract richer speech representations than handcrafted features with a good generalization ability. However, applying them to various speech downstream tasks is difficult [19]. Taking the baseline system XLSR in this paper as an example, its number of parameters is 317,000,000, and retraining it would cause huge time costs and computational overhead. Another issue is that the final output of the model is not suitable for the specific task.

To make it more suitable for the spoofing speech detection task and at the same time ensure that only a small number of parameters are introduced, a new adapter framework, MSFA, was proposed. As shown in Figure 2, the adapter consisted of six modules, including a fully connected layer that projects the parameter dimensions downward, a Res2Net block, a GELU activation function, a fully connected layer that reduces the parameter dimensions, a SE block, and jump connections.

To reduce the number of parameters introduced, the output of the transformer layer was first projected down its parameter dimension before performing a series of transformations on it, which can be expressed as follows:

x = F C_{d o w n} (x)

(2)

Here,

F C_{d o w n}

denotes the fully connected layer of the down projection.

In the spoofing speech detection task, the speech signal has multi-scale features, including spectral features, time domain features, speech content, tone of speech, etc., and these different scale features are crucial for distinguishing bona fide speech from spoofed speech. The transformer model is good at capturing global information, but has some limitations in processing local information at different scales. So, the Res2Net module was added to capture multi-scale local features, and the global and local information complemented each other to improve the feature representation ability of the model. The module structure of Res2Net is shown in Figure 4. It improves multi-scale representation by increasing the number of receptive fields. After a 1 × 1 convolution, it evenly splits the input feature maps by the channel dimension into

s

subsets, denoted by

x_{i}

, where

i \in \{1, 2, \dots, s\}

. Except for

x_{1}

, each

x_{i}

is processed by a

3 \times 3

convolutional filter

K_{i}

. Starting from

i

= 3,

x_{i}

is added with the output of

K_{i - 1}

before being fed into

K_{i}

, and the final output of the adapter module can be expressed as:

y_{i} = \{\begin{matrix} x_{i}, & i = 1 \\ K_{i} (x_{i}), & i = 2 \\ K_{i} (x_{i} + y_{i - 1}), & 2 \leq i \leq s \end{matrix}

(3)

Here,

s

is defined as the scale dimension, indicating the number of partitions applied to split feature maps. Finally, it concatenates all splits and passes them through a

1 \times 1

convolution filter to maintain the channel size of this residual block.

The current adapter structures used in speech tasks are mostly borrowed from the field of natural language processing or computer vision, where the activation function is mostly rectified linear unit (ReLU). However, different tasks may require different activation functions. Considering that a GELU function can produce non-zero outputs when the input is negative and can learn more complex nonlinear mappings in deep networks, the GELU activation function was used in the MSFA architecture proposed by this paper, whose formula is shown below:

G E L U = 0.5 x \times [1 + t a n h (\sqrt{\frac{2}{π}} (x + 0.044715 x^{3}))]

(4)

After modeling the fine-grained features using the Res2Net module, it is necessary to select the features that are beneficial to the spoofing speech detection task from a variety of local information and reduce the weight of unimportant features. So, this paper utilized the SENet module to incorporate a channel attention mechanism for the model to enhance its ability to perceive key features. The structure of the model is shown in Figure 5.

The squeeze operation uses global average pooling to map each feature channel of a feature map to a value with a global receptive field. At the end of this operation, the feature map is mapped to a vector with the same length as the number of feature channels.

The excitation operation models the correlation between channels, using two fully connected layers to form a bottleneck structure. It generates a weight for each feature channel, so that feature channels that are useful for the task receive a larger weight.

The reweight operation multiplies the weights of each channel one-by-one to previous features, completing the recalibration of the original feature map in the channel dimension.

To keep the architecture of the pre-trained model, it is necessary to project the transformed feature dimension upward to the original dimension, as expressed in the following equation:

x = F C_{u p} (x)

(5)

Here,

F C_{u p} ()

denotes the fully connected layer that reduces the feature dimensions.

To enable the pre-trained model to introduce task-specific speech representations while preserving the subject features, residual connectivity was also incorporated into the adapter architecture proposed in this paper. The final output of the adapter module can be expressed as:

x = x + F C_{u p} (S E N e t (G E L U (R e s 2 N e t (F C_{d o w n} (x)))))

(6)

3.3. CAWs

Different layers of the pre-trained model can capture semantic, intonational, or phonetic aspects of the input speech, and a lot of information may be lost by using only the output of the last transformer layer as the feature vector [17]. However, simply summing the feature vectors of all layers element-by-element as the final output has some redundancy and interferes with information. To find the more important layers for the spoofing speech detection task, this paper introduced CAWs to aggregate the outputs of the multi-layer transformer. This mechanism aims to determine the importance of each layer and weighted-sum to produce the final speech representation, using the formula shown below:

\tilde{x} = \sum_{i = 1}^{k} w_{i} x_{i}

(7)

Here,

k

denotes the number of layers of the transformer. After obtaining the final speech representation, it is fed into the back-end classifier to discriminate the authenticity.

4. Experiments

4.1. Dataset

This paper used the train set of ASVspoof2019 LA [20] as the training set and the dev set as the validation set. The test sets of ASVspoof2019 LA and ASVspoof2021 LA [21] were used to evaluate the model’s effectiveness and generalization ability.

In ASVspoof2019 LA, bona fide speech was sourced from the voice cloning toolkit (VCTK) corpus, while spoofed speech was generated using various speech synthesis and voice conversion algorithms. The training and development sets shared the same attack algorithms (A01–A06). However, the test set included spoofed speech generated using various attack algorithms (A07–A19) that differed from those in the training set. The test speech in this dataset was relatively clean, without channel or background noise. The ASVspoof2021 LA dataset contained bona fide and spoofed speech transmitted through telephone and VoIP networks, exhibiting various codec and transmission artifacts. This makes it more challenging and requires higher robustness from the model. Table 1 shows the specific details of the two datasets.

4.2. Experiment Environment and Performance Metrics

The experiments were conducted on an Inspur server, which is manufactured by a company from Shandong China, running the CentOS 8.3.2011 operating system. The server was equipped with Tesla V100s GPUs, and CUDA version 10.1 was used. The code was implemented using the deep learning framework PyTorch (version 1.6.1) and Python 3.7.

This study employed EER and the minimum tandem detection cost function (min t-DCF) as evaluation metrics. The formulae are as follows:

P_{F A R} (θ) = \frac{\{s p o o f t r i a l s w i t h s c o r e > θ\}}{\{t o t a l s p o o f t r i a l s\}}

(8)

P_{F R R} (θ) = \frac{b o n a f i d e t r i a l s w i t h s c o r e \leq θ}{\{t o t a l b o n a f i d e t r i a l s\}}

(9)

E E R = P_{F A R} (θ) = P_{F P R} (θ)

(10)

\min t - D C F = C_{f a} \times F A R \times (1 - p_{t a r g e t}) + C_{f r} \times F R R \times p_{t a r g e t}

(11)

Here,

θ

represents the decision threshold.

P_{F A R} (θ)

and

P_{F R R} (θ)

denote the false acceptance rate (FAR) and false rejection rate (FRR), respectively, at threshold

θ

. EER is obtained when FAR equals FRR. The min t-DCF metric extends EER by incorporating prior probabilities and different costs, making it more applicable to real-life scenarios. Specifically,

C_{f a}

represents the risk factor for false acceptance samples,

C_{f r}

represents the risk factor for false rejection samples, and

p_{t a r g e t}

and

1 - p_{t a r g e t}

represent the prior probabilities of positive and negative samples, respectively.

4.3. Experimental Setup

In this study, training was conducted for a default of 100 epochs, with early stopping iterations of 20. The Adam optimizer was used for optimization,

β_{1} = 0.99

,

β_{2} = 0.999

,

ε =

10⁻⁸ with a batch size of 32 and an initial learning rate of

3 \times 10^{- 4}

, which was halved every 10 epochs. The adapter’s down-projection dimension was set to 256, and the output dimension of the FC layer following the pre-trained front-end model was set to 128. For the min t-DCF evaluation metric,

C_{f a} = 10

,

C_{f r} = 1

, and

p_{t a r g e t} = 0.01

were used.

All experiments were conducted over three training rounds, with different random seeds used to initialize the network weights in each round, where all parameters of the pre-trained model were frozen. The training process was divided into two steps: (1) Initial Training. The frozen XLSR front-end was used to train the model and obtain the weights for each transformer layer; (2) Fine-tuning with MSFA. The weights obtained in the previous step were added to the model parameters, and the MSFA module was introduced for further training.

4.4. Experimental Result and Analysis

To investigate the effectiveness of the MSFA module and CAWs, this paper conducted five sets of ablation experiments on ASVspoof2019 LA and ASVspoof2021 LA. In the experiments, ID1 denotes the paper’s baseline system. ID2 and ID3 use average weighting and CAWs for the aggregation of pre-trained multi-layer features, respectively. ID4 and ID5 use the CAWs based on the original adapter and the MSFA module of the paper, respectively. The experimental results are shown in Table 2.

The following conclusions can be drawn from Table 2: (1) According to ID1, ID2, and ID3, it can be seen that compared with the method that only uses the last layer of the output of the pre-trained model as the speech features, our approach can aggregate multi-layer features, and thus improve the spoofing speech detection performance. The reason is that the CAWs proposed in the paper can make the model focus on the layers that are more relevant to the task of spoofing speech detection. As the features that are beneficial to the task will be assigned a larger weight, the extracted speech representations are more distinguishable. (2) According to ID4 and ID5, it can be seen that the adapter module that fine-tunes the pre-trained model can help the pre-trained model extract more information relevant to the spoofing speech detection task, while maintaining the original feature extraction capability. Based on it, the residuals are introduced to the MFSA module, which introduces the residuals of the convolution block and the squeeze-and-excitation mechanism, according to the characteristics of the spoofing speech detection task. It can capture the multi-scale local features relevant to the task and complement the global features of the transformer, which further improves the accuracy of the model. (3) To investigate the impact of different random seeds on model performance, we conducted three sets of experiments using different random seeds. The results indicate that the different random seeds exhibited some variation, but were within a normal fluctuation range, which is aligned with the experimental expectations.

Overall, the average EER of the MSFA-CAW method on the ASVspoof2019 LA test set was only 0.36%, which is a decrease of 2.93% compared with the baseline system. It should be noted that compared with ASVspoof2019 LA, the test set of ASVspoof2021 LA added background noise and various codec traces, which made the detection task more challenging. This paper achieved an average EER of 4.29% on ASVspoof2021 LA, which represents a decrease of 10.98% compared with the baseline system, indicating a large improvement in detection performance. The results show the good performance of the MSFA-CAW method.

To further validate the effectiveness of the proposed method, Table 3 and Table 4 compare the method with other systems on the ASVspoof2019 LA and ASVspoof2021 LA datasets, respectively. As can be seen from Table 3 and Table 4, the proposed method achieved good detection performance, with average EER of 0.36% and 4.29% on the ASVspoof2019 LA and ASVspoof2021 LA datasets, respectively, demonstrating good detection performance and generalization ability. In the tables, the W2V2-XLSR model, the WavLM + MFA model, and our method used the pre-trained model as the front-end feature extractor. As a result, their accuracy was greatly improved compared with other models using graph neural networks, such as RawGAT-ST, LCMM-LSTM, and deep neural networks. It confirms that fine-tuning the pre-trained model as a front-end feature extractor for various speech downstream tasks works.

Meanwhile, WavLM + MFA and W2V2-XLSR fully fine-tuned the pre-trained model, whereas our method only added the MSFA and CAW modules, where most of the parameters of the model are frozen. The number of parameters for the final model was 28,300,000, which was only 8.93% of that of the baseline wav2vec 2.0 XLSR model. However, it still achieved an approximately 14% performance improvement on the ASVspoof2019 LA and ASVspoof2021 LA datasets, thus achieving good results.

4.5. Visual Analytics

To further illustrate the effectiveness of our method, the baseline model and the speech representations extracted from the model were visualized, as shown in Figure 6. It can be seen that some of the bona fide and spoofed speech extracted by the baseline system overlapped with each other after dimensionality reduction. In contrast, the speech representations extracted by our model were roughly divided into two clusters, and the boundaries between bona fide speech and spoofed speech were more clear-cut.

5. Conclusions

Fine-tuning speech pre-trained models to make them applicable to various speech downstream tasks is an important research problem. As the existing adapter architecture cannot adapt the features required for the downstream tasks, this paper proposed a new adapter architecture named MSFA to extract the local information in speech and complements the global information in the pre-trained model for the spoofing speech detection task. Furthermore, as existing fine-tuning methods do not distinguish the different information captured by different layers of the pre-trained model, cross-layer adaptive weightings named CAWs were added to capture more discriminative speech embeddings. In the paper, we achieved an average EER of 0.36% on ASVspoof2019 LA, and showed good generalization ability on ASVspoof2021 LA.

In the future, we plan to investigate the effect of the number of MSFA module insertions on the performance of the model, so that its performance could be improved but fewer parameters are used. Additionally, most existing spoofing speech detection models rely on the raw audio as input, which requires access to complete speech information and thus raises significant privacy concerns. Future research will explore the decoupling of semantic information and acoustic features from raw audio with the goal of advancing the practical application of the model.

Author Contributions

Conceptualization, H.Y., L.Z. and B.N.; Methodology, H.Y. and B.N.; Validation, H.Y., L.Z. and B.N.; Formal analysis, H.Y., L.Z., B.N. and X.Z.; Writing—original draft preparation, H.Y.; Writing—review and editing, H.Y., L.Z., B.N. and X.Z.; Supervision, L.Z., B.N. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This project received funding from the Shanxi Major Science and Technology Special Project under grant 202301020101001, the Key R&D Program of Shanxi Province under grant 202302010101004, and the Shanxi Provincial Applied Basic Research Program under grant 202203021222093.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASV	Automatic Speaker Verification
CAWs	Cross Adaptable Weights
CQCCs	Constant Q Cepstral Coefficients
CNNs	Convolutional Neural Networks
DNNs	Deep Neural Networks
EER	Equal Error Rate
FFN	Feed-Forward Network
FAR	False Acceptance Rate
FRR	False Rejection Rate
FC	Fully Connected Layer
GAP	Global Average Pooling Layer
GELU	Gaussian Error Linear Unit
LA	Logical Access
LCNN	Lightweight Convolutional Neural Network
LFCCs	Linear Frequency Cepstral Coefficients
LSTMs	Long Short-Term Memory Networks
MFCCs	Mel-Frequency Cepstral Coefficients
min t-DCF	minimum tandem Detection Cost Function
MSFA	Multi-Scale Feature Adapter
PF	Partially Fake Audio Detection
ReLU	Rectified Linear Unit
SE	Squeeze and Excitation Mechanism
t-SNE	t-Distributed Stochastic Neighbor Embedding

References

Evans, N.; Kinnunen, T.; Yamagishi, J. Spoofing and countermeasures for automatic speaker verification. In Proceedings of the 2013 14th Conference of the International Speech Communication Association, Lyon, France, 25–29 August 2013; pp. 925–929. [Google Scholar]
Sathya, P.; Ramakrishnan, S. Non-redundant frame identification and keyframe selection in DWT-PCA domain for authentication of video. IET Image Process. 2020, 14, 366–375. [Google Scholar] [CrossRef]
Balamurali, B.T.; Lin, K.E.; Lui, S.; Chen, J.-M.; Herremans, D. Toward robust audio spoofing detection: A detailed comparison of traditional and learned features. IEEE Access 2019, 7, 84229–84241. [Google Scholar] [CrossRef]
Tak, H.; Patino, J.; Todisco, M.; Nautsch, A.; Evans, N.; Larcher, A. End-to-end anti-spoofing with rawnet2. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6369–6373. [Google Scholar]
Liu, X.; Liu, M.; Wang, L.; Lee, K.A.; Zhang, H.; Dang, J. Leveraging positional-related local-global dependency for synthetic speech detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 6–10 June 2023; pp. 1–5. [Google Scholar]
Yoon, S.H.; Yu, H.J. Multiple points input for convolutional neural networks in replay attack detection. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual Conference, 4–9 May 2020; pp. 6444–6448. [Google Scholar]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Chen, S.; Wu, Y.; Wang, C.; Chen, Z.; Chen, Z.; Liu, S.; Wu, J.; Qian, Y.; Wei, F.; Li, J.; et al. UniSpeech-SAT: Universal speech representation learning with speaker aware pre-training. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6152–6156. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 12449–12460. [Google Scholar]
Suresh, V.; Ait-Mokhtar, S.; Brun, C.; Calapodescu, I. An adapter-based unified model for multiple spoken language processing tasks. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10676–10680. [Google Scholar]
Lv, Z.; Zhang, S.; Tang, K.; Hu, P. Fake audio detection based on unsupervised pretraining models. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9231–9235. [Google Scholar]
Wang, X.; Yamagishi, J. Investigating self-supervised front ends for speech spoofing countermeasures. arXiv 2021, arXiv:2111.07725. [Google Scholar]
Wu, H.; Zhang, J.; Zhang, Z.; Zhao, W.; Gu, B.; Guo, W. Robust Spoof Speech Detection Based on Multi-Scale Feature Aggregation and Dynamic Convolution. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10156–10160. [Google Scholar]
Thomas, B.; Kessler, S.; Karout, S. Efficient adapter transfer of self-supervised speech models for automatic speech recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7102–7106. [Google Scholar]
Peng, J.; Stafylakis, T.; Gu, R.; Plchot, O.; Mošner, L.; Burget, L.; Černocký, J. Parameter efficient transfer learning of pre-trained transformer models for speaker verification using adapters. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 6–10 June 2023; pp. 1–5. [Google Scholar]
Li, Y.; Huang, H.; Chen, Z.; Guan, W.; Lin, J.; Li, L.; Hong, Q. SR-HuBERT: An efficient pre-trained model for speaker verification. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11591–11595. [Google Scholar]
Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; von Platen, P.; Saraf, Y.; Pino, J.; et al. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv 2021, arXiv:2111.09296. [Google Scholar]
Chen, Z.; Chen, S.; Wu, Y.; Qian, Y.; Wang, C.; Liu, S.; Qian, Y.; Zeng, M. Large-scale self-supervised speech representation learning for automatic speaker verification. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6147–6151. [Google Scholar]
Nautsch, A.; Wang, X.; Evans, N.; Kinnunen, T.H.; Vestman, V.; Todisco, M.; Delgado, H.; Sahidullah, M.; Yamagishi, J.; Lee, K.A. ASVspoof 2019: Spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Trans. Biom. Behav. Identity Sci. 2021, 3, 252–265. [Google Scholar] [CrossRef]
Yamagishi, J.; Wang, X.; Todisco, M.; Sahidullah, M.; Patino, J.; Nautsch, A.; Liu, X.; Lee, K.A.; Kinnunen, T.; Evans, N.; et al. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. arXiv 2021, arXiv:2109.00537. [Google Scholar]
Guo, Y.; Huang, H.; Chen, X.; Zhao, H.; Wang, Y. Audio deepfake detection with self-supervised wavlm and multi-fusion attentive classifier. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12702–12706. [Google Scholar]
Jung, J.; Heo, H.S.; Tak, H.; Shim, H.J.; Chung, J.S.; Lee, B.J.; Yu, H.J.; Evans, N. AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6367–6371. [Google Scholar]
Tak, H.; Jung, J.W.; Patino, J.; Kamble, M.; Todisco, M.; Evans, N. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv 2021, arXiv:2107.12710. [Google Scholar]
Wang, X.; Yamagishi, J. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. arXiv 2021, arXiv:2103.11326. [Google Scholar]
Li, X.; Li, N.; Weng, C.; Liu, X.; Su, D.; Yu, D.; Meng, H. Replay and synthetic speech detection with res2net architecture. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6354–6358. [Google Scholar]
Ma, Y.; Ren, Z.; Xu, S. RW-Resnet: A novel speech anti-spoofing model using raw waveform. arXiv 2021, arXiv:2108.05684. [Google Scholar]

Figure 1. Adapter architecture. There are four layers: “Down Project” is a fully connected layer for down-projecting the parameter dimension, “Activation” is a nonlinear activation function, “Up Project” is a fully connected layer to restore the original parameter dimension, and “LN” is a layer normalization.

Figure 2. Pipeline of our proposed model. (Left) Feature extractor, where multi-scale feature adapters are inserted between adjacent transformer blocks. (Upper right) LLGF classifier. (Lower right): MSFA block.

Figure 3. The architecture of XLSR (a large multilingual wav2vec 2.0).

Figure 4. Illustration of a Res2Net block (scale dimension

s = 4

).

Figure 4. Illustration of a Res2Net block (scale dimension

s = 4

).

Figure 5. A squeeze-and-excitation block. There are three operations: squeeze, excitation, and reweight.

Figure 6. t-distributed stochastic neighbor embedding (t-SNE) visualization of speaker embeddings. We visualized 3000 speech data (1500 bona fide speech, 1500 spoofed speech) from the ASVspoof2019 LA test set. (a) Baseline model. (b) MSFA-CAW.

Table 1. ASVspoof2019 LA and ASVspoof2021 LA datasets.

Dataset		Number of Speeches
		Bona Fide	Spoof
2019LA	Train	2580	22,800
	Dev	2548	22,296
	Test	7355	63,882
2021LA	Test	14,816	133,360

Table 2. Model EER (%) and ablation experiments on 19 LA and 21 LA. I, II, and III represent the experimental results of three different random seed initializations. Average refers to the mean result of the three sets of experiments.

ID	System	2019 LA				2021 LA
		I	II	III	Average	I	II	III	Average
1	Baseline	3.43	3.35	3.10	3.29	16.04	16.63	13.13	15.27
2	Average weight	1.48	1.42	1.53	1.48	10.57	11.35	10.50	10.81
3	CAW	1.17	1.27	1.29	1.24	9.49	9.48	9.51	9.49
4	Adapter-CAW	1.09	0.85	0.94	0.96	8.58	8.79	7.75	8.37
5	MFSA-CAW	0.35	0.39	0.34	0.36	4.23	3.99	4.66	4.29

Table 3. Comparison with existing models on the 2019 LA dataset.

System	2019 LA
System	EER (%)	min t-DCF
MFSA-CAW	0.36	0.0269
WavLM + MFA [22]	0.42	0.0126
Rawformer [5]	0.59	0.0184
AASIST [23]	0.83	0.0275
RawGAT-ST [24]	1.06	0.0355
LCNN-LSTM-sum [25]	1.92	0.0524
W2V2-XLSR, fine-tuned [13]	1.35	0.1003
SE-Res2Net50 [26]	2.50	0.0743
RW-Resnet [27]	2.98	0.0820
W2V2-XLSR, fixed [13]	3.11	0.1320

Table 4. EER (%) with the existing models on the 2021 LA test set.

System	2021 LA
MFSA-CAW	4.29
Rawformer [5]	4.53
WavLM + MFA [22]	5.08
wav2vec-XLSR [18]	6.53
AASIST [23]	6.24
W2V2-XLSR, fine-tuned [13]	7.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, H.; Zhang, L.; Niu, B.; Zheng, X. A Spoofing Speech Detection Method Combining Multi-Scale Features and Cross-Layer Information. Information 2025, 16, 194. https://doi.org/10.3390/info16030194

AMA Style

Yuan H, Zhang L, Niu B, Zheng X. A Spoofing Speech Detection Method Combining Multi-Scale Features and Cross-Layer Information. Information. 2025; 16(3):194. https://doi.org/10.3390/info16030194

Chicago/Turabian Style

Yuan, Hongyan, Linjuan Zhang, Baoning Niu, and Xianrong Zheng. 2025. "A Spoofing Speech Detection Method Combining Multi-Scale Features and Cross-Layer Information" Information 16, no. 3: 194. https://doi.org/10.3390/info16030194

APA Style

Yuan, H., Zhang, L., Niu, B., & Zheng, X. (2025). A Spoofing Speech Detection Method Combining Multi-Scale Features and Cross-Layer Information. Information, 16(3), 194. https://doi.org/10.3390/info16030194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Spoofing Speech Detection Method Combining Multi-Scale Features and Cross-Layer Information

Abstract

1. Introduction

2. Related Work

2.1. Handcrafted Feature-Based Spoofing Speech Detection

2.2. Deep Feature-Based Spoofing Speech Detection

2.3. Spoofing Speech Detection Based on Pre-Trained Speech Models

3. Methods

3.1. XLSR-LLGF

3.2. MSFA

3.3. CAWs

4. Experiments

4.1. Dataset

4.2. Experiment Environment and Performance Metrics

4.3. Experimental Setup

4.4. Experimental Result and Analysis

4.5. Visual Analytics

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI