Next Article in Journal
Road Traffic Forecast Based on Meteorological Information through Deep Learning Methods
Previous Article in Journal
A Bayesian Approach to Unsupervised, Non-Intrusive Load Disaggregation
Previous Article in Special Issue
Improved Estimation of End-Milling Parameters from Acoustic Emission Signals Using a Microphone Array Assisted by AI Modelling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BPCNN: Bi-Point Input for Convolutional Neural Networks in Speaker Spoofing Detection

1
Department of Artificial Intelligence, Kongju National University, Cheonan 31080, Korea
2
School of Computer Science, University of Seoul, Seoul 02504, Korea
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(12), 4483; https://doi.org/10.3390/s22124483
Submission received: 12 May 2022 / Revised: 2 June 2022 / Accepted: 7 June 2022 / Published: 14 June 2022
(This article belongs to the Special Issue Machine Learning-Based Audio Signal Processing via the Use of Sensors)

Abstract

:
We propose a method, called bi-point input, for convolutional neural networks (CNNs) that handle variable-length input features (e.g., speech utterances). Feeding input features into a CNN in a mini-batch unit requires that all features in each mini-batch have the same shape. A set of variable-length features cannot be directly fed into a CNN because they commonly have different lengths. Feature segmentation is a dominant method for CNNs to handle variable-length features, where each feature is decomposed into fixed-length segments. A CNN receives one segment as an input at one time. However, a CNN can consider only the information of one segment at one time, not the entire feature. This drawback limits the amount of information available at one time and consequently results in suboptimal solutions. Our proposed method alleviates this problem by increasing the amount of information available at one time. With the proposed method, a CNN receives a pair of two segments obtained from a feature as an input at one time. Each of the two segments generally covers different time ranges and therefore has different information. We also propose various combination methods and provide a rough guidance to set a proper segment length without evaluation. We evaluate the proposed method on the spoofing detection tasks using the ASVspoof 2019 database under various conditions. The experimental results reveal that the proposed method reduces the relative equal error rate (EER) by approximately 17.2% and 43.8% on average for the logical access (LA) and physical access (PA) tasks, respectively.

1. Introduction

Automatic speaker verification (ASV) is a technique to verify a user’s identity from his or her utterances. It is convenient and has provided consistently low verification errors. However, ASV is vulnerable to various spoofing attacks that attempt to deceive the system and manipulate its results using a fake utterance. A spoofed utterance is one artificially manipulated to sound like the target speaker’s utterance. Current spoofing attacks include text-to-speech (TTS) synthesis, voice conversion (VC), and replay attacks. Although the identity claim with spoofed utterance must be rejected for obviating the harms caused by the impersonation, ASV may not be able to reject these claims because it is hard to distinguish between a human being’s genuine utterance and the spoofed utterance.
The development of spoofing detection systems to address the shortcomings of ASV has recently gained importance. Spoofing detection is a technique to protect ASV systems from spoofing attacks by distinguishing between genuine and spoofed utterances. Spoofing detection systems reject the identity claim with a spoofed utterance, regardless of how similar the spoofed utterance is to the target speaker’s genuine utterance. For research and development on spoofing detection techniques, ASV spoofing and countermeasure (ASVspoof) challenges have been organized periodically [1,2,3,4]. In this paper, we focus on ASVspoof 2019 [3], which first addresses all major types of spoofing attacks (i.e., TTS, VC, and replay attack).
It is essential for spoofing detection to capture the differences of the frequency attributes between genuine and spoofed utterances, regardless of the type of spoofing attack. The frequency attributes of genuine and spoofed utterances differ [5,6]. Furthermore, the discriminative attributes between genuine and spoofed utterances are spread across the entire time domain of an utterance and there is no strong time dependency. This locally connected information can be effectively modeled by convolutional neural networks (CNNs) [7]. Accordingly, many cutting-edge spoofing detection systems are based on CNNs [8,9,10,11].
Like other types of neural networks, a CNN is also trained in a mini-batch unit [12]. When feeding input features into a CNN in a mini-batch unit, all features in each mini-batch must have the same size. However, the lengths of speech utterances differ because speakers have different uttering styles, even if all speakers are constrained to utter only the same phrase. Therefore, it is required that all input utterances (it is not limited to ‘utterance’ and can correspond to any kind of sequence (e.g., a waveform signal, a sequence of acoustic feature vectors), although we use the word ‘utterance’ in this sentence) have the same length before we feed them into a CNN. There are several ways to make all input features in each mini-batch have the same length, which can be categorized into three methods: (1) padding only, (2) both padding and truncation, and (3) feature segmentation. Throughout this paper, the target length refers to the length that all features in a mini-batch must have.
The first method is to apply padding only, by repeating the feature’s own frames or filling with zeros (i.e., zero-padding). In other words, there is no feature truncated. The target length is set long enough (e.g., the length of the longest feature over the dataset) so that no feature is longer than the target length, as in [13,14,15]. This method has the advantage of no information loss caused by truncation. However, it causes a data redundancy problem caused by padding because the padded frames are not novel but merely part of the original frames (for the repeating) or common zero values that are meaningless (for the zero-padding). Thus, padding brings the additional unnecessary information in terms of the information diversity and novelty, which may not help improve classification. Furthermore, this problem accompanies the increases in the computational burden and memory usage. This side effect worsens as the standard deviation of the utterance lengths increases.
The second method is to apply both padding and truncation. The target length is set to a moderate length, neither too long nor too short (e.g., the average length), as in [9,10,15,16,17,18,19,20,21,22,23,24]. Features shorter than the target length are padded and those longer than the target length are truncated. This method has a relatively smaller computational burden and memory usage than the first method. However, it causes the information distortion that is due to both padding and truncation. The padding leads to the additional unnecessary information, as described above, and truncation leads to the information loss because of not being able to use truncated frames.
In [25], an on-the-fly data loader was proposed to handle variable-length input features. It can be regarded as an extension of the aforementioned second method to enable batch-wise variable-length training. For each iteration, the target length (common to all the features in each mini-batch) and start points (for truncation only; different for each input feature in a mini-batch) are randomly selected. This method achieved improved performances on both text-independent speaker verification and language identification [25]. However, our findings reveal that the on-the-fly data loader was shown to be ineffective for improving performance on spoofing detection, except in some cases. This issue will be discussed in Section 5.2.
The third method is feature segmentation [8,11,26,27,28,29], which breaks each feature down into fixed-length segments. The target length corresponds to the length of a segment. A CNN receives one segment at one time rather than an entire feature. This method seems to combine the benefits of the previous two methods. It can reduce the information distortion caused by padding and truncation. Therefore, it is a more flexible method in terms of the difference in the utterance length. In this paper, we used the feature segmentation as a baseline method.
However, segmentation has the shortcoming of the limitation on the amount of information available at one time because a CNN can consider only one segment at one time, not an entire feature. At least, but not limited in spoofing detection, this shortcoming is detrimental to the performance. It is important for better detection to capture and model as many the discriminative attributes in an input feature as possible, and then aggregate them into a fixed-size embedding. Moreover, the discriminative attributes are spread across the entire time domain, rather than being concentrated in certain time intervals, as mentioned earlier. In other words, the feature segmentation is inevitable to reach suboptimal performances on spoofing detection. This problem has made us take how to increase the amount of information available at one time into consideration.
We had proposed a method called multiple points input for CNN [27]. The proposed method increases the amount of information available at one time by feeding a pair of two segments as an input to a CNN. It exhibited substantial reductions in detection error in the ASVspoof 2019 PA scenario (i.e., replay attack detection task). Nonetheless, there are still some issues worthy of further discussions in [27] because our previous experiments in [27] were performed under restricted conditions: (1) the use of only one type of CNN-based model, squeeze-and-excitation [30] residual network [31] (SE-ResNet), (2) a fixed target length at 400 frames, and (3) evaluation only in the replay attack detection task. In this paper, therefore, we present comprehensive experimental results conducted under various conditions and in-depth analyses to verify the ability of the proposed method to reduce the detection errors in various conditions. The contributions of this paper are summarized as follows:
  • Proposing a bi-point input for CNN (BPCNN). It feeds a pair of two segments, rather than one segment, into a CNN at one time. The main advantage of BPCNN is to increase the amount of the available information at one time with hardly changing the CNN structure.
  • Proposing various methods for combining the two segments in various levels: embedding-level combination, feature map-level combination, and two-channel input.
  • Evaluating the performances of the proposed method in both logical and physical access tasks in various conditions and analyzing the effect and strength of the proposed method via the ablation studies.
The structure of this paper is organized as follows. Section 2 outlines the conventional feature segmentation method. Section 3 introduces the proposed method. Section 4 describes the experimental setup, and Section 5 analyzes the obtained results. Finally, Section 6 concludes the paper.

2. Conventional Feature Segmentation

Feature segmentation is a method that decomposes all the variable-length features into fixed-length segments along the time axis. Figure 1 and Algorithm 1 illustrate the overview and pseudocode of conventional feature segmentation. Let X = [ x 0 , , x T 1 ] be a sequence of T feature frames extracted from an utterance (e.g., log power spectrogram). With the feature segmentation, X is divided into N segments { F i } i = 0 N 1 using a sliding window of M ( < T ) frames length with L frames shift (this explanation is only for the case of M < T . The case of M T will be discussed later.) All segments have the same length of M , which is the target length. The number of segments N is different for each feature, which is proportional to the feature length T . The i -th segment F i corresponds to the sequence of M frames that start from the i L -th frame x i L to the ( i L + M 1 ) -th frame x i L + M 1 , F i = [ x i L ,   ,   x i L + M 1 ] , where i L + M T for all i .
Algorithm 1 Pseudocode of Conventional Feature Segmentation
Input
   X = [ x 0 ,   ,   x T 1 ] : A feature with the length of T
   M : The segment length
   L : The shift interval for segmentation
Output
   S : A list of the segments
1.if  T > M :
2.   S = [   ] # Empty list
3.   N = M T # The quotient
4.   r = mod ( T M , L ) # The remainder
5.  for  i = 0 to N 1 :
6.     F i = [ x i L ,   ,   x i L + M 1 ] # The i -th segment
7.     S . append ( F i )
8.  if  r > 0 :
9.     F N = [ x T M ,   ,   x T 1 ] # The last M frames of X
10.     S . append ( F N )
11.  return  S
12.else if  T < M :
13.   F 0 = pad ( X , M ) # Set X to have the length of M
14.   return { F 0 }
15.else:# The case of T = = M
16.   F 0 = X # X itself becomes a segment
17.   return { F 0 }
We do not consider a unified feature map approach [11,16,22] for segmentation. In the unified feature map approach, all features are first extended to have the same length that are sufficiently long (e.g., usually the length of the longest feature over the dataset) before segmentation. Therefore, all features have the same number of segments regardless of their original lengths. However, we empirically found no significant difference in detection error between with and without the unified feature map. Nonetheless, our approach (i.e., without the unified feature map) is more efficient in terms of data redundancy and computational amount and memory usage. This is why the unified feature map approach is not considered throughout this paper.
The following two issues should be considered to implement the feature segmentation. The first issue is how to handle the features shorter than or equal to the target length (i.e., M T ), where segmentation cannot be applied. As described above, M should be set to a moderate length, neither too long nor too short. In most cases, therefore, there are some features shorter than M . In our approach, we extend the features shorter than M to have the length of M by padding M T frames from those features. If the number of additionally required frames M T is below the target length T (i.e., M T < T ), a padded segment F 0 is obtained by padding the first M T frames of X at the end of X , F 0 = pad ( X , M ) = [ x 0 , , x T 1 , x 0 , , x M T 1 ] = [ X , x 0 , , x M T 1 ] . Otherwise (i.e., M T T ), F 0 is defined as Q = M T times repeat of X followed by the first M Q T frames of X , F 0 = pad ( X , M ) = [ X , , X , x 0 , , x M Q T 1 ] . When M = T , F 0 just corresponds to X itself. In turn, there is only one segment F 0 for the feature whose length is shorter than or equal to M .
The second issue is how to handle the remaining frames fewer than M after segmentation. For each feature of which the length is T ( > M ) , there are R = mod ( T M , L ) frames left after segmentation, where mod ( a , b ) is the modulo operation with the dividend a and divisor b . We also use the remaining R frames to exploit complete frames. When R > 0 , take an additional segment corresponding to the last M frames of X , regardless of the number of remainder frames R . This segment is defined as the last segment F N = [ x T M ,   ,   x T 1 ] .
In the training phase of the proposed method, each segment is treated as individual data, regardless of which feature the segment was obtained from. Even so, all segments obtained from a feature share the same class label. For instance, when the class label of a feature X is c , all the segments { F i } i = 0 N 1 from X also have the class label c . In the evaluation phase, the score (e.g., log probability of the genuine class) of each feature is calculated by aggregating (e.g., averaging) the scores of all the segments obtained from the feature.

3. The Proposed Method

The proposed method alleviates the problem of the conventional method that the available information at one time is limited because CNN takes only one segment F i at one time. The proposed method consists of two steps: (1) bidirectional feature segmentation (cf., unidirectional feature segmentation in the conventional method) and (2) bi-point input (cf., one-point input in the conventional method).

3.1. Bidirectional Feature Segmentation

In the conventional feature segmentation described in Section 2, a feature is decomposed along only one direction: the positive time direction. In contrast, in the proposed method, a feature is decomposed along two opposite directions: the positive and negative time directions. It is inspired by the bidirectional recurrent neural network (BRNN) [32]. BRNNs are developed to increase the amount of available information. They can obtain twice as much information as standard RNNs, using the information in both the forward (responsible for the positive time direction) and backward (responsible for the negative time direction) sequences.
Figure 2 and Algorithm 2 illustrate the overview and pseudocode of bidirectional feature segmentation. Given a feature X = [ x 0 , , x T 1 ] , bidirectional feature segmentation is processed in two steps: forward and backward steps. The forward step is identical to conventional feature segmentation. In the forward step, X is divided by moving the sliding window, of which the length and shift interval are M ( T ) and L , respectively, along the positive time direction (i.e., from 0 to T 1 ) to obtain N forward segments { F i } i = 0 N 1 . The i -th forward segment F i is [ x i L , ,   x i L + M 1 ] , as explained in Section 2. In contrast, in the backward step, X is divided by moving the sliding window along the negative time direction (i.e., from T 1 to 0 ). After the backward step is complete, we can obtain N backward segments { B i } i = 0 N 1 . Let X ˜ = [ x T 1 , , x 0 ] be the flipped order of X . The i -th backward segment B i is [ x T i L 1 , ,   x T i L M ] , where i L + M T for all i .
Algorithm 2 Pseudocode of Bidirectional Feature Segmentation
Input
   X = [ x 0 ,   ,   x T 1 ] : A feature with the length of T
   X ˜ = [ x T 1 ,   ,   x 0 ] : The flipped order of X
   M : The segment length
   L : The shift interval for segmentation
Output
   S : A list of the segment pairs
1.if  T > M :
2.   S = [   ] # Empty list
3.   N = M T # The quotient
4.   r = mod ( T M , L ) # The remainder
5.  for  i = 0 to N 1 :
6.     F i = [ x i L , ,   x i L + M 1 ] # The i -th forward segment
7.     B i = [ x T i L 1 , ,   x T i L M ] # The i -th backward segment
8.     S . append ( F i , B i )
9.  if  r > 0 :
10.     F N = [ x T M ,   ,   x T 1 ] # The last M frames of X
11.     B N = [ x M 1 , , x 0 ] # The last M frames of X ˜
12.     S . append ( F i , B i )
13.  return  S
14.else if  T < M :
15.   F 0 = pad ( X , M ) # Set X to have the length of M
16.   B 0 = pad ( X ˜ , M ) # Set X ˜ to have the length of M
17.  return  { F i , B i }
18.else:# The case of T = = M
19.   F 0 = X # X itself becomes a forward segment
20.   B 0 = X ˜ # X itself becomes a backward segment
21.  return  { F i , B i }
When M > T , one forward segment F 0 and one backward segment B 0 can be obtained without segmentation, as in the conventional method. The method of obtaining the forward segment F 0 is identical to the conventional method. The backward segment B 0 can be obtained as follows. If M T < T , where M T corresponds to the number of additionally required frames, B 0 is obtained by padding the first M T frames of X ˜ at the end of X ˜ , B 0 = pad ( X ˜ , M ) = [ x T 1 , , x 0 , x T 1 , , x 2 T M ] (the first M T frames of X ˜ correspond to the last M T frames of X with the flipped order of frames, [ x T 1 , , x 2 T M ] , where 2 T M = T ( M T ) ) Otherwise (i.e., M T T ), B 0 is defined as Q = M T times repeat of X ˜ followed by the first M Q T frames of X ˜ , B 0 = pad ( X ˜ , M ) = [ X ˜ , , X ˜ , x T 1 , , x ( Q + 1 ) T M ] (the first M Q T frames of X ˜ correspond to the last M Q T frames of X with the flipped order of frames, [ x T 1 , , x ( Q + 1 ) T M ] , where ( Q + 1 ) T M = T ( M Q T ) ) For the case of M = T , B 0 just corresponds to X ˜ itself.
If there are remaining R frames after bidirectional feature segmentation, the last forward segment F N is defined the same as in conventional feature segmentation. Similarly, the last backward segment B N is defined as the last M frames of X ˜ , B N = [ x M 1 , , x 0 ] .
All the other conditions not described are the same as those in conventional feature segmentation. For example, the class label of the feature is also shared for all backward segments.

3.2. Bi-Point Input

With bidirectional feature segmentation, we can obtain two sets of segments from an utterance: a set of forward segments { F i } i = 0 N 1 and a set of backward segments { B i } i = 0 N 1 . With bi-point input, the i -th forward segment F i forms a pair with the i -th backward segment B i . Therefore, there are N pairs of segments. The CNN receives one pair (i.e., two segments, F i and B i ) as an input at one time, rather than only one (forward) segment. Therefore, the amount of available information at one time becomes up to double. The amount of available information with the bi-point input is usually less than twice the conventional one-point input method because the time ranges of F i and B i could be partially or completely overlapped for some i (e.g., around the middle of the feature) under our approach. The wider the overlapped range, the less the gain in the increase in the amount of information available at one time.
It is reasonable that each pair consists of F i and B i , not F i and the ( N 1 i ) -th backward segment B N 1 i . Recall that the primary purpose of the bi-point input is to increase the amount of information available at one time. If the i -th pair is made up of F i = [ x i L , ,   x i L + M 1 ] and B N 1 i = [ x i L + M 1 , ,   x i L ] (for a brief explanation, we assume there is no remainder frame after segmentation, that is, R = mod ( T M , L ) = 0 , but even if R > 0 , the increased amount of information available at one time is negligible and has little effect on reducing the detection error), both segments in each pair (i.e., F i and B N 1 i ) always cover the exact same time range. The only difference between the two segments is the frame order. It indicates that both segments in each pair always have the same information. In this case, therefore, it is difficult to increase the amount of information available at one time. Even if the detection error is reduced, it may be due to the data augmentation based on flipping the time order of feature, rather than the increase in the available information at one time. We will discuss this issue in detail in Section 5.2.
Recall that F i and B i have the opposite frame order (explained in Section 3.1), which not only avoids meaningless operations but also has the side effect of data augmentation. Although F i and B i in a same pair generally cover the different time ranges, there could be some cases where F i and B i cover the same time range at certain i (e.g., around the middle of the feature), as mentioned above. When F i and B i cover the same time range, their outputs are exactly the same if they have the same frame order. In this case, feeding F i and B i together is merely a redundancy of meaningless operations, which is not the increase in the available information at one time nor data augmentation. This problem can be avoided by making the frame orders of F i and B i opposite, although the available information at one time still cannot be increased. As long as the frame orders are opposite, F i and B i can be treated as different features for CNN, even if they cover exactly the same time range. Notice that the words ‘different features’ do not mean that F i and B i have different information, but mean that they just have (slightly) different output values.
Figure 3 illustrates the overall frameworks of (a) the conventional system (i.e., based on conventional feature segmentation and one-point input) and (b–d) the proposed systems (i.e., based on bidirectional feature segmentation and bi-point input). The differences between the conventional and proposed systems are highlighted in blue shade. In the proposed systems, both segments in a pair are fed into the same CNN, which indicates that the CNN parameters are shared across both segments. The results from each of the segments in a pair are combined before being fed into the classifier. From a pair of two segments, therefore, we obtain one score value, not two.
We propose several methods to combine the results from the two segments in a pair, including the methods used in [8,27,29]. These are designed considering that most CNN-based models are composed in the following order of modules: CNN, a global average pooling (GAP) layer, and a classifier. The GAP layer transforms C feature maps of size ( C × H × W ) into a vector of size C by averaging along each channel, where C is the number of channels. The methods are categorized into four methods, depending on where the combination is performed within the networks.

3.2.1. Embedding-Level Combination

Figure 3b illustrates the embedding-level combination method. We used the output of GAP layer as embedding. In this method, two embeddings obtained from each pair of segments are combined to form one embedding. The combined embedding is then fed into the classifier. We combine the two embeddings with three approaches: concatenating (denoted as concat), element-wise maximum (denoted as vmax), and element-wise averaging (denoted as vmean). Let u = [ u 1 ,   ,   u D ] T and v = [ v 1 ,   ,   v D ] T be two embedding vectors, and w = [ w 1 ,   ,   w D ] T be the output. vmax computes the element-wise maximum, so the output w becomes w = max ( u , v ) , where w i = max ( u i , v i ) . vmean computes the element-wise average, so the output w becomes w = 0.5 ( u + v ) (i.e., just vector addition followed by scalar multiplication), where w i = 0.5 ( u i + v i ) . The dimensionality of the combined embedding is doubled when using concat, indicating that the number of parameters for the classifier has doubled. In contrast, when using vmax or vmean, the combined embedding has the same dimensionality as each original embedding, so no additional parameter is required.

3.2.2. Feature Map-Level Combination

Figure 3c illustrates the feature-map-level combination method. The output of the CNN module with the shape of ( C × H × W ) corresponds to a set of C feature maps, where C is the number of output channels of the last convolution layer in the module, and H and W are the height and width of each feature map, respectively. Two sets of feature maps can be obtained from two segments in a pair. We combine these two feature maps using an element-wise maximum (denoted as fmax). We do not consider other operations for combination, such as concatenating and element-wise averaging per channel, because concatenating and element-wise averaging at the feature map level produce the same results as those at the embedding level (i.e., concat and vmean, respectively).

3.2.3. Two-Channel Input

Figure 3d illustrates the two-channel input method. Most acoustic features from utterances in the form of spectrograms, corresponding to a CNN input in many cases, are in 2D matrix form with the shape of ( H × W ) . Such a 2D matrix can be viewed as a 3D tensor with the shape of ( 1 × H × W ) , where the number of channels is 1. We propose the two-channel input method (denoted as 2ch), where two segments in a pair are concatenated along the channel axis to form one 3D tensor with the shape of ( 2 × H × W ) . This concatenated 3D tensor is then used as a CNN input. For receiving an input with two channels, the number of kernels of the first convolution layer of the CNN module must be doubled. Consequently, the number of parameters for the first convolution layer doubles.
The CNN designed for one-channel inputs can also receive two-channel inputs without increasing the number of kernels by sharing the kernels across all input channels (denoted as 2ch_s). In this method, therefore, the number of parameters is not changed. However, we do not consider the 2ch_s method because of its lower gain in reducing detection error. The comparison of the detection errors of 2ch and 2ch_s will be discussed in the last part of Section 5.2.

3.2.4. Statistics-Level Combination

The statistics-level combination method is only for an x-vector network [33,34], one of the models used in our experiments. Because the x-vector network cannot receive inputs with multiple channels, the feature map-level combination and two-channel input methods cannot be used with the x-vector network.
Accordingly, we propose another statistics-level combination method (denoted as statc) that can be used only for the x-vector network. The statistics pooling layer originally computes the statistics (e.g., mean and standard deviation) over all the M output frames from one (forward) segment. In contrast, with the statc method, the statistics pooling layer computes the statistics over 2 M output frames, of which some M frames are from a forward segment and the other M frames are from a backward segment.

4. Experiments

4.1. Database

We used the ASVspoof 2019 database [35]. It includes two task scenarios: logical access (LA) and physical access (PA).
The LA scenario addresses the spoofing attacks generated by TTS synthesis and VC. It is divided into the training (2580 genuine and 22,800 spoofed speeches), development (2548 genuine and 22,296 spoofed speeches), and evaluation (7355 genuine and 63,882 spoofed speeches) sets. The utterance length is approximately 3.25 s on average and ranges from 0.47 to 13.19 s with a standard deviation of 1.47. Figure 4a illustrates the histogram of the utterance lengths in the LA scenario.
The PA scenario addresses the spoofing attacks generated by replay attack methods. It is divided into the training (5400 genuine and 48,600 spoofed speeches), development (5400 genuine and 24,300 spoofed speeches), and evaluation (18,090 genuine and 116,640 spoofed speeches) sets. The utterance length is approximately 4.28 s on average and ranges from 1.46 to 10.32 s with a standard deviation of 1.2. Figure 4b illustrates the histogram of the utterance lengths in the PA scenario.
The training sets were used to build spoofing detection systems. The development sets were used to validate the detection errors of the systems for every epoch. The evaluation sets were used to evaluate the detection error rates of the system.

4.2. Experimental Setup

We used a 257-dimensional log power spectrogram as input feature for all types of CNNs used in our experiments. For each utterance, 25 ms frames were extracted at 10 ms intervals. The frame-level preprocessing was performed in the following order: removing the DC offset, pre-emphasis filtering with a coefficient of 0.97, and applying the Hamming window. The number of fast Fourier transform (FFT) points was 512. We performed neither voice activity detection (VAD) [36] nor cepstral mean and variance normalization (CMVN) for the spectrogram. The Kaldi toolkit [37] was used to extract the feature.
For segmentation, we set the length of segment M to 200 (corresponding to 2 s), 400 (corresponding to 4 s), and 600 (corresponding to 6 s). Approximately 81%, 25%, and 5% of utterances in the LA subset are longer than 2, 4, and 6 s, respectively. Approximately 99%, 56%, and 8% of utterances in the PA subset are longer than 2, 4, and 6 s, respectively. The shift interval L was half of M , which is based on the claim in [26] that segmentation with overlap generally outperforms that without overlap (i.e., L = M ).
We used six types of CNN-based models to build spoofing detection systems: SE-ResNet, x-vector network, dense convolutional network (DenseNet) [38], MobileNetV2 [39], ShuffleNetV2 [40], and MNASNet [41] designed using an approach of automated mobile neural architecture search (MNAS). All models used in our experiments have a fully connected (FC) softmax classifier at the top. The number of classes is three (i.e., genuine, TTS, and VC) for the LA, and two (i.e., genuine and replayed) for the PA. AMSGrad [42], a variant of the Adam [43] optimizer, was used to minimize cross-entropy loss. The hyper-parameters for the optimizer were as follows: a learning rate of 10 3 , β 1 = 0.9 , β 2 = 0.999 , ϵ = 10 8 , and a weight decay of 10 4 . All weights were initialized from the He normal distribution [44] and no bias was used. We trained the networks for 100 epochs with a mini-batch size of 64. PyTorch [45] with the torchvision.models package was used to implement the systems.
We used the equal error rate (EER) as the evaluation metric. The lower the EER, the lower the detection error. For the LA, where the number of classes is three, the score of each segment was defined as the log probability of the genuine class log   p g ( x ) , corresponding to the log-softmax output for that class. For the PA, where the number of classes is two, the score of each segment was defined as the log ratio of the spoof probability to the genuine probability, log   p g ( x ) log   p s ( x ) . The score of each utterance is computed by averaging the scores of all the segments in the utterance.

5. Results

5.1. Experimental Results and Discussion

Table 1 presents the EERs of the baseline and proposed systems on the development and evaluation trials of ASVspoof 2019 LA and PA tasks. For each system (corresponding to the combination of model and method), we selected the one among 100 epochs that exhibited the lowest EER in the development trials. As guidance, the lowest EERs in the evaluation trials (i.e., the oracle EERs) were additionally reported in Appendix A. The Fusion method indicates the score-level fusion of all the proposed systems using simply summation. Unless otherwise noted, the explanations are focused on the results of the evaluation trials.
The experimental results in Table 1 indicate that the proposed method outperforms the conventional feature segmentation on spoofing detection, both the LA and PA tasks. In most cases, the proposed method exhibited lower EERs than the baseline with suitable combination methods, regardless of the type of model, segment length M , and task. There is an exceptional case where the proposed method failed to reduce EER for MobileNetV2 with M = 600 . This result is mainly due to the limitation of the validation (especially for the LA task), rather than the proposed method’s ineffectiveness at reducing the EER. This issue is explained in Appendix A. Our experiments provide evidence that the discriminative attributes between genuine and spoofed utterances are spread across the entire time domain, rather than concentrated primarily in some specific ranges, as stated in Section 1. The performance would not be improved with the proposed method if the discriminative attributes were concentrated in some specific time ranges. However, even if the attributes are spread over the entire time, we could not get lower EERs if the attributes in different time ranges do not have distinct information in terms of the detection. Therefore, the proposed method is efficient for the cases where the distinct information is spread across the entire time domain, such as spoofing detection.
We discuss the results in Table 1 with respect to the segment length M . The longer the M , the lower the EERs in most cases for both baseline and proposed systems. It is consistent with our intuition that the longer segment has more information than the shorter segment (if M becomes even longer (e.g., thousands of frames or more), EER could become saturated, but we do not compare with the cases where M is extremely long because it is rarely considered because of the limited computing power and memory usage). Some exceptional cases were observed for the LA task where the EERs were increased on average, although M lengthened from 400 to 600: with SE-ResNet, DenseNet (only with the proposed methods), and MNASNet. These cases are discussed in Appendix A.
We compared the baseline with a target length M b and the proposed methods with the target length M p shorter than M b (i.e., M p < M b ). For M b = 400 and M p = 200 , it could be expected that the proposed systems exhibit similar EERs to the baseline systems, because the total numbers of frames fed at one time are the same. Following the same logic, it also could be expected for M b = 600 and M p = 400 that the proposed systems exhibit lower EERs to the baseline systems because the total number of frames fed at one time is greater with the proposed systems than the baseline systems. However, the proposed systems with M p = 200 generally exhibited EERs higher than those of the baseline systems with M b = 400 . The proposed systems with M p = 400 exhibited slightly lower but similar levels of EERs as the baseline systems with M b = 600 on average. Recall that the amount of information with the proposed method is usually less than twice the amount with the baseline method because of the partially overlapped time ranges, as described in Section 3.2. Accordingly, the above results are acceptable and do not undermine our expectation that the longer M , the more available information at one time, and thus the lower the EERs.
The above results illustrated that the effectiveness of the proposed method on reducing EERs depends on M . We empirically found that the proposed method was more effective at reducing EER as M was closer to the average length. For M = 200 , 33 out of the 35 proposed systems (including Fusion) exhibited lower EERs than the baseline in both the LA and PA tasks. The EERs were reduced on average by approximately 17.2% (−12.1% to 44.6%) for the LA task and 35.6% (−36.9% to 73.4%) for the PA task. For M = 400 , 28 and 33 out of the 35 proposed systems exhibited lower EERs than the baseline in the LA and PA tasks, respectively. The EERs were reduced on average by approximately 13.4% (−43.5% to 41%) for the LA task and 43.8% (−6.5% to 73.3%) for the PA task. When M is longer than the average, the proposed method exhibited a lower gain in reducing EER. For M = 600 , 18 out of the 35 proposed systems exhibited higher EERs than the baseline for the LA task. The EERs for the LA task increased on average by approximately 2.7% (i.e., decreased approximately −2.7%, range from −74.8% to 25%) when using the proposed method. For the PA task, 27 out of the 35 proposed systems exhibited lower EERs than the baseline. The EERs were reduced on average by approximately 18.2% (−100.5% to 64.4%). The proposed method was most effective for reducing EER when M = 200 for the LA task and when M = 400 for the PA task. These results can be interpreted as follows. For the features longer than M , the proposed method enables the CNN to consider more information at one time. In contrast, the proposed method has a negligible effect for features shorter than M because the CNN already considers the information of the complete feature at one time for these short features, even without the proposed method. Thus, for the features shorter than M , there is no longer room to increase the amount of information available at one time.
We provide a rough guidance to set a proper M when using the proposed method without evaluation by defining the additional gain α = M ρ , where ρ is the proportion of the number of the features longer than M in the dataset (stated in Section 4.2). For the LA dataset, where the average number of frames is 325, α = 200 × 0.81 = 162 when M = 200 (i.e., 81% of features have frames more than 200), α = 400 × 0.25 = 100 when M = 400 (i.e., 25% of features have frames more than 400), and α = 600 × 0.05 = 30 when M = 600 (i.e., 5% of features have frames more than 600). The average EER reductions are approximately 17.2%, 13.4%, and −2.7% (i.e., the EERs were increased by approximately 2.7% on average) when M = 200 , M = 400 , and M = 600 , respectively. For the PA dataset, where the average number of frames is 428, α = 200 × 0.99 = 198 when M = 200 (i.e., 99% of features have frames more than 200), α = 400 × 0.56 = 224 when M = 400 (i.e., 56% of features have frames more than 400), and α = 600 × 0.08 = 48 when M = 600 (i.e., 8% of features have frames more than 600). The average EER reductions are approximately 35.6%, 43.8%, and 18.2% when M = 200 , M = 400 , and M = 600 , respectively. It is reasonable to infer that the higher the α , the higher the efficiency of the proposed method on reducing EER. Therefore, we recommend considering α to find a proper value of M when using the proposed method.
Next, we discuss the results in terms of the combination method. Contrary to our prediction that some specific combination methods would achieve lower EERs than others, there is no consistent tendency in terms of the combination method. Instead, the optimal combination method (i.e., the methods that showed the EER underlined in Table 1) differed for each case. Furthermore, in many cases, there were mismatches in the optimal combination method between the development and evaluation trials, although the other conditions (e.g., model, M , and task) were the same. Although Fusion achieved the lowest EER in most cases or EERs close to the lowest, it is inefficient because it requires multiple networks with different combination methods. Therefore, it is encouraged to develop a new combination method to achieve stable performances (e.g., consistent tendency across trials/conditions), left for future research.

5.2. Ablation Study

Throughout this paper, we have claimed that the proposed method reduces detection errors by increasing the amount of information available at one time. In this section, we support this claim through additional experiments in the following three factors: time range, flipping, and randomness of target length. We set the segment length M to 400 for all the experiments.
In Section 3.2, we doubted whether the proposed method still can reduce the detection errors when two segments in a pair always cover the same time range. To solve this doubt, we first compared the EERs between the proposed systems and its variations where both segments in a pair always cover the same time range (denoted as ST). Table 2 illustrates the oracle EERs of systems with the proposed method and ST for the LA and PA tasks, respectively. We focus on the results in the evaluation trials. If the detection errors are reduced even with ST, as with the proposed method, we no longer claim that the mechanism of the proposed method reducing the errors is to increase the amount of information available at one time. Instead, it might be because flipping the frame order has an effect of data augmentation, which led to the error reduction.
For the LA task, the proposed method exhibited lower EERs than ST, approximately 3.2% on average (−30.3% to 33.5%). Compared with the baseline, the proposed method exhibited lower EERs, approximately 8.2% on average, where 23 out of 29 proposed systems (except average) achieved the lower EERs. ST exhibited lower EERs than the baseline, approximately 2.5% on average, where 14 out of 29 systems exhibited the lower EERs with ST. Contrary to our prediction, there was no dominant performance gap between the proposed method and ST, albeit these results are still in accordance with our claim on the mechanism of the proposed method.
Unlike in the LA task, ST does not helpfully reduce EER at all for the PA task. The proposed method exhibited considerably lower EERs than ST for all cases, except the system of the MobileNetV2 with vmax. The EERs of the proposed methods are significantly lower than those of ST, approximately 46% on average (−24.2% to 64.5%). Moreover, compared with the baseline, the 27 out of 29 systems exhibited lower EERs when using the proposed method, approximately 43.2% on average, whereas 13 out of 29 systems exhibited lower EERs when using ST than the baseline, approximately merely 6.3% on average.
To further analyze the effect of the flipping the frame order, we conducted additional experiments to compare the oracle EERs between the baseline, proposed, backward-only (denoted as BO), and data-augmented (denoted as Augment) systems. BO is a variant of the baseline where only the backward segments B i were used in both the training and evaluation phases, not the forward segments F i . Augment is identical to the baseline, but the total quantity of data is exactly twice the baseline in both the training and evaluation phases. These doubled amounts of data correspond to all the forward and backward segments, obtained by bidirectional feature segmentation. Augment treats two segments in each pair as individual data, not receiving them at one time. Therefore, the baseline, BO, and Augment have the same amount of information available at one time.
Table 3 presents the oracle EERs of the baseline, proposed, BO, and Augment systems in the development and evaluation trials of the LA and PA tasks. For the proposed systems, we presented the average of the EERs obtained with each combination method (excluding Fusion).
For the evaluation trials of the LA task, both BO and Augment exhibited higher EERs than the proposed systems on average, higher by approximately 14.9% (−9.4% to 40.5%) and 10% (−13.3% to 53.1%), respectively. Compared to the baseline systems, BO exhibited higher EERs, approximately 5.2% (−26.3% to 23%) on average. Augment exhibited almost the same EERs on average as the baseline, lower by only approximately 0.01% (−27.7 to 23.6%).
For the PA task, the proposed method always exhibited the lowest EERs by large margins. In the development trials, BO exhibited higher EERs on average, by approximately 11.5% (−18.5% to 57.7%) and 138.8% (32% to 228.8%) than the baseline and proposed systems, respectively. Augment exhibited higher EERs on average, by approximately 12.9% (−13.9% to 62.3%) and 141.7% (29.3% to 218%) than the baseline and proposed systems, respectively. In the evaluation trials, BO exhibited higher EERs on average, by approximately 3.6% (−19.2% to 40.5%) and 89% (14.5% to 138.6%) than the baseline and proposed systems, respectively. Augment exhibited higher EERs on average, by approximately 35.2% (17.2% to 94.7%) and 145.1% (63.7% to 199.1%) than the baseline and proposed systems, respectively.
As presented in Table 2 and Table 3, it can be observed that the proposed method generally exhibited lower EERs than ST, BO, and Augment in both the LA and PA tasks. In the LA task, the average performance gains with ST, BO, Augment, and the proposed method compared to the baseline are 2.5%, −5.2%, 0.01%, and 8.2%, respectively. In the evaluation trials of the PA task, the average performance gains with ST, BO, Augment, and the proposed method compared to the baseline are 6.3%, −3.6%, −35.2%, and 46%, respectively. These results support the claim that it is important to increase the amount of available information at one time for robust spoofing detection. The proposed method can satisfy this requirement, which is difficult, by merely flipping the frame order.
The next experiment compares the EERs of 2ch and 2ch_s, introduced in Section 3.2.4. Table 4 presents the oracle EERs of systems on the development and evaluation trials of the LA and PA tasks. The x-vector network was excluded in this experiment because it cannot receive a multi-channel input, as explained in Section 3.2.3.
For the LA task, 2ch_s exhibited higher EERs than 2ch in the evaluation trials, by approximately 10.8% on average (−11.2% to 32.4%). Three of the five models exhibited lower EERs with 2ch than with 2ch_s.
For the PA task, 2ch_s exhibited distinctly higher EERs than 2ch, regardless of the model. It exhibited approximately 134.1% (66.8% to 225.1%) and 79.7% (59.6% to 99.3%) higher EERs in the development and evaluation trials, respectively. 2ch exhibited lower EERs than 2ch_s. Consistent with our expectation, this is due to the additional parameters that provide different modeling from the existing parameters at the input feature level.
The last experiments were conducted to demonstrate that the proposed method is better than on-the-fly data loader on spoofing detection, as introduced in Section 1. Table 5 illustrates the oracle EERs of the baseline, on-the-fly data loader (denoted as OTF), and proposed systems in the development and evaluation trials of the LA and PA tasks. For the on-the-fly data loader, we sampled the target length for each iteration from a uniform distribution on the interval [ 200 ,   600 ] to match the average number of frames that CNNs take at one time with the baseline.
For the evaluation trials of the LA task, the proposed systems showed relatively 10.3% (−3.4% to 28.9%) lower EERs than OTF on average. Compared to the baseline systems, OTF exhibited higher EERs, approximately 4.3% (−21.3% to 39.7%) on average, whereas the proposed systems exhibited lower EERs, approximately 8.2% (−5.7% to 18.6%), than the baseline systems on average.
For the PA task, the proposed method showed relatively 58.6% (35.6% to 69%) and 51.8% (38.4% to 67.5%) lower EERs than OTF on average in the development and evaluation trials, respectively. Compared to the baseline systems, OTF exhibited higher EERs, approximately 20.4% (−4.1% to 59.3%) and 19.2% (−2.3% to 42.4%) on average, in the development and evaluation trials, respectively. In contrast, the proposed systems showed significantly lower EERs, approximately 50.9% (30.9% to 63.3%) and 41.8% (15% to 60.4%) on average, in the development and evaluation trials, respectively.

6. Conclusions

We proposed a bi-point input method based on bidirectional feature segmentation. It enables CNNs that receive variable-length inputs to increase the amount of available information at one time. Each feature is decomposed into two sets of segments (i.e., forward and backward segment sets) along two directions (i.e., positive and negative time directions). Then, the CNN receives a pair of two segments (i.e., one forward segment and one backward segment) as an input at one time. We also evaluated various combination methods. The results from both segments are combined using one of the proposed combination methods before being fed into the final classifier to aggregate the information from both segments. The proposed method achieved lower EERs than the conventional method on both the LA and PA tasks of ASVspoof 2019, while minimizing changes in the network structure. Based on these results, we claim that the proposed method with suitable combination methods can function as robust spoofing detection systems by increasing the amount of information available at one time. We also provided a rough guidance to set a proper segment length without evaluation.
In the future, we will extend the proposed method to receive more than two segments at one time, for a further reduction in detection errors. To achieve a stable performance without the fusion, we will investigate how to aggregate the information in all the pairs of segments more efficiently. Moreover, we will apply our proposed method to different tasks, such as speaker verification, acoustic scene classification, and others that use different types of time-series input other than audio. We predict that the proposed method would also benefit tasks where the discriminative attributes between the classes are spread across the entire time domain, like spoofing detection.

Author Contributions

Conceptualization, S.Y.; methodology, S.Y.; formal analysis, S.Y.; investigation, S.Y.; writing—original draft preparation, S.Y.; writing—review and editing, H.-J.Y.; supervision, H.-J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a research grant from Kongju National University in 2022.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Appendix A. The Oracle Evaluation Results

Table A1 presents the oracle EERs of the baseline and proposed systems in the evaluation trials, where each system is the one among 100 epochs that exhibited the lowest EER in the evaluation trials rather than the development trials (i.e., Table 1). As already depicted in Table 1 in Section 5.1, almost all of the development EERs of the LA task were zero, even though the evaluation EERs were considerably far from zero. Furthermore, most systems easily achieved zero development EER for the LA task for most epochs. These results indicate it is easy to achieve zero EER for the development trials of the LA task, even with inadequate systems in the evaluation trials. Therefore, especially for the LA task, there is a limitation that model selection based on the development EER could be prone to failing to choose adequate systems in the evaluation trials. Consequently, even if the proposed system selected based on the development EERs presents a worse EER than the baseline in the evaluation trials, it is difficult to assert that the increase in the EER with the proposed method is due to whether the method is ineffective or a poor-performing system is chosen. Therefore, it is encouraged to interpret the results from the validation in the development trial (Table 1) with reference to the oracle results (Table A1), especially for the LA task.
Table A1. Oracle EERs (%) of systems on the evaluation trials of ASVspoof 2019 LA and PA scenarios. For each model, the lowest EER is highlighted in bold (comparison including Fusion) or is underlined (comparison excluding Fusion).
Table A1. Oracle EERs (%) of systems on the evaluation trials of ASVspoof 2019 LA and PA scenarios. For each model, the lowest EER is highlighted in bold (comparison including Fusion) or is underlined (comparison excluding Fusion).
ModelMethodLAPA
M = 200 M = 400 M = 600 M = 200 M = 400 M = 600
SE-ResNetBaseline11.655 8.211 7.195 3.197 1.255 0.901
concat10.156 7.967 7.956 2.127 0.708 1.007
vmax10.102 9.517 7.480 2.118 1.327 1.503
vmean10.605 9.081 8.226 2.599 1.156 1.526
fmax10.536 7.576 10.510 2.304 1.105 0.896
2ch11.543 9.261 8.019 3.282 1.040 0.790
Fusion10.3358.0387.3131.6430.6190.553
X-vector
Network
(TDNN)
Baseline13.680 8.157 5.830 8.297 3.538 1.408
concat12.618 6.352 5.820 3.964 1.493 1.166
vmax11.013 6.741 7.396 3.936 1.027 1.028
vmean13.068 7.804 6.349 3.626 1.497 1.248
statc12.071 6.309 7.028 3.555 1.581 1.156
Fusion10.4286.1715.1121.8790.8850.768
DenseNetBaseline10.986 9.095 7.127 2.6751.3600.702
concat10.129 6.868 6.378 1.3930.5420.557
vmax8.484 6.717 6.841 1.1160.7190.492
vmean10.075 7.981 7.699 1.7080.5640.469
fmax9.762 7.628 6.881 1.4700.7020.493
2ch9.9667.8286.4972.5100.7960.486
Fusion8.6885.9015.6260.7630.3920.276
MobileNetV2Baseline9.939 7.695 7.507 4.223 1.885 0.785
concat8.513 6.067 7.775 2.593 1.398 0.878
vmax8.987 7.439 5.138 2.769 2.150 0.714
vmean9.029 7.533 7.599 2.587 1.847 0.967
fmax9.720 6.186 5.806 2.029 1.371 0.874
2ch9.669 6.242 8.279 3.344 0.935 0.692
Fusion7.0175.7484.9611.6641.0060.510
ShuffleNetV2Baseline17.919 7.303 6.744 4.374 2.102 1.283
concat12.101 6.609 6.812 2.102 0.885 0.790
vmax14.004 7.032 7.939 2.102 0.823 0.691
vmean15.892 6.405 6.637 2.366 0.869 0.696
fmax12.397 7.454 8.840 2.123 1.061 0.768
2ch15.309 8.786 6.706 3.444 1.227 0.951
Fusion10.9046.5785.6951.5420.5790.485
MNASNetBaseline9.384 6.092 8.064 4.019 2.224 0.857
concat8.554 6.080 9.680 3.097 0.774 0.614
vmax8.403 5.398 5.682 3.384 0.818 0.526
vmean9.027 5.384 5.970 2.935 0.923 0.525
fmax8.539 4.840 7.709 3.185 0.846 0.525
2ch8.403 6.840 7.575 3.582 1.039 0.452
Fusion6.8394.6905.9982.1880.5420.321
In Section 5.1, we saw that there were exceptional cases for the LA task with SE-ResNet, DenseNet, and MNASNet. The first two cases are due to the limitation of the validation. Contrary to the EERs depicted in Table 1, the oracle EERs (i.e., in Table A1) were lower when M = 600 than when M = 400, like most other cases. For the case of MNASNet, however, even the oracle EERs were still higher when M = 600 than when M = 400. Further investigations should be conducted to understand this exception case. On average, the oracle evaluation EERs are relatively lower than the evaluation EERs shown in Table 1, by approximately 29.2% (18.4% to 46.2%) and 10% (2.1% to 21.6%) for the LA and PA tasks, respectively. It means that the validation based on the development trials has the limitation to select proper systems for the LA task, as our expectation. Overall, similar trends were observed between Table 1 and Table A1. For example, the proposed method shows lower EERs than the baseline with suitable combination methods, but there is no consistent tendency in terms of the combination method. In addition, the proposed method is more effective to reduce EER as M is closer to the average length. These are common trends observed in both Table 1 and Table A1. There is one opposite trend with Table 1 on the LA task that the proposed method showed higher oracle EERs than the baseline for SE-ResNet with M = 600, regardless of the combination method.

References

  1. Wu, Z.; Kinnunen, T.; Evans, N.; Yamagishi, J.; Hanilci, C.; Sahidullah, M.; Sizov, A. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 1–5. [Google Scholar]
  2. Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 1–6. [Google Scholar]
  3. Todisco, M.; Wang, X.; Vestman, V.; Sahidullah, M.; Delgado, H.; Nautsch, A.; Yamagishi, J.; Evans, N.; Kinnunen, T.; Lee, K.A. ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv 2019, arXiv:1904.05441. [Google Scholar]
  4. Yamagishi, J.; Wang, X.; Todisco, M.; Sahidullah, M.; Patino, J.; Nautsch, A.; Liu, X.; Lee, K.A.; Kinnunen, T.; Evans, N.; et al. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; pp. 47–54. [Google Scholar]
  5. Kamble, M.R.; Patil, H.A. Novel energy separation based instantaneous frequency features for spoof speech detection. In Proceedings of the 2017 25th European Signal Processing Conference, Kos, Greece, 26 October 2017; pp. 106–110. [Google Scholar]
  6. Yoon, S.-H.; Koh, M.-S.; Park, J.-H.; Yu, H.-J. A new replay attack against automatic speaker verification systems. IEEE Access 2020, 8, 36080–36088. [Google Scholar] [CrossRef]
  7. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  8. Yoon, S.; Yu, H.-J. Multiple-point input and time-inverted speech signal for the ASVspoof 2021 challenge. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; pp. 37–41. [Google Scholar]
  9. Tomilov, A.; Svishchev, A.; Volkova, M.; Chirkovskiy, A.; Kondratev, A.; Lavrentyeva, G. STC antispoofing systems for the ASVspoof2021 challenge. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; pp. 61–67. [Google Scholar]
  10. Chen, X.; Zhang, Y.; Zhu, G.; Duan, Z. UR channel-robust synthetic speech detection system for ASVspoof 2021. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; pp. 75–82. [Google Scholar]
  11. Benhafid, Z.; Selouani, S.A.; Yakoub, M.S.; Amrouche, A. LARIHS ASSERT reassessment for logical access ASVspoof 2021 challenge. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; pp. 94–99. [Google Scholar]
  12. LeCun, Y.; Bottou, A.; Orr, G.B.; Muller, K.R. Efficient backprop. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 9–48. [Google Scholar]
  13. Lai, C.; Abad, A.; Richmond, K.; Yamagishi, J.; Dehak, N.; King, S. Attentive filtering networks for audio replay attack detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6316–6320. [Google Scholar]
  14. Wang, H.; Dinkel, H.; Wang, S.; Qian, Y.; Yu, K. Cross-domain replay spoofing attack detection using domain adversarial training. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2938–2942. [Google Scholar]
  15. Yoon, S.-H.; Yu, H.-J. A simple distortion-free method to handle variable length sequences for recurrent neural networks in text dependent speaker verification. Appl. Sci. 2020, 10, 4092. [Google Scholar] [CrossRef]
  16. Cai, W.; Wu, H.; Cai, D.; Li, M. The DKU replay detection system for the ASVspoof 2019 challenge: On data augmentation, feature representation, classification, and fusion. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1023–1027. [Google Scholar]
  17. Lavrentyeva, G.; Novoselov, S.; Tseren, A.; Volkova, M.; Gorlanov, A.; Kozlov, A. STC antispoofing systems for the ASVspoof2019 challenge. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1033–1037. [Google Scholar]
  18. Alzantot, M.; Wang, Z.; Srivastava, M.B. Deep residual neural networks for audio spoofing detection. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1078–1082. [Google Scholar]
  19. Wu, H.; Liu, S.; Meng, H.; Lee, H. Defense against adversarial attacks on spoofing countermeasures of ASV. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6564–6568. [Google Scholar]
  20. von Platen, P.; Tao, F.; Tur, G. Multi-task Siamese neural network for improving replay attack detection. In Proceedings of the Interspeech, Shanghai, China, 26–29 October 2020; pp. 1076–1080. [Google Scholar]
  21. Parasu, P.; Epps, J.; Sriskandaraja, K.; Suthokumar, G. Investigating Light-ResNet architecture for spoofing detection under mismatched conditions. In Proceedings of the Interspeech, Shanghai, China, 26–29 October 2020; pp. 1111–1115. [Google Scholar]
  22. Monteiro, J.; Alam, J.; Falk, T.H. A multi-condition training strategy for countermeasures against spoofing attacks to speaker recognizers. In Proceedings of the Odyssey Speaker Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 296–303. [Google Scholar]
  23. Halpern, B.M.; Kelly, F.; van Son, R.; Alexander, A. Residual networks for resisting noise: Analysis of an embedding-based spoofing countermeasures. In Proceedings of the Odyssey Speaker Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 326–332. [Google Scholar]
  24. Chettri, B.; Kinnunen, T.; Benetos, E. Subband modeling for spoofing detection in automatic speaker verification. In Proceedings of the Odyssey Speaker Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 341–348. [Google Scholar]
  25. Cai, W.; Chen, J.; Zhang, J.; Li, M. On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1038–1051. [Google Scholar] [CrossRef]
  26. Lai, C.; Chen, N.; Villalba, J.; Dehak, N. ASSERT: Anti-spoofing with squeeze-excitation and residual networks. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1013–1017. [Google Scholar]
  27. Yoon, S.-H.; Yu, H.-J. Multiple points input for convolutional neural networks in replay attack detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6444–6448. [Google Scholar]
  28. Wang, Q.; Lee, K.A.; Koshinaka, T. Using multi-resolution feature maps with convolutional neural networks for anti-spoofing in ASV. In Proceedings of the Odyssey Speaker Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 138–142. [Google Scholar]
  29. Yoon, S.-H.; Koh, M.-S.; Yu, H.-J. Phase spectrum of time-flipped speech signals for robust spoofing detection. In Proceedings of the Odyssey Speaker Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 319–325. [Google Scholar]
  30. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  31. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 770–778. [Google Scholar]
  32. Schuster, M.; Paliwai, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef] [Green Version]
  33. Snyder, D.; Garcia-Romeo, D.; Shell, G.; Povey, D.; Khudanpur, S. Deep neural network embeddings for text-independent speaker verification. In Proceedings of the Interspeech, 2017, Stockholm, Sweden, 20–24 August 2017; pp. 999–1003. [Google Scholar]
  34. Snyder, D.; Garcia-Romeo, D.; Shell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, AL, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
  35. Wang, X.; Yamagishi, J.; Todisco, M.; Delgado, H.; Nautsch, A.; Evans, N.; Sahidullah, M.; Vestman, V.; Kinnunen, T.; Lee, K.A.; et al. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Comput. Speech Lang. 2020, 64, 101114. [Google Scholar] [CrossRef]
  36. Sahidullah, M.; Kinnunen, T.; Hanilci, C. A comparison of features for synthetic speech detection. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015; pp. 1–6. [Google Scholar]
  37. Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, Waikoloa, HI, USA, 11–15 December 2011; pp. 1–4. [Google Scholar]
  38. Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  39. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4510–4520. [Google Scholar]
  40. Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
  41. Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. MnasNet: Platform-aware neural architectures search for mobile. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2820–2828. [Google Scholar]
  42. Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of Adam and beyond. arXiv 2019, arXiv:1904.09237. [Google Scholar]
  43. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  44. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
  45. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, L.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the NIPS 2017 Workshop Autodiff Submission, Long Beach, CA, USA, 9 December 2017. [Google Scholar]
Figure 1. The overview of the conventional feature segmentation when (left) M < T and (right) M > T , where X = [ x 0 , , x T 1 ] is a feature with the length of T , M is the segment length, L is the shift interval for segmentation, and F i is the i -th segment.
Figure 1. The overview of the conventional feature segmentation when (left) M < T and (right) M > T , where X = [ x 0 , , x T 1 ] is a feature with the length of T , M is the segment length, L is the shift interval for segmentation, and F i is the i -th segment.
Sensors 22 04483 g001
Figure 2. The overview of the proposed bidirectional feature segmentation when (left) M < T and (right) M > T , where X = [ x 0 , , x T 1 ] is a feature with the length of T , M is the segment length, and F i and B i are the i -th forward and backward segments, respectively.
Figure 2. The overview of the proposed bidirectional feature segmentation when (left) M < T and (right) M > T , where X = [ x 0 , , x T 1 ] is a feature with the length of T , M is the segment length, and F i and B i are the i -th forward and backward segments, respectively.
Sensors 22 04483 g002
Figure 3. The framework of (a) conventional and (bd) proposed systems, where GAP stands for global average pooling, Combine (b,c) stands for one of the appropriate combination methods (described in Section 3.2.1 and Section 3.2.2, respectively), and F i and B i are the i -th forward and backward segments, respectively. The proposed systems consist of (b) embedding-level combination, (c) feature-map level combination, and (d) two-channel input. The modules for the proposed method are marked in blue.
Figure 3. The framework of (a) conventional and (bd) proposed systems, where GAP stands for global average pooling, Combine (b,c) stands for one of the appropriate combination methods (described in Section 3.2.1 and Section 3.2.2, respectively), and F i and B i are the i -th forward and backward segments, respectively. The proposed systems consist of (b) embedding-level combination, (c) feature-map level combination, and (d) two-channel input. The modules for the proposed method are marked in blue.
Sensors 22 04483 g003
Figure 4. The histogram of the utterance lengths in ASVspoof (a) LA and (b) PA scenarios.
Figure 4. The histogram of the utterance lengths in ASVspoof (a) LA and (b) PA scenarios.
Sensors 22 04483 g004
Table 1. EERs (%) of systems on the development and evaluation trials of ASVspoof 2019 LA and PA scenarios. For each model, the lowest EER is highlighted in bold (comparison including Fusion) or is underlined (comparison excluding Fusion).
Table 1. EERs (%) of systems on the development and evaluation trials of ASVspoof 2019 LA and PA scenarios. For each model, the lowest EER is highlighted in bold (comparison including Fusion) or is underlined (comparison excluding Fusion).
ModelMethodLAPA
M = 200 M = 400 M = 600 M = 200 M = 400 M = 600
Dev.Eval.Dev.Eval.Dev.Eval.Dev.Eval.Dev.Eval.Dev.Eval.
SE-ResNetBaseline014.099010.388012.5512.152 3.197 0.811 1.255 0.519 0.985
concat015.486011.14809.4101.261 2.641 0.370 0.945 0.612 1.007
vmax013.635012.157010.6331.279 2.118 0.704 1.327 0.998 1.975
vmean013.63709.081012.2111.573 3.300 0.593 1.337 0.926 1.702
fmax015.81209.868013.3531.388 2.875 0.534 1.105 0.484 0.934
2ch013.923011.434012.4982.538 4.378 0.597 1.063 0.316 0.912
Fusion013.13109.001010.2360.7371.9400.2610.6530.1680.614
X-vector
Network
(TDNN)
Baseline0.00917.321013.39506.9473.965 8.297 1.797 3.538 0.908 1.664
concat0.00915.69007.90006.7181.762 3.964 0.667 1.520 0.834 1.366
vmax014.7500.0028.17108.8111.296 3.936 0.721 1.072 0.721 1.056
vmean0.08216.558011.08107.7632.170 3.897 0.797 1.708 0.856 1.520
statc0.04016.92708.6200.00212.1432.037 3.892 0.926 1.752 0.797 1.156
Fusion0.00414.45108.38908.5520.6282.2050.4440.9500.6100.902
DenseNetBaseline016.232010.13009.8182.2963.1550.8501.3600.3370.713
concat012.12007.139011.4050.8541.5930.3530.6580.2960.580
vmax011.109010.74408.7430.7001.1160.3660.8570.2220.492
vmean012.509012.628012.1581.0901.7080.2960.5640.2380.476
fmax014.279014.535011.8280.6301.6090.3330.8450.2420.508
2ch013.96309.258011.9512.1642.7980.3700.9560.2260.542
Fusion011.02707.410010.0230.4440.8850.1480.4250.1290.303
MobileNetV2Baseline019.511 011.299 08.292 2.667 4.666 1.076 2.233 0.409 0.785
concat013.584 08.307 010.414 1.649 3.466 0.667 1.597 0.386 0.878
vmax012.794 010.223 08.767 1.333 3.217 1.022 2.206 0.279 0.872
vmean012.387 08.280 010.416 1.185 2.996 0.799 1.873 0.501 1.039
fmax013.433 09.950 08.768 1.076 3.228 0.756 1.592 0.388 0.928
2ch014.167 010.727 08.537 1.889 4.533 0.337 0.962 0.279 0.901
Fusion010.80907.81508.3630.7412.3870.4070.9780.1850.520
ShuffleNetV2Baseline020.479 011.748 011.051 4.170 4.374 1.392 2.255 0.904 1.382
concat013.664 09.433 08.849 1.240 2.161 0.462 0.917 0.388 0.867
vmax019.443 010.265 010.007 1.146 2.102 0.503 0.823 0.388 0.746
vmean017.975 09.705 09.912 1.390 2.366 0.335 0.869 0.353 0.696
fmax018.204 09.571 010.048 1.094 2.123 0.514 1.128 0.412 0.779
2ch0.00219.798 010.472 0.0389.764 2.168 3.737 0.739 1.260 0.593 0.951
Fusion013.92308.74408.6200.7601.5810.2790.6030.2070.492
MNASNetBaseline019.050 010.608 011.205 2.263 5.052 0.834 2.232 0.353 0.995
concat011.652 010.249 011.624 1.207 3.570 0.255 0.912 0.370 0.614
vmax017.362 011.895 011.107 1.630 4.416 0.353 1.156 0.168 0.718
vmean014.333 07.207 08.863 1.316 4.388 0.279 1.321 0.164 0.685
fmax015.894 07.570 011.327 1.037 3.642 0.316 0.969 0.152 0.615
2ch017.933 09.341 010.331 1.540 3.687 0.409 1.238 0.240 0.857
Fusion013.70507.02909.4630.6632.7320.1480.6360.0940.359
Table 2. Oracle EERs (%) of systems on the development and evaluation trials of ASVspoof 2019 LA and PA scenarios, with the proposed and ST methods. For each model, the lowest EER is highlighted in bold.
Table 2. Oracle EERs (%) of systems on the development and evaluation trials of ASVspoof 2019 LA and PA scenarios, with the proposed and ST methods. For each model, the lowest EER is highlighted in bold.
ModelMethodLAPA
ProposedSTProposedST
Dev.Eval.Dev.Eval.Dev.Eval.Dev.Eval.
SE-ResNetconcat07.96708.4840.3700.7081.3291.835
vmax09.51708.9890.7041.3270.9831.520
vmean09.08109.0120.5931.1560.9451.414
fmax07.57608.8930.5341.1051.2381.913
2ch09.26107.9410.5971.0401.4031.896
average08.68008.6640.5601.0671.1801.716
X-vector
Network
(TDNN)
concat06.35208.2670.6671.4931.7603.108
vmax0.0026.7410.0048.6740.7211.0271.6302.665
vmean07.804010.5240.7971.4971.8173.814
statc06.30909.4880.9261.5811.6302.924
average0.0016.8020.0019.2380.7781.4001.7093.128
DenseNetconcat06.868 06.4720.3530.5420.9081.194
vmax06.717 06.4990.3660.7191.1331.414
vmean07.981 07.1080.2960.5641.1481.587
fmax07.628 05.9440.3330.7020.9081.332
2ch07.82807.5480.3700.7960.7370.961
average07.40406.7140.3440.6650.9671.298
MobileNetV2concat06.06706.7170.6671.3981.0042.034
vmax07.439 05.7101.022 2.150 0.5951.731
vmean07.53308.2390.7991.8471.0372.311
fmax06.186 06.1720.7561.3710.8151.979
2ch06.24207.7380.3370.9350.9802.217
average06.69306.9150.7161.5400.8862.054
ShuffleNetV2concat06.60909.1660.4620.8851.6491.924
vmax07.032 06.5640.5030.8231.6302.299
vmean06.40507.1360.3350.8691.3532.278
fmax07.454 07.3300.5141.0611.2941.869
2ch08.786 08.1440.7391.2271.7022.846
average07.25707.6680.5110.9731.5262.243
MNASNetconcat06.080 06.0790.2550.7740.9081.974
vmax05.39805.4520.3530.8180.6511.951
vmean05.38407.0980.2790.9230.9632.427
fmax04.840 04.7560.3160.8460.7762.095
2ch06.840 06.3490.4091.0390.9262.869
average05.70805.9470.3220.8800.8452.263
Table 3. Oracle EERs (%) of systems in the development and evaluation trials of ASVspoof 2019 LA and PA, with four different methods (i.e., baseline, BO, Augment, and proposed). For each model, the lowest EER is highlighted in bold.
Table 3. Oracle EERs (%) of systems in the development and evaluation trials of ASVspoof 2019 LA and PA, with four different methods (i.e., baseline, BO, Augment, and proposed). For each model, the lowest EER is highlighted in bold.
ModelMethodLAPA
Dev.Eval.Dev.Eval.
SE-ResNetBaseline08.2110.8111.255
BO08.3471.2791.763
Augment07.9401.3162.443
Proposed08.6800.5601.067
X-vector
Network
(TDNN)
Baseline08.1571.7973.538
BO09.5571.4642.858
Augment010.4151.5954.145
Proposed06.8020.7781.400
DenseNetBaseline09.0950.8501.360
BO06.7051.1311.581
Augment06.9471.0941.713
Proposed07.4040.3440.776
MobileNetV2Baseline07.6951.0761.885
BO07.7910.9451.763
Augment07.0430.9262.521
Proposed06.6930.7161.540
ShuffleNetV2Baseline07.3031.3922.102
BO08.9851.5142.024
Augment06.2931.4992.549
Proposed07.2570.5110.973
MNASNetBaseline06.0920.8342.224
BO06.9600.8342.100
Augment07.4000.8672.632
Proposed05.7080.3220.880
Table 4. Oracle EERs (%) of systems (excluding the x-vector network) in the development and evaluation trials of ASVspoof 2019 LA and PA, with the methods of 2ch and 2ch_s. For each model, the lowest EER is highlighted in bold.
Table 4. Oracle EERs (%) of systems (excluding the x-vector network) in the development and evaluation trials of ASVspoof 2019 LA and PA, with the methods of 2ch and 2ch_s. For each model, the lowest EER is highlighted in bold.
ModelMethodLAPA
Dev.Eval.Dev.Eval.
SE-ResNet2ch09.2610.5971.040
2ch_s08.2260.9961.731
DenseNet2ch07.8280.3700.796
2ch_s09.2571.2031.527
MobileNetV22ch06.2420.3370.935
2ch_s08.2670.8691.863
ShuffleNetV22ch08.7860.7391.227
2ch_s07.7751.5381.958
MNASNet2ch06.8400.4091.039
2ch_s08.6200.8691.885
Table 5. Oracle EERs (%) of systems in the development and evaluation trials of ASVspoof 2019 LA and PA, with three different methods (i.e., baseline, OTF, and proposed). For each model, the lowest EER is highlighted in bold.
Table 5. Oracle EERs (%) of systems in the development and evaluation trials of ASVspoof 2019 LA and PA, with three different methods (i.e., baseline, OTF, and proposed). For each model, the lowest EER is highlighted in bold.
ModelMethodLAPA
Dev.Eval.Dev.Eval.
SE-ResNetBaseline08.2110.8111.255
OTF09.6261.2921.787
Proposed08.6800.5601.067
X-vector
Network
(TDNN)
Baseline08.1571.7973.538
OTF0.0028.1581.7233.455
Proposed06.8020.7781.400
DenseNetBaseline09.0950.8501.360
OTF07.1621.1111.393
Proposed07.4040.3440.776
MobileNetV2Baseline07.6951.0761.885
OTF07.1631.1112.498
Proposed06.6930.7161.540
ShuffleNetV2Baseline07.3031.3922.102
OTF010.2001.5892.493
Proposed07.2570.5110.973
MNASNetBaseline06.0920.8342.224
OTF05.9140.9962.709
Proposed05.7080.3220.880
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yoon, S.; Yu, H.-J. BPCNN: Bi-Point Input for Convolutional Neural Networks in Speaker Spoofing Detection. Sensors 2022, 22, 4483. https://doi.org/10.3390/s22124483

AMA Style

Yoon S, Yu H-J. BPCNN: Bi-Point Input for Convolutional Neural Networks in Speaker Spoofing Detection. Sensors. 2022; 22(12):4483. https://doi.org/10.3390/s22124483

Chicago/Turabian Style

Yoon, Sunghyun, and Ha-Jin Yu. 2022. "BPCNN: Bi-Point Input for Convolutional Neural Networks in Speaker Spoofing Detection" Sensors 22, no. 12: 4483. https://doi.org/10.3390/s22124483

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop