New Acoustic Features for Synthetic and Replay Spooﬁng Attack Detection

: With the rapid development of intelligent speech technologies, automatic speaker veriﬁcation (ASV) has become one of the most natural and convenient biometric speaker recognition approaches. However, most state-of-the-art ASV systems are vulnerable to spooﬁng attack techniques, such as speech synthesis, voice conversion, and replay speech. Due to the symmetry distribution characteristic between the genuine (true) speech and spoof (fake) speech pair, the spooﬁng attack detection is challenging. Many recent research works have been focusing on the ASV anti-spooﬁng solutions. This work investigates two types of new acoustic features to improve the performance of spooﬁng attacks. The ﬁrst features consist of two cepstral coefﬁcients and one LogSpec feature, which are extracted from the linear prediction (LP) residual signals. The second feature is a harmonic and noise subband ratio feature, which can reﬂect the interaction movement difference of the vocal tract and glottal airﬂow of the genuine and spooﬁng speech. The signiﬁcance of these new features has been investigated in both the t-stochastic neighborhood embedding space and the binary classiﬁcation modeling space. Experiments on the ASVspoof 2019 database show that the proposed residual features can achieve from 7% to 51.7% relative equal error rate (EER) reduction on the development and evaluation set over the best single system baseline. Furthermore, more than 31.2% relative EER reduction on both the development and evaluation set shows that the proposed new features contain large information complementary to the source acoustic features.


Introduction
In recent years, the performances of automatic speaker verification (ASV) systems have been significantly improved. It has become one of the most natural and convenient biometric speaker recognition methods. The ASV technologies are now widely deployed in many diverse applications and services, such as call centers, intelligent personalized services, payment security, access authentication, etc. However, many studies in recent years have found that the state-of-the-art ASV systems fail to handle the speech spoofing attacks, especially for three major types of attacks: the replay [1], speech synthesis [2,3], and voice conversion. Therefore, great efforts are required to ensure adequate protection of ASV systems against spoofing. Due to the symmetry distribution characteristic between the genuine speech and spoof speech pair, the spoofing attack detection is challenging. As commonly used acoustic features between genuine and spoof speech share lots of similarities, specifically designed features for spoofing attack detection are in great demand.
Replay attacks refer to using pre-recorded speech samples collected from the genuine target speakers to attack the ASV systems. They are the most simple and accessible attacks, and pose huge threats to ASV, because of the availability of high quality and low-cost recording devices [1]. Unlike replay attacks, speech synthesis and voice conversion attacks are usually produced from sophisticated speech processing techniques. In these attacks, the state-of-the-art text-to-speech synthesis (TTS) systems are utilized to generate the artificial speech signals, while the voice conversion (VC) systems are utilized to adapt a given natural speech to target speakers. Given sufficient training data, both the speech synthesis and voice conversion techniques can produce high-quality spoofing signals to mimic the ASV systems [2].
To help research works on anti-spoofing and provide common platforms for the assessment and comparison of spoofing countermeasures, the Automatic Speaker Verification Spoofing and Countermeasures (ASVSpoof) challenges were organized every two years in ASV society from 2015. The most recent ASVSpoof 2019 was the first to consider the replay, speech synthesis, and voice conversion attacks within a single challenge [2]. In ASVSpoof 2019, these attacks are divided into the logical access (LA, speech synthesis, and voice conversion attacks) and physical access (PA, replay attacks) scenarios according to different use case scenarios (https://www.asvspoof.org/(accessed on 1 November 2021)).
In this study, we focus on using the ASVSpoof 2019 database to explore automatic spoofing detection methods for both logical and physical access spoofing attacks. We investigate two types of new acoustic features to enhance the discriminative information between bonafide and spoofed speech utterances.
The first proposed features consist of two cepstral coefficients and one log spectrum feature, which are extracted from the linear prediction (LP) residual signals of speech utterances. As we know that the LP residual signal of each utterance implicity contains its excitation source information. Although the latest TTS and VC algorithms can even produce spoofed speech that is perceptually indistinguishable from bonafide speech under some well-controlled conditions, there is still a big difference in excitation source signals between these spoofed speech and the bonafide speech. Moreover, from the analysis of replayed speech, we find that most of the playback device distortions occur in excitation signals, they can not affect much in the speech spectrum trajectories. Therefore, compared with the typical acoustic features that are directly extracted from the bonafide speech, we speculate the cepstral features extracted from the residual signals can capture better discrimination between the bonafide and spoofed speech, either for the synthetic or voice converted attacks or for the replayed attacks.
The second new feature is the harmonic and noise subband ratio feature (HNSR). It is motivated by the conventional Harmonic plus Noise Model (HNM) [4] for high-quality speech analysis. Based on the HNM, the speech signal is decomposed into a deterministic component and a stochastic (noise or residual) component. According to the decomposition, we assume that the interaction movement difference of the vocal tract and glottal airflow of the bonafide and spoofed speech can be reflected and distinguishable in the HNSR features. The usefulness of the proposed features is examined in both the t-stochastic neighborhood embedding (t-SNE) space and the spoofing detection modeling space.
Besides that two new features, this paper also compared various single features on two major baseline systems, which are Gaussian mixture model (GMM) based system and Light Convolutional Neural Networks (LCNN)-based system. Then, to verify the complementarity between different features, the score-level fusion system is proposed.

Related Work
In the literature, the continuously organized ASVSpoof challenges [2,5,6] have resulted in a large number of spoofing detection countermeasures. These countermeasures can be divided into two main categories, binary classification modeling algorithms, and new discriminative features, either for the replayed, or the synthetic or voice converted speech detection.

Classifiers
At the classifier level, the algorithms or architectures used for replay speech and synthetic or converted speech detection are almost the same in recent ASVspoof challenges. They can be categorized into shallow models and deep models. In the literature, the Gaussian mixture model (GMM) is the most widely used shallow model in most spoofing detection works [1,[7][8][9][10][11][12]. Most of the deep models are based on the DNNs [13][14][15], recurrent neural networks [16,17], and convolution neural networks (CNN) [18,19]. It has been found that the deep models perform better than the shallow models given the same input features in the in-domain ASV tasks. However, our previous work in [20] found that the performance gains obtained from the development set was very difficult to generalize to the evaluation set, the shallow model GMMs showed better robustness to the unseen or new conditions in our works on ASVSpoof 2019 challenge.

Features
For replay speech detection, previous works mainly focus on extracting new features to reflect the acoustic level difference between original and replay speech. These differences have resulted from the recording or playback devices, and the recording environments. Ref. [21] explored the re-recording distortions by using the constant Q cepstral coefficients (CQCC) in the high-frequency sub-band (6-8 kHz). Ref. [22] proposed to extract phase information by incorporating the Constant Q Transform (CQT) with a Modified Group Delay (MGD) on ASVSpoof 2019. Ref. [23] extracted the mel-frequency cepstral coefficients (MFCC) from the linear prediction residual signal to capture the playback device distortions. Ref. [24] proposed a low-frequency frame-wise normalization in CQT domain to capture the artifacts in the playback speech. Though MFCC is a widely used feature for speechrelated works, it is not utilized for the two ASVSpoof baseline systems conducted by this work. Besides these hand-crafted features, deep features using neural networks have also been investigated to detect playback speech. For instance, the convolutional neural network (CNN) was used to learn deep features from the group delay [25] and Siamese embedding the spectrogram [26]. In [27], the DNN-based frame-level and RNN-based sequence-level features were extracted to capture a better representation of playback distortions. In general, due to the strong feature modeling ability of deep neural networks, these deep features are more effective than hand-crafted features on the in-domain ASV tasks. However, these features may not easy to be generalized to out-of-domain ASV tasks, because their extraction is highly dependent on the DNN model training data.
To capture the artifacts introduced during the TTS and VC speech manipulation, new features mainly focus on both the acoustic and prosodic level features. For instance, the modulation features from magnitude and phase spectrum proposed in [28] were used to detect temporal artifacts caused by the frame-by-frame speech synthesis processing. The best system [29] in ASVSpoof 2015 used a combination of standard MFCC and cochlear filter cepstral coefficients (CFCCs) with the change in instantaneous frequency (IF) to detect the discrimination between bonafide and spoofed speech. As the phase information is almost entirely lost in a spoofed speech in current synthesis/conversion techniques, a modified group delay-based feature and the frequency derivative of the phase spectrum have been explored in [7]. And in [8], the fundamental frequency variation features were also proposed to capture the prosodic difference between the bonafide and spoofed speech. Actually, in the ASVSpoof challenge, we find many acoustic level features are effective in both the logical and physical spoofing speech detection, the typical features are the CQCC [9] and the linear frequency cepstral coefficient (LFCC) [30] features that have been used to build the ASVSpoof 2019 baselines.
In this study, we also focus on exploring new acoustic features to capture the artifacts in logical and physical spoofing speech detection. The CQCC and LFCC features extracted from the LP residual signals, and the harmonic and noise subband ratio feature are first investigated together with the shallow GMM classifier. Then we validate the effectiveness of using deep CNNs to model the spectrum of natural speech and its residual signals. Details of all these new features are presented in the next sections.

Linear Prediction Residual Modeling and Analysis
In the LP model of speech, the bonafide speech signal S t (n) is formulated as whereŜ t (n) models the vocal-tract component of the bonafide speech signal in terms of LP coefficients a k , k = 1, 2, . . . , p, and the error in the prediction r t (n), called as LP residual signal that models the excitation component. As discussed in [23], the replayed speech S r (n) is the convolution of input speech S t (n) with impulse response i(n) of the playback device as, According to Equations (1) and (2), S r (n) can be expanded to Equation (4) can be expanded to Equation (5) as Equation (6) is a simplified version of Equation (5). Equation (7) is part of Equation (6), indicating the replay signal generated by linear prediction, the other part of Equation (6) is the residual signal.
theŜ r (n) models the vocal-tract component of replayed speech in terms of LP coefficients c k , k = 1, 2, . . . , p, and r r (n) corresponds to the excitation component. It is clear from Equation (5) that, both the vocal-tract and excitation source components of bonafide speech S t (n) are affected by the characteristics of playback device i(n). Detail analysis in [23] showed that, compared with the vocal tract component, the source component is relatively more affected. It indicated that the i(n) affected residual signal is more different from that of bonafide speech, so it has advantages for detecting whether the speech is bonafide or replayed.
For the synthesized speech, as most latest TTS and VC techniques are parametric-based approaches, not the traditional unit selection and waveform concatenation approaches, there is still a big difference in the excitation source signals between these synthesized spoofing and the bonafide speech. Figures 1 and 2 demonstrate the time and spectral domain representation, and the corresponding LP residuals between a bonafide speech and its corresponding replayed and synthesized speech, respectively. It is clear to observe that, there is a big difference between time-domain representation comparisons under PA and LA scenarios. Compared with the large temporal differences observed in Figure 2, The temporal differences between the bonafide speech and replayed one, and their corresponding LP residuals are much smaller. This indicates that the excitation variations introduced by speech synthesis methods are much larger than the impulse response of the playback device.

Conventional Cepstral Features
The constant Q cepstral coefficients (CQCCs) [9] and linear frequency cepstral coefficients (LFCCs) [30] are two types of effective conventional acoustic features used for antispoofing speech detection in ASV tasks. Both of them have been widely used in previous ASVSpoof Challenges, such as ASVSpoof 2015, 2017, etc. A brief description of these two features is presented in the next subsections.

CQCC Features
The extraction of CQCCs can be summarized in Figure 3. First, we calculate the constant-Q transform (CQT) X CQ (k, n) of the input discrete time-domain signal x(n) as: where k = 1, 2, . . . , K is the frequency bin index, a * k (n) is the complex conjugate of a k (n) and N k are variable window lengths. a k (n) are cinolex-valued time-frquency atoms, the definition of a k (n) can be found from paper [9]. The notation . infers rounding down towards the nearest integer. Unlike the regularly spaced frequency bins used in the standard Short-Time Fourier Transform (STFT), the CQT uses geometrically spaced frequency bins. This makes it offers a higher frequency resolution at lower frequencies and higher temporal resolution at higher frequencies. Then, after the CQT, the log power spectrum of X CQ (k, n) is computed before performing a uniform re-sampling. As cepstral analysis cannot be applied directly to the CQT, due to the fact that the frequency bins are on a different scale to those of the basic functions of the discrete cosine transform (DCT). By applying the uniform re-sampling, the geometric space is then converted into the linear space for conventional cepstral analysis. Finally, in the linear space, the DCT is applied to get the CQCCs as where p = 0, 1, . . . , L − 1, and where l is the newly re-sampled frequency bins. More details of the CQCCs extraction, please find in [9].

LFCC
Like mel−frequency cepstral coefficients (MFCCs), Linear frequency cepstral coefficients (LFCCs) [30] are extracted the same way but the filters are triangular and spaced in linear scale as illustrated in Figure 4. The power spectrum is first integrated using overlapping band-pass filters and logarithmic compression followed by DCT is performed to produce the cepstral coefficients.

Residual Cepstral Coefficients
Motivated by the discussion on the potential effectiveness of residual signal for detecting the replayed and synthesized speech in Section 3.1, in this section, we investigate to extract the conventional acoustic features, CQCCs, and LFCCs from the LP residual signals instead of the raw audio. These new features are termed residual CQCC (RCQCC) and residual LFCC (RLFCC), respectively.  Figure 5 shows the detailed block diagram of RCQCC feature extraction. Given a speech segment s(n), it is the first overlap segmented into short-time frames using Hamming window. Then we perform the p-th order linear prediction analysis and inverse filtering to obtain the residual signal r(n) frame-by-frame. These frames are then taken as the input signal to extract D-dimensional (D = 30) CQCC feature as shown in Figure 3, followed by a ∆ + ∆∆ operation to achieve 2D-dimensional dynamic features. These 3D-dimensional features are concatenated together to form the final RCQCC acoustic features to train the spoofing detection classifiers.  Figure 6 demonstrates the extraction diagram of RLFCC features. The frame-by-frame residual signal r(n) is obtained in the same way as in RCQCCs. Then the DFT is used to transform the time-domain residual signal to the spectrum domain. After performing the linear-frequency scale uniform triangular band-pass filter banks on the power spectrum, the discrete cosine transform (DCT) is applied on the logarithm of the subband energies obtained from the triangular-filter banks to get the LFCC features. And as the RCQCCs, we also extract the ∆ + ∆∆ dynamic features to form the final RLFCC features.

t-SNE Feature Visualization
Before using the above proposed residual features to build ASVSpoof systems, in this section, the discrimination ability between the genuine and spoof speech segments are investigated, using the t-stochastic neighborhood embedding (t-SNE) [31] visualization. Figures 7 and 8 demonstrate the effectiveness of the proposed residual acoustic features RCQCCs and RLFCCs over the standard CQCCs and LFCCs, under the logical access and physical access scenarios, respectively. In both t-SNE figures, the two-dimensional visualization of LFCCs and RLFCCs are transformed from the 60-dimensional raw feature (including ∆ + ∆∆) as described in Sections 3.2.2 and 3.3; while the two-dimensional visualization of CQCCs and RCQCCs are transformed from the 90-dimensional raw features (including ∆ + ∆∆). From both Figures 7 and 8, we see that there is a significant overlap between the original genuine and spoof speech segments, either in the CQCC (subfigure (c)) or the LFCC (subfigure (a)) features spaces, and whatever is under the LA or the PA scenarios. However, from sub-figure (b) and (d), the acoustic features of residual signals of genuine and spoof speech are clearly separated, especially for the RCQCCs and RLFCCs of examples under the PA scenario. It tells us that the acoustic feature discrimination between the residual signal of bonafide and spoof speech is much larger than that in the original/raw speech signals. Therefore, using RLFCCs and RCQCCs can build better ASVSpoof systems.

Harmonic and Noise Interaction Features
From the speech production mechanism and the success of Harmonic plus Noise Model (HNM) [4] for high-quality speech analysis and synthesis, we know that the generation of speech can be regarded as the interaction movement of the vocal tract and glottal airflow. Our previous work [32] has proposed a new feature called the spectral subband energy ratios (SSER) to reflect the interaction property, and experimental results have proved its effectiveness to characterize the speaker identity. Therefore, we think that the natural interaction property in the synthesized or replayed spoofing speech may be distorted by the speech synthesis algorithms or the speech recording devices, etc. These distortions may result in very different interaction features between the bonafide and spoofed speech. Therefore, in this section, we try to explore the effectiveness of the interactive features for the ASVSpoof task.
As proposed in our previous work of [32], here, we also use the spectral subband energy ratios as the interactive features but extract them differently as in [32]. So, in order to distinguish these features from SSER, we call them "Harmonic and Noise Subband Ratio (HNSR)". The principle of the HNSR is shown in Figure 9.
Given a speech segment s(t), we first estimate the fundamental frequency (F0) and make the unvoiced/voiced decision using the Normalized Cross-Correlation Function (NCCF). Both F0 and NCCF are estimated using the Kaldi toolkit [33]. The pitch-markers are computed as in [34,35]. All the unvoiced frames are discarded, only voiced frames are used to extract the HNSRs. The pitch-synchronous frames are then extracted by using Hamming windows of two pitch period, and centered at each pitch marker. Finally, the harmonic component h(t) is then estimated with the HNM analysis [4]. Instead of using the high-pass filtering method to get the stochastic noise part as in [4] for speech synthesis, in this study, we choose to directly subtract the h(t) by the original windowed speech signal s(t) to achieve the n(t). Once the h(t) and n(t) are available, we then transform them into the spectrum domain and compute the frequency subband energies as where STFT is the short-time Fourier transform, Bs and Be represent the starting and ending frequencies of the spectral subband. For each frequency subband, we set the bandwidth Bw = B e − B s to the averaged F0 (225 Hz) value of all the training datasets to have as many spectral subbands as possible. The maximum voiced frequency used in HNM analysis is fixed to 8 kHz. So, we can obtain a 35-dimensional HNSR feature vector for each voiced frame as As the RCQCC and RLFCC visualization, we also apply the t-SNE to see the distribution of HNSR features as shown in Figure 10. Compared with Figures 7 and 8, the discrimination of HNSRs is much poor. It indicates that the HNSR features can not well distinguish the genuine and spoof speech, but in this study, we can verify whether there is complementarity information between HNSR features and other acoustic features.

Dataset-ASVSpoof 2019
All of our experiments are performed on 2019 Automatic Speaker Verification Spoofing and Countermeasures (ASVSpoof) challenge [2]. Different from previous ASVSpoof challenges, the 2019 year challenge is the first to focus on countermeasures for all three major attack types, namely those stemming from TTS, VC, and replay spoofing attacks. It contains two sub-challenges, namely the logical access (LA) task, and physical access (PA) task. Brief descriptions [20] of these two tasks are as follows.
Logical access task: Compared with the ten spoofing types in ASVSpoof 2015 [5], the spoofing utterances of LA sub-challenge in the ASVSpoof 2019 were synthetic speech using the most recent technology [10]. The quality of the synthetic speech from the ASVSpoof 2019 has improved a lot, which poses substantial threats to ASV. This sub-challenge contained training, development, and evaluation partitions. The genuine speech was collected from 107 speakers with no significant channel or background noise. The spoofed speech was generated from the genuine data using various spoofing approaches. No speaker overlap among the three subsets.
Physical access task: The speech of replay attacks in the ASVSpoof2019 was based upon simulated and carefully controlled environments. The training and development data of the PA sub-challenge was created according to 27 different acoustic environments, consisting of 3 room sizes, 3 levels of reverberation, and 3 microphone distances. There were 9 different replay configurations generated by 3 categories of recording distances, and 3 audio qualities. Detailed information about the training, development, and evaluation sets of ASVSpoof 2019 is illustrated in Table 1. Official baselines: Two official baseline countermeasure systems were made available to ASVSpoof 2019 participants. Both use a common Gaussian mixture model (GMM) back-end classifier with either constant-Q cepstral coefficient (CQCC) features [9] (B01) or linear frequency cepstral coefficient (LFCC) features [30] (B02). The GMMs for both B01 and B02 are with 512 components. The bonafide and spoofing GMM models are trained separately. Baselines are trained separately for LA and PA scenarios.
LCNN classifier: Besides the official GMM baselines, we also investigate using the Light Convolutional Neural Networks (LCNN) [18] as the classifier, which was the best system submitted to the ASVSpoof 2017 Challenge. As in [18], the same normalized log power magnitude spectrum (logspec) obtained by FFT is utilized as input of LCNN. To get a unified time-frequency (T-F), the shape of input features. the normalized FFT spectrograms along the time axis with the size of 864 × 400 × 1 is truncated as the input of LCNN. Short files are extended by repeating their contents to keep the same in length. The LCNN classifier used in this paper is a reduced CNN architecture with Max-Feature Map activation (MFM), which provides feature selection function for LCNN. Detail about LCNN is the same as LCNN-9 that was used in work [18], with only a minor change of the outputs of FC6 layer, here, we use 256 × 2 instead of the 32 × 2.

Features
CQCC-based baselines use a constant-Q transform (CQT), which is applied with a maximum frequency of f nyq = f s/2 ( f s = 16 kHz); The minimum frequency is set to nine octaves below the maximum frequency f min = f max/2 9 ∼ = 15 Hz. The number of bins per octave is set to 96. The resulting geometrically-scaled CQT spectrogram is re-sampled to a linear scale using a sampling period of 16. The 29 + 0th order DCT static coefficients plus its ∆ + ∆∆ dynamic coefficients that are computed using two adjacent frames are taken as our final CQCC features.
LFCC-based baselines use a short-term Fourier transform. Each frame is 20ms with Hamming window and 10 ms shift. The power magnitude spectrum of each frame is calculated using a 512-point FFT. A triangular, linearly spaced filter-bank of 20 channels is applied. Different from the CQCCs, only 19 + 0th order DCT static coefficients are extracted, plus its ∆ + ∆∆, the final LFCCs are 60-dimensional acoustic feature vectors. Table 2 presents all other detail parameters that are not mentioned in the baseline features or in Section 3. These parameters can generate the best performance for this experiment. "FL" and "FS" refer to frame-length and frame-shift in milliseconds(ms), respectively, "LA/PA" means the parameters for LA and PA condition, LP-order is the LP order to obtain the residual signal, "GMM(LA/PA)" means the GMM components for LA and PA conditions, respectively. All these parameters are tuned on the development datasets of the ASVSpoof 2019 challenge.

Evaluation Metrics
Besides the equal error rate (EER) that was used for evaluating previous ASVSpoof challenge systems, in ASVSpoof 2019, a new ASV-centric metric referred to as the tandem detection cost function (t-DCF) [36] was first proposed as the primary evaluation metric. Use of the t-DCF means that the ASVSpoof 2019 database is designed not for the standalone assessment of spoofing countermeasures but their impact on the reliability of an ASV system when subjected to spoofing attacks. In this study, we use both the EER and t-DCF to evaluate the effectiveness of our proposed methods.

Results on LA Task
As presented in Section 5.2.1, two classifiers are used to build ASVSpoof detection systems, one is the GMM, and the other is the LCNN, either for the LA task or for the PA task. We first evaluate our proposed features for the LA task as follows.

Detection Results of Single Systems
Results for the LA task of ASVSpoof 2019 are shown in Table 3. L1 and L3 are the official GMM-based baselines, and L6 is our LCNN baseline. By comparing L2 with L1, the proposed residual CQCCs achieve similar results with the CQCCs on the development set, but with around absolute 2.14% worse EER than the CQCC baseline on the evaluation set. However, when we compare L4 with its baseline L3, significant performance improvements are obtained, either in EER or in t-DCF for both the development and evaluation sets, such as on the development set, the EER is reduced from 2.72% to 1.16%, and t-DCF is reduced from 0.0663 to 0.0358, and on the evaluation set, consistent performance gains are achieved. Relative equal error rate (EER) improvements of 57.2%, 24.8% are achieved for development and evaluation set over the official baseline, respectively. It tells us that the RLFCC features extracted from the residual signal are much effective than the LFCCs extracted from the original source signal. Moreover, from L5, it is clear that the proposed HNSR features are much worse than other features with GMM classifier, however, we hope that it may provide some complementary information to other acoustic features during the system fusion, because the HNSRs, RCQCCs, RLFCCs are extracted in a totally different way, they may capture the spoofing acoustic characteristics of speech synthesis, voice conversion in different aspects. Furthermore, we also investigate extracting LogSpec features from residual signals to see its effectiveness for detecting spoofing speech. Comparing system L7 with its baseline L6, it is interesting to find that the EER and t-DCF on the evaluation set are significantly reduced, but the gains on the development set are exactly the opposite. We also hope the proposed RLogSpec can provide some complementary information for the LCNN-based countermeasures. This experiment shows the efficiency of residual signals compared to their respective baseline features for the ASVSpoof LA task. However, the performance of RCQCCs and RLFCCs for the LCNN model and RLogSpecs for the GMM model are missing. The efficiency of the proposed features for other ASVSpoof LA tasks needs further evaluation in the future.

System Fusion
To verify whether the different acoustic features are complementary, we perform system fusion on the score level instead of the feature level. The bosaris toolkit [37] is used. This toolkit provides a logistic regression solution, which can perform the fusion by learning the weights from scores of the development set and evaluation set. Results for the LA task of the ASVSpoof 2019 Challenge are shown in Table 4. In this table, we investigate many system fusion strategies to exploit complementary information of the acoustic features. By comparing these results with those ones in Table 3, it is clear to see that, score-level fusion can result in significant performance gains than single systems, either for the CQCCs, LFCCs, or the RCQCCs and RLFCCs. Considering both the performances on development and evaluation sets, we see the LF1 + RLFCC achieves the best fusion results on the GMM-based systems.
Furthermore, by comparing LF7 with L6 in Table 3, we see that adding residual LogSpec features can provide significant complementary information to the original LogSpec features, e.g., the EERs are reduced from 0.2%, 20.83% to 0.12%, 9.67% on the development and evaluation set, respectively. From LF8, we see further improvements by combining the best GMM-based system (LF5 + HNSR) with the LCNN-based system (LF7), the EER and t-DCF on the development set are reduced to zeros on the development set, and these numbers are also significantly reduced to 2.57% and 0.073931 on the evaluation test set. All these performance gains can prove that our provided residual acoustic features are useful and effective. From the big performance difference between the development and evaluation set, it is clear that the generalization ability of current GMM-based and LCNN-based systems is limited.

Results on PA Task
As the above experimental investigations for the LA task, results for the PA task are as follows. The same acoustic features, system fusion strategies are validated.

Detection Results of Single Systems
Results for the PA task of ASVSpoof 2019 are shown in Table 5. As in Table 3, here, P1 and P3 are the official GMM-based baselines, and P6 is the LCNN baseline. Results on the LA and PA tasks in Tables 3 and 5, show consistent performance on the behavior of residual acoustic features, such as, the RLFCCs achieve better results than LFCCs, while RCQCCs are not better or even worse than baselines, a single GMM system using HNSRs is still very bad, and the RLogSpec features are also very effective for the PA spoofing detection. This experiment shows the efficiency of residual signals compared to their respective baseline features for the ASVSpoof PA task. But the performance of RCQCCs and RLFCCs for the LCNN model and RLogSpecs for the GMM model are missing. The efficiency of the proposed features for other ASVSpoof PA tasks needs further evaluation in the future. Table 6 presents the score-level system fusion results for the PA task of the ASVspoof 2019 Challenge. As in Table 4, the same fusion strategies are examined, and not as LA task, the best fusion results are achieved from system PF6 + HNSR, whose EER and t-DCF are much lower than the official baseline system fusion (PF1) results, by introducing the complementary information of the proposed residual acoustic features RCQCCs and RLFCCs. For the LCNN systems, the fused system PF7 further reduces the EER from 5.44%, 6.59% to 2.69%, 3.39%, the t-DCF from 0.1795, 0.1908 to 0.104703, 0.13176 on the development and evaluation set, respectively. Moreover, system fusion between the best GMMbased system (PF6 + HNSR) and PF7 further improves the final system performances, e.g., PF8 achieves relative 31.2% and 37.1% EER reductions over PF7 on the development and evaluation sets, respectively. All these improvements are consistent on both the LA and PA tasks. And from the performance gap between development and evaluation sets in Tables 5 and 6, we also see that system fusion can reduce the performance gap between different test conditions, and improve the system robustness under PA task.

Conclusions
In this paper, we investigate two types of new acoustic features to improve the detection of synthetic and replay speech spoofing attacks. One is the residual CQCCs, LFCC, and LogSpecs, the other is the HNSR interaction feature. From the experimental results, we find that the acoustic features extracted from residual signals behave much better than those extracted from source speech signals. Although the single system built from HNSR features get bad results, they still show a little bit of complementary information at the score-level system fusion. Furthermore, we investigate different score-level fusion strategies and find that all of the proposed residual features can provide significantly complementary information to official baselines, and the GMM-based and LCNN-based systems have further complementary information (more than 30% relative EER reduction) to improve the final system performances.
Though the experiment results show the efficiency of proposed features, these features may not outperform all other acoustic features for the ASVSpoof task. Future works will focus on evaluating other acoustic features to conduct a comprehensive feature level comparison for the ASVSpoof task, and improving the generalization ability of anti-spoofing countermeasures under different conditions.