Exposing Speech Resampling Manipulation by Local Texture Analysis on Spectrogram Images

: Speech tampering may be aided by the resampling operation. It is signiﬁcant for speech forensics to e ﬀ ectively detect the resampling; however, there are few studies on speech resampling detection. The purpose of this paper was therefore to provide a new training ideal to detect speech resampling. After resampling, the speech signal changes regularly in the time–frequency domain. In this paper, we theoretically analyzed the corresponding relationship between time domain and frequency domain of the resampled speech. Compared with the original speech, the bandwidth of resampled speech was stretched or compressed. First, the spectrogram was generated by short-time Fourier transform (STFT) from the speech. Then, the local binary pattern (LBP) operator was applied to model the statistical changes in the spectrogram and the LBP histogram was calculated as discriminative features. Finally, a support vector machine (SVM) was applied to classify the developed features to identify whether the speech had undergone the resampling operation. The experimental results show that the proposed method has superior detection performance in di ﬀ erent resampling scenarios than some existing methods, and the proposed features are very robust against the commonly used compression post-processing operation. This highlights the promising potential of the proposed method as a speech resampling detection tool in practical forensics applications.


Introduction
The digital information age has accelerated the development of multimedia, which is now universally present in people's daily lives. Common multimedia includes video, image, audio, and so on. Due to the popularity and intelligence of media editing tools, people can edit and spread media easily, making the primitiveness and authenticity of multimedia unguaranteed. Multimedia forensics technology has become a leading research topic in the field of information security. Multimedia forensics can be divided into active forensics and passive forensics. Active forensics requires embedding prior information (called a watermark) into the media before the media is released. The extracted watermark is compared with the original watermark at the receiving end, and the primitiveness and authenticity of the media are identified according to the changes in the watermark. On the contrary, passive forensics is a direct detection of media information without prior information. The application scenario of passive forensics is more extensive, but its technical requirements are also higher. Recently, passive image forensics has been the subject of many state-of-the-art studies [1][2][3][4], but the research on passive audio forensics is still in its infancy [5]. Forensic scenarios such as speech source detection [6], speech steganalysis [7], and speech recapture [8], etc., are the categories of passive speech forensics.

Principle of Resampling
For this paper, the characteristics of signal resampling were taken as the starting point, in order to detect the tampering features. In this section, the principle of resampling is analyzed, which leads to the rule of change on the spectrogram image. The change in the signal's integer multiple only involves up-sampling, interpolation, or down-sampling, and changes in any multiple requires the combination of up-sampling, interpolation, and down-sampling. Suppose the original signal x(n) has m sampling points, p is the interpolation factor, q is the decimation factor, and the resampling factor is expressed as p/q: (1) Up-sampling: p − 1 zero points are inserted between every two points of the signal, and the sampling point of the changed signal x u (n) becomes p(m − 1) + 1, which enlarges the signal x(n) by a factor of p in the time domain. When n = 1, 2, . . . , m, x u (p(n − 1) + 1) = x(n); otherwise, x u (n) = 0.
(2) Interpolation: By convoluting the signal x u (n) with the low-pass linear interpolation filter h(n), the interpolated signal x c (n) = x(n) * h(n) can be obtained, which can be divided into linear interpolation, spline interpolation, and cubic interpolation according to the different filters [17].
(3) Down-sampling: Take one point for each q point, that is, the interval between two points is q − 1, and this reduces the signal x c (n) by a factor of q in the time domain. The changed signal is x d (n) = x c (1 + q(n − 1)); when n = 1, 2, . . . , (p−1)m p , (p−1)m p + 1, the resampled signal is y(n) = x d (n). Therefore, if p > q, there is an increase in the signal in the time domain, whereas if p < q, there is a decrease in the signal in the time domain. Different types of resampling algorithms can be divided into the above three steps. The difference between speech resampling methods lies in the interpolation filters used in step (2). In this paper, we also discuss the influence of several common interpolation methods on the spectrogram.

Spectrogram
It is well known that speech is a nonstationary signal, so the premise of generating a spectrogram is that the speech can be considered stable for a certain period of time. First, we framed the speech signal x(n). The frame length used in this paper was set to 50 ms and the value of the frame shift was 25 ms. The Fourier transform was performed on each frame, and the Fourier transform form X(k, t) related to the t-th frame is given by the following: where d is the frame length and w(n) is the Hamming window function. Each frame is arranged in a frame shift time to form a linear spectrogram L(k, t), which can be formulated as follows: where |·| stands for the amplitude operator and t ∈ 1, . . . , n−25 50−25 . A logarithmic spectrogram S(k, t) can be generated by taking the logarithm of the linear spectrogram [18], which can be expressed as follows: S(k, t) = log(L(k, t)). ( The purpose of converting a linear spectrogram to a logarithmic one is to reduce the dynamic range of amplitude components by using logarithmic compression. In fact, what the spectrogram presents is three-dimensional information. The horizontal axis represents time, that is, the length of speech; the vertical axis represents frequency; and the depth of color represents the frequency Electronics 2020, 9, 23 4 of 16 amplitude [19]. Pitch period, frequency band width, and fundamental frequency can be extracted from the spectrogram.

Relationship between Resampling and Spectrogram
As can be seen from Section 3.1, Fourier transform is a crucial part of the spectrogram generation, so the traces introduced by speech resampling will also be reflected on the spectrogram. Speech can be regarded as a complex signal consisting of k-order harmonics. This section analyzes the relationship between the time domain and frequency domain of k-order harmonics to illustrate the relevant phenomena after the resampling manipulation on the spectrogram.
For harmonic periodic function x(t), for example: where A is the magnitude, ϕ is the phase angle, and w/(2π) is the frequency of the function, which can be written using trigonometric equations: Since ϕ is a constant, let a = Acosϕ, b = −Asinϕ; it can be simplified as follows: The Fourier expansion f (n) of the signal is as follows: Sampling must be introduced between the processes where signals convert analog signals into digital signals. This processes that for the sampling interval for ∆t, k is the order of harmonic, l is one half of the period, N is the sampling number, and n corresponds to the time sequence of time, so the n → i∆t , therefore: Applying it to the above equation, the discrete Fourier series can be obtained as follows: In the equations: where k = 0, a 0 is the DC component, and the k-order harmonic of x i can be obtained as follows: Electronics 2020, 9, 23 5 of 16 The frequency f k and period T k of the k-order harmonic are as follows: where T is the time of the speech taken. It can be observed from the equation that the frequency of each harmonic function is inversely related to the total time length, which also confirms the scaling characteristics in the Fourier transform: where f is the signal in the time domain, F represents the Fourier transform of the signal, and c is a constant. Therefore, it can be concluded in this paper that when the time domain period changes in multiples, the range of frequency domain also changes correspondingly. For example, when the speech is up-sampled by the factor p, it is equivalent to stretching p times in the time domain, and the frequency of the harmonics will compress p times. At the time of down-sampling, q is applied to the discrete signal again. As shown in Figure 1, the signal shortens by q times in the time domain and stretches by q times in the frequency domain. From Figure 1, it can be confirmed that the bandwidth of resampled speech is stretched or compressed in the frequency domain, which can be extended to the spectrogram. With the change in the sampling factor, the period of speech in the time domain changes, the bandwidth on the longitudinal axis changes correspondingly, and each harmonic has the same positional movement.
The frequency k f and period k T of the k-order harmonic are as follows: where T is the time of the speech taken. It can be observed from the equation that the frequency of each harmonic function is inversely related to the total time length, which also confirms the scaling characteristics in the Fourier transform: (a) (b) (c)  Figure 2 shows the influence of different sampling factors and compression bit rates on the spectrogram. In the bottom left corner is the spectrogram of the original speech. The horizontal axis represents the change in the sampling factor, and the vertical axis represents the change in the MP3 compression bit rate. It can be observed from the horizontal axis and the frequency positions of each harmonic that, after the down-sampling process, the corresponding spectrogram is stretched and the characteristics of the spectrogram are changed when compared with the original speech. At the time of speech up-sampling, the spectrogram is compressed correspondingly, which can be observed from the bandwidth of the whole spectrogram. We observe from the vertical axis that different compression bit rates have no negative effect on the spectrogram, but the effect of resampling still exists. Therefore, the spectrogram-based resampling detection method has a better anti-mp3 compression performance.  Figure 2 shows the influence of different sampling factors and compression bit rates on the spectrogram. In the bottom left corner is the spectrogram of the original speech. The horizontal axis represents the change in the sampling factor, and the vertical axis represents the change in the MP3 compression bit rate. It can be observed from the horizontal axis and the frequency positions of each harmonic that, after the down-sampling process, the corresponding spectrogram is stretched and the characteristics of the spectrogram are changed when compared with the original speech. At the time of speech up-sampling, the spectrogram is compressed correspondingly, which can be observed from the bandwidth of the whole spectrogram. We observe from the vertical axis that different compression bit rates have no negative effect on the spectrogram, but the effect of resampling still exists. Therefore, the spectrogram-based resampling detection method has a better anti-mp3 compression performance.  Figure 3 shows the changes in the spectrogram under different interpolation methods of the resampled speech, and the resampling factors are all 1.5. It can be seen that except for the zerocomplementing interpolation, the common three interpolation methods leave energy residue in the blank band of the spectrogram, but the change pattern still exists. In Section 5, we present the experimental comparison for different interpolation cases.  Figure 3 shows the changes in the spectrogram under different interpolation methods of the resampled speech, and the resampling factors are all 1.5. It can be seen that except for the zero-complementing interpolation, the common three interpolation methods leave energy residue in the blank band of the spectrogram, but the change pattern still exists. In Section 5, we present the experimental comparison for different interpolation cases.  Figure 3 shows the changes in the spectrogram under different interpolation methods of the resampled speech, and the resampling factors are all 1.5. It can be seen that except for the zerocomplementing interpolation, the common three interpolation methods leave energy residue in the blank band of the spectrogram, but the change pattern still exists. In Section 5, we present the experimental comparison for different interpolation cases.

Feature Extraction
After the speech is resampled, the spectrogram is compressed or stretched, and the texture features of the image change accordingly. An LBP histogram is a kind of texture description operator, which is used to extract rich local texture information from images [20]. It puts more emphasis on texture features and the variation of elements in adjacent areas, and can extract local features of the spectrogram well, thereby reflecting the variation pattern of the resampled speech. Therefore, this paper used the LBP histogram to model the local texture change in the spectrogram image.
The change information in the arbitrary direction of the spectrogram was then represented by the difference between the eight-neighborhood elements and the center element. The steps to extract the spectral features using an LBP operator are as follows.
First, the spectrogram was divided into 3 × 3 regions, as shown in Figure 4a. The LBP compares the value of the center element with the surrounding eight elements; the neighborhood element smaller than the center element is marked as 0, otherwise it is marked as 1. The binary value output is calculated using Equation (16). The binary value is multiplied by the weight and then added to obtain the LBP code C L of the central element [21]. The

Feature Extraction
After the speech is resampled, the spectrogram is compressed or stretched, and the texture features of the image change accordingly. An LBP histogram is a kind of texture description operator, which is used to extract rich local texture information from images [20]. It puts more emphasis on texture features and the variation of elements in adjacent areas, and can extract local features of the spectrogram well, thereby reflecting the variation pattern of the resampled speech. Therefore, this paper used the LBP histogram to model the local texture change in the spectrogram image.
The change information in the arbitrary direction of the spectrogram was then represented by the difference between the eight-neighborhood elements and the center element. The steps to extract the spectral features using an LBP operator are as follows.
First, the spectrogram was divided into 3 × 3 regions, as shown in Figure 4a. The LBP compares the value of the center element with the surrounding eight elements; the neighborhood element smaller than the center element is marked as 0, otherwise it is marked as 1. The binary value output is calculated using Equation (16). The binary value is multiplied by the weight and then added to obtain the LBP code L C of the central element [21]. The L C represents the LBP code of the central element, T is the threshold, S i stands for the value of each element in the spectrogram, and C i is the binary value output of the LBP.
where the value of the threshold T is set to 0. the value of the center element with the surrounding eight elements; the neighborhood element smaller than the center element is marked as 0, otherwise it is marked as 1. The binary value output is calculated using Equation (16). The binary value is multiplied by the weight and then added to obtain the LBP code C L of the central element [21]. The C L represents the LBP code of the central element, T is the threshold, i S stands for the value of each element in the spectrogram, and i C is the binary value output of the LBP. Thresholding(T=0) where the value of the threshold T is set to 0. Finally, the histogram of these codes is calculated to represent the characteristic image [22]. Figure 5 is the comparison of the LBP histograms computed from an original speech and its resampled versions.
As can be seen from Figure 5, the local texture features extracted from the spectrogram of the original speech are significantly different from those of the resampled speech. The underlying reason is that the spectrogram is stretched or compressed after resampling manipulation, which leads to the changes in the statistical correlation between the spectrogram's elements. Therefore, the local details and local textures are correspondingly changed, which can be depicted effectively by the LBP operator. Furthermore, the obtained LBP histogram feature can be fed into the LIBSVM [23] to distinguish the original speech from the resampled one. The LIBSVM is a software package based on the support vector machine. For the proposed method in this paper, we consider the maximum complexity of each part as the entire complexity of the algorithm, then the time complexity of the spectrogram is T (n) = O (nlog2n) and the LBP is T (n) = O (n 2 ). Thus, the time complexity of the proposed algorithm can be written as T (n) = O (n 2 ). The proposed speech resampling detection process is as follows and shown Finally, the histogram of these codes is calculated to represent the characteristic image [22]. Figure 5 is the comparison of the LBP histograms computed from an original speech and its resampled versions.
As can be seen from Figure 5, the local texture features extracted from the spectrogram of the original speech are significantly different from those of the resampled speech. The underlying reason is that the spectrogram is stretched or compressed after resampling manipulation, which leads to the changes in the statistical correlation between the spectrogram's elements. Therefore, the local details and local textures are correspondingly changed, which can be depicted effectively by the LBP operator. Furthermore, the obtained LBP histogram feature can be fed into the LIBSVM [23] to distinguish the original speech from the resampled one. The LIBSVM is a software package based on the support vector machine. Thresholding(T=0)
As can be seen from Figure 5, the local texture features extracted from the spectrogram of the original speech are significantly different from those of the resampled speech. The underlying reason is that the spectrogram is stretched or compressed after resampling manipulation, which leads to the changes in the statistical correlation between the spectrogram's elements. Therefore, the local details and local textures are correspondingly changed, which can be depicted effectively by the LBP operator. Furthermore, the obtained LBP histogram feature can be fed into the LIBSVM [23] to distinguish the original speech from the resampled one. The LIBSVM is a software package based on the support vector machine. For the proposed method in this paper, we consider the maximum complexity of each part as the entire complexity of the algorithm, then the time complexity of the spectrogram is T (n) = O (nlog2n) and the LBP is T (n) = O (n 2 ). Thus, the time complexity of the proposed algorithm can be written as T (n) = O (n 2 ). The proposed speech resampling detection process is as follows and shown in Figure 6.
(1) Use short-time Fourier transform to transform input speech into a spectrogram; For the proposed method in this paper, we consider the maximum complexity of each part as the entire complexity of the algorithm, then the time complexity of the spectrogram is T (n) = O (nlog 2 n) and the LBP is T (n) = O (n 2 ). Thus, the time complexity of the proposed algorithm can be written as T (n) = O (n 2 ). The proposed speech resampling detection process is as follows and shown in Figure 6.
(1) Use short-time Fourier transform to transform input speech into a spectrogram; (2) Then divide the spectrogram into 3 × 3 regions to calculate the LBP code;  Figure 6. Proposed speech resampling detection process.

Experimental Results and Analysis
The original speech dataset in the experiment was selected from the TIMIT corpus. This corpus contains 6300 utterances, with an average duration of three seconds, from 630 participants in different regions of the United States. The original speech files are mono, PCM encoding, 16 kHz sampling ratio, and 16-bit quantization [16]. In this paper, 3000 original speech samples were selected from the TIMIT corpus. The original speech was resampled by linear, spline, and cubic interpolations separately and the sampling factor was changed from 0.6 to 2 with a step of 0.1 to achieve the resampled databases with linear, splines and cubic interpolations. Then, we used MP3 compression with three different bit rates, namely 32 kbps, 64 kbps, and 128 kbps, to compress the original speech and resampled speech, yielding a compressed speech database with MP3 compression. In order to facilitate the experiment, a mixed corpus was constructed by randomly selecting resampled speech from different interpolation databases (i.e., linear, spline, and cubic interpolations). There was a total of 684,000 speech samples used for the experiment, and the process of making the corpus is shown in Figure 7. The original speech:3000 The compressed speech: 3000×3=9000 The resampled speech: 3000×3×14=126000 The resampled speech through compression: 126000×3=378000  Figure 6. Proposed speech resampling detection process.

Experimental Results and Analysis
The original speech dataset in the experiment was selected from the TIMIT corpus. This corpus contains 6300 utterances, with an average duration of three seconds, from 630 participants in different regions of the United States. The original speech files are mono, PCM encoding, 16 kHz sampling ratio, and 16-bit quantization [16]. In this paper, 3000 original speech samples were selected from the TIMIT corpus. The original speech was resampled by linear, spline, and cubic interpolations separately and the sampling factor was changed from 0.6 to 2 with a step of 0.1 to achieve the resampled databases with linear, splines and cubic interpolations. Then, we used MP3 compression with three different bit rates, namely 32 kbps, 64 kbps, and 128 kbps, to compress the original speech and resampled speech, yielding a compressed speech database with MP3 compression. In order to facilitate the experiment, a mixed corpus was constructed by randomly selecting resampled speech from different interpolation databases (i.e., linear, spline, and cubic interpolations). There was a total of 684,000 speech samples used for the experiment, and the process of making the corpus is shown in Figure 7. (2) Then divide the spectrogram into 3 × 3 regions to calculate the LBP code; (3) Subsequently, calculate the histogram of these codes for feature representation.
(4) Finally, the obtained LBP histogram feature can be fed into the LIBSVM to distinguish the original speech from the resampled one.

Experimental Results and Analysis
The original speech dataset in the experiment was selected from the TIMIT corpus. This corpus contains 6300 utterances, with an average duration of three seconds, from 630 participants in different regions of the United States. The original speech files are mono, PCM encoding, 16 kHz sampling ratio, and 16-bit quantization [16]. In this paper, 3000 original speech samples were selected from the TIMIT corpus. The original speech was resampled by linear, spline, and cubic interpolations separately and the sampling factor was changed from 0.6 to 2 with a step of 0.1 to achieve the resampled databases with linear, splines and cubic interpolations. Then, we used MP3 compression with three different bit rates, namely 32 kbps, 64 kbps, and 128 kbps, to compress the original speech and resampled speech, yielding a compressed speech database with MP3 compression. In order to facilitate the experiment, a mixed corpus was constructed by randomly selecting resampled speech from different interpolation databases (i.e., linear, spline, and cubic interpolations). There was a total of 684,000 speech samples used for the experiment, and the process of making the corpus is shown in Figure 7. The original speech:3000 The compressed speech: 3000×3=9000 The resampled speech: 3000×3×14=126000 The resampled speech through compression: 126000×3=378000 The Number of Speech Samples in a Complete Corpus :

Experimental Results under Different Compression Interpolation Methods
In Section 2, this paper introduced three steps of speech resampling, namely up-sampling, interpolation, and down-sampling. There are several interpolation methods in resampling, which mainly include linear interpolation, spline interpolation, and cubic interpolation. Therefore, this paper also discusses the influence of various interpolation methods on the detection accuracy of resampling. When the resampling factors are 0.8, 0.9, 1.1, and 1.2, the speech tampering degree is so small that it is difficult to detect. Consequently, this section uses these four representative sampling factors for a comparative study. The kernel of LIBSVM is a radial basis function (RBF), and we use the multiplicative grid of five-fold cross-validation to select the parameters with the smallest error, and the ranges of the parameters c and g are all from 2 −5 to 2 5 . It can be concluded from Tables 1-3 and Figure 8 that when the resampling factor is greater or smaller than 1, the detection accuracy of all methods is high. Generally speaking, under different interpolation methods, it can be seen that the detection accuracy of the proposed algorithm is higher than that in Wang's and Hou's work. It is indicated that the proposed algorithm significantly outperforms these state-of-the-art studies.
For experimental convenience, except for this section, the following experiments in this paper use the mixed resampling corpus of linear interpolation, spline interpolation, and cubic interpolation for the research.

Experimental Results under Different Compression Sampling Factors
In previous work, Wang et al. used the logarithmic ratio of statistical energy to find the best threshold for resampling detection based on the inconsistency between bandwidth and sampling ratio. Hou et al [15]. employed the periodicity of the second-order difference of the speech signal to identify resampling. When the sampling factor is greater than 1.2, these methods can achieve better detection results. In the case of a sampling factor from 0.8 to 1.1, their method cannot achieve the satisfactory performance. Therefore, this paper used the mixed corpus to carry out the detection accuracy in the case of a sampling factor from 0.6 to 2, and the detection accuracy was improved to a certain extent. The experimental results are shown in Table 4 and Figure 9.

Experimental Results under Different Compression Sampling Factors
In previous work, Wang et al. used the logarithmic ratio of statistical energy to find the best threshold for resampling detection based on the inconsistency between bandwidth and sampling ratio. Hou et al. [15]. employed the periodicity of the second-order difference of the speech signal to identify resampling. When the sampling factor is greater than 1.2, these methods can achieve better detection results. In the case of a sampling factor from 0.8 to 1.1, their method cannot achieve the satisfactory performance. Therefore, this paper used the mixed corpus to carry out the detection accuracy in the case of a sampling factor from 0.6 to 2, and the detection accuracy was improved to a certain extent. The experimental results are shown in Table 4 and Figure 9.   The following observations can be made from Table 4 and Figure 9: (1) The detection accuracy is improved notably with the increase of the up-resampling factor. This is because the local texture difference between the original and resampled speech becomes more obvious as the compression degree of the spectrogram increases.
(2) The detection performance degrades when the resampling factor is close to 1.0. The potential reason is that the local texture difference between the original and resampled speech is getting The following observations can be made from Table 4 and Figure 9: (1) The detection accuracy is improved notably with the increase of the up-resampling factor. This is because the local texture difference between the original and resampled speech becomes more obvious as the compression degree of the spectrogram increases.
(2) The detection performance degrades when the resampling factor is close to 1.0. The potential reason is that the local texture difference between the original and resampled speech is getting smaller and smaller, which makes the resampled speech hard to detect.
(3) Among these three methods, the proposed method always outperforms the other two methods, which demonstrates that the proposed features are more discriminative for speech resampling detection, especially in small-scale resampling.

Experimental Results under Different Compression Scenarios
MP3 compression has become the most popular compression format in recent years. The compression principle is that low-frequency speech carries much information and compression removes some of the redundant information [24,25]. Therefore, the speech compressed by MP3 has no obvious distortion. The resampled speech is often compressed by MP3 in the transmission process. As a result, the detection of speech resampling under compression is meaningful in speech forensics. The higher the MP3 compression bit rate, the better the speech quality and vice versa. This paper used the standardized MPEG-I layer 3 codec to carry out the experiments [26] with the LAME tool. Three commonly used compression bit rates, that is, 32 kbps, 64 kbps, and 128 kbps, were investigated in this paper to evaluate the robustness of the algorithms. The detection rates with the proposed method and the two investigated methods are shown in Table 5 and Figure 10.
As can be seen in Table 4 and Figure 10, the proposed detector also worked well against Mp3 compression. The performance of the proposed detector is still better than that of the other two detectors, which indicates that the proposed detector is more robust in MP3 compression scenarios. Furthermore, when the compression bit rate is fixed, the detection result is better with the increase of the up/down sampling scale. Moreover, it can be seen that the detection rate is slightly improved with the increase in the compression level. Since interpolation introduces some residual traces in the high-frequency component of the spectrogram, the details related to the high-frequency information are weakened after compression. Hence, the partial influence introduced by interpolation is offset in the high-frequency component [27], which slightly increases the detector's performance. The higher the MP3 compression bit rate, the better the speech quality and vice versa. This paper used the standardized MPEG-I layer 3 codec to carry out the experiments [26] with the LAME tool. Three commonly used compression bit rates, that is, 32 kbps, 64 kbps, and 128 kbps, were investigated in this paper to evaluate the robustness of the algorithms. The detection rates with the proposed method and the two investigated methods are shown in Table 5 and Figure 10. As can be seen in Table 4 and Figure 10, the proposed detector also worked well against Mp3 compression. The performance of the proposed detector is still better than that of the other two detectors, which indicates that the proposed detector is more robust in MP3 compression scenarios. Furthermore, when the compression bit rate is fixed, the detection result is better with the increase of the up/down sampling scale. Moreover, it can be seen that the detection rate is slightly improved with the increase in the compression level. Since interpolation introduces some residual traces in the high-frequency component of the spectrogram, the details related to the high-frequency information are weakened after compression. Hence, the partial influence introduced by interpolation is offset in the high-frequency component [27], which slightly increases the detector's performance.

Conclusions
This paper presented a novel forensic method for speech resampling identification. Since the change in the spectrogram is regular with the speech resampling scale, we theoretically analyzed the corresponding relationship between time domain and frequency domain of the resampled speech. For the above analysis, an LBP operator was used to model the local statistical change in the spectrogram caused by speech resampling. Experimental results showed that the proposed method is universal in different interpolation methods and robust to MP3 compressions. Compared with some existing methods, the detection performance was improved notably under various sampling factors, especially when the sampling scale was small. In future work, we will focus on feature optimization to further improve the performance of the algorithm.