Source Cell-Phone Identiﬁcation in the Presence of Additive Noise from CQT Domain

: With the widespread availability of cell-phone recording devices, source cell-phone identiﬁcation has become a hot topic in multimedia forensics. At present, the research on the source cell-phone identiﬁcation in clean conditions has achieved good results, but that in noisy environments is not ideal. This paper proposes a novel source cell-phone identiﬁcation system suitable for both clean and noisy environments using spectral distribution features of constant Q transform (CQT) domain and multi-scene training method. Based on the analysis, it is found that the identiﬁcation difﬁculty lies in different models of cell-phones of the same brand, and their tiny differences are mainly in the middle and low frequency bands. Therefore, this paper extracts spectral distribution features from the CQT domain, which has a higher frequency resolution in the mid-low frequency. To evaluate the effectiveness of the proposed feature, four classiﬁcation techniques of Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Network (CNN) and Recurrent Neuron Network-Long Short-Term Memory Neural Network (RNN-BLSTM) are used to identify the source recording device. Experimental results show that the features proposed in this paper have superior performance. Compared with Mel frequency cepstral coefﬁcient (MFCC) and linear frequency cepstral coefﬁcient (LFCC), it enhances the accuracy of cell-phones within the same brand, whether the speech to be tested comprises clean speech ﬁles or noisy speech ﬁles. In addition, the CNN classiﬁcation effect is outstanding. In terms of models, the model is established by the multi-scene training method, which improves the distinguishing ability of the model in the noisy environment than single-scenario training method. The average accuracy rate in CNN for clean speech ﬁles on the CKC speech database (CKC-SD) and TIMIT Recaptured Database (TIMIT-RD) databases increased from 95.47% and 97.89% to 97.08% and 99.29%, respectively. For noisy speech ﬁles with seen noisy types and unseen noisy types, the performance was greatly improved, and most of the recognition rates exceeded 90%. Therefore, the source identiﬁcation system in this paper is robust to noise.


Introduction
With the development and advancement of digital multimedia and Internet technologies, a variety of powerful and easy-to-operate digital media editing software has emerged, bringing new problems and challenges to the availability of collected data-multimedia security issues.Recording device identification is a branch of multimedia forensics technology, and has research significance.Compared with recorders, cameras, DVs, etc., mobile phones are more popular and convenient.More and more people are using mobile phones to collect the scenes they hear, and even use the recording file as evidence before courts or other law enforcement agencies.Therefore, source cell-phone identification is a hot topic for many forensic researchers.
In recent years, source cell-phone identification has achieved great research results.In the beginning, many researchers used cepstral coefficients or features based on cepstral coefficients as the fingerprint of the device.C. Hanilci et al. [1] extracted the Mel frequency cepstral coefficient (MFCC) from the recording file as a device-distinguishing feature, and 14 different models of cell-phones were evaluated in the experiment.The closed-collection recognition rate reached 96.42% using SVM classifiers.In a follow-up study, C. Hanilci et al. [2] used SVM to compare MFCC, linear frequency cepstral coefficient (LFCC), Bark frequency cepstral coefficient (BFCC) and linear predictive cepstral coefficient (LPCC).Their comparison covered various kinds of feature optimization, including feature normalization, cepstral mean normalization, cepstral mean and variance normalization, and delta and double-delta coefficients.The experimental results showed that while baseline MFCCs outperformed other types of features, work of both cepstral mean and variance normalization yielded superior performance for LPCCs (with only slightly better results than MFCCs).In addition, C. Kotropoulos et al. [3] extracted MFCC from any recorded speech signal at a frame level.The MFCC from each recording device trained a Gaussian Mixture Model (GMM) with diagonal covariance matrices.A Gaussian super vector (GSV) is derived by concatenating the mean vectors and the main diagonals of the covariance matrices that is used as a template for each device.The best identification accuracy (97.6%) was obtained by the Radial Basis Functions neural network.The above cell-phone source recognition directly processes the original recording file.Since the silent segment contains the same device information as the original speech files, and is not affected by factors such as speaker emotion, voice, intonation and speech content, some researchers began to extract features from the silent segment to characterize the recording device.C. Hanilci et al. [4] extracted MFCC and LFCC features from the silent segment.The results showed that the MFCC features have the highest recognition rate under SVM, and the recognition rates were 98.39% and 97.03%, respectively, on the two databases.
In addition to the cepstral coefficients, power-normalized cepstral coefficient (PNCC) gradually entered the field of source cell-phone identification.Zou et al. [5] used the Universal Background Model of Gaussian Mixture Model (GMM-UBM) classifier to compare MFCC and PNCC in terms of source cell-phone recognition performance.Experiments showed that MFCC is more effective than PNCC.The recognition rates of the two databases reached 92.86% and 97.71%, respectively.Wang et al. [6] extracted an improved PNCC feature from the silent segment, which uses long-term frame analysis to remove the influence of background noise.GMM-UBM was set as the baseline system, which was improved by two-step discriminative training.The experimental results indicated that the average accuracy for 15 kinds of devices was 96.65%.
Although these features have also achieved good results in the field of source cell-phone identification, most of these cepstral coefficients are constructed based on the perception characteristics of the human ear.Researchers hope to find features that can characterize the inherent characteristics of the device and use them as a fingerprint for the device.Some scholars have begun to extract features directly from the spectrum of the Fourier transform domain as distinguishing features of mobile phones.C. Kotropoulos et al. [7] proposed a new source cell-phone identification algorithm, which uses the sketches of spectral features (SSFs) as an intrinsic fingerprint.By applying a sparse-representation-based classifier to the SSFs, identification accuracy exceeded 95% on a set of 8 telephone handsets from the Lincoln-Labs Handset Database.Jin et al. [8] proposed a method for extracting the noise of the recording device from the silent segment.The spectral shape features and spectral distribution features were extracted from the device noise.The features obtained by combining the two features were the best, and recognition rates reached 89.23% and 94.53%, respectively, for the two databases.Qi et al. [9] obtained the noise signal by de-noising using the spectral subtraction method and used the Fourier histogram coefficient of the noise signal as the input for the deep model classifier.In comparing the recognition effects of three different deep learning classifiers-SOFTMAX, Multilayer perceptron (MLP) and CNN-CNN performed well, and the voting model combined with multiple classifiers had the best effect, with a recognition rate reaching 99%.Recently, Luo et al. [10] proposed a new feature-the band energy difference feature-which is obtained by processing the difference between the energy values of the Fourier transform of the speech file.This feature not only has low computational complexity, but it is also highly distinct for different mobile devices.It reached an accuracy of over 96% using SVM.
Although most source cell-phone identification systems have good accuracy, they have certain limitations.The objects they identify are almost clean speech files (nearly no environmental noise).Few studies have considered noise attacks.In actual life, the speech files that need to be identified are usually recorded in a variety of different noisy environments, and the environmental noise affects the accuracy of recognition.Therefore, the identification of the source cell-phone in a noisy environment is more realistic and challenging.Based on this, this paper proposes a source cell-phone identification algorithm suitable for noisy environments.This algorithm uses the spectrum distribution feature of the constant Q transform domain as the device fingerprint, and uses the multi-scene training method to train the CNN model for source cell-phone identification.
The rest of paper is set out as follows: Section 2 analyzes the differences of speech files recorded by different brands of cell-phone and different models of cell-phone from the same brand; Section 3 presents the spectrum distribution features of the CQT domain proposed in this paper by device difference analysis and two traditional features-MFCC and LFCC; four kinds of classifiers and a cell-phone source identification algorithm flow chart are introduced in Section 4; Section 5 describes the construction process of the basic speech databases and the noisy speech databases; and Section 6 gives the experimental results.Lastly, we conclude this paper in Section 7.

Device Difference Analysis
A spectrogram is a visual representation of the spectrum of a speech signal which changes with time.In order to study the differences between speech files recorded by different cell-phones in different frequency bands, Figure 1 shows short-term Fourier transform (STFT) domain spectrograms of speech files recorded by the same speaker simultaneously with 8 cell-phones in a quiet office environment.As can be seen from the figure, the spectrograms of speech files recorded by different brands of cell-phones vary greatly.For example, HuaweiMate7's energy is rapidly reduced near 0.7 kHz, but the decrease of Mi4 is near 1 kHz.The energy distribution of other bands' cell-phones and the frequency band of sudden changes in energy are also different.However, for different models of the brand Apple, the spectrograms are very similar.
To analyze the frequency difference of different models of cell-phones from the same brand, Figure 2 plots the spectrograms of speech files recorded by different models of Apple cell-phones.Although the four images are very similar, with a rapid energy change at around 1.5 kHz, there are still some differences.For example, the iPhone 6 has lower energy in the 0-1.5 kHz band than the other three phones.The iPhone 5 and iPhone 6s have distinct peaks in energy at around 1.2 kHz and 1 kHz, respectively, and the other two phones do not.
According to the above, different brands of cell-phones have large differences and are easy to distinguish.Therefore, the key to source cell-phones identification lies in the identification of different models of cell-phones of the same brand.That is to say, the identification of cell-phones depends on whether it is possible to distinguish well between differences in the middle and low frequencies in the recording equipment.
spectrograms of speech files recorded by the same speaker simultaneously with 8 cell-phones in a quiet office environment.As can be seen from the figure, the spectrograms of speech files recorded by different brands of cell-phones vary greatly.For example, HuaweiMate7's energy is rapidly reduced near 0.7 kHz, but the decrease of Mi4 is near 1 kHz.The energy distribution of other bands' cell-phones and the frequency band of sudden changes in energy are also different.However, for different models of the brand Apple, the spectrograms are very similar.To analyze the frequency difference of different models of cell-phones from the same brand, Figure 2 plots the spectrograms of speech files recorded by different models of Apple cell-phones.Although the four images are very similar, with a rapid energy change at around 1.5 kHz, there are still some differences.For example, the iPhone 6 has lower energy in the 0-1.5 kHz band than the other three phones.The iPhone 5 and iPhone 6s have distinct peaks in energy at around 1.2 kHz and 1 kHz, respectively, and the other two phones do not.According to the above, different brands of cell-phones have large differences and are easy to distinguish.Therefore, the key to source cell-phones identification lies in the identification of different models of cell-phones of the same brand.That is to say, the identification of cell-phones depends on whether it is possible to distinguish well between differences in the middle and low frequencies in the recording equipment.

Spectral Distribution Features of the CQT Domain
Based on the analysis of the difference in the spectrogram from the Fourier transform (STFT) domain of the speech files recorded by different cell-phones in Section 2, the difference between different brands is obvious and the similarity between different models of the same brand is high, with subtle differences only in the middle-and low-frequency bands.This paper chooses to construct features from the constant Q transform (CQT) domain that can effectively distinguish different recording devices.Compared with STFT with a fixed time-frequency resolution, CQT has a higher frequency resolution at low frequencies and a higher time resolution at high frequencies.To capture these variations in the spectrogram in the CQT domain, this paper selects the spectral distribution features to describe the characteristics of the spectrum.
The calculation process of the spectral distribution features of the CQT domain is given below.
(1) If the time domain signal of a speech files is x(n), and the frequency domain signal after the

Spectral Distribution Features of the CQT Domain
Based on the analysis of the difference in the spectrogram from the Fourier transform (STFT) domain of the speech files recorded by different cell-phones in Section 2, the difference between different brands is obvious and the similarity between different models of the same brand is high, with subtle differences only in the middle-and low-frequency bands.This paper chooses to construct features from the constant Q transform (CQT) domain that can effectively distinguish different recording devices.Compared with STFT with a fixed time-frequency resolution, CQT has a higher frequency resolution at low frequencies and a higher time resolution at high frequencies.To capture these variations in the spectrogram in the CQT domain, this paper selects the spectral distribution features to describe the characteristics of the spectrum.
The calculation process of the spectral distribution features of the CQT domain is given below.
(1) If the time domain signal of a speech files is x(n), and the frequency domain signal after the CQT is X CQT (k), X CQT (k) is defined by: where k = 1, 2, ..., K is the frequency bin index; f s is the sampling rate; f k is the center frequency of bin k, which is exponentially distributed and is defined as where B is the number of bins per octave, f 1 is the center frequency of the lowest frequency bin and is computed according to: (3) w N k (n) is a window function (Hanning window); from high frequency to low frequency, with the increasing frequency resolution, the time resolution will gradually be sacrificed.Therefore, the window length N k varies with k and is inversely proportional to k, namely: The Q-factor is a constant independent of k and is defined as the ratio of the center frequency to (2) For the frequency value X i (k) of the i-th frame at the kth frequency point, the amplitude Y i (k (3) Spectral distribution features: where T k represents the total number of frames of the speech in the kth frequency band, k = 1, 2, . . ., K. Therefore, for a speech file, its spectral distribution features comprise a 1 × K vector.K is set as 420 in this paper.

Traditional Features (MFCC, LFCC)
Almost all of the features in cell-phone source recognition are extracted from the Fourier transform domain of the speech signal.The most popular features for recognition systems are the MFCC and LFCC.Their extraction procedure is shown in Figure 3. Firstly, the speech signal is divided into overlapping frames, and each frame is windowed using an appropriate window function.Then, the power spectrum is computed using the fast Fourier transform (FFT), which is then smoothed with a bank of triangular filters whose center frequencies are uniformly spaced on the different scales.Finally, logarithmic filter bank outputs are converted into features by taking the discrete cosine transform (DCT).In this paper, we use 30 millisecond frames with a 15 millisecond overlap and a Hamming window.For every speech file, the dimensions of MFCC and LFCC are both set to 24.
The difference between MFCC and LFCC lies in different frequency scale to locate triangular filters.MFCC uses a Mel-frequency scale, but the triangular filters are linearly spaced for LFCC.
discrete cosine transform (DCT).In this paper, we use 30 millisecond frames with a 15 millisecond overlap and a Hamming window.For every speech file, the dimensions of MFCC and LFCC are both set to 24.
The difference between MFCC and LFCC lies in different frequency scale to locate triangular filters.MFCC uses a Mel-frequency scale, but the triangular filters are linearly spaced for LFCC.

SVM
The Support Vector Machine (SVM) is a supervised learning classification algorithm for solving the two-class problem.Its basic model is to find the best-separated hyperplane in the feature space, so that the positive and negative sample intervals on the training set are as large as possible.SVM can be used to solve linear problems, and it can also be used to solve nonlinear problems after introducing a kernel method.For the n-classification problem in this paper, n × (n − 1)/2 SVM classifiers are integrated for classification, and the kernel function selects the Gaussian kernel.

RF
When Random Forest (RF) is used as a classifier, it is an integrated classification algorithm that relies on the voting choices of multiple decision trees to determine the final classification result.RF has two characteristics, sample randomness and feature randomness, ensuring that there is no over-fitting phenomenon when classifying Sample randomness means that if the training set size is x, for each decision tree, training samples (also for x) are randomly and regressively extracted from the training set using the bootstrapping method.Feature randomness means that when training the nodes of each decision tree, the features used are randomly selected from all the features according to a certain proportion.By calculating the amount of information contained in each feature, the feature with the highest classification ability is selected for node splitting.In this paper, the CART algorithm is used to generate the decision tree, and the Gini index is used as the criterion for selecting the optimal feature and the splitting point.

CNN
Convolutional Neural Network (CNN) is a multi-layer neural network consisting mainly of a convolutional layer, a pooling layer, a nonlinear activation layer, and a fully connected layer.The nonlinear modeling capability of CNN makes it an excellent classifier.The CNN network structure adopted in this paper is shown in Figure 4.As you can see, it uses dropout, which is a regularization method to prevent overfitting.The basic idea of dropout is to randomly drop out some neurons during the training of the deep learning network.The model can be made more robust because it does not rely too much on local features (because local features are likely to be discarded).This paper uses the method of random gradient descent when training the CNN model.convolutional layer, a pooling layer, a nonlinear activation layer, and a fully connected layer.The nonlinear modeling capability of CNN makes it an excellent classifier.The CNN network structure adopted in this paper is shown in Figure 4.As you can see, it uses dropout, which is a regularization method to prevent overfitting.The basic idea of dropout is to randomly drop out some neurons during the training of the deep learning network.The model can be made more robust because it does not rely too much on local features (because local features are likely to be discarded).This paper uses the method of random gradient descent when training the CNN model.

RNN-BLSTM
Recurrent Neuron Network (RNN) is a neural network that models sequence data.The network memorizes previous information and applies it to the calculation of the current output.That is, the nodes between the hidden layers are no longer connectionless but connected.In addition, the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment.Long Short-Term Memory Neural Network (LSTM) is a special type of RNN that is more advantageous than basic RNN in sequence generation and sequence tagging.Each node in the RNN-LSTM hidden layer is represented by an LSTM structure compared to a normal RNN model with a basic node.The one-way RNN-LSTM only models the forward sequence, which means that the latter sequence cannot affect the modeling of the previous sequence.If the entire sequence in the positive and negative directions can be used at the same time, the accuracy of the model should be improved.Therefore, RNN-BLSTM is selected to classify different recording devices.The specific network framework used is shown in Figure 4.

Multi-Scene Training Recognition Systems
In this paper, the multi-scene training method is used to enhance the noise robustness of the source cell-phone identification system.The specific flow chart of this system is shown in Figure 5.The traditional single-scenario training method only uses clean speech files to extract the distinguishing features of the device, and then uses those features to establish a recognition model.When the multi-scene training method is used to build the model, the training set not only has clean speech files but also noisy speech files containing different noise types and different noise intensities.This model can learn the effect of noise on the differences in speech recorded by different recording devices, making the model more robust.

RNN-BLSTM
Recurrent Neuron Network (RNN) is a neural network that models sequence data.The network memorizes previous information and applies it to the calculation of the current output.That is, the nodes between the hidden layers are no longer connectionless but connected.In addition, the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment.Long Short-Term Memory Neural Network (LSTM) is a special type of RNN that is more advantageous than basic RNN in sequence generation and sequence tagging.Each node in the RNN-LSTM hidden layer is represented by an LSTM structure compared to a normal RNN model with a basic node.The one-way RNN-LSTM only models the forward sequence, which means that the latter sequence cannot affect the modeling of the previous sequence.If the entire sequence in the positive and negative directions can be used at the same time, the accuracy of the model should be improved.Therefore, RNN-BLSTM is selected to classify different recording devices.The specific network framework used is shown in Figure 4.

Multi-Scene Training Recognition Systems
In this paper, the multi-scene training method is used to enhance the noise robustness of the source cell-phone identification system.The specific flow chart of this system is shown in Figure 5.The traditional single-scenario training method only uses clean speech files to extract the distinguishing features of the device, and then uses those features to establish a recognition model.When the multi-scene training method is used to build the model, the training set not only has clean speech files but also noisy speech files containing different noise types and different noise intensities.This model can learn the effect of noise on the differences in speech recorded by different recording devices, making the model more robust.

Basic Speech Databases
In the experiment, we used 24 cell-phones from 7 brands to research source cell-phone recognition, and their specific information is as shown in Table 1.To make the experimental results comparable, we used two different basic speech databases, the TIMIT Recaptured Database

Basic Speech Databases
In the experiment, we used 24 cell-phones from 7 brands to research source cell-phone recognition, and their specific information is as shown in Table 1.To make the experimental results comparable, we used two different basic speech databases, the TIMIT Recaptured Database (TIMIT-RD) and the CKC speech database (CKC-SD) [11], to investigate the performance of our cell-phone recognition system under different conditions.They were built using the cell-phone devices of Table 1.
TIMIT-RD is a speech database comprising speech files ripped from the TIMIT [12] database, and these speech files include 1600 speech samples of 160 people (half each for men and women) which were played back through a high-fidelity speaker (PhilipsDTM3500, manufacturer: Philips Investment Co., Ltd., Zhongshan, China) in a quiet office environment.For each cell-phone, we used 800 utterances for training and the remaining 800 utterances for testing.The CKC-SD database is a second database constructed by recording speech spoken by 12 speakers (half of which were male) in a quiet office environment.All cell-phones were placed in a circular arc around the speaker for simultaneous recording.Each speaker recorded two speech segments with a duration of more than 5 min, based on a normal speech rate and intonation.One speech segment was fixed content, and each person's recorded content was the same.The other section used the form of question and answer, and the content recorded by each person was different.Half of the recording (5 min), which was segmented into 3 s long chunks, was used to train each phone, and the remaining 5 min portion was segmented into 3 s long chunks for testing.

Noisy Speech Databases
To study the robustness of the source cell-phone identification system in noisy environments, different types of noises at a variety of signal-to-noise ratios need to be added to the two basic speech databases to simulate the actual noise scene.When adding noise to the underlying databases, we used the filtering and noise addition tool (FaNT, version), which is an open-source tool that follows ITU's noise addition and filtering.The noise signal was selected from five noise types-white noise, babble noise, street noise, cafe noise and Volvo noise-in the NOISEX-92 noise database, and for each type of noise, three signal-to-noise ratio (SNR) levels, i.e., 0 dB, 10 dB and 20 dB were considered.Therefore, each basic speech database constituted 15 noisy databases with different noise intensities and different types of noise.The reasons for choosing these five types of noise are as follows: (1) the energy of white noise is evenly distributed over the frequency components.Although it rarely represents the actual situation, it is a commonly used noise when studying robust speech processing methods; (2) Babble noise is one of the most difficult types of noise in speaker applications with multiple speakers, which occurs every day in any crowded place; (3) Streets, cafe and Volvo noises are other types of noise that often occur in our daily lives.

Experimental Setup
For the multi-scene training recognition system used in this paper, the types of speech files included in the training set and the test sets are as follows: Train: clean, white (0,10,20 dB), babble (0,10,20 dB), street (0,10,20 dB).
The training set used in the training phase includes not only clean speech files in the basic speech database, but also three noisy speech files with three signal-to-noise ratio (SNR) levels of white, babble, and street noise types in the noisy speech database.When testing the model trained in the training set, 16 test sets are used, each of which includes one type of speech file, 10 of which comprise the clean speech file and 9 different noisy speech files with different noise types and noise intensities that were in the training set (seen noisy speech); the remaining 6 test sets are noisy speech files for the two noise types that were not in the training set (unseen noisy speech).This is advantageous for detecting whether the model established by the multi-scene training is universal; that is, whether it can conduct effective source cell-phone recognition on the speech files of the unseen noisy scenarios.

Parameter Setup
When the CQT converts the time domain signal of the speech file into the frequency domain, K determines the accuracy of the frequency domain information of the speech signal, so the magnitude of K has a certain influence on the recognition performance.As shown by Equation ( 3): OCT = log 2 f max f 1 (9) where x represents the largest integer less than or equal to x, OCT is the number of octaves (generally an integer less than or equal to 9), and B is the number of frequency points per octave (generally about 100).So the size of K is determined by B and OCT. Figure 6 shows that the performance of the spectral distribution features of different OCT values when B is 100.The accuracy of the device recognition experiments are compared by using two traditional classifiers-SVM and RF, respectively-on the CKC-SD database.It can be seen from the figure that different OCT values have similar effects on the recognition of different test sets under the two traditional classifiers.As the OCT value increases, the accuracy of the noisy speech files containing white noise and babble noise gradually increases, but that of the speech files containing Volvo noise drops sharply, and the remaining speech types are not significantly affected, except for cafe noise.Considering the complexity of the algorithm and the influence of different OCT values on the accuracy of different noise types, the OCT value is set to 7.
the figure that different OCT values have similar effects on the recognition of different test sets under the two traditional classifiers.As the OCT value increases, the accuracy of the noisy speech files containing white noise and babble noise gradually increases, but that of the speech files containing Volvo noise drops sharply, and the remaining speech types are not significantly affected, except for cafe noise.Considering the complexity of the algorithm and the influence of different OCT values on the accuracy of different noise types, the OCT value is set to 7. Figure 7 shows the comparison of different B values for the influence of device recognition when the OCT value is 7. Experiments were carried out using two traditional classifiers-SVM and RF-on the CKC-SD database.From the figure, with an increase in B value, there is little effect on the recognition of clean speech files and noisy speech files containing seen noise types, and the recognition rate of speech files containing unseen noise types is only slightly improved.In short, the value of B has little effect on the accuracy of source identification, so we choose 60 as the value of B. Therefore, the K value is 420 (7 × 60) in the paper.For a speech file, number of its spectral distribution features is 420. Figure 7 shows the comparison of different B values for the influence of device recognition when the OCT value is 7. Experiments were carried out using two traditional classifiers-SVM and RF-on the CKC-SD database.From the figure, with an increase in B value, there is little effect on the recognition of clean speech files and noisy speech files containing seen noise types, and the recognition rate of speech files containing unseen noise types is only slightly improved.In short, the value of B has little effect on the accuracy of source identification, so we choose 60 as the value of B. Therefore, the K value is 420 (7 × 60) in the paper.For a speech file, number of its spectral distribution features is 420.

Comparison of Features
To compare the influence of the proposed features and traditional features on the performance of source cell-phone recognition, Figure 8 plots the accuracy of four features-MFCC, LFCC, SSF (STFT) and SSF (CQT)-on different test sets under SVM.As can be seen from the figure, these four features have a good recognition effect on clean speech files, but with the addition of noise, the accuracy decreases, and the performance gets worse with increasing noise intensity.Secondly, for the same noise intensity, the recognition of seen noisy speech files is significantly better than the unseen noisy speech files.The recognition rate of the traditional features-MFCC and LFCC-for noisy speech files decreases sharply with the increase in noise intensity; the situation is even worse for the unseen noisy scenarios, where the highest accuracy is only 80%, and the lowest accuracy is 24%.Therefore, the traditional features are poorly robust to noise.The performance of SSF (STFT) features is generally worse than the traditional features, but it is superior to traditional features in the case of strong noise intensity.The SSF (CQT) feature is more robust than the other features.It is obviously better than MFCC, LFCC, and SSF (STFT) for clean speech and seen noisy speech files, with an accuracy higher than 70%.However, the recognition effect of the unseen noisy speech files does not change significantly compared with the other features.For weak noise intensity, the accuracy is slightly improved, while for high noise intensity, the accuracy is reduced.

Comparison of Features
To compare the influence of the proposed features and traditional features on the performance of source cell-phone recognition, Figure 8 plots the accuracy of four features-MFCC, LFCC, SSF (STFT) and SSF (CQT)-on different test sets under SVM.As can be seen from the figure, these four features have a good recognition effect on clean speech files, but with the addition of noise, the accuracy decreases, and the performance gets worse with increasing noise intensity.Secondly, for the same noise intensity, the recognition of seen noisy speech files is significantly better than the unseen noisy speech files.The recognition rate of the traditional features-MFCC and LFCC-for noisy speech files decreases sharply with the increase in noise intensity; the situation is even worse for the unseen noisy scenarios, where the highest accuracy is only 80%, and the lowest accuracy is 24%.Therefore, the traditional features are poorly robust to noise.The performance of SSF (STFT) features is generally worse than the traditional features, but it is superior to traditional features in the case of strong noise intensity.The SSF (CQT) feature is more robust than the other features.It is obviously better than MFCC, LFCC, and SSF (STFT) for clean speech and seen noisy speech files, with an accuracy higher than 70%.However, the recognition effect of the unseen noisy speech files does not change significantly compared with the other features.For weak noise intensity, the accuracy is slightly improved, while for high noise intensity, the accuracy is reduced.
for the unseen noisy scenarios, where the highest accuracy is only 80%, and the lowest accuracy is 24%.Therefore, the traditional features are poorly robust to noise.The performance of SSF (STFT) features is generally worse than the traditional features, but it is superior to traditional features in the case of strong noise intensity.The SSF (CQT) feature is more robust than the other features.It is obviously better than MFCC, LFCC, and SSF (STFT) for clean speech and seen noisy speech files, with an accuracy higher than 70%.However, the recognition effect of the unseen noisy speech files does not change significantly compared with the other features.For weak noise intensity, the accuracy is slightly improved, while for high noise intensity, the accuracy is reduced.
In general, the SSF (CQT) feature is significantly superior to the other features as the device fingerprint.The MFCC, LFCC and SSF (STFT) features are extracted from the STFT domain, while SSF (CQT) is derived from the CQT domain.Therefore, the frequency domain information obtained by using different time-frequency transform methods of speech signals is different, leading to a difference in accuracy.CQT is more suitable for source cell-phone recognition than the STFT.In general, the SSF (CQT) feature is significantly superior to the other features as the device fingerprint.The MFCC, LFCC and SSF (STFT) features are extracted from the STFT domain, while SSF (CQT) is derived from the CQT domain.Therefore, the frequency domain information obtained by using different time-frequency transform methods of speech signals is different, leading to a difference in accuracy.CQT is more suitable for source cell-phone recognition than the STFT.
Tables 2 and 3, respectively, show the specific classification results of the MFCC and SSF (CQT) features on the clean test set from CKC-SD.In the tables, AL is the actual device model in which the speech files are recorded, and PL indicates the predicted device model.It can be seen from Table 2 that the average accuracy of MFCC for the 24 devices is 92%.The overall performance of MFCC is good, but the accuracy varies greatly for different device models.The recognition rate of Meizu and Xiaomi is almost 100%.The recognition rate is the lowest for two models (D610t, D820t) of HTC, at 56% and 79%, respectively.Like Huawei and Apple, three models of HTC are also misjudged within the brand.The misclassification of Xiaomi and Samsung is mainly misjudged within the brand, but also includes a small number of misjudgments outside the brand.It can be seen from Table 3 that the average accuracy of SSF (CQT) for 24 devices is 98%, which is 6 percentage points higher than that of MFCC.This feature is almost perfect for the recognition of Meizu, Xiaomi, OPPO and Samsung brands.The wrong scores for HTC, Huawei, and Apple are misjudgments within the brand, and the accuracy is improved compared to MFCC.
Table 4 shows the classification results of MFCC and SSF (CQT) features for different brands on different test sets.Regardless of whether examining the clean speech test set or the noisy speech test set containing white noise, the SSF (CQT) feature improves the accuracy for each brand compared to the MFCC.This confirms that the high resolution of the low-middle frequency band in CQT can improve the recognition performance of different models of cell-phones of the same brand.
The above experimental results and analysis indicate that: SSF (CQT) can be used to determine the unique identity information of a specific model of device, and can effectively identify the recording equipment for both clean speech files and noisy speech files.It can be seen from the figure that the traditional SVM and RF classifiers have an almost identical recognition effect for clean speech files, but there are differences for noisy speech files.The classification effect of RNN on the clean speech test set and the noisy speech test sets with white noise, babble noise, and street noise is significantly worse than that of the traditional classifiers, but the recognition effect in unseen noisy speech is better than the traditional classifiers, especially for Volvo noise, where the highest accuracy shows an increase of about 20%.Surprisingly, the accuracy of CNN on the 16 test sets is higher than that of the other three classifiers, especially for the speech test set of unseen noisy scenarios, the performance of which is greatly improved.In the test sets of cafe and Volvo noise with different noise intensities, most of the accuracies are higher than 90%, with the lowest accuracy also being greater than 70%.

Comparison of Single-Scene and Multi-Scene Training
To verify the effectiveness of the multi-scene training method, Table 5 compares the performance of single-scene and multi-scene training methods on the two databases CKC-SD and TIMIT-RD, respectively.The features and classifiers use SSF (CQT) and CNN, respectively.It can be seen from the figure that when the testing speech consists of clean speech files, the recognition rate of the multi-scene training algorithm is higher than that for the single-scene algorithm in the two databases, indicating that if noisy speech files are added to the training set, the recognition effect will be improved.Secondly, when the testing speech files consist of noisy speech, the accuracy of the multi-scene training recognition algorithm is greatly improved for the two speech databases compared to the single-scene training method, especially for high-intensity noisy speech, the accuracy of which can be increased by up to 60%.
The multi-scene training recognition algorithm using the features proposed in this paper and the CNN classifier not only achieves a good performance in the seen noise-scene speech files, but also has considerable accuracy in the unseen noise-scene speech files.Therefore, training the model using the multi-scene training method can solve the carrier mismatch problem of the single-scene training method.Therefore, the performance of the deep-learning CNN classifier is very prominent, not only maintaining a good performance on clean speech files, but also having a good recognition effect on 15 kinds of noisy speech test sets.Even if the training concentration does not include these speech files with cafe noise and Volvo noise, CNN can also distinguish recording devices from noisy speech files containing these two kinds of noise, and can achieve accuracy comparable to that for seen noisy speech.Therefore, CNN is more suitable for source identification in noisy environments.

Comparison of Single-Scene and Multi-Scene Training
To verify the effectiveness of the multi-scene training method, Table 5 compares the performance of single-scene and multi-scene training methods on the two databases CKC-SD and TIMIT-RD, respectively.The features and classifiers use SSF (CQT) and CNN, respectively.It can be seen from the figure that when the testing speech consists of clean speech files, the recognition rate of the multi-scene training algorithm is higher than that for the single-scene algorithm in the two databases, indicating that if noisy speech files are added to the training set, the recognition effect will be improved.Secondly, when the testing speech files consist of noisy speech, the accuracy of the multi-scene training recognition algorithm is greatly improved for the two speech databases compared to the single-scene training method, especially for high-intensity noisy speech, the accuracy of which can be increased by up to 60%.The multi-scene training recognition algorithm using the features proposed in this paper and the CNN classifier not only achieves a good performance in the seen noise-scene speech files, but also has considerable accuracy in the unseen noise-scene speech files.Therefore, training the model using the multi-scene training method can solve the carrier mismatch problem of the single-scene training method.

Comparison of Different Identification Algorithms
To comprehensively evaluate the source cell-phone identification algorithm proposed in this paper, we compare the recognition algorithm from the Reference [10] and this paper's recognition algorithm using multi-scene training method and testing on our speech databases.The number of training and test speech files is the same as that used in this article.Recognition algorithm of Reference [10] extracted the sub-band energy difference feature as a distinguishing feature, which is obtained by performing differential processing on the power value of the original speech file after Fourier transform and uses SVM as a classifier.Their parameter settings are consistent with Reference [10].
Table 6 is a comparison of the accuracy of the source cell-phone identification of the algorithm presented in Reference [10] and the algorithm proposed in this paper.It can be seen from the table that the two algorithms are almost equivalent in performance for clean speech files, but for noisy speech files, this paper's algorithm is superior to the algorithm presented in Reference [10], especially for noisy speech of unseen noise type.

Conclusions
Currently, all source cell-phone identification algorithms use cepstral coefficients, features based on cepstral coefficients, or extract features directly from the spectrum of the Fourier transform domain as the distinguishing feature of the mobile phone.They have good performance, but the recognition objects of source cell-phone recognition are almost always speech files recorded in a quiet environment (can be considered as no scene noise).When the speech file contains scene noise, the performance of the source cell-phone recognition algorithm drops sharply, and as the noise intensity increases, the recognition becomes worse and worse, so the noise robustness of these algorithms is poor.Considering that speech files that need to be recognized usually contain scene noise, and that the traditional recognition algorithm has poor noise robustness, this paper proposes a source cell-phone recognition algorithm suitable for noisy environments that has very good recognition performance for clean speech files and noisy speech files.It has strong noise robustness.
Through analysis, this paper finds that the difference between different brands of cell-phones is mainly at high frequency, and the difference is obvious, and is easy to distinguish.However, cell-phones of different models of the same brand only have slight differences in the middle and low frequencies, which are difficult to distinguish.Therefore, the algorithm extracts the spectrum distribution feature of the CQT domain of the speech file as the device fingerprint.When the CQT converts the speech time domain signal to the frequency domain, the frequency resolution increases from high frequency to low frequency, such that the low-frequency information amplifies the slight differences between different models of cell-phones of the same brand, and enhances the recognition of different models of mobile phones of the same brand.The features of the traditional recognition algorithm are extracted from the Fourier transform domain.Fourier transformation uses a fixed frequency resolution, and the representation of the low-frequency information of speech is less accurate than CQT.Secondly, the multi-scene training method adopted in this paper can not only improve the accuracy of clean speech files, but can also improve the performance for noisy speech files with seen noise types.Despite this, the recognition performance for noisy speech files with unseen noise types is not improved.Finally, this paper uses CNN (deep learning classifier) as a classifier.Compared with the machine learning classifier used in the traditional source recognition algorithm, it not only improves the recognition of clean speech files and noisy speech files with seen noise types, but also greatly

Figure 1 .
Figure 1.Spectrogram of speech files recorded by different brands of cell-phones.Figure 1. Spectrogram of speech files recorded by different brands of cell-phones.

Figure 1 .
Figure 1.Spectrogram of speech files recorded by different brands of cell-phones.Figure 1. Spectrogram of speech files recorded by different brands of cell-phones.

Figure 2 .
Figure 2. Spectrograms of speech files recorded by different models of Apple cell-phones.

Figure 2 .
Figure 2. Spectrograms of speech files recorded by different models of Apple cell-phones.

Figure 5 .
Figure 5. Source cell-phone identification algorithm block diagram of multi-scene training.

Figure 5 .
Figure 5. Source cell-phone identification algorithm block diagram of multi-scene training.

Figure 6 .
Figure 6.Comparison of different OCT values.

Figure 6 .
Figure 6.Comparison of different OCT values.

Information 2018, 9 ,
x FOR PEER REVIEW 10 of 17

Figure 7 .
Figure 7.Comparison of different B values.

Figure 7 .
Figure 7.Comparison of different B values.

Figure 8 .
Figure 8.Comparison of accuracy of different features under SVM.Figure 8. Comparison of accuracy of different features under SVM.

Figure 8 .
Figure 8.Comparison of accuracy of different features under SVM.Figure 8. Comparison of accuracy of different features under SVM.

Table 3 .
Specific classification results of SSF (CQT) for the clean test set (%).

Figure 9
Figure9compares the performance of SSF (CQT) under four different classifiers.It can be seen from the figure that the traditional SVM and RF classifiers have an almost identical recognition effect for clean speech files, but there are differences for noisy speech files.The classification effect of RNN on the clean speech test set and the noisy speech test sets with white noise, babble noise, and street noise is significantly worse than that of the traditional classifiers, but the recognition effect in unseen noisy speech is better than the traditional classifiers, especially for Volvo noise, where the highest accuracy shows an increase of about 20%.Surprisingly, the accuracy of CNN on the 16 test sets is higher than that of the other three classifiers, especially for the speech test set of unseen noisy scenarios, the performance of which is greatly improved.In the test sets of cafe and Volvo noise with different noise intensities, most of the accuracies are higher than 90%, with the lowest accuracy also being greater than 70%.

Table 2 .
Specific classification results of MFCC for the clean test set (%).

Table 4 .
Classification results of MFCC and SSF (CQT) for different brands.

Table 4 .
Classification results of MFCC and SSF (CQT) for different brands.

Table 5 .
Comparison of accuracy of single-scene and multi-scene training.

Table 6 .
Comparison of accuracy of different algorithms