1. Introduction
With the development and advancement of digital multimedia and Internet technologies, a variety of powerful and easy-to-operate digital media editing software has emerged, bringing new problems and challenges to the availability of collected data–multimedia security issues. Recording device identification is a branch of multimedia forensics technology, and has research significance. Compared with recorders, cameras, DVs, etc., mobile phones are more popular and convenient. More and more people are using mobile phones to collect the scenes they hear, and even use the recording file as evidence before courts or other law enforcement agencies. Therefore, source cell-phone identification is a hot topic for many forensic researchers.
In recent years, source cell-phone identification has achieved great research results. In the beginning, many researchers used cepstral coefficients or features based on cepstral coefficients as the fingerprint of the device. C. Hanilci et al. [
1] extracted the Mel frequency cepstral coefficient (MFCC) from the recording file as a device-distinguishing feature, and 14 different models of cell-phones were evaluated in the experiment. The closed-collection recognition rate reached 96.42% using SVM classifiers. In a follow-up study, C. Hanilci et al. [
2] used SVM to compare MFCC, linear frequency cepstral coefficient (LFCC), Bark frequency cepstral coefficient (BFCC) and linear predictive cepstral coefficient (LPCC). Their comparison covered various kinds of feature optimization, including feature normalization, cepstral mean normalization, cepstral mean and variance normalization, and delta and double-delta coefficients. The experimental results showed that while baseline MFCCs outperformed other types of features, work of both cepstral mean and variance normalization yielded superior performance for LPCCs (with only slightly better results than MFCCs). In addition, C. Kotropoulos et al. [
3] extracted MFCC from any recorded speech signal at a frame level. The MFCC from each recording device trained a Gaussian Mixture Model (GMM) with diagonal covariance matrices. A Gaussian super vector (GSV) is derived by concatenating the mean vectors and the main diagonals of the covariance matrices that is used as a template for each device. The best identification accuracy (97.6%) was obtained by the Radial Basis Functions neural network. The above cell-phone source recognition directly processes the original recording file. Since the silent segment contains the same device information as the original speech files, and is not affected by factors such as speaker emotion, voice, intonation and speech content, some researchers began to extract features from the silent segment to characterize the recording device. C. Hanilci et al. [
4] extracted MFCC and LFCC features from the silent segment. The results showed that the MFCC features have the highest recognition rate under SVM, and the recognition rates were 98.39% and 97.03%, respectively, on the two databases.
In addition to the cepstral coefficients, power-normalized cepstral coefficient (PNCC) gradually entered the field of source cell-phone identification. Zou et al. [
5] used the Universal Background Model of Gaussian Mixture Model (GMM-UBM) classifier to compare MFCC and PNCC in terms of source cell-phone recognition performance. Experiments showed that MFCC is more effective than PNCC. The recognition rates of the two databases reached 92.86% and 97.71%, respectively. Wang et al. [
6] extracted an improved PNCC feature from the silent segment, which uses long-term frame analysis to remove the influence of background noise. GMM-UBM was set as the baseline system, which was improved by two-step discriminative training. The experimental results indicated that the average accuracy for 15 kinds of devices was 96.65%.
Although these features have also achieved good results in the field of source cell-phone identification, most of these cepstral coefficients are constructed based on the perception characteristics of the human ear. Researchers hope to find features that can characterize the inherent characteristics of the device and use them as a fingerprint for the device. Some scholars have begun to extract features directly from the spectrum of the Fourier transform domain as distinguishing features of mobile phones. C. Kotropoulos et al. [
7] proposed a new source cell-phone identification algorithm, which uses the sketches of spectral features (SSFs) as an intrinsic fingerprint. By applying a sparse-representation-based classifier to the SSFs, identification accuracy exceeded 95% on a set of 8 telephone handsets from the Lincoln-Labs Handset Database. Jin et al. [
8] proposed a method for extracting the noise of the recording device from the silent segment. The spectral shape features and spectral distribution features were extracted from the device noise. The features obtained by combining the two features were the best, and recognition rates reached 89.23% and 94.53%, respectively, for the two databases. Qi et al. [
9] obtained the noise signal by de-noising using the spectral subtraction method and used the Fourier histogram coefficient of the noise signal as the input for the deep model classifier. In comparing the recognition effects of three different deep learning classifiers—SOFTMAX, Multilayer perceptron (MLP) and CNN—CNN performed well, and the voting model combined with multiple classifiers had the best effect, with a recognition rate reaching 99%. Recently, Luo et al. [
10] proposed a new feature—the band energy difference feature—which is obtained by processing the difference between the energy values of the Fourier transform of the speech file. This feature not only has low computational complexity, but it is also highly distinct for different mobile devices. It reached an accuracy of over 96% using SVM.
Although most source cell-phone identification systems have good accuracy, they have certain limitations. The objects they identify are almost clean speech files (nearly no environmental noise). Few studies have considered noise attacks. In actual life, the speech files that need to be identified are usually recorded in a variety of different noisy environments, and the environmental noise affects the accuracy of recognition. Therefore, the identification of the source cell-phone in a noisy environment is more realistic and challenging. Based on this, this paper proposes a source cell-phone identification algorithm suitable for noisy environments. This algorithm uses the spectrum distribution feature of the constant Q transform domain as the device fingerprint, and uses the multi-scene training method to train the CNN model for source cell-phone identification.
The rest of paper is set out as follows:
Section 2 analyzes the differences of speech files recorded by different brands of cell-phone and different models of cell-phone from the same brand;
Section 3 presents the spectrum distribution features of the CQT domain proposed in this paper by device difference analysis and two traditional features—MFCC and LFCC; four kinds of classifiers and a cell-phone source identification algorithm flow chart are introduced in
Section 4;
Section 5 describes the construction process of the basic speech databases and the noisy speech databases; and
Section 6 gives the experimental results. Lastly, we conclude this paper in
Section 7.
Author Contributions
Conceptualization, W.R. and Y.D.; Methodology, Q.T.; Validation, L.L.
Funding
This research was funded by the National Natural Science Foundation of China (Grant No. U1736215, 61672302), Zhejiang Natural Science Foundation (Grant No. LZ15F020002, LY17F020010), Ningbo Natural Science Foundation (Grant No. 2017A610123), Ningbo University Fund (Grant No. XKXL1509, XKXL1503), Mobile Network Application Technology Key Laboratory of Zhejiang Province (Grant No. F2018001).
Conflicts of Interest
The authors declare no conflict of interest.
References
- Hanilci, C.; Ertas, F.; Ertas, T. Recognition of Brand and Models of Cell-Phones from Recorded Speech Signals. IEEE Trans. Inf. Forensics Secur. 2012, 7, 625–634. [Google Scholar] [CrossRef]
- Hanilçi, C.; Ertas, F. Optimizing Acoustic Features for Source Cell-Phone Recognition Using Speech Signals. In Proceedings of the First ACM Workshop on Information Hiding and Multimedia Security, Montpellier, France, 17–19 June 2013; pp. 141–148. [Google Scholar]
- Kotropoulos, C.; Samaras, S. Mobile Phone Identification Using Recorded Speech Signals. In Proceedings of the 19th International Conference on Digital Signal Processing, Hong Kong, China, 20–23 August 2014; pp. 586–591. [Google Scholar]
- Hanilçi, C.; Kinnunen, T. Source Cell-Phone Recognition from Recorded Speech Using Non-speech Segments. Digital Signal Process. 2014, 35, 75–85. [Google Scholar] [CrossRef]
- Zou, L.; Yang, J.; Huang, T. Automatic cell phone recognition from speech recordings. In Proceedings of the 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), Xi’an, China, 9–13 July 2014; pp. 621–625. [Google Scholar]
- He, Q.; Wang, Z.; Rudnicky, A.I.; Li, X. A Recording Device Identification Algorithm Based on Improved PNCC Feature and Two-Step Discriminative Training. Electron. J. 2014, 42, 191–198. [Google Scholar]
- Kotropoulos, C. Telephone Handset Identification Using Sparse Representations of Spectral Feature Sketches. In Proceedings of the 2013 International Workshop on Biometrics and Forensics (IWBF), Lisbon, Portugal, 4–5 April 2013; pp. 1–4. [Google Scholar]
- Jin, C.; Wang, R.; Yan, D.; Tao, B.; Chen, Y.; Pei, A. Source Cell-Phone Identification Using Spectral Features of Device Self-noise. In Proceedings of the 15th International Workshop on Digital Watermarking (IWDW), Beijing, China, 17–19 September 2016; pp. 29–45. [Google Scholar]
- Qi, S.; Huang, Z.; Li, Y.; Shi, S. Audio Recording Device Identification Based on Deep Learning. In Proceedings of the 2016 IEEE International Conference on Signal and Image Processing (ICSIP), Beijing, China, 13–15 August 2016; pp. 426–431. [Google Scholar]
- Luo, D.; Korus, P.; Huang, J. Band Energy Difference for Source Attribution in Audio Forensics. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2179–2189. [Google Scholar] [CrossRef]
- Jin, C. Research on Passive Forensics for Digital Audio. Ningbo University, 2016; pp. 28–35. Available online: http://cdmd.cnki.com.cn/Article/CDMD-11646-1017871275.htm (accessed on 17 August 2018).
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.; Dahlgren, N.L. DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1; U.S. Department of Commerce: Gaithersburg, MD, USA, 1993; p. 93.
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).