To help with image forensic investigations, researchers introduced different methods for image source camera identification [
9,
10]. This section gives a comprehensive overview of the various proposed methods for identifying the source camera of an image. This examination delves into existing methods for image source camera identification, including methods based on intrinsic hardware artifacts resulting from manufacturing imperfections, and those utilizing software-related properties. Intrinsic hardware-related flaws that can be exploited in image source camera identification include sensor pattern noise, lens radial distortion, and sensor dust, among others. Software artifact-based methodologies are used in camera fingerprint extraction using the characteristics and artifacts left by camera software, such as auto white balance approximation and colour filter array interpolation, among others.
Figure 2 shows the taxonomy of image source camera identification techniques.
3.1. Sensor Pattern Noise-Based Techniques
A flaw in the manufacturing process of the image sensor chip, which creates in pixel sensitivity variation in the imaging sensor, is the source of sensor pattern noise (SPN). These pattern noises contain a distinctive quality that makes them identifiable to that camera imaging sensor. Therefore, it provides a “fingerprint” of that specific digital camera. The main component of SPN is the photo response non-uniformity (PRNU) noise. Therefore, analyzing the PRNU noise, which is measured as a unique camera fingerprint, is one of the trustworthy techniques for image source camera identification using SPN. The image still undergoes further processing stages like demosaicking, interpolation, and gamma correction after the sensing process. Even after going through all of this, the image still has bullet scratches which are not removable by the above processes.
In the paper, which has been thought of as a benchmark for image source camera identification using SPN, Lukas et al. [
14], introduced a technique that uses discrete wavelet transform to decompose the original images into four sub-bands. Then it applies a Wiener denoising filter on the resulting three high-frequency wavelet subbands to denoise the image high-frequency wavelet subbands and reconstruct the image using the smoothed wavelet high-frequency sub-bands. It subtracts the resulting denoised image from the input image to compute the reference pattern noise of the image. The camera fingerprint is computed by averaging the reference pattern noise of a few images from the camera under different conditions. Then, to determine if the image comes from the reference camera, they use the normalization cross correlation between the calculated pattern noise of the injury image and the pattern noise of the camera. Even though this method appears to have the potential to increase computing complexity and cannot be used for large-scale processing, its level of reliability tends to be high. The experiments were conducted on roughly 320 images captured by nine consumer digital cameras, and the outcomes of the experiment were assessed using false acceptance rate (FAR) and false rejection rate (FRR) error rates. Even for cameras of the same model, the camera recognition is 99.8% accurate. Jaiswal and Srivastava in [
15] highlighted that image scenes may highly contaminate the extracted PRNU, resulting in wrong camera identification. Therefore, they proposed a framework based on the frequency and spatial features to increase the size of the image dataset used to train and estimate the camera PRNU. The proposed framework uses discrete wavelet transform (DWT) and local binary pattern (LBP) to extract features from the images. These features are then used to train a multi-class classifier, e.g., support vector machine (SVM), linear discriminant analysis (LDA), and K-nearest neighbor (KNN). The resulting trained classifier is then used to identify the image camera source. Soobhany et al. [
16] proposed another technique like [
14] where they used a non-discrete wavelet transform to decompose the input image into four wavelet sub-bands. To calculate the SPN from the image, the coefficients of the resulting wavelet high-frequency sub-bands are de-noised. The image SPN signature was compared to the camera reference SPN signature to identify the image source camera. An advantage of this technique is that the non-decimated wavelet transform maintains all the details of the wavelet sub-bands during the decomposition process allowing for more information to be preserved. Again, the SPN signature can be retrieved after the first level of wavelet decomposition, as compared to the decimated approach, which requires four levels of wavelet decomposition to obtain a credible SPN. The proposed method was tested using images from ten different cameras from the Dresden image dataset. Results demonstrate that the suggested method outperforms the state-of-the-art wavelet-based image source camera identification method with relatively low computational cost. Al-Athamneh et al. [
17] suggested the use of only the green component of an RGB image for PRNU extraction while using a similar method used in [
14]. This is because human eyes are susceptible to green colour, and the green colour of the sensor pixel caries twice the information compared to its red and blue components. The green colour channel of the video frames was examined to create G-PRNU (green—photo response non-uniformity). The technique demonstrated a good level of reliability in identifying digital video cameras and generated superior performance compared to PRNU in identifying the source of digital videos. Images from six cameras were used to test the technique (two mobile phones and four consumer cameras). Videos, 290 in number, were recorded over the course of four months in a variety of settings. The 2-D correlation coefficient detection test was used to determine the sources of each of the 290 test videos. Their results show an average prediction accuracy of 99.15%. Akshatha et al. [
18] proposed an image camera source identification technique. They used a high-order wavelet statistics (HOWS) method to remove the camera noise from the input image and extract the camera signature. To determine the originating source camera for the given image, the features were fed to support vector machine classifiers, and the results were validated using the ten-fold cross-validation technique. Images taken with different cell phone cameras were used, and the algorithm proved to be capable of accurately identifying the source camera of the provided image with 96.18% accuracy on average, irrespective of camera model or band. Georgievska et al. [
19] proposed an image source camera identification method where images are clustered based on peak to correlation energy (PCE) similarity scores of their PRNU patterns. The image is first converted to grayscale. The initial estimate of the PRNU pattern is obtained using the first step total variation (FSTV) algorithm. After that zero mean and Wiener filtering steps are performed to filter out any artefacts produced by colour interpolation, on-sensor signal transfer, imaging sensor design, and JPEG compression. Then, PCE is computed as the ratio between the height of the peak and the energy of the cross correlation between two PRNU patterns. Their proposed technique uses graphics processing units (GPUs) to extract the PRNU patterns from large sets of images as well as to compute the PCE scores within a reasonable timeframe. The performance of the proposed method was evaluated using the Dresden image dataset. Their result showed this technique is highly effective.
Rodrıguez-Santos et al. [
20] proposed employing Jensen–Shannon divergence (JSD) to statistically compare the PRNU-based fingerprint of each qualifying source camera against the noise residual of the disputed image for the digital camera identification technique. Zhang et al. [
21] proposed an iterative algorithm tri-transfer learning (TTL) for source camera identification, this algorithm combines transfer learning with tri-training learning. The transfer learning module in TTL transfers knowledge obtained from training sets to improve identification performance. In comparison to previous methods, combining the two modules allows the framework to achieve superior efficiency and performance on mismatched camera model identification compared to other state-of-the-art techniques. Zeng et al. [
22] proposed a dual tree complex wavelet transform (DTCWT)-based approach for extracting the SPN from a given image that performs better near strong edges. Symmetric boundary extension rather than periodized boundary extension was used to improve the quality of SPN as well as the picture border. Balamurugan et al. [
23] proposed an image source camera identification technique, which uses an improved locally adaptive discrete cosine transform (LADCT) filter followed by a weighted averaging method to exploit the content of images carrying PRNU efficiently. LADCT is believed to perform well on images with high image-dependent noise like multiplicative noise of which PRNU is one of such. The technique divides images into blocks of fixed size in pixels that can be shifted in a single step either horizontally or vertically. A discrete cosine transform (DCT) is applied on each block, extracting its DCT coefficient, and for each of the provided blocks and over the DCT coefficients, and a threshold is applied. With the application of inverse DCT (IDCT) on the DCT coefficients, the blocks are once more reconstructed in the spatial domain. Then the average of the DCT coefficients for the same spatial domain values is used to determine the final estimation of the pixel. The weighted average provides weight to every coefficient of the blocks with the same weights, providing a greater averaging value than the simple average. The Dresden image dataset was used to evaluate the performance of the proposed technique. Their experimental results demonstrated its significant effectiveness. Qian et al. [
24] introduced a source camera identification technique for web images using neural-network augmented sensor pattern noise to easily trace web images while maintaining confidentiality. Their technique includes three stages: initial device fingerprint registration, fingerprint extraction, secure connection establishment during image collection, and verification of the relationship between images and their source devices. This technique provides cutting-edge performance for dependable source identification in modern smartphone images by adding metric learning and frequency consistency into the deep network design. Their technique also offers many optimisation sub-modules to reduce fingerprint leakage while improving accuracy and efficiency. It uses two cryptographic techniques, the fuzzy extractor and zero-knowledge proof, to securely establish the correlation between registered and validated image fingerprints.
Lawgaly and Khelifi [
25] proposed similar techniques that use locally adaptive DCT (LADCT) for image source camera identification. Their technique enhanced the locally adaptive DCT filter before the weighted averaging (WA) approach as in [
23] to effectively exploit the content of images conveying the PRNU. The estimated colour PRNUs were concatenated for better matching because the physical PRNU is present in all colour planes. The system was thoroughly evaluated via extensive experiments on two separate image datasets considering varied image sizes, and the gain obtained with each of its components was highlighted. To produce denoised estimates of neighboring and overlapping blocks, they used a sliding block window. The local block means and the local noise variance both influence each block’s threshold. The algorithm was evaluated using images from the Dresden dataset; their results demonstrated superior performance against cutting-edge techniques. Chen and Thing [
26] adopted what they called block matching and 3D filtering (BM3D) which is known as a collaborative filtering process. This proposed technique grouped similar blocks extracted from images where each group is stacked together to form 3D cylinder-like shapes. Filtering is performed on every block group. A linear transform is applied on the image before Wiener filtering. Then, the transform is inverted to reproduce all filtered blocks before the image is transformed back to its 2D form. Their results show that PRNU-based methods can provide a certain level of capability in terms of verifying the integrity of images. However, increasing the number of images utilized for PRNU pattern estimate might enhance performance but it would also make the approach less practical.
Yaqub [
27] proposed a simple scaling-based technique for image source camera identification when the questioned image is cropped from an unidentified source or when it is full resolution. The technique presents a simple, effective, and efficient approach for image source camera identification based on a hierarchy of scaled camera fingerprints. Lower levels of the hierarchy, which contain scaled-down fingerprints, allow for the elimination of many candidate cameras, which reduces computation time. Test results show that the technique while being applicable to full-resolution and cropped query images, leads to significantly less computation. A test with 500 cameras showed that for non-cropped images, the technique has 55 times less run time overhead than the conventional full-resolution correlation, while for cropped images, the overhead is decreased by a factor of 13.35. Kulkarni and Mane [
28] proposed a hybrid system made up of the best results as a method for extracting sensor noise that uses gradient-based operators and Laplacian operators to generate a third image while also revealing the noise and edges present in it. To obtain the noise present in the image, a threshold is applied to remove the edges.
The gray level co-occurrence matrix (GLCM) in the feature extraction module is then given this noisy image. Based on its qualities, homogeneity, contrast, correlation, and entropy are used to extract numerous features. To obtain an exact match, the SPN is retrieved from the GLCM features and used for matching with the test set. Results are improved by the hybrid method that combines GLCM feature extraction with SPN extraction. Using Dresden image dataset, the technique’s accuracy is found to be, on average, 97.59%, which is quite high.
Figure 3 shows the flow chart for source camera identification using large components of sensor pattern noise.
The effect of wavelet transform on the performance of the conventional wavelet-based image camera source identification technique was reported in [
29]. The authors used plane images from the VISION image dataset captured using eleven different camera brands to generate the experimental results. They reported that the conventional wavelet-based technique achieves its highest performance when it uses a sym2 wavelet.
3.3. Colour Filter Array Interpolation
Colour filter array (CFA) is a demosaicing method used in digital cameras. It is also known as a colour reconstruction method, which is used to reconstruct a digital colour image from the colour samples generated by an image sensor overlaid with a CFA. This demosaicing information can be extracted and used as a camera fingerprint.
To discern the correlation structure present in each color band for image classification purposes, Bayram et al. [
31] investigated the CFA interpolation procedure. The underlying assumption is that each device manufacturer’s interpolation algorithm and CFA filter pattern design exhibit distinct uniqueness, leading to discernible correlation structures in captured images. Utilizing the iterative expectation maximization (EM) algorithm, two distinct sets of features are derived for classification: the interpolation coefficients derived from the images and the peak locations and magnitudes within the frequency spectrum of the probability maps. Two camera models: Sony DSC-P51 and Nikon E-2100 with a resolution of two megapixels are used in the dataset. Using the confusion matrix for assessment the classification accuracy is 95.71% for two separate cameras when using a 5 × 5 interpolation kernel, however, it decreases to 83.33% when three cameras are compared. It ought to have been investigated how this technique affected the categorization accuracy with a larger number of cameras. The technique has not been tested with identical model cameras, but failure could be anticipated because identical model cameras often utilize the same CFA filter pattern and interpolation algorithm. Consequently, this technique may not perform well where compressed images are involved.
Figure 5 shows the Bayram RGB interpolation values.
Lia and Lin [
32] introduced an algorithm that employs an interpolation of images to determine image characteristic values with a support vector machine (SVM) to lower the required processing power and attain a high true positive. This algorithm uses the colour interpolation methods, which includes bilinear interpolation, adaptive colour plane interpolation, effective colour interpolation and highly effective iterative demosaicking. Cameras of various brands and models were employed to conduct classification in the study and the results of their study showed that this method had a good identification rate, with a recognition rate of up to 90% only when a wave filter was additionally introduced. Chen and Stamm [
33] proposed a camera brand identification technique. Their method first re-samples colour components of the input image in relation to a predetermined CFA pattern, where M different baseline demosaicing algorithms are applied to demosaic missing colour components in the input image. It then subtracts each resulting re-demosaic image from the input image generating M demosaic residual images. The resulting demosaic residual images are considered as a set of co-occurrence matrices using K different geometric patterns. It then uses the multi-class ensemble classification method to extract the camera brand signature. They used relative error reduction (RER) criteria to measure the performance of their technique. They reported a performance of 98% in terms of accuracy for camera model identification using images from the Dresden image dataset.
3.4. Machine Learning
Machine learning technology is being steadily incorporated into the field of image forensics with the evolution of artificial intelligence and the development of available image datasets. Moreover, machine learning technology can extract most appropriate features from a range of training datasets, suppressing the drawbacks of features that were generated artificially. Ahmed et al. [
34] introduced a deep convolutional neural network for a source camera identification algorithm that employs a max pooling layer, three convolutional layers with batch normalization, a rectified linear unit as an activation function, one fully connected layer, a drop out layer, and a classification layer as its first few layers. Significantly lower training images are used to train the network to determine the source of an image, and the noise pattern of the images is determined using the algorithm reported in [
14]. False positive rate (FPR) and false negative rate (FNR) are computed to assess the performance of both approaches using image datasets taken from eleven different cameras using the same set of training and test images with dimensions 128 × 128 and 256 × 256 for both approaches. Their research demonstrates that the PRNU-based technique is more effective than the convolutional neural networks (CNN) based approach. Marra et al. in [
35] and Freire-Obregón et al. in [
36] used CNN which is a subset of machine learning and is composed of multiple layers, with each layer containing a set of high-pass filters applied across the input image, for camera source identification. The convolutional process is used for the automatic extraction of features from the data and the subsequent learning from these extracted features. Their result shows an accuracy level of 98.1% for camera model identification and 91.1% accuracy for individual camera identification. Kirchner and Johnson [
37] proposed a technique that uses CNN to train and estimate the camera signature and then compute the noise residual from the test images and uses the maximum likelihood fingerprint estimator (MLE) to estimate the fingerprint of the test images. Using VISION image dataset and Dresden image dataset, the study proved that using a deep learning technique can result in a more appropriate extractor, which leads to better source attribution as it achieves the best results using a certain set of criteria for each potential camera fingerprint.
Ding et al. [
38] reported an algorithm which has one pre-processing module, one feature extractor, and one hierarchical multi-task learning method. The pre-processing module uses domain information for the deep learning method of camera identification. By distributing the knowledge across all the tasks, a hierarchical multi-task learning approach contributes more supervised information to the classification problem. The ResNet can combine low-, mid-, and high-level features and reuse earlier features through shortcut connections. The input image is first transferred to the pre-processing module to generate intermediate features; these features are then sent into the deep network, which is used to identify cameras. The convolutional layer is denoted as “Conv.” Resnet3_1, Resnet4_1, and Resnet5_1 execute down-sampling with a stride of 2. The functions classify 1, classify 2, and classify 3 are used to distinguish between camera brands, modes, and devices. Using original and altered images for assessment, the framework was assessed for brand, model, and device-level identification and the result shows that the technique is robust and reliable. The findings showed a significant improvement in the accuracy of mobile phone device identification to an average of 84.3%, better than consumer-level camera device identification. The result experimented on the Dresden dataset and the cell phone dataset using the t-SNE analysis.
Liu et al. [
39] reported a proficient source camera identification method based on convolutional neural networks. Their method has three essential components: patch selection based on multiple criteria, fine-grained multiscale residual prediction, and a modified visual geometry group (VGG) identification method. Authors argued that the conventional source camera identification’s performance is influenced by image content and falls short of meeting the demands of real-world applications, particularly for small image patches. The proposal advocated the division of all training and test images into 64 × 64 nonoverlapping patches with underlying distributions representative of all training and testing patches. These representative patches obtained by the patch selection module are used as training data to supervise the learning of subsequent residual prediction and classification throughout the training phase. All patches in test images are recognized for final performance evaluation after the parameters have been trained. This boosts robustness while lowering training costs, and representative patches are chosen based on a variety of parameters to increase training data diversity. At the brand, model, and instance levels, a modified VGG network was presented for source camera detection. According to them, this approach performed admirably in terms of both identification accuracy and computational efficiency. Using the Dresden dataset and the classification confusion matrix visualization of the 18 camera models, the classification accuracy of most camera models is higher than 97%. Huang et al. [
40] reported a convolutional neural network-based technique for determining the source camera of digital images. The technique depends on constructing a new network that includes an input layer, three convolutional layers with max pooling and normalization, two fully connected layers, and the Softmax classifier. To reduce the size of sample images from the targeted camera that the network needs to use as training data, the original images are cropped into small-sized patches that the network is meant to assess. To identify the source camera, a local-to-global technique is also implemented that respects the principle of majority voting among the image patches. Using images from the Dresden dataset and confusion matrix as assessment criteria the technique reached an accuracy of up to 99.8%, according to test results. Timmerman [
41] proposed an improved restricted convolutional layer that can handle colour inputs and can handle inputs with three colour channels. Colour inputs require three kernels as opposed to one kernel for grayscale inputs. While [
40] used images in their technique, ref. [
41] used video frames to extract and source the camera using SPN. The method was created to categorize distinct video frames, which are then combined by a majority vote to identify the originating camera. The benchmark VISION data set, which contains 1539 videos from 28 distinct cameras, was used to evaluate the technique. The tests revealed that the method was resistant to the compression methods used by YouTube and WhatsApp and still managed to reach up to 93.1% accuracy. Bondi et al. in [
42] and Kang et al. in [
43] proposed different CNN based techniques for source camera identification, which generate a high level of accuracy. For protection against malicious adversarial attacks on source camera identification, Hui et al. in [
44] proposed a defense mechanism to counter adversarial attacks in the source camera identification problem. This involves analyzing the image acquisition process, refining the source camera identification problem and its adversarial attacks, modeling feature extraction, and deriving a defense objective based on information monotonicity to suppress adversarial noise amplification during mapping. Additionally, local smooth mapping was used to reduce mapping oscillation. To address the training cost and migration challenges of existing solutions like adversarial training, they implemented a two-phase migratable pre-defense network. Their result showed that the source camera identification model combined with a pre-defense network maintained high identification accuracy and improved adversarial robustness compared to other defense models.
In [
45], a unified architectural representation of source camera identification powered by a deep neural network was introduced. The proposed method extracts the residue noise from each input image by first denoising the input image using a U-net and then subtracting it from the input image. The resulting image residual noises are then encoded into an embedding using a feature modulator, where they are conditioned on the triplet loss function to minimize the distance between images from the same camera and maximizes the distance between different images. Finally, the one-shot method is used to determine the camera source of the image. Their results showed that their method achieved 97.59% and a 97.01% in terms of F-score accuracy, respectively. In [
46], an adaptive dual-branch fusion residual network based on the SE-BRB module to improve network performance for image camera source identification was presented. The authors claimed that the network is relatively simple in terms of complexity and can be used for small size source identification. They reported a performance of 99.33% in terms of accuracy on images of the Dresden dataset.
3.6. Image Features-Based Techniques
An image feature is a piece of data about an image’s content that is used in computer vision and image processing by applying data mining techniques. It often pertains to whether a certain section of the image possesses unique characteristics. Features in an image can be particular elements like points, edges, or objects. To identify the camera sources of the images, these features were trained and classified. Tsai and Wu [
49] proposed an image camera source identification technique that used a total of 33 features to identify the source cameras. These features were grouped into three categories: colour features, image quality features, and wavelet domain features. Features under colour features include average pixel value, RGB pairs correlation, neighbor distribution center of mass and RGB pairs energy ratio. Features under image quality include mean square error, MSE, mean absolute error, and Minkowski difference for pixel difference-based; structural content, normalized cross correlation, and Czekonowski correlation for correlation-based; spectral magnitude error, spectral phase error, spectral phase-magnitude error, block spectral magnitude error, block spectral phase error, and block spectral phase-magnitude error for spectral-based. The means for each of the three resulting high frequency sub-band coefficients of each image color band was calculated and used as features to determine the image camera source. LibSVM was the classifier used in this paper which aids in categorizing incoming data and assessing the accuracy rate. The method discovered that the feature-based approach significantly outperforms other camera brand identification methods.
Xu et al. [
50] proposed an algorithm that uses image texture attributes that are taken from the carefully chosen colour model and colour channel for source camera identification. With this technique, the local phase quantization (LPQ) features are extracted from the original images and the residual noise images, whilst the LBP features are extracted from the original images and the residual noise images, respectively. In the HSV colour space, the H and V colour channels are used to extract the LBP and LPQ features. After that, the combined LBP and LPQ features are fed into the multi-class LibSVM classifier for source camera prediction. The technique has satisfactory detection accuracy and resilience, according to its result, distinguishing camera brands and models, camera models from the same brand of cameras, and camera individuals from the same model and brand of cameras. Three functions are envisaged to be accomplished by the proposed method: distinguishing camera brands and models; models from the same brand of cameras; and individual cameras from the same model and brand of cameras. Using the Dresden image dataset, the results show that the performance of this technique is satisfactory when compared with the state-of-the-art techniques. A summary of different existing image camera source identification methods, the datasets, and the assessment criteria they used with the accuracy achieved is tabulated in
Table 1.
An overview of some publicly available image datasets for camera identification is presented in
Table 2. VISION image dataset was captured using 35 portable devices of 11 major brands, containing 34,427 images and 1914 videos, both in native and social media formats. Images are made up of plain and textured images and all images are in JEPG format [
11]. Dresden image dataset contains over 14,000 images that are made up of various indoor and outdoor scenes, captured using 73 digital cameras of 25 different models. All images are in JEPG format [
12]. High dynamic range image dataset contains more than 5000 images captured using 23 different mobile devices of 7 major brands. Images are in JEPG format and made up of plain and textured images [
13]. Forchheim image database consists of more than 23,000 images of 143 scenes by 27 smartphone cameras. Each image is provided in six different qualities: the original camera-native version, and five copies from social networks. All the images are in JPG format [
51].