A Deep Learning Approach in the DCT Domain to Detect the Source of HDR Images

: Although high dynamic range (HDR) is now a common format of digital images, limited work has been done for HDR source forensics. This paper presents a method based on a convolutional neural network (CNN) to detect the source of HDR images, which is built in the discrete cosine transform (DCT) domain. Speciﬁcally, the input spatial image is converted into DCT domain with discrete cosine transform. Then, an adaptive multi-scale convolutional (AMSC) layer extracts features related to HDR source forensics from different scales. The features extracted by AMSC are further processed by two convolutional layers with pooling and batch normalization operations. Finally, classiﬁcation is conducted by a fully connected layer with Softmax function. Experimental results indicate that the proposed DCT-CNN outperforms the state-of-the-art schemes, especially in accuracy, robustness, and adaptability.


Introduction
With the limitation of bit-depth, the conventional 8-bit digital images cannot accurately reflect the current state of the environment, resulting in a loss of visual information in regions with imprecise exposures [1,2]. To reflect more realistic information, high dynamic range format stores accurate information by using higher bit-depth and floating-point formats [3]. As a consequence, the dynamic range of HDR images can reach 10 4 -10 9 orders of magnitude, which far exceeds the dynamic range of low dynamic range(LDR) images [4,5].
With the development of display techniques, some display devices have been able to display HDR contents [6][7][8]. Meanwhile, HDR images can be easily obtained with the advancement of mobile devices and imaging techniques. Since native HDR sensors have not been widely used, HDR images are mainly obtained from LDR images. There are two common types of HDR images according to the source of HDR images: (1) HDR images synthesized from multiple LDR images of the same scene with different exposures, which are mainly obtained directly through fusion algorithms when shooting images. This type of HDR images are denoted by mHDR [9,10]. (2) HDR images generated by using inverse tone mapping (iTM) to expand the dynamic range of a single LDR image, which are used to replace the existing LDR images [11]. This type of HDR images are denoted by iHDR [12][13][14]. There is evidence that the mHDR image is indistinguishable from the iHDR image [15][16][17]. Hence, source forensics of HDR images has become a new problem in the field of image forensics: identifying mHDR images synthesized from multiple exposures and iHDR images generated by iTM from a single LDR image.
Image forensics methods extract features based on numerical values to identify the source of the image or whether the image has been tampered. Identifying the source of an image is an important issue in the field of image forensics. This article is dedicated to solving the problem of HDR image source forensics. The motivation of this paper is to detect the source of HDR images. More specifically, HDR images are mainly divided into mHDR and iHDR according to the source of HDR images. The proposed method is designed to distinguish mHDR images from iHDR images. From the perspective of multimedia security, solving the problem of identifying the source of HDR images can assist in validating the authenticity of the content in images.
Currently, rare research focuses on forensic problems in the HDR domain. All existing HDR source forensic methods are conducted in the spatial domain. According to the way of extracting features, these methods can be divided into two strategies: (1) Manually specified methods extract hand-crafted features and use support vector machine (SVM) to complete classification [18][19][20][21]. (2) Convolutional neural network (CNN)-based methods use CNN to automatically extract features related to forensics and determine the type of the input HDR images in an end-to-end way [22]. In this article, a CNN for HDR source forensics is built in the frequency domain, taking advantage of the frequency domain in HDR forensics feature representation. To our best knowledge, this is the first time HDR image source forensics has been conducted in the frequency domain.
The main contribution of this paper is as follows. First, with the aim of using the decorrelation characteristic of DCT to make CNN focus on the features associated with forensics rather than the content of the image, we designed a multi-channel DCT (MC-DCT) module to convert the HDR image in the spatial domain into a DCT coefficients matrix. Second, we construct a multi-scale convolutional layer with different kernel sizes to extract features from different scales, which improves the ability of CNN to extract forensics-related features. Last, the multi-scale features are weighted by a channel attention mechanism, which allows CNN to focus on the channels with more relevant to forensics. Extensive experiments have shown that the performance of the proposed method is significantly improved compared with existing methods.
The remainder of this paper is organized as follows: Section 2 summarizes relevant research on the HDR images source forensics. Section 3 illustrates the architecture of the proposed DCT-CNN in detail. Section 4 describes the details of the datasets used in the experiments and analyses experimental results on different datasets. Section 5 gives the conclusion.

Related Works
Only a little literature exists on image forensics in HDR contents. This is because the HDR format is relatively new in the fields of multimedia and signal processing, and the scarcity of HDR image datasets also limits the development of forensics on HDR contents.
As the first work on forensic problems related to HDR contents. Bateman et al. proposed a scheme to extract suitable features for distinguishing tone-mapped HDR images and LDR images using SVM [18]. This work raised a new problem in the field of image forensics: identifying the LDR images obtained from tone-mapped HDR images and the original LDR images. This forensics problem still focused on LDR contents and the scheme proposed was conducted in LDR contents.
Furthermore, Wei et al. proposed a new forensics problem: identifying the mHDR image synthesized from multiple LDR images with different exposures and the iHDR image obtained via inverse tone mapping of a single LDR image [19]. This new problem was related to HDR contents and was named after the problem of HDR source forensics. This work proposed a powerful HDR forensics feature that distinguished mHDR images from iHDR images by using local high-order statistics (LHS) based on fisher scores calculated under the Gaussian mixture model. However, manually specified method cannot fully extract the features related to HDR source forensics. The drawback is that the feature related to HDR source forensics need to be manually designed, which limits the ability of forensics methods to extract features associated with forensics.
With the development of deep learning, more CNN-based methods were applied to image forensics. To overcome the drawback of manually specified methods, Huo et al. used convolutional neural network (CNN) to achieve source forensics of HDR images [22]. In this method, an end-to-end scheme named HDR-CNN was proposed and validated the feasibility of CNN for HDR source forensics. The experimental results showed that by using convolutional neural networks to extract features automatically, the accuracy of HDR source forensics is much better than that of conventional manually specified methods. However, HDR-CNN which is built in the spatial domian tends to extract features related to the content rather than information about forensics, which limits the performance of this method.
In addition to conducting forensics in the spatial domain, some forensics methods were built in the frequency domain to avoid the interference of images content. To use the decorrelation characteristic of DCT, Zhang et al. proposed a CNN-based method of median filtering forensics in the discrete cosine transform domain by converting the images in the spatial domain into data in the frequency domain through DCT [23]. The drawbacks of this method are that some low-frequency and high-frequency DCT coefficients were discarded and the DCT coefficients are given a fixed weight, which limited the performance of this method. Singhal et al. proposed a CNN-based method for detecting manipulation by converting image residuals into DCT domain [24]. The drawbacks of this method are that the DCT is conducted on the Median Filter Residual (MFR) and with no multi-scale module to extract features from different scales. Inspired by these works, we consider developing an effective HDR source forensics method based on CNN in the DCT domain to avoid drawbacks of manually specified methods and the interference of image content in the spatial domain. To avoid the drawbacks of other DCT-based CNNs mentioned above, we introduce a multi-channel discrete cosine transform (MC-DCT) module to keep all the DCT coefficients and the AMSC module to extract multi-scale features.

Deep Learning Architecture
Convolutional neural networks can update weights to extract more specific features in the training process. Therefore, the method proposed in this paper is based on CNN to extract features in the DCT domain. For brevity, DCT-CNN is used as the abbreviation for the proposed method. Figure 1 shows the basic process of the proposed CNN for identifying the source of HDR images. In the spatial domain, CNNs tend to extract features related to image content, which will interfere with the accuracy of HDR source forensics. The discrete cosine transform has the characteristics of decorrelation, which can make the data structure lose the spatial pixel dependence and reduce the influence of the image content on the accuracy of forensics. Therefore, the digital image in spatial domain needs to be transformed into frequency domain with DCT. In the proposed scheme, multi-channel discrete cosine transform is implemented on every channel of the HDR image to obtain multi-channel DCT coefficients, which are used as input to the network instead of using the pixel values of the image.

Overview of the Proposed CNN Model
First, the input HDR image is first converted to DCT coefficients by multi-channel discrete cosine transform block, and then a convolutional layer named Conv1 extracts features from the DCT coefficient matrix. The extracted features are processed with Batch Normalization (BN) [25] and ReLU as input of adaptive multi-scale convolution module. This part is represented by Frequency Domain Feature Extraction in Figure 1. The adaptive multi-scale feature extraction process is represented by Adaptive Multi-Scale Feature Extraction in Figure 1. The multi-scale features extracted by AMSC module are processed with BN, ReLU, and max pooling. Then, a two-layer convolutional stream with max pooling and activation function is used for high-level feature extraction, represented by Hierarchical Feature Extraction in Figure 1. To introduce the adaptability of input with different sizes to the network, average pooling is used to downsample the feature map to a fixed size. Finally, a fully connected layer with Softmax activation function is used to implement the classification. Table 1 indicates the outline of the proposed DCT-CNN. Multi-channel discrete cosine transform and adaptive multi-scale feature extraction will be discussed in detail in Sections 3.2 and 3.3.

Multi-Channel Discrete Cosine Transform
The expansion of the dynamic range is mainly carried out on the luminance value of the image. The common operation of the existing HDR source forensics methods is to fuse the red channel (R), the green channel (G) and the blue channel (B) of the HDR image according to Equation (1) to obtain the luminance value of the whole image. Then extract traces related to HDR source forensics based on the distribution of luminance (L).
This approach reduces the dimensionality of input data at the cost of losing part of the information related to HDR source forensics to a certain extent. To improve the accuracy of HDR source forensics, all image information must be fully used. In the proposed method, for the sake of preserving the information in each color channel and converting the input HDR image into the DCT domain, a multi-channel discrete cosine transform as shown in Figure 2 is used. More specifically, the input multi-channel HDR image is split into three channels, denoted by the Red channel, the Green channel and the Blue channel. In addition, DCT is performed on each color channel separately to obtain three individual DCT coefficient matrices. Finally, the three DCT coefficient matrices are concatenated into a 3-channel DCT coefficient matrix. It should be emphasized that the output DCT coefficient matrix has the same size as the input HDR image. Therefore, we can see that the method proposed in this paper is different from the other two DCT-based methods. In Referrence [23], the DCT coefficients matrix is multiplied by a weight matrix with values increasing from the upper left corner to the lower right corner. At the same time, some low-frequency DCT coefficients and high-frequency DCT coefficients in the DCT coefficient matrix are discarded. In Reference [24], the DCT transform is performed on the median filter residual of an image, which discarded the image information before the DCT operation. The method proposed in this paper uses MC-DCT to retain information in multiple color channels without discarding image information of DCT coefficients. Therefore, the proposed method theoretically has better performance than the other two DCT-based methods.

Adaptive Multi-Scale Feature Extraction
To represent the forensic features more efficiently, we develop the adaptive multi-scale block, where the convolution operations with n convolution kernels of different sizes are carried out on the input in a parallel manner. Then, multiple scale features are weighted by a channel attention mechanism.
The adaptive multi-scale feature extraction module is shown in Figure 3. Multiple scale features are extracted using a multi-scale convolutional layer, and these features are weighted by a channel attention mechanism. Convolutional layers with different kernel sizes enable CNN to extract features related to HDR source forensics from diverse scales. In this work, channel attention mechanism is used to assign weights to the features extracted by the multi-scale convolutional layer. This adaptive multi-scale feature extraction block can emphasize features that are positive for forensics and suppress irrelevant features by applying channel-wise weights to every channel of multi-scale feature.
Specifically, four convolutional layers with different kernel sizes are carried out on the input to obtain four sets of features that correspond to different scales. Each set of features has 32 channels. Then, a multi-scale feature matrix with 128 channels is derived by concatenating four feature matrices with 32 channels. To extract the most relevant features for HDR source forensics, a channel attention mechanism is used to perform channel-wise weighting operations on the 128-channel multi-scale feature matrix. In this work, the channel attention mechanism is implemented using the Efficient Channel Attention module [26]. In ECA module, global average pooling (GAP) is conducted on the features to obtain aggregated features with a size of 1 × 1 × c, where c denotes the number of channels. Then, a 1D convolution is used to extract relationship between channels, followed by a Sigmoid activation to generate the weights of different channels. This channel attention mechanism can be formulated as: where w refers to the weights of channels, σ is a Sigmoid function, C1D indicates 1D convolution, k is the kernel size of convolution. The obtained weights and the input features of the AMSC module are multiplied channel-wise to obtain the weighted multi-scale features. Hence, the subsequent convolutional layers can focus on the channels that are conducive to improving the performance of forensics. The multi-scale structure proposed in [23] involves three different convolution kernel sizes and uses the maxout activation function for activation. During this process, some features related to forensics will be lost. In Reference [24], no multi-scale feature extraction structure is proposed.
Compared with the above DCT-based CNNs, the network proposed in this paper uses an adaptive multi-scale module to extract features without losing features and the features are weighted by a channel attention mechanism to enhance the performance of forensics.

Experimental Results
To evaluate the performance of the proposed DCT-CNN on HDR source forensics, we created several datasets with different types of HDR images and different sizes of image blocks. The performance is assessed by classification accuracy (Acc), receiver operating characteristic curve (ROC) and the area under the curve (AUC), and compared with six state-of-art forensics methods in [19][20][21][22]24,27]. The classification accuracy (Acc) is defined as: where TP denotes true positive, which is an outcome when the model correctly predicts the positive class, TN denotes true negative, which is an outcome when the model correctly predicts the negative class, FP denotes false positive, which is an outcome when the model incorrectly predicts the positive class FN denotes false negative, which is an outcome when the model incorrectly predicts the negative class. ROC is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. AUC is defined as the area under the ROC curve enclosed by the coordinate axis.

Training and Testing Datasets
To obtain mHDR images, we choose the following mHDR databases: • HDRSID dataset includes 232 mHDR images [28].
We chose the datasets mentioned above to produce the mHDR image blocks used in the experiments. All HDR images in these datasets were produced using multi-exposure capturing technique. The mHDR images are denoted by 'M'.
The generation of an iHDR image only requires a single LDR image. In this experiment, we chose the MIT-Adobe FiveK dataset [32] as the source of the LDR images. The MIT-Adobe FiveK dataset includes 5000 high-resolution images of different scenes, which can cover a broad range of scenes, subjects, and lighting conditions. In this paper, we select four inverse tone mapping algorithms for generating iHDR images: 1. Akyüz et al.'s method [33], denoted by 'A'. In this method, the input luminance value is first normalized and non-linearly scaled, and then linearly scaled to extend the low dynamic range to the desired high dynamic range. [34], denoted by 'H'. Huo presented a physiological inverse tone mapping algorithm inspired by the property of the Human Visual System (HVS), which could implement the expansion of the dynamic range only in the specific area of the input LDR image. This method can efficiently generate iHDR images with high visual quality. 3. Kovaleski et al.'s method [35], denoted by 'K'. In this work, an inverse tone mapping algorithm based on cross-bilateral filtering was proposed. This method can generate high quality HDR images and videos suitable for a wide range of exposures by using the expand map in specific areas of the image to linearly expand the input LDR content to the desired high dynamic range. 4. Kuo et al.'s method [36], denoted by 'U'. This work proposed an inverse tone mapping method based on histogram. The method includes a content-adaptive inverse tone mapping operator, which has different responses to different scenarios. This algorithm could adaptively select environmental parameters through classification of scenarios to enhance the image in over-exposed areas as well as in remaining well-exposed areas.

Huo et al.'s method
We used all 5000 high-quality LDR images from MIT-Adobe FiveK dataset to generate 5000 iHDR images using the above four inverse tone mapping algorithms. As a result, a mHDR dataset including 386 mHDR images and an iHDR dataset including 20,000 iHDR images were obtained. These mHDR and iHDR images constitute the basic experimental datasets. Figure 4 shows the difference between mHDR image and iHDR images generated by different iTM methods. Finally, by cropping two type HDR images into blocks of different sizes, specific datasets for evaluating the performance of forensic methods were generated. Specifically, the block size is set to 32, 64, and 128 to verify the performance of forensics under different image sizes. The experiments were conducted on 12 datasets. Each dataset includes 30,000 mHDR image blocks and 30,000 iHDR image blocks. Details of the datasets are shown in Table 2. For each dataset, 25,000 mHDR image blocks and 25,000 iHDR image blocks were randomly selected to form a training set, with the remaining 5000 mHDR images and 5000 iHDR images forming a testing set. After this operation, 12 training datasets and 12 test datasets for subsequent experiments are obtained. These datasets are subsets of the datasets shown in Table 2.

Implementation of the CNN
The DCT-CNN for HDR source forensics is implemented with the Pytorch deep learning framework [37]. Experiments were carried out on a high-performance computer with Intel Core TM i7-9800X (3.80 GHz) (Intel, Santa Clara, CA, USA), 64 GB RAM and NVIDIA GEFORCE RTX 2080 Ti GPU (NVIDIA, Santa Clara, CA, USA). The parameters of the network are set as follows. The initial learning rate with a learning rate decay strategy is set to 0.001. The batch size is set to 64 images, the loss function is cross-entropy loss, and the optimizer is Adam [38]. Classification accuracy (Acc) is used to evaluate the performance of forensics methods. We chose LHS [19], SPAM [20], HOG [21], HDR-CNN [22], RF-CNN [24] and MISL-net [27] as comparative methods.

Forensics on Images without Anti-Forensics Attack
The classification accuracy averaged over the test datasets with a resolution of 32 × 32 are summarized in Table 3 for all the tested methods. The best results are marked in bold. Since small-size images include less information related to forensics, experiments conducted on small-size images can reflect the feature extraction capability of forensic methods. Table 3 indicates that the performance of HDR source forensics using manually specified feature extraction methods is weaker than using CNN-based methods to extract features automatically. For instance, the highest classification accuracy of LHS is 88.59% on the M-A dataset, while the accuracy of the two CNN-based forensic methods reached 94.62% and 98.94%. For CNN-based forensic methods, the performance of DCT-CNN in the frequency domain is better than HDR-CNN in the spatial domain. This result validates that the decorrelation of DCT helps CNN extract the most important features related to HDR source forensics. In this experiment, the proposed DCT-CNN manifests the best performance on different HDR datasets. For the proposed DCT-CNN, classification accuracy increased by 10.35% compared with the manually specified feature extraction methods. In addition, compared with HDR-CNN which is a CNN-based forensics method built the spatial domain, the forensics accuracy increased by 4.32%. The experimental results validate that the proposed DCT-CNN for HDR source forensics which is built in the DCT domain can achieve desired forensic performance on 32 × 32 images. It can be observed from Table 3 that compared with other methods, the proposed DCT-CNN gained the highest AUC on different datasets. Figure 5 shows the ROC of different methods, the curve of the DCT-CNN proposed in this paper is closer to the point (0, 1), which indicates that DCT-CNN has better forensics performance over other methods. Table 3. Forensics accuracy and AUC of different methods on datasets with resolution of 32 × 32.  The classification accuracy and AUC averaged over the test datasets with a resolution of 64 × 64 are summarized in Table 4 for all the tested methods. It can be concluded that in both the CNN-based forensics methods and manually specified feature extraction methods, the accuracy was improved to a certain extent compared with results on 32 × 32 images. Taking LHS as an instance, the forensic accuracy is 93.15% on the M-A dataset with an image size of 64 × 64, while accuracy of LHS on the M-A dataset with an image size of 32 × 32 is 88.59%. The forensics accuracy of HDR-CNN on 64 × 64 images is also improved by 2.92-4.74% compared to result on 32 × 32 images. It should be noted that our proposed method has achieved high forensic accuracy on 32 × 32 images. Hence, performance of proposed DCT-CNN only increased by 0.09-0.49% on 64 × 64 images. In this experiment, the proposed DCT-CNN still achieves the highest classification accuracy on four different datasets with a resolution of 64 × 64. The DCT-CNN still achieved the highest AUC on different datasets, which verifies its forensic performance from another perspective. The classification accuracy averaged and AUC over the test datasets with a resolution of 128 × 128 are listed in Table 5. Clearly, 128 × 128 is a relatively large image size. A larger size means that image includes more information related to forensics. It can be observed from Table 5 that the manually specified feature extraction methods and the CNN-based forensics methods have achieved higher classification accuracy on datasets with a resolution of 128 × 128 compared with results on 32 × 32 images and 64 × 64 images. In this experiment, the proposed DCT-CNN still achieves the highest classification accuracy and the highest AUC. By analyzing the experimental results, we can draw a conclusion that larger image includes more information related to the HDR source forensics. It should be emphasized that among all the methods, RF-CNN and the proposed DCT-CNN were carried out in the DCT domain. The proposed DCT-CNN uses multi-channel DCT to avoid the loss of information and uses an adaptive multi-scale module to extract multi-scale features, which makes the forensic performance of DCT-CNN superior to RF-CNN.

M-A M-H M-K M-U
Through Tables 3-5, a conclusion can be drawn that the proposed method is not sensitive to the size of images. High classification accuracy and AUC can also be achieved on the images with low resolution, which validates the strong robustness of DCT-CNN in respect of image size. In addition, we can observe that the performance of forensics methods built in the spatial domain on different types of datasets is not very stable. For instance, HDR-CNN has an accuracy between 90.49-94.62% on different types of datasets with a resolution of 32 × 32. The fluctuation in accuracy of HDR-CNN is 4.13%. The fluctuation in the forensic performance of SPAM, LHS and HOG on datasets with different types is 3.26-5.56%. The fluctuation in the accuracy of our proposed DCT-CNN on datasets with different types are within 1%, which indicates that the proposed DCT-CNN has strong robustness and adaptability in respect of HDR image types.

Forensics on Images under Anti-Forensics Attack
Image anti-forensics are techniques that aim to make forensics algorithms fail by modifying the images in a visually imperceptible way. Anti-forensics attack are methods used to make forensics method invalid or to decrease the performance of forensics method, which are used to verify the robustness of forensics methods in this experiment. Median filtering has the characteristic of changing the distribution of image pixel values while preserving the content of the image. Due to this characteristic of median filtering, median filtering is often used as anti-forensics attack, which invalidates or reduces the performance of forensic methods. Therefore, it is necessary to study the robustness of the forensics methods under the median filtering attack. The median filter replaces a pixel by the median of all pixels in a neighborhood w: where w represents a neighborhood, centered around location [m, n] in the image. Furthermore, in order to verify the robustness against anti-forensics attack of the forensics methods, we chose the median filtering as the anti-forensics attack method. In this experiment, the size of the images in datasets is fixed to 32 × 32. Median filtering operation with two different kernels of 3 × 3 (MF3) and 5 × 5 (MF5) were conducted on all HDR images. The experiments were conducted on these post-processed datasets to verify the robustness of the HDR source forensics methods. The experimental results are shown in Tables 6 and 7.  Table 3, it can be observed from Table 6 that the performance of all forensics methods decreased under the median filtering attack. Especially, the accuracy of forensics methods built in the spatial domain significantly decreased. For instance, LHS gains best performance among manually specified feature extraction methods. However, the accuracy of LHS on the M-A dataset has decreased by 7.85% compared to the accuracy without an attack. As a CNN-based forensics method, HDR-CNN has also decrease by 8.13% on the M-K dataset. For our proposed DCT-CNN, the accuracy under the median filtering attack is still the highest among all methods on four different datasets, which validates that DCT-CNN is robust against anti-forensics attacks.  Comparing Tables 6 and 7, it can be concluded that the median filtering with a kernel size of 5 × 5 has a greater impact on the performance of forensics methods than that with a kernel size of 3 × 3. In the case of more intense anti-forensics attacks, our proposed method still achieved the highest accuracy and the highest AUC on all the datasets. Compared with the results under median filtering with kernel size of 3 × 3, the accuracy of the forensics methods built in the spatial domain fluctuates between 2-11%, while the fluctuation of DCT-CNN is between 0.5-2.29%, which proves the proposed DCT-CNN is very robust to anti-forensics attacks.

Conclusions
In this paper, we propose a CNN-based model in the DCT domain to detect the source of HDR images. To the best of our knowledge, this is the first attempt to achieve HDR source forensics in the frequency domain. Decorrelation of the image content is conducted by transforming the input image in the spatial domain into the DCT domain with a MC-DCT transformation. Hence, the subsequent network can focus on the features related to forensics. Furthermore, an adaptive multi-scale convolution module is applied to extract forensics-related information from different scales with the aim to improve forensics performance of the network. The experimental results show that, compared with the manually specified feature extraction methods and the current CNN-based method, our DCT-CNN has achieved the best classification accuracy and AUC on datasets with different resolutions and datasets with different types of HDR images. Sufficient experiments also validate the strong robustness of the proposed DCT-CNN in respect of image sizes and HDR image types. Moreover, it yields good robustness against median filtering. We hope that this work will inspire follow-up work in the field of HDR source forensics.