Attention-Based Color Difference Perception for Photographic Images

Qiang, Hua; Zhang, Xuande; Hou, Jinliang

doi:10.3390/app15052704

Open AccessArticle

Attention-Based Color Difference Perception for Photographic Images

by

Hua Qiang

^1,2,

Xuande Zhang

^1,* and

Jinliang Hou

³

¹

School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China

²

School of Electrical Engineering, Shaanxi Polytechnic Institute, Xianyang 712000, China

³

School of Electronic Information and Artificial Intelligence, Xi’an Technological University, Xi’an 710021, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2704; https://doi.org/10.3390/app15052704

Submission received: 22 January 2025 / Revised: 22 February 2025 / Accepted: 25 February 2025 / Published: 3 March 2025

(This article belongs to the Special Issue Advances in Image Enhancement and Restoration Technology)

Download

Browse Figures

Versions Notes

Abstract

Traditional color difference (CD) measurement methods cannot adapt to large sizes and complex content of photographic images. Existing deep learning-based CD measurement algorithms only focus on local features and cannot accurately simulate the human perception of CD. The objective of this paper is to propose a high-precision image CD measurement model that simulates the perceptual process of the human visual system and apply it to the CD perception of smartphone photography images. Based on this, a CD measurement network called CD-Attention is proposed, which integrates CNN and Vision Transformer features. First, a CNN and the ViT are used separately to extract local features and global semantic features from the reference image and the distorted image. Secondly, deformable convolution is used for attention guidance, utilizing the global semantic features of the ViT to direct CNN to focus on salient regions of the image, enhancing the transformation modeling capability of CNN features. Thirdly, through the feature fusion module, the CNN features that have been guided by attention are fused with the global semantic features of the ViT. Finally, a dual-branch network for high-frequency and low-frequency predictions is used for score estimation, and the final score is obtained through a weighted sum. Validated on the large-scale SPCD dataset, the CD-Attention model has achieved state-of-the-art performance, outperforming 30 existing CD measurement methods and demonstrating useful generalization ability. It has been demonstrated that CD-Attention can achieve CD measurement for large-sized and content-complex smartphone photography images. At the same time, the effectiveness of CD-Attention’s feature extraction and attention guidance are verified by ablation experiments.

Keywords:

image quality assessment; color difference perception; attention mechanism; vision transformer; deformable convolution

1. Introduction

Color is a key factor in product design, manufacturing, and quality control. In industries such as textiles, printing, coatings, and electronic displays, accurate control and evaluation of CDs are crucial for ensuring product quality and consumer satisfaction [1]. The objective of this paper is to propose a high-precision method for image CD measurement that simulates the perceptual process of the human visual system. The color difference measurement method can be applied in medical imaging, video analysis, image transmission, quality control, automated inspection, and image analysis, and it is of great significance for improving photographic quality, optimizing image processing, and optimizing image display. However, due to the complexity of color perception, developing a formula that can accurately predict and quantify CD has been a long-standing challenge. Modern smartphones employ display technologies like LCD, OLED, and AMOLED, each with its own unique color rendition characteristics [2]. With the continuous advancement of display technology, the color gamut and dynamic contrast ratio that screens can display have significantly improved, presenting new challenges and opportunities for CD measurement.

The history of CD measurement dates back to the early 20th century when scientists began to investigate quantitative methods for measuring color. Early work focused primarily on the physical measurement of color, such as using spectrophotometers to measure the reflective or transmittance properties of colors [3]. With the introduction of the CIE XYZ color space by the International Commission on Illumination (CIE) in 1931, a foundation was laid for the scientific measurement of color. Since then, researchers have proposed various color spaces, and in 1976, the CIE recommended two CD formulas, CIELAB [4] and CIELUV [4], both of which are color spaces based on CIE color-matching functions, providing methods for quantifying CD, mainly used to assess small to medium-sized CDs. CIELAB is widely used due to its simplicity and versatility, while CIELUV is more suitable for specific applications that require a more uniform color space. However, both CIELAB and CIELUV have limitations, particularly when dealing with specific colors or in certain color regions where errors are larger. The CMC CD formula was proposed in 1984 [5], which uses different weighting factors to adjust differences in brightness, chroma, and hue to meet different application needs. With the development of digital image processing and display technology, CD measurement has begun to be applied to digital environments. The development of the CIEDE2000 [6] CD formula took into account the complexity of visual perception. Xue Mei Zhang proposed Spatial-CIELAB (S-CIELAB) [7] by simulating the spatial blur of the human visual system to extend CIELAB color measurement and used it to assess the color reproduction errors of digital images. In 2001, Imai introduced a perceptual CD metric based on the Mahalanobis distance, utilizing covariance matrices to account for differences in color attributes [8]. In 2003, Alexander Toet proposed expanding the grayscale image quality index into a new color space that is perceptually irrelevant, creating a novel metric for color image fidelity that was evaluated through observer experiments and found to be highly correlated with human perception [9]. In 2005, Lee proposed a measurement model to predict the performance of human judgments in similarity metrics of different images, exploring the model’s performance across a broad range of color spaces and identifying the optimal quantification for the selected color space [10]. In 2006, Guowei Hong proposed a new algorithm for CD measurement that took into account the importance of different regions in the image and the magnitude of CDs [11]. In 2008, Sonia Ouni proposed a new full-reference image quality assessment metric based on the characteristics of the human visual system, which considers pixel neighborhood information and introduces neighborhood information through weighted differences for perceptual comparison [12]. In 2009, Gabriele Simone proposed a new Euclidean CD formula for calculating image differences, particularly small to medium CDs in the logarithmically compressed OSA-UCS color space [13]. In 2014, Dohyoung Lee introduced the CDICH, a circular chrominance-based CD index, which independently processes the brightness and chrominance components of image data, taking into account the periodicity of chrominance [14]. In 2019, Ortiz-Jaramillo introduced a CD measurement method based on image segmentation and Local Binary Patterns for calculating CDs in natural scene color images [15]. In 2012, Ingmar Lissner proposed an image difference prediction framework that emphasizes the use of color information to enhance the assessment of gamut-mapped images. The Image Difference Metric (IDM) integrates various image difference features and uses a factor combination model for overall image difference prediction [16]. In 2017, Alakuijala introduced Guetzli, a new JPEG encoder designed to produce visually indistinguishable images at lower bit rates than other common JPEG encoders, utilizing a closed-loop optimizer to refine both the JPEG’s global quantization tables and the DCT coefficient values within each JPEG block [17]. Since image quality assessment (IQA) methods can measure image distortion, they can also be used to measure and calculate CD. Classical IQA methods include algorithms such as SSIM [18], VSI [19], LIP [20], PieAPP [21], LPIPS [22], and DISTS [23], which can be divided into two categories: traditional image evaluation methods that use texture and gradient features, and image quality evaluation methods based on deep learning features.

The existing CD measurement algorithms extract only local features of the image, neglecting the global semantic features, and they do not use global features to guide the local features for feature fusion. As a result, these algorithms perform poorly in CD measurement for high-resolution photographic images.

The transformer was originally proposed by Vaswani in 2017, primarily for natural language processing (NLP) tasks [24]. The core of the Transformer model is the self-attention mechanism, which allows the model to consider all other elements in the sequence while processing each element. This mechanism offers a powerful global modeling capability because it does not need to process the sequence step by step like RNN, but rather processes the entire sequence in parallel, increasing training efficiency [25]. The Vision Transformer (ViT) is an innovation in the field of deep learning, extending the philosophy of the transformer architecture to computer vision tasks. The global feature extraction capability of ViT enables it to excel at understanding the overall structure and modeling of images, which is particularly important for tasks that require global semantic features [26]. The introduction of ViT has brought a brand new solution to CD perception, which can utilize its global modeling capability to make CD measurement closer to human perception.

In this work, an algorithm for CD measurement that integrates local and global features is proposed. This method simulates the human perception process of images, extracting both global semantic features and features from key local areas, thus demonstrating good performance in image CD measurement tasks. This method is named CD-Attention. CD-Attention extracts features that include local detail features of CNN and global semantic features of ViT and uses deformable convolution to fuse global and local features. It uses global semantic features to guide local features, and the CD of the image is ultimately obtained through the weighted sum of high-frequency and low-frequency prediction branches. Our main contributions can be summarized as follows:

A hybrid network architecture for CD measurement based on attention mechanisms is proposed, which integrates the global semantic features of Vision Transformer (ViT) into the color measurement process and utilizes its attention mechanism to direct local features towards significant areas of the image.

Comparisons have been made with common classical algorithms on the SPCD dataset, proving that CD-Attention is a state-of-the-art method, and the effectiveness of the feature extraction and fusion approach has been verified through ablation experiments. The rest of this article is organized as follows. Section 2 presents an overview of the CD formula, vision transformer, and deformable convolution. Section 3 presents the architecture and implementation of CD-Attention. In Section 4, we describe the specific implementation, training process, and main experiments of CD-Attention. Section 5 provides a summary.

2. Related Works

2.1. CD Formula

The purpose of CD measurement is to find a method that accurately quantifies and assesses the human eye’s perception of CDs, so as to achieve consistency and accuracy in color measurement in practical applications.

Traditional Algorithms

The study of traditional CD formulas began in the early 20th century and can be roughly divided into three stages: early CD algorithms, improved CD algorithms, and image quality assessment methods [27]. The early CD measurement was mainly represented by CIELAB and CIELUV, which directly calculated CD in a uniform color space, and the calculation formula is as follows:

Δ E_{ab}^{*} = \sqrt{{(L_{1} - L_{2})}^{2} + {(a_{1} - a_{2})}^{2} + {(b_{1} - b_{2})}^{2}}

(1)

L₁, a₁, b₁, and L₂, a₂, b₂ are the brightness and color coordinates of two color samples in the standard color space, respectively. The limitation of this method is that CIELAB and CIELUV cannot adapt to the diversification of modern photographic image content, including different textures, patterns, and lighting conditions. They cannot accurately reflect the CDs in HDR images and cannot fully simulate the human eye’s perception of CDs in complex images.

With the advancement of color science, new CD measurement methods and models such as CMC, CIEDE2000, and S-CIELAB have been developed. These improvements include the weighting of brightness, chroma, and hue differences, the introduction of rotation factors to enhance predictions in specific hue areas, and improvements to computational efficiency. These enhancements allow the algorithms to provide more accurate CD measurements across various applications such as quality control, color design, and image processing. The CIEDE2000 calculation formula is as follows:

Δ E_{00} = \sqrt{{(Δ L^{'})}^{2} + {(Δ C^{'})}^{2} + {(Δ H^{'})}^{2} + ρ \cdot (Δ L^{'}) (Δ C^{'}) + σ \cdot (Δ C^{'}) (Δ H^{'}) + δ \cdot (Δ L^{'}) (Δ H^{'})}

(2)

where

Δ L^{'}

indicates the luminance difference,

Δ C^{'}

denotes the chroma difference,

Δ H^{'}

signifies the hue difference, and the interaction terms

ρ, σ

, and

δ

represent the mutual influence of CDs. These interaction terms and weighting functions are used together to adjust CD measurements to better match the perception of the human eye. Although such methods have made significant advancements in simulating the human eye’s perception of CDs, the choice of weighting functions may depend on specific application scenarios, requiring users to adjust according to the specific situation, increasing the difficulty of use.

Learning-Based CD Formulas

In recent research, some scholars have proposed CD formulas based on deep learning, with representative methods being CD-Net [28] and CD-Flow [29]. The author of CD-Net first created a CD dataset for mobile phone images, SPCD, and proposed an end-to-end learnable CD formula based on lightweight neural networks. They extract features by performing convolutions and coordinate transformations on the original and distorted images, calculate the distance of the features to obtain the loss, and use this loss to train the network. Experiments have verified that CD-Net is superior to 33 existing CD formulas. CD-Flow, based on the foundation of CD-Net, has been improved by training a multi-scale autoregressive normalization flow for feature transform, which includes six scales of features and performs splitting and cascading operations, respectively. It uses the Euclidean distance, which is proportional to human perception of CD for CD calculation. Experiments have proven that its algorithm is currently the best, superior to CD-Net. The above two deep learning-based CD measurement algorithms perform significantly better than traditional algorithms and show good generalization. However, their algorithms use convolution to extract features, focusing mainly on local features, and do not utilize the global semantic features of the image, so they cannot aptly simulate the way humans perceive image CDs.

2.2. Vision Transformer

The vision transformer (ViT) is a visual model based on the transformer architecture, extending the successful experience in the field of natural language processing to computer vision tasks [30]. ViT applies the transformer to image data by dividing the image into multiple small patches, treating these as elements within a sequence. Dosovitskiy first introduced the ViT in their 2020 paper “An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale”, demonstrating that the transformer architecture could be directly applied to image classification tasks [31]. The ViT has undergone various improvements: enhancing performance by increasing the model’s depth and width, introducing multi-scale features to enhance the model’s ability to capture different scale features, and using local or hierarchical attention to reduce computational complexity and improve the attention mechanism. The ViT has shown strong generalization capabilities across multiple vision tasks, including image classification, object detection, and image segmentation [32]. The ViT’s self-attention structure can capture global relationships between different regions in an image, allowing it to model long-range dependencies more effectively than traditional Convolutional Neural Networks (CNNs). The ViT is able to capture global semantic information of the image, which makes it excellent at understanding the overall content of images. The proposed CD-Attention in this paper utilizes the ViT’s global modeling capability and its ability to capture global semantics, enhancing the network model’s global modeling capability and improving the accuracy of CD measurement.

2.3. Deformable Convolution

Deformable convolution is a significant innovation in the fields of deep learning and computer vision, enabling convolution operations to adapt to geometric variations in image content [33]. Its principle differs from that of traditional Convolutional Neural Networks by introducing a deformable sampling grid. A learnable sampling grid is constructed on the input feature map, which can adjust the sampling locations of the convolution kernel to adapt to the shape and size of the targets in the image. In the deformable convolution structure, the introduction of sampling point offsets adds an additional offset to each sampling point in the convolution kernel. These offsets are learned through training, allowing the kernel to move flexibly in space to better capture the features of the targets. Deformable convolution can adaptively adjust its receptive field to match the actual shape and size of the targets in the image. Adapting to the geometric changes in the targets improves the model’s robustness to occlusions, rotations, and scale variations. This enhancement increases the model’s expressive power for targets in complex scenes, especially when the targets have variable shapes or occlusions are present. Deformable convolution adjusts the convolution kernel to improve feature integration across different scales and orientations, effectively capturing the geometric variations of targets. In this paper, deformable convolution’s feature integration capability is utilized to effectively fuse global semantic features with local texture features, guiding the local features to focus more on the important areas of the image.

3. Methodology

This section provides a detailed introduction to the network structure and design concept of the proposed CD-Attention. The CD-Attention structure consists of four parts: feature extraction, attention guidance, feature fusion, and a dual-branch prediction module, as shown in Figure 1. It integrates the global semantic features extracted by the ViT with the local texture features extracted by CNN, uses deformable convolution to enhance the model’s ability to model transformations, and focuses local features on the key areas of the image. The prediction section introduces high-frequency and low-frequency predictions to determine the final score.

3.1. Feature Extraction Model

The focus of this paper is on the CD perception in photographic images, which are characterized by more complex content and larger image sizes. Human perception of CDs between two images typically uses global features as a reference to conduct a detailed analysis of local key areas. Inspired by this, the feature extraction module must be capable of extracting both local and global image features. Therefore, the designed feature extraction module includes two branches: the ViT feature extraction module and the CNN feature extraction module. CNN features are a classic convolution-based feature extraction method in deep learning, focusing on capturing local detail and texture features of the image but lacking global semantic features. The ViT, with its self-attention module, can establish long-distance dependency relationships between image regions, thereby extracting global semantic features of the image.

The self-attention module in the Transformer is used to model image features in the ViT structure, which is different from the way CNN extracts local image features through convolutional and pooling layers [34]. The blocking of the ViT model is based on the encoder of the transformer, which includes multi-head attention and has shown good performance in fields such as panoramic segmentation, instance segmentation, semantic segmentation, and medical image segmentation. By dividing the image into a series of small patches and treating them as serialized tokens input into the transformer encoder, image feature extraction and classification are achieved. The process of feature extraction in ViT is shown in Figure 2. Initially, the image is divided into fixed-size image patches, which are then converted into patch embeddings. Positional encoding information is added, and these embeddings are processed through a Transformer encoder that includes multi-head self-attention and feedforward neural networks. Finally, the features without position markers are taken as global features. During the forward-propagation process, the reference image and the distorted image are both input into the CNN and ViT to extract features. The features extracted from the reference image and the distorted image by the ViT are denoted as

f_{v i t R} \in R^{p \times p \times 5 c}

and

f_{v i t D} \in R^{p \times p \times 5 c}

, as shown in Equations (3) and (4), which represent the ViT feature extraction process, with

I_{R I}

and

I_{D I}

representing the input reference image and distorted image. As shown in Equations (5) and (6), ResNet is used to extract image features, with

f_{c n n R} \in R^{4 p \times 4 p \times c}

and

f_{c n n D} \in R^{4 p \times 4 p \times c}

representing the features extracted by the CNN for the reference image and distorted image.

f_{v i t R} = φ (I_{R I})

(3)

f_{v i t D} = φ (I_{D I})

(4)

f_{c n n R} = ϑ (I_{R I})

(5)

f_{c n n D} = ϑ (I_{D I})

(6)

3.2. Attention Guidance and Feature Fusion Module

The image features extracted by CNN contain noise and interference, and the sampling points of the convolutional kernels are fixed. Although the content of the input images to neural networks varies, the features of the images are sampled at the same locations by the convolutional kernels. To address this issue, inspired by the human perception of CDs and targeting the large size and variable content of photographic images, an attention guidance module is designed. The aim is to better integrate the global semantic features of the ViT with the local texture features of CNN, thereby enhancing the transformation modeling capability of CNN.

Deformable convolution, an enhancement of traditional convolutional neural networks, enables the convolution kernel to automatically adapt to the shape and size of the input features through learning. The performance of traditional CNNs in handling targets with complex shapes, textures, and large-scale variations is enhanced by deformable convolution [35]. By introducing additional offsets, deformable convolution more accurately captures key features of objects, thereby enhancing the performance and robustness of the model. Deformable convolution adjusts the position of the convolution kernel by learning offsets on the input feature map, enabling the kernel to adapt to the geometric changes in objects, such as scale, pose, viewpoint, and part deformation. This offset is generated by an additional convolutional layer, which is not the same as the convolution kernel performing the convolution operation. As shown in Figure 3, the features of the reference image extracted by the ViT, denoted as

f_{v i t R}

, are first subjected to nearest-neighbor bilinear interpolation, which considers the values of the four closest points around the sampling point and calculates the value of the sampling point through linear interpolation. Since the offsets may result in sampling points falling on non-integer coordinates, bilinear interpolation is required to calculate the values at these non-integer coordinate points. Then, an additional convolutional layer is defined to generate offsets, with an input channel of 768 × 5 and an output channel of 2 × 3 × 3. The input to this convolutional layer is the feature map and the output is the offset field, which has the same shape as the input feature map but with a channel count of 2N, where N is the area size covered by the convolution kernel; for a 3 × 3 convolution kernel, N = 9. These 2N channels correspond to the offsets in the x and y directions for each sampling point. For the center point of each convolution kernel, its sampling grid position is adjusted according to the corresponding offset. These 2N channels correspond to the offsets in the x and y directions for each sampling point. For the center point of each convolution kernel, its sampling grid position is adjusted according to the corresponding offset. Specifically, if the coordinates of the convolution kernel’s center point are (p, q), the offset is (Δx, Δy), then the adjusted coordinates of the sampling point are (p + Δx, q + Δy). After adjusting the positions of the sampling points, standard convolution kernels are used to perform convolution operations at these adjusted locations to obtain the adjusted feature maps

f_{c n n R D}

and

f_{c n n D D}

for

f_{c n n R}

and

f_{c n n D}

. In summary, the features extracted by CNN, which are guided by the global semantic features of the reference image, focus more on the salient areas of the image, more closely resembling the human perception process of CDs in images. Through the aforementioned steps, the process of image feature extraction more closely resembles the human perception process of CDs.

3.3. Double-Branch Prediction Module

In the last section, four feature maps,

f_{c n n R D}

,

f_{c n n D D}

,

f_{v i t R}

, and

f_{v i t D}

, are obtained. By fusing features

f_{c n n R D}

and

f_{v i t R}

, feature

f_{R}

is obtained. By fusing features

f_{c n n D D}

and

f_{v i t D}

, feature

f_{D}

is obtained. Finally, features

f_{R}

,

f_{D}

, and (

f_{D}

−

f_{R}

) are fused to obtain the final feature

f_{F}

.

A double-branch prediction approach is adopted to forecast the final score, as shown in Figure 4. After feature fusion, a low-frequency prediction module is used to predict the score of global features, resulting in a low-frequency prediction score map. First, a convolution operation is used for feature smoothing, and then the Sigmoid activation function is utilized to obtain the global scores from the features. High-frequency prediction also uses convolution for feature smoothing, and then the high-frequency prediction score map is calculated through convolution operations. Finally, the low-frequency prediction score and the high-frequency prediction score are fused through a weighted average method to obtain the final score of the image CD, as shown in Formula (7). Since each feature in the feature map is a high-level feature obtained through feature fusion, using a weighted average approach can maximize the use of information from each characteristic. Moreover, by using the low-frequency score as the weight, the impact of global features on the result is learned by the model, allowing the model to focus on the most relevant parts of the input features. More precise feature representations are learned by the model, enhancing the model’s expressive and generalization capabilities. The chosen loss function for training is MES LOSS, and the difference between the predicted score and the ground true is used for back-propagation learning.

S = \frac{\sum S_{L} {\times S}_{H}}{\sum S_{H}}

(7)

4. Experiment

4.1. Dataset

The large-scale dataset with human ratings is provided by Smartphone Photography Color Difference (SPCD), which is designed to foster further research on the measurement of perceptual CDs in photographic images and assist scholars in developing color measurement methods that are more in tune with human perception. The SPCD dataset comprises 30,000 image pairs, uniformly sampled from 1000 diverse natural scenes, encompassing a range of realistic photographic settings, Some representative images are shown in Figure 5. The image size is 1024 × 1024 and they are stored in an uncompressed format. The SPCD encompasses four types of distortions: images of the same scene captured by different smartphones, the same image modified through Photoshop, the same image processed with iPhone filters, and the same image reproduced with incorrect ICC profile configurations. Among them, 10,005 image pairs are not perfectly aligned and are used to assess CD perception in the presence of geometric distortions such as translation or parallax. The remaining 19,995 image pairs are aligned and used for CD assessment without the influence of any geometric transformations. Each image pair has a ground truth of human-perceived CDs obtained through large-scale experiments. Utilizing SPCD can validate the stability of CD perception metrics when geometric distortions are introduced, effectively simulating the translation and parallax distortions introduced during the image capture process, which aids in the promotion and application of later algorithms. During the network training phase, 70% of the image pairs from the dataset are randomly allocated as the training set, 20% as the test set, and 10% as the validation set.

4.2. Implementation Details

Network Architecture. The proposed image CD measurement network CD-Attention, which combines the CNN and ViT, uses the ResNet50 and ViT_Base_Patch16 as feature extraction networks. Local texture features are extracted by ResNet50 while the global semantic features are extracted by the ViT, and both have been pre-trained using the trained networks for feature extraction. ResNet50 and the ViT are used to extract features from the reference image and the distorted image, respectively, with the resulting feature map sizes being (B, 768, 56, 56) and (B, 3840, 14, 14), where B represents the batch size set to 8. Subsequently, the nearest neighbor bilinear interpolation and convolution operations are used to process the ViT features of the reference image to obtain the offset for deformable convolution. The size of the offset is (8, 18, 56, 56), and this offset serves as the global receptive field for the current input image, which is used to guide the CNN features. The features extracted from the reference image and the distorted image by ResNet50 are input into the deformable convolution separately, using the offset from the reference image to guide the features. After being processed by the deformable convolution, the features from ResNet50 are transformed into the dimensions of (8, 768, 14, 14). The features of the reference and distorted images processed by deformable convolution are respectively fused with the features extracted from the reference and distorted images by the ViT, resulting in

f_{R}

and

f_{D}

with sizes of (8, 4608, 14, 14). Dimensionality reduction is applied to the feature maps, resulting in sizes of (8, 256, 14, 14). Feature fusion of

f_{R}

,

f_{D}

, and (

f_{D}

−

f_{R}

) results in the final feature map

f_{F}

, with sizes of (8, 768, 14, 14). Finally, a dual-branch prediction model is utilized for score prediction. Convolutional operations are employed to obtain high-frequency global feature maps and low-frequency local feature maps, both with dimensions of (8, 1, 14, 14). The final image CD score is calculated using a weighted average.

Training and Testing Details. CD-Attention is trained on a computer equipped with an A100 GPU, using the SPCD dataset for model training. The loss curve of model training is shown in Figure 6, the learning rate is set to 10-4, and the time for one round of training is 16 min and 30 s. A total of 200 rounds of training are conducted, ultimately resulting in the best model loss of 0.05224.

Evaluation Criteria. The results are evaluated using three metrics: standardized residual sum of squares (STRESS), the Pearson linear correlation coefficient (PLCC), and Spearman’s rank correlation coefficient (SRCC). These formulas are used to validate the consistency and correlation of the CD-Attention with HSV. The calculation formula for STRESS is as follows:

S T R E S S = 100 \sqrt{\frac{\sum_{i = 1}^{M} {(Δ E_{i} - F Δ V_{i})}^{2}}{F^{2} \sum_{i = 1}^{M} Δ V_{i}^{2}}}

(8)

where M is the number of test pairs and F is the scale correction factor between ∆E and ∆V, defined as:

F = \frac{\sum_{i = 1}^{M} Δ E_{i}^{2}}{\sum_{i = 1}^{M} Δ E_{i} Δ V_{i}}

(9)

PLCC and SRCC are used to measure the correlation and monotonicity of the CD evaluation network. The calculation formula for PLCC is as follows:

P L C C = \frac{\sum_{i = 1}^{M} (Δ E_{i} - Δ \bar{E}) (Δ V_{i} - Δ \bar{V})}{\sqrt{\sum_{i = 1}^{M} {(Δ E_{i} - Δ \bar{E})}^{2}} \sqrt{\sum_{i = 1}^{M} {(Δ V_{i} - Δ \bar{V})}^{2}}}

(10)

where

Δ \bar{E} = \frac{1}{M} \sum_{i = 1}^{M} Δ E_{i} and Δ \bar{V} = \frac{1}{M} \sum_{i = 1}^{M} Δ V_{i}

are the mean predicted and perceptual CDs, respectively. A preprocessing step is added to linearize model predictions by fitting a four-parameter monotonic function before computing PLCC

Δ \hat{E} = (η_{1} - η_{2}) / (1 + \exp (- (Δ E - η_{3}) / |η_{4}|)) + η_{2}

(11)

The calculation formula for SRCC is as follows:

S R C C = 1 - \frac{6 \sum_{i = 1}^{M} d_{i}^{2}}{M (M^{2} - 1)}

(12)

where

d_{i}

is the difference between the i-th pair’s rank orders in ∆E and ∆V.

4.3. Main Results

Dataset validation results. As shown in Table 1, CD-Attention is compared with 30 common CD measurement algorithms. These methods can be categorized into five groups based on their application scenarios and algorithmic characteristics. The first category mainly includes methods aimed at CD measurement for natural images: CIELAB [4], CIE94 [36], CMC [5], CIEDE2000 [6], CIECAM02 [37], CIECAM16 [38], S-CIELAB [7], Imai01 [8], Toet03 [9], Lee05 [10], Hong06 [11], Ouni08 [12], Simone09 [13], and Pedersen12 [39]. The second category consists of typical algorithms for image quality assessment: SSIM [18], VSI [19], LIP [20], PieAPP [21], LPIPS [22], and DISTS [23]. The third category includes just noticeable difference (JND) measures: Chou07 [40], Lissner12 [16], and Butteraugli [17]. Additionally, classic CNN networks such as VGG [41], ResNet-18 [42], UNet [43], and CAN [44] are trained, and their performance in CD measurement is tested. CD-Attention is compared with the state-of-the-art methods CD-Net [28] and CD-Flow [29].

From the table, the following conclusions can be drawn: First, traditional CD calculation methods designed for natural images do not perform particularly well on images captured by smartphones. The reason is that the direct computation of image CDs does not align with the human perception of CDs, as subjective ratings tend to favor regions of interest within the image. Among these, CIELAB [4], CIE94 [36], CIEDE2000 [6], CIECAM02 [37], CIECAM16 [38], S-CIELAB [7], and Ouni08 [12] perform relatively better because they all calculate CDs based on the CIELAB color space. Second, typical image quality assessment algorithms such as SSIM [18], VSI [19], LIP [20], PieAPP [21], LPIPS [22], and DISTS [23] have been proven effective in the field of image quality assessment, capable of identifying and detecting common image distortions. However, they do not perform well on CD image datasets because changes in image quality due to CDs are not necessarily perceived as image distortions, which also confirms the hypothesis proposed in the paper. Third, CD measurement algorithms based on JND can perceive the smallest changes detectable by humans, but they perform poorly on the SPCD dataset, with low correlation and STRESS metrics due to a poor perception of changes above the threshold in the dataset. Fourth, training models directly with classic CNN networks yields better results than traditional algorithms and image quality assessment algorithms as neural networks have strong generalization capabilities, and transfer learning through training on the SPCD dataset achieves more accurate CD perception results. This proves that using CNN networks for feature extraction in CD perception is feasible. Fifth, most algorithms perform well on aligned datasets but poorly on misaligned datasets due to insufficient adaptability to geometric changes in images. Sixth, compared with the state-of-the-art methods CD-Net and CD-Flow, the CD-Attention’s correlation metrics PLCC and SRCC are significantly higher than the current best algorithms. This is because CD-Net and CD-Flow are designed based on CNN networks, mainly extracting local texture features without combining global features. In summary, CD-Attention has achieved state-of-the-art performance, accurately perceiving the quality loss in images due to CDs and showing a high correlation with human ratings.

Generalization Analysis. To verify the generalization of the proposed CD-Attention model, the TID2013 dataset is used to train the model. The TID2013 dataset is used for image quality assessment and contains 25 reference images and 3000 distorted images. These distorted images are generated through 24 different distortion types, with 5 distortion levels for each type. Mean Opinion Score (MOS) values are obtained from 971 experimental observers from five countries. Distortion types related to CD in the dataset are selected for model training. The results are evaluated using STRESS, PLCC, and SROCC metrics, as shown in Table 2. According to the data in the table, although CIEDE2000 [6] is considered the best-performing CD formula in predicting experimental datasets currently, its performance is mediocre. Compared with the existing image quality assessment methods PieAPP [21], LPIPS [22], and DISTS [23], the performance of CD-Attention has improved, proving its better generalization performance. When compared with the current best algorithms CD-Net [28] and CD-Flow [29], a higher correlation is shown by CD-Attention, demonstrating its superior generalization performance.

4.4. Ablation Study

To analyze the effectiveness of CD-Attention, ablation experiments are conducted on the key modules using the SPCD. This includes modules for feature extraction and attention guidance.

Feature Extraction Module. The feature extraction module is a crucial part of the network, and the quality of feature extraction directly affects the network’s predictions and outputs. Common feature extraction backbones were compared with the results shown in Table 3. The compared backbones include Resnet50 [42], Resnet101 [42], Resnet152 [42], HRnet [45], InceptionResNetV2 [46], and ViT-B/16 and ViT-B/8 [31]. The feature extraction model combining Resnet50 [42] and ViT-B/16 [31] is found to be the most effective. There is a noted decrease in performance with the increase in depth of the CNN network. This is attributed to the fact that the deeper the CNN network, the more high-level features are captured. However, CD-Attention requires only the extraction of local features by the CNN network, with the ViT responsible for extracting global and deep features. When comparing the Transformer backbone networks ViT-B/8 [31] and ViT-B/16 [31], the performance of ViT-B/16 [31] is superior.

Attention Guidance Module. An ablation study was conducted on the attention guidance module of CD-Attention to verify the effectiveness of introducing deformable convolution for attention guidance. As shown in Table 4, the first method employs deformable convolution for attention guidance as used in CD-Attention, which combines CNN features with ViT global semantic features and then proceeds with feature fusion. The second method uses traditional feature concatenation to directly merge CNN features with ViT features. In CD measurement tasks, the first method significantly outperforms the second, demonstrating that the proposed approach of attention guidance before feature fusion effectively integrates local texture features with global semantic features. This further proves that the global semantic features extracted by the ViT can guide the local shallow features extracted by CNN, bringing the CD measurement results closer to human perception. This further proves that the approach of using global semantic features to guide the local shallow features extracted by CNN is correct. The third and fourth methods in the table utilize only Resnet50 and ViT-B/16 as the backbone for which the performance of their models drops significantly. This proves that the features from Resnet50 and ViT-B/16 significantly affect the performance of CD-Attention, demonstrating that both local texture features and global semantic features are essential in CD measurement tasks. It also further confirms that the design concept of the proposed CD-Attention model is correct.

5. Conclusions

A high-precision method for image CD measurement that simulates the human visual system’s perception of images has been proposed and applied to the CD perception of smartphone photography images. The formula proposed for photography images CD is called CD-Attention, which consists of modules for feature extraction, attention guidance, feature fusion, and dual-branch prediction. Local texture features are extracted using Resnet50, while global semantic features are captured through the ViT. Attention guidance for image features is implemented based on deformable convolution, using global semantic features to direct local texture features to focus on the regions of interest in the image, making CD-Attention closer to the human perception of CDs. Finally, a dual-branch prediction module that combines high-frequency and low-frequency features is used to calculate the image CD score. The effectiveness of CD-Attention’s feature extraction and attention guidance has been proven through ablation studies. The experimental results show that the proposed CD-Attention model is a state-of-the-art method for CD measurement, and with STRESS, PLCC, and SRCC, it reached 18.52, 0.914, and 0.902. It has been demonstrated that CD-Attention can achieve CD measurement for large-sized and content-complex smartphone photography images, and it is of great significance for improving photographic quality, optimizing image processing, and optimizing image display.

Author Contributions

Methodology, H.Q. and X.Z.; software, H.Q. and J.H.; validation, H.Q.; writing—original draft preparation, H.Q.; writing—review and editing, X.Z. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

Shaanxi Provincial Department of Education: 24JP013; Xianyang Bureau of Science and Technology: 2024VCZK-002, L2023-ZDYF-QYCX-031; Shaanxi University Youth Innovation Team.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Please reach out to the first author for the source code and data if necessary.

Conflicts of Interest

The authors declare no conflicts of interest.

References

ElAzzaby, F.; Sabour, K.; Elakkad, N.; El-Shafai, W.; Torki, A.; Rajkumar, S. Color image encryption using a Zigzag Transformation and sine-cosine maps. Sci. Afr. 2023, 22, e01955. [Google Scholar] [CrossRef]
Ullah, M.; Hamayun, S.; Wahab, A.; Khan, S.U.; Rehman, M.U.; Haq, Z.U.; Rehman, K.U.; Ullah, A.; Mehreen, A.; Awan, U.A.; et al. Smart technologies used as smart tools in the management of cardiovascular disease and their future perspective. Curr. Probl. Cardiol. 2023, 48, 101922. [Google Scholar] [CrossRef] [PubMed]
Delbracio, M.; Kelly, D.; Brown, M.S.; Milanfar, P. Mobile computational photography: A tour. Annu. Rev. Vis. Sci. 2021, 7, 571–604. [Google Scholar] [CrossRef]
Robertson, A.R. The CIE 1976 color-difference formulae. Color Res. Appl. 1977, 2, 7–11. [Google Scholar] [CrossRef]
British Standards Institution. Method for Calculation of Small Colour Differences; American National Standards Institute: Washington, DC, USA, 1998. [Google Scholar]
Luo, M.R.; Cui, G.; Rigg, B. The development of the CIE2000 colour-difference formula: CIEDE2000. Color Res. Appl. 2001, 26, 340–350. [Google Scholar] [CrossRef]
Zhang, X.; Wandell, B.A. A spatial extension of CIELAB for digital color-image reproduction. J. Soc. Inf. Disp. 1997, 5, 61–63. [Google Scholar] [CrossRef]
Imai, F.H.; Tsumura, N.; Miyake, Y. Perceptual color difference metric for complex images based on Mahalanobis distance. J. Electron. Imaging 2001, 10, 385–393. [Google Scholar]
Toet, A.; Lucassen, M.P. A new universal colour image fidelity metric. Displays 2003, 24, 197–207. [Google Scholar] [CrossRef]
Lee, S.; Xin, J.H.; Westland, S. Evaluation of image similarity by histogram intersection. Color Res. Appl. 2005, 30, 265–274. [Google Scholar] [CrossRef]
Hong, G.; Luo, M.R. New algorithm for calculating perceived colour difference of images. Imaging Sci. J. 2006, 54, 86–91. [Google Scholar] [CrossRef]
Ouni, S.; Zagrouba, E.; Chambah, M.; Herbin, M. A new spatial colour metric for perceptual comparison. In Proceedings of the International Conference on Computing and E-Systems, Hammamet, Tunisia, 15–17 March 2008; pp. 413–428. [Google Scholar]
Simone, G.; Oleari, C.; Farup, I. An alternative color difference formula for computing image difference. In Gjøvik Color Imaging Symposium; Høgskolen i Gjøviks Rapportserie; Gjøvik University College: Gjøvik, Norway, 2009; pp. 8–11. [Google Scholar]
Lee, D.; Plataniotis, K.N. Towards a novel perceptual color difference metric using circular processing of hue components. In Proceedings of the IEEE International Conference on Acoustics, Speech, & Signal Processing, Florence, Italy, 4–9 May 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 166–170. [Google Scholar]
Ortiz-Jaramillo, B.; Kumcu, A.; Platisa, L.; Philips, W. Evaluation of color differences in natural scene color images. Signal Process. Image Commun. 2019, 71, 128–137. [Google Scholar] [CrossRef]
Lissner, I.; Preiss, J.; Urban, P.; Lichtenauer, M.S.; Zolliker, P. Image-difference prediction: From grayscale to color. IEEE Trans. Image Process. 2012, 22, 435–446. [Google Scholar] [CrossRef] [PubMed]
Alakuijala, J.; Obryk, R.; Stoliarchuk, O.; Szabadka, Z.; Vandevenne, L.; Wassenberg, J. Guetzli: Perceptually guided JPEG encoder. arXiv 2017, arXiv:1703.04421. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Zhang, L.; Shen, Y.; Li, H. VSI: A visual saliency-induced index for perceptual image quality assessment. IEEE Trans. Image Process. 2014, 23, 4270–4281. [Google Scholar] [CrossRef]
Andersson, P.; Nilsson, J.; Akenine-Möller, T.; Oskarsson, M.; Åström, K.; Fairchild, M.D. FLIP: A difference evaluator for alternating images. ACM Comput. Graph. Interact. Tech. 2020, 3, 15. [Google Scholar] [CrossRef]
Prashnani, E.; Cai, H.; Mostofi, Y.; Sen, P. PieAPP: Perceptual image-error assessment through pairwise preference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE Computer Society: Los Alamitos, CA, USA, 2018; pp. 1808–1817. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE Computer Society: Los Alamitos, CA, USA, 2018; pp. 586–595. [Google Scholar]
Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Huang, W.; Deng, Y.; Hui, S.; Wu, Y.; Zhou, S.; Wang, J. Sparse self-attention transformer for image inpainting. Pattern Recognit. 2023, 145, 109897. [Google Scholar] [CrossRef]
Song, B.; Wu, Y.; Xu, Y. ViTCN: Vision Transformer Contrastive Network For Reasoning. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology, Nanjing, China, 29–31 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 452–456. [Google Scholar]
Wang, Z.; Xu, K.; Ding, K.; Jiang, Q.; Zuo, Y.; Ni, Z.; Fang, Y. CD-iNet: Deep Invertible Network for Perceptual Image Color Difference Measurement. Int. J. Comput. Vis. 2024, 132, 5983–6003. [Google Scholar] [CrossRef]
Wang, Z.; Xu, K.; Yang, Y.; Dong, J.; Gu, S.; Xu, L.; Fang, Y.; Ma, K. Measuring perceptual color differences of smartphone photography. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10114–10128. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Wang, Z.; Yang, Y.; Sun, Q.; Ma, K. Learning a deep color difference metric for photographic images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; IEEE Computer Society: Los Alamitos, CA, USA, 2023; pp. 22242–22251. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Tao, D. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 764–773. [Google Scholar]
Khan, A.; Rauf, Z.; Sohail, A.; Khan, A.R.; Asif, H.; Asif, A.; Farooq, U. A survey of the vision transformers and their CNN-transformer based variants. Artif. Intell. Rev. 2023, 56 (Suppl. S3), 2917–2970. [Google Scholar] [CrossRef]
Zhuo, S.; Zhang, J. Attention-based deformable convolutional network for Chinese various dynasties character recognition. Expert Syst. Appl. 2024, 238, 121881. [Google Scholar] [CrossRef]
McDonald, R.; Smith, K. CIE94-A new colour-difference formula. J. Soc. Ofdyers Colour. 1995, 111, 376–379. [Google Scholar] [CrossRef]
Luo, M.R.; Li, C. CIECAM02 and its recent developments. In Advanced Color Image Processing and Analysis; Springer: New York, NY, USA, 2013; pp. 19–58. [Google Scholar]
Li, C.; Li, Z.; Wang, Z.; Xu, Y.; Luo, M.R.; Cui, G.; Melgosa, M.; Pointer, M. A revision of CIECAM02 and its CAT and UCS. Color Imaging Conf. 2016, 24, 208–212. [Google Scholar] [CrossRef]
Pedersen, M.; Hardeberg, J.Y. A new spatial filtering based image difference metric based on hue angle weighting. J. Imaging Sci. Technol. 2012, 56, 1–12. [Google Scholar] [CrossRef]
Chou, C.-H.; Liu, K.-C. A fidelity metric for assessing visual quality of color images. In Proceedings of the International Conference on Computer Communications and Networks, Honolulu, HI, USA, 13–16 August 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1154–1159. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; IEEE Computer Society: Los Alamitos, CA, USA, 2016; pp. 770–778. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, Q.; Xu, J.; Koltun, V. Fast image processing with fully-convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2497–2506. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alexander, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; AAAI Press: Washington, DC, USA, 2017. [Google Scholar]

Figure 1. Overview of CD-Attention. Reference images and distorted images are, respectively, fed into CNN and ViT to obtain feature maps. Based on the ViT feature maps, the offset is calculated. Deformable convolution is used to guide the CNN features, followed by the fusion of all features. Finally, dual-branch prediction is employed to calculate the final score.

Figure 2. ViT’s feature extraction.

Figure 3. Attention guidance model.

Figure 4. Prediction module.

Figure 5. Typical mages of SPCD.

Figure 6. Loss curve of model training.

Table 1. Using the SPCD dataset, methods are compared by the STRESS, PLCC, and SRCC.

Method	Color Space	Perfectly Aligned Pairs			Non-Perfectly Aligned Pairs			All
Method	Color Space	STRESS ↓	PLCC ↑	SRCC ↑	STRESS ↓	PLCC ↑	SRCC ↑	STRESS ↓	PLCC ↑	SRCC ↑
CIELAB [4]	CIELAB	31.244	0.793	0.775	29.639	0.69	0.579	31.872	0.716	0.666
CIE94 [36]	CIELAB	34.721	0.79	0.772	29.916	0.693	0.572	34.326	0.71	0.654
CMC [5]	CIELAB	34.113	0.786	0.786	34.125	0.591	0.49	35.936	0.664	0.632
CIEDE2000 [6]	CIELAB	29.975	0.825	0.821	30.347	0.667	0.563	31.439	0.726	0.686
CIECAM02 [37]	CIECAM02	33.377	0.797	0.781	29.769	0.69	0.574	33.397	0.714	0.66
CIECAM16 [38]	CIECAM16	31.507	0.81	0.799	29.529	0.691	0.577	32.138	0.722	0.673
S-CIELAB [7]	CIELAB	30.094	0.822	0.819	31.804	0.631	0.522	32.78	0.7	0.657
Imai01 [8]	CIELAB	60.123	0.683	0.694	48.573	0.527	0.524	57.329	0.597	0.606
Toet03 [9]	lαβ	34.941	0.337	0.392	38.624	0.139	0.048	36.216	0.197	0.176
Lee05 [10]	CIELAB	58.891	0.734	0.741	55.826	0.622	0.624	58.01	0.697	0.71
Hong06 [11]	CIELAB	60.557	0.794	0.81	57.07	0.543	0.461	61.227	0.645	0.632
Ouni08 [12]	CIELAB	29.977	0.826	0.821	30.355	0.668	0.563	31.444	0.726	0.685
Simone09 [13]	OSA-UCS	35.798	0.687	0.697	35.212	0.528	0.395	36.712	0.564	0.545
Pedersen12 [39]	CIELAB	60.385	0.798	0.812	58.565	0.482	0.407	63.153	0.612	0.6
SSIM [18]	Grayscale	39.393	0.589	0.549	53.035	0.077	0.044	48.025	0.309	0.324
VSI [19]	LMN	35.221	0.617	0.665	39.033	0.16	0.114	36.482	0.404	0.391
LIP [20]	CIELAB	29.318	0.745	0.715	27.158	0.734	0.64	29.099	0.718	0.663
PieAPP [21]	RGB	41.258	0.51	0.517	38.457	0.502	0.433	41.375	0.478	0.46
LPIPS [22]	RGB	47.34	0.674	0.683	40.104	0.258	0.239	66.594	0.428	0.439
DISTS [23]	RGB	39.771	0.735	0.73	38.247	0.428	0.388	52.413	0.437	0.384
Chou07 [40]	CIELAB	50.721	0.787	0.785	36.184	0.603	0.459	49.545	0.612	0.557
Lissner12 [16]	CIELAB	36.81	0.605	0.618	40.144	0.339	0.247	41.449	0.429	0.42
Butteraugli [17]	RGB	42.62	0.606	0.593	48.217	0.258	0.245	54.737	0.371	0.359
VGG [41]	RGB	19.199	0.843	0.831	24.052	0.833	0.771	20.906	0.836	0.814
ResNet-18 [42]	RGB	17.969	0.883	0.892	19.577	0.874	0.849	18.574	0.876	0.889
UNet [43]	RGB	18.236	0.849	0.843	26.039	0.789	0.76	21.073	0.813	0.812
CAN [44]	RGB	19.826	0.858	0.861	23.158	0.825	0.744	21.152	0.833	0.818
CD-Net [28]	RGB	20.891	0.867	0.87	22.543	0.818	0.776	21.431	0.846	0.842
CD-Flow [29]	RGB	16.613	0.896	0.904	21.374	0.856	0.794	18.473	0.871	0.865
CD-Attention	RGB	17.93	0.925	0.9135	22.83	0.872	0.834	18.52	0.9143	0.902

Table 2. Comparison of the generalization performances on TID2013.

Method	STRESS ↓	PLCC ↑	SRCC ↑
CIEDE2000 [6]	18.203	0.73	0.751
PieAPP [21]	20.918	0.62	0.653
LPIPS [22]	15.42	0.816	0.804
DISTS [23]	15.235	0.821	0.805
CD-Net [28]	15.962	0.801	0.826
CD-Flow [29]	14.110	0.837	0.832
CD-Attention	15.626	0.851	0.846

Table 3. Comparison of different feature extraction backbones on the SPCD testing datasets.

CNN	ViT	PLCC	SRCC
Resnet50 [42]	ViT-B/16 [31]	0.915	0.912
Resnet101 [42]		0.881	0.879
Resnet152 [42]		0.887	0.880
HRnet [45]		0.886	0.878
IncepReaV2 [46]		0.886	0.876
Resnet50 [42]	ViT-B/8 [31]	0.892	0.886

Table 4. Comparison of different feature fusion strategies on the SPCD testing datasets.

NO	Resnet50	ViT-B/16	Fusion Method	PLCC	SRCC
1	✓	✓	deform + concat	0.915	0.912
2	✓	✓	concat	0.903	0.899
3	✓		-	0.862	0.858
4		✓	-	0.873	0.864

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiang, H.; Zhang, X.; Hou, J. Attention-Based Color Difference Perception for Photographic Images. Appl. Sci. 2025, 15, 2704. https://doi.org/10.3390/app15052704

AMA Style

Qiang H, Zhang X, Hou J. Attention-Based Color Difference Perception for Photographic Images. Applied Sciences. 2025; 15(5):2704. https://doi.org/10.3390/app15052704

Chicago/Turabian Style

Qiang, Hua, Xuande Zhang, and Jinliang Hou. 2025. "Attention-Based Color Difference Perception for Photographic Images" Applied Sciences 15, no. 5: 2704. https://doi.org/10.3390/app15052704

APA Style

Qiang, H., Zhang, X., & Hou, J. (2025). Attention-Based Color Difference Perception for Photographic Images. Applied Sciences, 15(5), 2704. https://doi.org/10.3390/app15052704

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention-Based Color Difference Perception for Photographic Images

Abstract

1. Introduction

2. Related Works

2.1. CD Formula

2.2. Vision Transformer

2.3. Deformable Convolution

3. Methodology

3.1. Feature Extraction Model

3.2. Attention Guidance and Feature Fusion Module

3.3. Double-Branch Prediction Module

4. Experiment

4.1. Dataset

4.2. Implementation Details

4.3. Main Results

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI