1. Introduction
Underwater images serve as an important medium for transmitting ocean information, supporting autonomous underwater vehicles, remotely operated vehicles, and other marine instruments to help them efficiently complete tasks such as ocean resource exploration and deep-sea facility monitoring [
1]. However, due to the unique properties of the underwater environment, the imaging process often suffers from light absorption and scattering, resulting in low contrast, color casts, haze effects, and other visual impediments that affect subsequent image processing systems in underwater instruments [
2]. Consequently, extensive research on Underwater Image Enhancement (UIE) algorithms [
3] has received significant attention, aiming to improve the visual quality of underwater images and aiding downstream tasks (i.e., saliency detection, object detection, image segmentation, etc.). However, due to the lack of high-quality underwater reference images, most existing UIE algorithms rely on synthetic datasets, and none are universally effective, limiting their applicability in real-world scenarios. Enhanced underwater images may still retain unresolved issues or introduce additional distortions, such as artifacts related to over-enhancement or detail loss related to under-enhancement. Therefore, Image Quality Assessment (IQA) [
4,
5,
6,
7] methods are often necessary for objective assessment, promoting the development of image systems for various optical sensors.
Thus far, many traditional in-air IQA methods have been proposed [
8,
9,
10,
11]. Among them, Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR) are classic Full Reference (FR) methods that often serve as benchmarks for optimizing UIE algorithms. However, MSE and PSNR, defined based on pixel-wise differences, fail to adequately capture features affecting human perception, such as texture details [
12]. Given the unavailability of reference images in many cases, the application value of No-Reference (NR) IQA methods becomes more prominent. Existing NR-IQA methods primarily gauge image distortion by extracting Natural Scene Statistics (NSS) features in spatial or transform domains. These encompass classic NSS features, like Mean Subtracted Contrast Normalized (MSCN) coefficients [
8], wavelet transformation coefficients [
9], and discrete cosine transformation coefficients [
10], among others, which have shown success in in-air images. However, due to the unique imaging environment, statistical NSS features change in underwater images, making it difficult to capture the increased noise and non-Gaussian distribution variations present in these images. Additionally, some progress has been made in deep learning (DL)-based IQA methods for in-air images [
13,
14,
15,
16,
17,
18,
19,
20]. For example, some methods utilize transfer learning [
13] and meta-learning [
16] to learn distortion features of in-air images, thereby achieving distortion classification and discrimination. However, the imaging characteristics and primary degradations of underwater images differ from those of in-air images. More importantly, the UIE algorithms may lead to more complex and variable artificial distortions of underwater images. Therefore, directly applying IQA methods designed for air images to evaluate underwater image quality may yield suboptimal results.
To this end, many traditional underwater IQA (UIQA) methods [
21,
22,
23,
24,
25,
26,
27,
28,
29,
30] have been proposed. Most of the existing traditional UIQA methods rely on extracting handcrafted features to reflect the distortion situations. For instance, methods like underwater color image quality evaluation (UCIQE) [
21] gauge underwater image quality by extracting statistical features like chroma, saturation, and contrast from the CIELab color space, but do not consider any human perception factors. The underwater image quality measure (UIQM) [
22], inspired by characteristics of the Human Vision System (HVS), incorporates measurements of colorfulness, sharpness, and contrast to evaluate underwater image degradation. However, the method of assigning manual weights to the extracted features to represent image quality has poor generalization and fails to adequately represent diverse degradation scenarios. Although these methods have a certain ability to assess common distortions such as low contrast and blurriness, they struggle to precisely quantify compound distortions caused by complex and dynamic underwater environments. Particularly, they inadequately perceive the intricate variations in color and structure introduced by UIE algorithms, like redshifts and artifacts. Due to the excellent feature learning capabilities of DL, some deep models for UIQA have been proposed [
31,
32,
33]. Most of the DL-based UIQA methods use additional priors or conditions beyond the input to better learn the quality features. For example, some methods integrate color histogram priors as quality tokens to supplement global degradation information [
31], but they fail to reflect the impact of local spatial structure on human perception. Others utilize depth maps as the weight to differentiate between foreground and background [
32], but lack consideration of phenomena such as low contrast caused by luminance distortions. Previous works [
34,
35] used saliency maps as weights to measure perceptual quality. Typically, they directly apply or resize the saliency map to fit the required input size without delving into their inherent perceptual information. Although these embedded conditions or priors help the methods to understand the image quality to some extent, there is still a lack of sufficient consideration for the visual information introduced by UIE algorithms. Therefore, it is necessary to improve the UIQA methods to evaluate the quality of the enhanced underwater images more accurately.
In this paper, we propose a novel joint saliency–luminance prior UIQA model called JLSAU. Specifically, considering distortions at different scales, such as texture details and haze effects, a pyramid-structured backbone network is constructed to capture multi-scale distortion features. Additionally, to measure the impact of the UIE algorithms on visual information, luminance and saliency priors are incorporated into multi-scale features, where luminance prior and saliency prior reflect the enhancement effect of global and local quality information, respectively. To this end, the Luminance Feature Extraction Module (LFEM) is proposed to comprehensively measure the global distortion of luminance, which is highly sensitive to the HVS. It supplements multi-scale features to enhance the perception of global distortions, such as low contrast. Furthermore, the Saliency Weight Learning Module (SWLM) is designed to learn saliency features that reflect variations in local quality, serving as supplementary priors to enhance the perception of textures and details within multi-scale features. Finally, multi-scale features containing rich perceptual quality information are obtained and the AFFM based on attention mechanism is proposed to model the visual information at different levels. The experimental results show that, compared to existing UIQA methods, the proposed JLSAU has better performances for two different underwater image datasets. The main contributions of our work are summarized as follows:
To enhance the global quality perception of multi-scale features, the LFEM is proposed to learn quality-aware representations of luminance distribution as prior knowledge. This includes learning quantitative statistical features through histogram statistics and designing a convolutional network to encode and supplement the positional information missing from the statistical features.
To improve the local quality perception of multi-scale features, the SWLM is designed to extract saliency features in the channel and spatial domains as prior knowledge. These features reflect the enhancement effect of UIE algorithms on local quality, thereby enhancing the perception of structure and texture.
To model the relationship of perceptual information among multi-scale features augmented by luminance and saliency prior, AFFM is introduced. It effectively integrates perception quality information from different levels of features using attention mechanisms, aiming to comprehensively perceive distortions in enhanced images.
3. The Proposed Method
To capture the distortion present in enhanced underwater images, we propose a novel multi-scale UIQA model named JLSAU, with a joint luminance–saliency prior. To focus on the variations in image saliency caused by local distortions, saliency features are extracted as visual weights to measure local perceptual quality. Additionally, the perceptual quality representation of luminance information is learned by extracting grayscale histogram statistical features, and positional information is supplemented. Finally, considering the multi-scale perceptual characteristics of HVS, a fusion module based on the attention mechanism is constructed to model the distortion information between different scale features. The overall framework of the proposed JLSAU is shown in
Figure 1. In this section, the key parts of JLSAU will be explained in detail, including its pyramid-structured backbone (hereinafter referred to as backbone), Luminance Feature Extraction Module (LFEM), Saliency Weight Learning Module (SWLM), Attention Feature Fusion Module (AFFM), and loss function setting.
3.1. Backbone
Enhanced underwater images not only include distortions caused by the inherent properties of the water medium, but also artificial distortions introduced by UIE algorithms. The distortions can be categorized at different scales into local distortions, such as structure loss, artifacts, and false edges, as well as global distortions, such as low contrast and color deviations. To capture the different scales of distortion information, we design a pyramid structure module as the backbone for JLSAU. The backbone extracts multi-scale features from the RGB space containing abundant raw color information. The extracted multi-scale features contain rich perceptual quality information such as color, texture, etc. The backbone consists of four stages. Each stage is composed of a downsampling layer, a Multi-Scale Conv Attention (MSCA), and a Feed-Forward Network (FFN). To reduce information loss, overlapping convolution is used instead of pooling for downsampling, which also facilitates modeling the local continuity of structural information. The MSCA consists of a batch normalization layer, a depth-wise convolution, multi-branch strip convolutions, and a 1 × 1 convolution. As shown in
Figure 1, MSCA can be represented as follows:
where
represents the input feature, ⊙ represents the element-wise matrix multiplication,
represents batch normalization,
, and
represent
convolution and depth-wise convolution, respectively, and
represents the
n-th branch of multi-branch strip convolutions, where
.
Specifically, depth-wise convolution aggregates local information, multi-branch strip convolutions capture multi-scale features, and 1 × 1 convolution aggregates the information between channels. FFN is composed of BN-Conv-Gelu-Conv, serving to efficiently integrate and transform channel-wise features, enhancing the representational power of the network for each stage. Finally, the output features are obtained during every stage, denoted as , where represents stages and and represent the number of channels, height, and width of the feature maps at the i-th stage, respectively.
3.2. Luminance Feature Extraction Module
Underwater images typically exhibit global distortions such as low contrast and low visibility in terms of luminance, which significantly affects their quality. To analyze the performance of enhanced underwater images of varying quality in terms of luminance attributes, adaptive histogram equalization is applied. Histogram equalization can distribute pixel values uniformly across the entire luminance range, enhance image contrast, and improve detail. As shown in
Figure 2, four groups of enhanced underwater images and their corresponding histogram equalization results are selected from the UIQE dataset. Each group includes enhanced images obtained by different UIE algorithms that vary from low quality to high quality. Obviously, compared to low-quality images, high-quality images exhibit smaller differences before and after processing, resulting in higher visual similarity. Conversely, the global distortions such as low contrast in low-quality images have been significantly improved. Viewing histogram equalization as a mapping process, the mapping distance between the before and after results can reflect image similarity and thereby indicate image quality. Thus, given an enhanced underwater image
and its histogram equalization result
, the difference between them can reflect the quality of
. To quantify the difference, LFEM is designed to extract high-dimensional luminance features containing differential information.
Specifically,
and
are transferred to grayscale and the results are denoted as
and
. Let
be the difference map obtained by subtracting
from
, which serves as the input of the LFEM. Then, the statistical features of
are obtained through histogram statistics, which are sent into a fully connected layer to obtain the final luminance statistical vector, denoted as
. It can be expressed as follows:
where
represents the histogram statistical operation in the grayscale space, and
represents the fully connection.
Histogram statistics only capture the quantitative statistical relationship of grayscale levels without providing the specific spatial information of pixels. Inspired by the encoding ability of the convolution layer with the relative position [
53], a convolutional branch is designed for implicit position encoding. Specifically, it comprises three stages, each comprising an LFE block (i.e., Conv-ReLU-Conv layer), a skip connection, and a downsampling layer. The LFE block not only captures local features but also preserves the spatial context, allowing the network to infer relative positions implicitly, which can be expressed as follows:
where
represents the number of stages,
represents the output of the convolutional branch at stage
i,
is equivalent to
,
represents downsampling, and
represents the ReLU activation function.
Subsequently, due to the subsequent processes not involving the handling of the variable whose subscript
i is 1, to avoid ambiguity and describe concisely, the subscript
i will be replaced with
below, where
. The obtained features
are subjected to a mapping operation that includes global max pooling and convolutional layers to obtain multi-scale luminance features
, which can be expressed as follows:
where
represents global max pooling.
Finally, by pixel-wise multiplication with
containing relative position information, multi-scale features
are supplemented with luminance information, which can be expressed as follows:
3.3. Saliency Weight Learning Module
Influenced by attributes such as color, contrast, and texture, human perception tends to focus on certain parts of a scene, namely regions of saliency. Saliency detection and UIQA are essentially related, as they both depend on how HVS perceives images. If visual saliency is affected, it implies a change in visual quality. Moreover, visual salient regions are also crucial aspects that UIQA needs to consider.
Figure 3 shows different enhanced underwater images and their corresponding saliency maps, including two control groups of high and low quality. It can be observed that different enhancement effects lead to varying degrees of improvement in the results of saliency detection, which can reflect the quality of the input enhanced images. Therefore, to better perceive local quality, SWLM is designed to extract saliency features. This module indirectly reflects image-quality changes introduced by UIE algorithms and assigns visual weights to different regions. The specific framework is shown in
Figure 4.
Firstly, the method proposed by Montalone [
54] is utilized to calculate the saliency map. For an enhanced underwater image
, its saliency map is denoted as
, which is input into two convolutional-activation layers to expand channel dimensions and obtain low-level saliency features
, which can be represented as follows:
Then, a dual-branch network is designed to obtain the perceptual features of
in both spatial and channel. The spatial branch aims to capture complex spatial relationships, while the channel branch aims to capture uneven channel degeneration. For the channel branch, it includes a global average pooling layer and a global max pooling layer to refine the feature, followed by a fully connected layer and sigmoid activation function to generate channel weights
, which can be expressed as follows:
where
represents the sigmoid activation function,
represents global average pooling, and © represents splicing on a specific dimension.
For the spatial branch, it contains an average pooling layer and a max pooling layer, followed by a convolution layer and sigmoid activation function to obtain the spatial weight
, which can be expressed as follows:
where
and
represent average pooling and max pooling, respectively.
To maintain correspondence with the size of the multi-scale feature
,
and
are obtained by adjusting the output dimension of
and the convolution kernel size of
. Finally, the feature
is weighted with the corresponding
and
obtained from the saliency prior, and concatenated on the channel dimension to obtain weighted features
, which can be expressed as follows:
3.4. Attention Feature Fusion Module
Feature fusion, as a common component of DL models, typically integrates low-level and high-level features through long skip connections or summation to obtain features with high resolution and strong semantic information [
55]. Finding the appropriate fusion method is crucial for achieving the desired goals for different tasks. For UIQA, the HVS exhibits multi-scale perception characteristics, where features at different scales have complementary advantages. Low-level features possess a small receptive field, demonstrating strong representations of geometric details that can reveal texture structure distortion. Conversely, high-level features have large receptive fields and a robust ability to represent semantic information, capturing global distortions such as the haze effect. To this end, considering that employing an attention module can enhance the understanding of contextual information within complex features and emphasize more important parts [
56,
57,
58], we propose an attention-based fusion module that is used to model the distortion information contained between multi-scale features. The main framework is illustrated in
Figure 5.
Firstly, to achieve dimensional transformation and feature extraction, the input feature is divided into fixed-size patches and linearly mapped, thus enabling the capture of essential representations for subsequent relationship modeling. Multi-scale feature
has different receptive fields and resolutions. To ensure the mapped patches maintain consistency in image content, different convolutions are used for patch embedding, followed by a flattened layer to reduce the dimension. The size of the convolution kernel is consistent with the patch size, and the number of output channels is consistent with the embedding size. After feature mapping, the spatial and channel attention are calculated independently. Spatial attention is used to model multi-scale contextual information as local features, enhancing expressive ability. Different scales of contextual information help capture the relationship between structure and texture in enhanced images. Then, the spatial attention feature
is obtained as follows:
where
represents linear mapping,
represents the softmax function,
represents the square root of the feature dimension, and
T represents transpose.
In addition, to model the interdependence between different channels and improve the perception of key channel information, we utilize channel attention feature
, which is obtained as follows:
Finally, the perceptual feature
is obtained by element-wise addition
and
, which can be expressed as follows:
3.5. Regression
By combining the luminance statistical vector
and perceptual feature
, the final quality feature is obtained, which is then input into the fully connected layer to calculate the final quality score
S, which can be expressed as follows:
The optimization of perceived quality prediction results employs MSE as the loss function
, which can be expressed as follows:
where
S represents the quality predicted by the proposed JLSAU, and
Q represents the Mean Opinion Score (MOS).
5. Further Discussion
As mentioned previously, we visualize the comparison of enhanced underwater images of different qualities before and after histogram equalization in
Figure 2. Generally, high-quality enhanced images exhibit smaller changes after equalization, resulting in greater similarity between
and
. Consequently, the mapping distance between them is smaller. The similarity between
and
should have a positive correlation with image quality. Therefore, we employ indicators such as cosine similarity, SSIM, histogram similarity distance, and pHash to evaluate the similarity between
and
. Each indicator is utilized as a luminance feature in place of LFEM, and their ability to represent mapping relationships is compared in
Table 4 to verify the superiority of LFEM. As shown in
Table 4, experiments are conducted on the UIQE dataset. Compared to SSIM, histogram similarity distance, and cosine similarity, the proposed JLSAU using LFEM generally achieves the best overall performance in learning the correlation between mapping relationships and image quality. This indicates its advantage in learning high-dimensional mapping relationships, with the second-best result obtained by pHash. Therefore, to better model the consistent correlation with perceived quality, we propose using LFEM to learn quality-aware representations of luminance distribution.
To validate the role of the convolutional block in LFEM, ablation experiments are conducted while keeping all structures consistent except for the convolutional block. The results are shown in
Table 5, where “W/O” represents without and “W/” represents with, indicating a notable overall performance decline after the removal of the convolutional block. This decline phenomenon is observed in both the UIQE and UWIQA datasets, highlighting the significance of the convolutional block in LFEM. This importance may stem from the ability of CNN to complement missing spatial information in grayscale statistics through relative positional encoding, thereby enhancing the perception of luminance information comprehensively.
To assess the impact of various saliency maps in the SWLM on the performance of JLSAU, we select four state-of-the-art salient object detection methods and conduct ablation experiments on the UIQE dataset. The overall structure of the model remains unchanged, and saliency maps obtained by different methods are used as inputs. The results are presented in
Table 6. It can be observed that the saliency maps obtained by different methods indeed affect the overall performance of quality assessment. Fortunately, the proposed method consistently achieves the best PLCC and KROCC, demonstrating the superiority of the selected saliency map acquisition method. Compared to the selected method, these saliency methods used for comparison have a weaker ability to distinguish between different-quality underwater images, which may be due to their stronger generalization capabilities. Even in the presence of various distortions, the obtained saliency maps remain relatively consistent.
Furthermore, the impact of feature selection in the backbone network on the experimental results is explored. Specifically, while keeping the overall structure unchanged, some necessary hyperparameters within the module are modified to match the dimensions of features at different stages for training. The results, shown in
Table 7, reveal that the model generally achieves optimal correlation indicators with the feature combination of
,
, and
(i.e., (4)), compared to other combinations, indicating superior performance. Additionally, within combinations containing
(i.e., (2), (3), (4)), generally better results are obtained compared to those obtained for (1). This is attributed to the superior semantic representation and richer global information provided by high-level features, which reflect the quality of enhanced underwater images at a deeper level. Therefore, the features
,
, and
are selected in the proposed JLSAU.
In addition, an effective UIQA method should have relatively low complexity. 50 images from UIQE dataset are utilized to test the running time of various methods, and the average time spent is taken as the experimental result, as shown in
Table 8. The experiment is carried out on a computer with an Intel i5-12500 4.08 GHz CPU, NVIDIA RTX 3070ti GPU, and 32 GB memory using matlab2016b. It can be found that the average running time of the proposed JLSAU is less than that of DIIVINE, BMPRI, TReS, HyperIQA, CCF, FDUM, UIQI, Twice-Mixing, and Uranker. Compared with BRISQUE, CNN-IQA, UIQM, UCIQE, and CSN, JLSAU runs a little longer, but achieves better performance.
6. Conclusions
In this paper, we present JLSAU, a multi-scale underwater image quality assessment (UIQA) method that joins luminance priors and saliency priors. By utilizing a pyramid-structured backbone network, it extracts multi-scale features that align with the perceptual characteristics of the Human Visual System (HVS), with a particular focus on distortions at different scales, such as texture details and haze effects. Then, to measure the enhancement effect of Underwater Image Enhancement (UIE) algorithms, we extract visual information such as luminance and saliency as supplements. The luminance priors obtained utilizing histogram statistics to learn the luminance distribution and by designing a convolutional network to encode positional information missing from statistical features. Furthermore, to improve the perception of structure and details locally, we separately learn saliency features in both the channel and spatial domains, reflecting changes in visual attention. Finally, by leveraging the attention mechanism, we effectively model the rich perceptual quality and enhancement information contained in the multi-scale features of enhanced images, thereby comprehensively improving the representation capability of image quality. Compared to existing UIQA methods, the proposed JLSAU demonstrates superior performance. However, there are still some limitations, such as the lack of a dedicated color prior for addressing color distortions, and the saliency detection methods used have yet to be optimized. In the future, we plan to further explore the incorporation of color priors and develop a saliency detection method tailored specifically to underwater images to better leverage the saliency prior.