This section provides a comprehensive evaluation of the proposed methodology. It begins with a description of the evaluation metrics and continues with detailing the implementation details. The initial training results are presented, followed by an overview of the datasets used in the experiments. The methodology’s performance is first evaluated in terms of accuracy and subsequently in terms of its effectiveness in improving underwater-specific metrics. Experimental results on both the test set and an in-the-wild dataset are then reported, along with a discussion of the associated computational costs.
4.1. Evaluation Metrics
For evaluating the effectiveness of the enhancement methods, two key factors influenced the selection of metrics. The first was the necessity to work without ground truth, as no reference ground truth for the enhanced images was available. The second was the importance of choosing metrics that are specific to underwater environments, where typical image processing challenges, such as color distortion and low visibility, are particularly pronounced. A table with all the metric acronyms and their definitions is included in
Table 2 for clearer understanding and reference. It is important to note that all metric scores are computed on the original high-resolution images, whereas resizing is used solely for selecting the most suitable enhancement method.
The first metric employed was the UCIQE [
25], which is a specialized quality metric designed to assess the visual quality of underwater images. Traditional image quality metrics are often based on subjective perception or predefined ground truth images. In contrast, UCIQE considers the unique characteristics of underwater images. This makes it more suitable for underwater evaluation. This includes the color distortion due to light absorption in water and the overall visibility of objects within the scene. UCIQE works by analyzing three primary factors, including colorfulness, contrast, and naturalness. It compares the enhanced image with typical characteristics of a natural, undistorted image, providing a quality score that reflects how visually pleasant and clear the image appears after enhancement. The metric ranges from 0 to 1, where a higher score indicates better enhancement. The formula for calculating UCIQE is given by
where
, is the standard deviation of chroma and measures the intensity of color variation in the image.
is the luminance contrast, which quantifies the contrast in brightness across the image, and
is the mean saturation, which indicates the overall color saturation of the image.
are constants that are used as weights for chroma, luminance contrast, and saturation.
In addition to the UCIQE metric, UIQM [
26] was also utilized to evaluate the enhancement of underwater images. The UIQM is a composite metric designed to assess the quality of underwater images by combining various quality indicators such as contrast, sharpness, and naturalness. It incorporates multiple individual metrics, each focusing on a specific aspect of image quality, which are then weighted and combined to produce a final score. The UIQM formula integrates the following three components, which include UICM (Underwater Image Contrast Measure), UISM (Underwater Image Sharpness Measure), and UICONM (Underwater Image Colorfulness Measure). The UIQM combines these individual measures using a weighted sum approach. The formula for UIQM is as follows:
where
,
,
are the empirical weighting coefficients and UICM, UISM, and UICONM represent the individual underwater image quality measures for contrast, sharpness, and colorfulness, respectively.
The third metric that was used was the one of UIF [
27], which is designed to quantify the fidelity of an underwater image by measuring the uniformity of chroma distribution after applying robust normalization. It focuses on the color information by calculating the deviation in the chroma distribution, making it a crucial metric for assessing color balance and consistency in underwater images. Unlike traditional metrics that only focus on global image characteristics, UIF accounts for local variations in chroma, ensuring that the color fidelity of the image is properly evaluated. The UIF is calculated using the following formula:
where
is the normalized chroma histogram for the
i-th bin,
is the reference uniform histogram, and
is the weight for each histogram bin.
K is the number of bins in the chroma histogram, and
is a small constant to prevent division by zero. For a specific value of
K, the deviation of the metric remains stable, though it can be adjusted depending on the specific requirements.
A composite metric was then created that integrates all three key image quality measures of UCIQE, UIQM, and UIF to leverage the strengths of each and provide a comprehensive evaluation of image enhancement techniques in underwater environments. The UCIQE metric is particularly sensitive to color and luminance quality, making it useful for assessing color fidelity and contrast, which are crucial for underwater images. The UIQM focuses on the image’s structural integrity and its perceptual similarity to human visual preferences, while UIF emphasizes the uniformity and distribution of chroma across the image. By combining these three metrics, the aim is to capture not only the technical aspects of image enhancement, such as color distribution and contrast, but also the perceptual aspects, ensuring that the resulting composite metric provides a balanced and holistic view of image quality.
The fusion of these metrics was achieved by applying a weighted combination of the z-scores derived from each individual metric. This approach normalizes the metrics and integrates them into a UIQM score that can be used to rank the quality of enhanced images. The statistical analysis of the combined results (e.g., mean and standard deviation) allows for assessing the overall performance of the enhancement methods. The composite metric, by aggregating the best features of UCIQE, UIQM, and UIF, serves as a more robust and reliable measure of image quality, particularly in the complex and diverse conditions found in underwater environments. By doing so, it eliminates the need for subjective judgment or reliance on a single metric, providing a more consistent and comprehensive evaluation framework. The formula for calculating the z-scores is given by
where
X is the value being transformed,
is the mean of the distribution and
is the standard deviation of the distribution. Then the composite z score is given by
This approach effectively combines the information from all three metrics, giving equal weight to each one in the final composite score. It is a simple yet effective way to leverage the strengths of different image quality metrics, and it assumes that each metric contributes equally to the final image quality evaluation. If a user likes to give different weights to each metric, the formula can be modified to use weighted averages instead of a simple average.
For the evaluation of the model, the F1 macro score was utilized as a key evaluation metric for assessing the model’s performance, alongside accuracy, throughout the training process. The F1 macro score provides a balanced measure of precision and recall across all classes, regardless of their distribution. Unlike accuracy, which may be biased towards more frequent classes, the F1 macro considers both false positives and false negatives, making it particularly valuable when evaluating imbalanced datasets. This metric computes the F1 score for each class individually and then averages the results, ensuring that the model’s ability to correctly classify all classes is taken into account. By using the F1 macro score, the evaluation process provides a more comprehensive view of the model’s overall performance, especially in cases where accuracy alone may not fully reflect the model’s effectiveness across different classes. This approach helps ensure that the model is not only accurate but also robust in handling varying data distributions, which is crucial for tasks involving complex or imbalanced data. Its formula is given by
where
are the true positives for class
cl,
are the false positives for class
cl,
are the false negatives for class
cl and
represents the number of classes in the classification problem. All of the enhancement methods are visualized in
Figure 3. The figure includes images that were not used during training, as well as in-the-wild samples that are completely unseen by the model.
4.2. Implementation Details
The training of the Swin Transformer model was performed on a high-performance desktop system with 62 GB of RAM and an Intel Core i9-13900F processor. Computation acceleration was provided by an NVIDIA GeForce RTX 3070 Ti GPU with 8 GB of GDDR6X VRAM. The other experiments, including image enhancement techniques and model evaluation, were conducted on a secondary desktop system featuring 32 GB of RAM, an Intel Core i5-10600K processor, and an NVIDIA GeForce GTX 1060 GPU.
For the UCIQE metric, the coefficients were fine-tuned with values = 0.4680, = 0.2745, and = 0.2576. Similarly, for the UIQM metric, the empirical weighting coefficients were determined as = 0.0282, = 0.2953, and = 3.5753. These weighting coefficients of UIQM are not arbitrarily selected in this work but are taken from the original UIQM formulation proposed in the literature. These values were empirically determined in prior studies to balance the relative contributions of colorfulness, sharpness, and contrast in underwater image quality assessment. Standard fixed coefficients are adopted to ensure consistency and fair comparison with existing methods rather than re-optimizing them. All other implementation parameters are adopted from the corresponding original works or standard settings reported in the literature to ensure fair and reproducible evaluation. For the UIF metric calculation, the number of bins K for the chroma histogram is set to 10, offering a balanced resolution for chroma normalization, although this value can be adjusted depending on the specific requirements of the application. For the Gamma correction, the value of was set to 0.7. For the CLAHE enhancement, the processing involves using a rectangular grid with a tile size of (4, 4) and setting the clip limit to 2. For the Shades-of-Gray white balance method, the Minkowski p-norm was set to 6, which is commonly used to compute the channel normalization and achieve a more accurate white balance correction. It is important to note that all images were resized to 224 × 224 pixels to reduce computational cost.
The Swin Transformer base variant was utilized, which employs a patch size of 4 and a window size of 7. The model was pretrained on the ImageNet dataset and subsequently adapted for the image enhancement classification task. It was trained for 50 epochs using a batch size of 16, and the dataset was divided into training (70%), validation (15%), and test (15%) subsets. The training and validation sets were used for model optimization, while the test set was reserved for final evaluation. To ensure reproducibility across different runs, a fixed random seed was applied to both data shuffling and model initialization. The model training was conducted using the Adam optimizer with a learning rate of , and the loss function used was Cross-Entropy Loss.
4.3. Dataset
The training dataset consisted of six sub-datasets, each collected at different depths and under varying environmental conditions in order to capture a wide range of underwater scenes. In total, the dataset contained 4051 images. From these, 321 images were reserved as an in-the-wild test dataset, while the remaining 3730 images were used for the training, validation, and testing processes. The data were split following a 70–15–15 ratio, with 70% used for training, 15% for validation, and 15% for testing.
The Mermaid Underwater Dataset [
28] (dataset 1), which is publicly available, was the first subset and was used as an in-the-wild test set. It contains 321 high-resolution images (3840 × 2880 pixels) captured in 2022 at a depth of approximately 20 m at the La Sirène site in Saint-Raphaël, France. The dataset was captured using a GoPro Hero 3, silver edition camera. The dataset covers an area of roughly 150 m
2 and provides sub-millimeter ground sampling resolution. It includes a variety of underwater environments, such as a mermaid statue, sandy plains, and rocky habitats. The images were acquired by divers using a single camera under natural lighting conditions and are provided without any preprocessing. Dataset 6 includes images captured under both natural illumination and artificial lighting by a diving torch, which is useful for visibility in deep water environments. This combination enhances the dataset’s diversity by incorporating conditions beyond natural lighting alone, thereby better representing realistic underwater operational environments.
Dataset 2 contains images of a Rubik’s Cube placed at a depth of 28 m underwater, in Chrousso, Greece, and consists of 2171 images. The cube was firmly embedded in the seabed to ensure stability and reduce movement caused by underwater currents. During image acquisition, the diver moved around the cube to capture it from different viewpoints.
Dataset 3 consists of 540 images captured at a depth of 20 m in Akti Kalogrias, Greece. Dataset 4 includes 342 images captured at a depth of 40 m in Porto Valitsa, Greece. Dataset 5 contains 77 images captured at a depth of 50 m in Avlaki, Greece. The second, third, fourth, and fifth datasets were captured using a GoPro Hero 9 camera equipped with an underwater housing and have Full HD resolution. Dataset 6 consists of 600 images captured at a depth of 62 m in Porto Valitsa, Greece, with Full HD resolution. The images were acquired using a Vaquita Paralenz camera. The composition of the dataset is illustrated in
Figure 4.
The GoPro HERO9 Black features a 1/2.3-inch sensor with approximate dimensions of 6.17 × 4.55 mm and sensor area of 28.07 mm2. It is equipped with a fixed lens of approximately 3 mm focal length and an aperture of f/2.5 under the standard configuration. The Paralenz Vaquita camera also employs a 1/2.3-inch sensor; however, its aperture and focal length are not publicly specified by the manufacturer. Both cameras were used with their default lenses, and no additional optical filters (e.g., anti-reflection or color filters) were used during image acquisition.
Since the proposed approach is camera-agnostic, it does not incorporate optical calibration or explicitly model distortions, but instead addresses them implicitly through learned image representations. It is important to note that the datasets were extracted from in situ underwater videos under real-world conditions, where the use of color calibration charts is neither feasible nor beneficial to the study’s purpose. Therefore, the evaluation focuses on relative performance under realistic conditions rather than absolute color calibration.
4.4. Experimental Results
First, each enhancement method was evaluated for every image in the dataset using metrics such as UCIQE, UIQM, UIF, and the composite Z score calculated across the entire dataset. Based on these metrics, the optimal enhancement for each individual image was identified and used as the ground truth to train and guide the image classification network. For the used dataset, out of the seven enhancement methods applied, only three consistently produced the best results, with their relative distribution illustrated in
Figure 5.
The fact that the top enhancement methods for the dataset are not limited to a single method, but rather three distinct ones, highlights the importance of automatically determining the optimal image enhancement for each case. This variability in the best method is particularly notable, as even within the same dataset originating from the same location, the most effective enhancement method can differ. Such differences likely arise due to varying features in the images, such as lighting conditions, noise, or contrast, which affect how each enhancement technique performs. This emphasizes the necessity for a dynamic model. Such a model should adapt to the specific characteristics of each image. It should also select the most suitable enhancement method to ensure optimal results.
The results from the training demonstrate the model’s progressive improvement across both the training and testing datasets. Initially, the model’s performance was modest, with training accuracy gradually increasing and validation accuracy fluctuating in the first few epochs. By Epoch 3, noticeable improvement in both training and validation accuracies was observed, signaling the model’s beginning to generalize well. Throughout the training process, the loss consistently decreased, and the accuracy steadily increased, indicating the model is effectively learning from the data and adapting to the underlying patterns.
While the model continued to improve throughout the epochs, Epoch 22 was preferred for further analysis due to its balanced performance and the highest validation accuracy. At Epoch 22, the training accuracy reached 95.86%, while validation accuracy was 88.37% and test accuracy was 88.04%. The F1 Macro score also peaked at 0.869913, demonstrating strong performance in terms of precision and recall balance across all classes. This epoch reflects the model’s ability to generalize well and make reliable predictions on unseen data. Nevertheless, the overall trend of increasing performance underscores the model’s effectiveness and strong progress across both the training and evaluation datasets.
The results presented in the table show a comparison of model performance metrics for the test set, specifically focusing on Accuracy and F1 Macro. The baseline models, random and major, serve as reference points for understanding the performance of the proposed approach. The Baseline random model achieved an accuracy of 33.24% and an F1 Macro of 29.27%, indicating random predictions across the three classes. The Baseline major model, which always selects the class with the highest frequency, performed better with an accuracy of 50.18% and an F1 Macro of 22.28%. In contrast, the Proposed swin model, leveraging a more advanced methodology, achieved significantly higher performance, with an accuracy of 88.04% and an F1 Macro of 87.88%. Compared to the Baseline random and Baseline major, the proposed model shows an improvement of 58.61% and 65.60% in F1 Macro, respectively. The results are shown in
Table 3. Also, the confusion matrix is provided in
Figure 6.
Except for evaluating accuracy, further assessment was conducted by calculating the metrics for the predicted classes and comparing them to the metrics that would result from selecting a specific enhancement method. These evaluations were carried out on the testing set. The values presented in
Table 4 are averages computed over the testing set for each enhancement method, with metrics including UCIQE, UIQM, UIF, and the Composite Score. These values reflect the overall effectiveness of each enhancement technique in improving image quality. Ideally, the best result would be obtained if the method selected for each image perfectly matched the optimal enhancement, leading to the highest possible values across all metrics. The “Optimal” row, which corresponds to the enhancement with the highest composite score, represents the ideal outcome of correctly predicting the optimal enhancement for every image.
While the goal is to maximize these values, the methods demonstrate varying levels of effectiveness, and the Proposed Swin model shows how the model’s predictions align with the best-performing methods on average. The proposed model shows notable improvements over the other enhancement methods. This demonstrates that the proposed model’s predictions are very close to the optimal enhancement technique. Moreover, the proposed model outperforms several other enhancement methods in terms of UCIQE, UIQM, and UIF values, indicating that it is highly effective in selecting the best enhancement based on the input image. The small gap between the proposed model and the optimal method reflects the model’s ability to consistently identify high-quality enhancement techniques.
In addition to the improvements demonstrated by the proposed model, one of its most significant contributions is its ability to automatically identify the best enhancement method, a task that would otherwise require prior knowledge of the image characteristics. The key challenge in image enhancement is that, in practice, the optimal method is not always apparent or consistent across different images. Users typically lack the expertise or means to determine which enhancement technique would work best for a given image without trial and error. The proposed model effectively addresses this problem. It selects the most appropriate enhancement method based on the specific features of each input image, ensuring the best possible outcome. This capability is crucial, as it removes the need for manual selection, streamlining the process and guaranteeing superior image quality without requiring user intervention or prior knowledge. A visualization of all the enhancement methods, including both the predicted and ground truth results, is presented in
Figure 3.
In addition to the controlled test set evaluations, an in-the-wild dataset [
28] was used to assess the methodology’s performance in more realistic, unconstrained conditions. In-the-wild evaluation is crucial for image enhancement models. It simulates real-world scenarios where images come from diverse sources. These images often vary significantly in terms of lighting, noise, image quality, and content. Unlike controlled datasets, in-the-wild images often contain unpredictable challenges, such as varying exposure, complex backgrounds, and inconsistent color distributions. By testing the model on such data, it becomes possible to evaluate its robustness and generalizability in handling diverse image types and conditions.
The second part of
Table 4 presents the performance metrics for the in-the-wild dataset. The Baseline random model achieves an F1 macro of 26.18%, while the Baseline major model shows a higher macro of 39.55%. The proposed Swin model surpasses both with an F1 macro of 52.14%, demonstrating an improvement of 31.87%. The higher accuracy in the Baseline major class is attributed to the dataset’s inherent class imbalance, with a larger number of images belonging to the major category. However, the significant improvement in the proposed Swin model suggests its ability to not only handle such class imbalances but also to boost overall performance. The model’s performance across both major and minor classes shows its effectiveness in recognizing patterns and making predictions even in the presence of imbalanced data. This also reflects the model’s potential to generalize well across varying datasets and adapt to different class distributions, which is critical for robust performance in real-world applications. The ability to balance the accuracy between classes further reinforces the model’s effectiveness in complex scenarios, where real-world data is often skewed or uncertain.
After these, the metrics for each enhancement method were calculated for the in-the-wild dataset, which is crucial for assessing how well these methods generalize to real-world, unrecognized data. The metrics for each enhancement method on the in-the-wild dataset highlight the effectiveness of various techniques in improving image quality across different dimensions. The proposed Swin model demonstrates superior performance with a composite score of 0.6491, surpassing all of the traditional methods. Notably, it outperforms the RGHS method, which achieves a composite score of 0.6465. The results are shown in the
Table 5.
It is equally crucial to recognize that the optimal enhancement method for the in-the-wild dataset differs from the best method identified on the testing set. This distinction underscores the power and necessity of the proposed model, as it actively identifies the most effective method tailored to each dataset. Without such a model, users would have to manually test multiple enhancement methods for each new dataset. This process is not only time-consuming, but is also prone to error, especially when the best method varies even within the same dataset. The proposed model automates this selection process, delivering significant time savings while ensuring the highest possible quality enhancement. This ability to dynamically choose the best method, adjusting to the specific characteristics of each dataset, demonstrates a level of flexibility and efficiency that would be nearly impossible to achieve through manual trial and error. The model’s adaptability to real-world conditions makes it an invaluable tool, setting a new standard for smart, data-driven enhancement.
The Swin Transformer model used in this study contains 86,746,299 parameters, reflecting its high capacity to learn complex features from underwater images. With 15.47 GFLOPs during inference, the model is relatively light in terms of computational complexity compared to other large-scale architectures. These metrics highlight its efficient performance while maintaining a manageable computational cost for practical deployment. The computational costs of various image enhancement methods were evaluated using the Mermaid dataset [
28], which consists of 321 high-resolution images transformed to 512 × 512. The results indicate that the Proposed Swin model achieved a processing speed of 26.75 fps for detecting the optimal enhancement methodology, demonstrating that it introduces only a minimal delay before applying the enhancement. This suggests that the Swin model efficiently integrates the decision-making process without significantly impacting overall processing time. The average RAM usage for the Swin model was 1375.44 MB, indicating a higher memory requirement compared to simpler methods, which had minimal memory usage. The Proposed Swin model demonstrated 49.74% CPU usage and 47.79% GPU usage, highlighting its efficient use of resources despite the higher computational load. In contrast, the traditional methods, such as RGB Stretching and RGHS, exhibited near-zero CPU and GPU usage, suggesting minimal computational demand. Notably, the CPU percentage reported here refers to the usage of a single CPU core, providing a more granular view of the computational demands. The results are shown in
Table 6.