Automating an Encoder–Decoder Incorporated Ensemble Model: Semantic Segmentation Workflow on Low-Contrast Underwater Images

Bektaş, Jale

doi:10.3390/app142411964

Open AccessArticle

Automating an Encoder–Decoder Incorporated Ensemble Model: Semantic Segmentation Workflow on Low-Contrast Underwater Images

by

Jale Bektaş

Department of Computer Engineering, Mersin University, 33110 Mersin, Türkiye

Appl. Sci. 2024, 14(24), 11964; https://doi.org/10.3390/app142411964

Submission received: 8 November 2024 / Revised: 13 December 2024 / Accepted: 18 December 2024 / Published: 20 December 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Numerous methods have been proposed for semantic segmentation and the state-of-the-art part is likely to be incorporated by deep learning-based methods which show a salient performance. This study addresses the challenge of semantic segmentation in low-contrast imbalanced underwater images. Moreover, it employs nine model fusions as a downstream workflow task using encoder–decoder architectures with Dice Loss and Focal Loss training focusing on the imbalance data. Afterwards, the most effective two encoder–decoder fusion models, Res34+Unet and VGG19+FPN, by 0.592%, 0.590% mIoU on average and by 0.510%, 0.491% F1-score yielded better performance, respectively, than other models. Using a weight-optimization algorithm, the ensemble model with recreated IoU results improves the accuracy for both the Res34+Unet and the VGG19+FPN models, by 0.652% mIoU on average which is 6%. The ensemble model combines the model performances of independent models by considering their superior inference accuracy on a per-class basis separately and improves the model performances by emphasizing the better one on a per-class basis.

Keywords:

semantic segmentation; encoder–decoder fusion; Unet; FPN; LinkNet; ensemble model

1. Introduction

Many semantic segmentation studies in the computer vision area have been conducted with datasets taken from multi-class and different feature-imaging systems [1,2]. Thanks to deep learning models, very successful results can be obtained in segmentation and detection studies on large image datasets even if the class is unbalanced. Unfortunately, performance-improvement studies on underwater datasets are still needed [3]. The aim is to increase the visual quality of underwater images that must cope with low-contrast and visibility problems by combining global and local contrast-enhancement techniques [4]. Due to the difference in pressure between water and air, detecting moving objects underwater is a problem that needs to be dealt with. In addition, decreasing visibility and changing illumination levels can also make the situation difficult. A dataset presented in this regard, Semantic Segmentation of Underwater Images (SUIM), is the largest underwater dataset labeled for semantic segmentation [5]. Class imbalance, a problem that needs to be dealt with in performance evaluations, is also encountered in this dataset.

In the computer vision research domain, semantic segmentation tasks can predict the pixel-wise masks and class categories of objects of interest based on RGB or hexadecimal coded pixels. Numerous methods have been proposed for semantic segmentation and the state-of-the-art part is likely to be incorporated by deep learning-focused methods which show a salient performance. There are now methods in which modules are added to the Mask RCNN algorithm, that is one of the first deep learning-based semantic segmentation methods, and the perceptual fields of each layer are expanded and target classification is improved [6]. Trending studies today focus on determining more precise boundary regions. Various deep learning architectures have addressed semantic segmentation studies in numerous domains. Ref. [7] proposed a new method for tackling multi-class semantic segmentation. Better results have been obtained than the state-of-the-art methods by fine-tuning with annotations to improve model performance by addressing the problem of background ambiguities. In multi-class semantic segmentation, deep symmetric convolutional neural networks dominate the algorithmic state-of-the-art. The effect on the performance of semantic segmentation with U-Net has been detected using satellite images [8]. Semantic segmentation was performed in another study to predict damage classes using Convolutional Neural Networks (CNN) on a multi-class and class-imbalanced structured dataset [9].

A new framework that leverages CoordConv and Group Normalization based on Mask R-CNN to improve generalization for underwater marine animals has been proposed [3]. In light of these studies addressed by deep learning architectures, an encoder–decoder incorporated working flow was carried out thanks to the generalization ability of the architectures without considering any image-enhancement preprocessing. The key points of this study are as follows:

Working on images taken underwater brings the problem of a low-contrast image quality. The images in the SUIM dataset have different sizes of resolution. Considering the computational cost and to obtain a standard suitable for all networks, all the RGB images and annotations were cropped as 256 × 256 × 3 patches. Afterwards, patches that do not contain any class information were detected and eliminated from both the annotation semantic context and the corresponding image path list. The process of cropping images into pieces helps to see and understand the broad connections between objects in the image. Moreover, focusing only on class objects without any loss of pixels is still sufficient for whole-scene segmentation, especially for such a low-contrast image dataset.
As a workflow strategy to evaluate training schemes, different variations of encoder–decoder dense frame fusions were used and the generalization ability of the models was increased. Dice loss and Focal loss were used in the models to address parts such as feature class imbalance and insufficient segmentation, and with this method, predictions for all objects were generalized.
Finally, instead of evaluating the success of these two models separately, a general performance success was achieved with an ensemble model. The ensemble model combines the model performances of independent models by considering their superior inference accuracy on a per-class basis separately and improves the model performances by emphasizing the better one on a per-class basis. This situation makes the ensemble model feasible for real-time underwater applications especially under offline circumstances. In this study, the ensemble model was constituted from the fusion of the models Res34+Unet and VGG19+FPN which obtained the best mIoU results.

2. Related Work

In the process of model-development studies on underwater images, various methods for image restoration examine mathematical models on images in the preprocessing stage, eventually obtaining higher-quality underwater images or evaluating the results by combining pre-trained segmentation and classification methods within any framework of deep learning. To investigate and detect objects in satellite images, ten different object classes were determined using the U-Net model for semantic segmentation [10]. Studies have been conducted on satellite images with a feature pyramid network (FPNet), which uses convolutional neural networks by combining feature maps from different scales [11]. Experiments have been conducted using the SegNet application and SUIM dataset, comparing image quality-enhancement and data-augmentation methods [12]. These are the image-segmentation networks used initially, but nowadays, backbone modifications or ensemble methods are widely used due to their efficiency. In this regard, semantic segmentation was performed on breast cancer images with the DenseNet-121 model and the attention-based pyramid scene-parsing network called Att-PSPnet [13]. A new method has been proposed for segmenting underwater images using Unet architecture and DenseNet as an encoder for semantic segmentation tasks [14].

Another work introduces a new data-driven framework called DatUS [15]. This framework provides dense semantic segmentation for unlabeled image datasets using a self-supervised image transformer and trained on low-contrast images. A loss function based on the active contour method and level set methods is proposed, and the integration of these methods with CNN architectures such as U-Net and DeepLabV3+ are studied on underwater images such as SUIM, RockFish, and DeepFish [16]. In the study conducted for multi-class semantic segmentation of high-resolution aerial images [17], modified residual block and dense spherical spatial pyramid pooling module were combined with the U-Net scheme and a better result was achieved compared to SOTA methods. Adding two objects with Gaussian blurring gave the best results and the performance was improved by using synthetic data [18]. In another study, a visual serving method that is more efficient than SOTA approaches and provides a balance between performance and computational efficiency was proposed [19]. In the study where a dual-resolution representation was developed [20] using fine superpixels for rare classes and coarse superpixels for background areas, a balanced semantic segmentation result was achieved within the scope of CNN feature representation for both frequent and rare classes, thanks to the shape information obtained. The backbone network, which is used in semantic segmentation as well as image classification and segmentation tasks, represents the main structure of the general network. This backbone architecture is adopted by networks such as VGG-net, AlexNet, GoogLeNet, and ResNet, which significantly outperforms much deeper models on the various challenging tasks [21]. Another study proposes a UISS-Net network for underwater scenes by using the ResNet18, ResNet34, and ResNet50 as the backbone networks that use the Unet to improve the performance of feature extraction for the backbone network [22].

Vision Transformers (ViTs) are also successful in image classification; however, they face several challenges when directly applied to dense prediction tasks such as semantic segmentation and object detection [23]. To overcome challenges like low visibility and variable lighting conditions, where traditional methods fall short, a hybrid model combining Swin Transformer and ConvMixer mechanisms, called SwinConvMixerUNet, has been developed [24]. To improve the quality of semantic segmentation, another work introduces a new deep learning framework called Swin Transformer U-Net (DS-TransUNet), utilizing the hierarchical structure of Swin Transformer in both the encoder and decoder [25]. Another study proposed a UNet model based on multiple attention mechanisms for the semantic segmentation of remote sensing images, known as Multi Attention–Unet (MA-Unet) [26]. MA-UNet uses a residual encoder with a simple attention module to improve the extraction of fine-detailed features. For CT image segmentation, an attention-augmented U-Net model (AA-U-Net) has been proposed [27]. AA-U-Net integrates attention-augmented convolutions in the encoder and decoder architecture to more accurately detect COVID-19 lesions.

3. Materials and Methods

In this section, along with providing specific details regarding the algorithmic steps the proposed workflow is given. The workflow consists of several sections: the patchifying procedure is presented in Section 3.2 of the original RGB images and annotation statistics in Section 3.1. Section 3.3 elaborates on the details of encoder–decoder architectures which consist of two parts. Section 3.3.1 gives encoder variant details that extract the feature maps of the preprocessing of the images based on the pre-trained backbone, while Section 3.3.2 gives training architectures based on deep learning. Finally, in Section 3.4, the calculation of the model-weighed algorithm and the weight optimization are described.

3.1. Data Description

The dataset that was released in 2020 by the University of Minnesota consists of 1550 images split into three separate tasks: 1440 images belonging to the train and validation with annotations and 110 images belonging to the test folder. Sample images and annotations are given in Figure 1.

Class distributions consist of 5000 labelled objects categorized into eight classes: waterbody_background, fish_and_vertebrates, reefs_and_invertebrates, sea-floor_and_rocks, human_divers, wrecks_and_ruins, aquatic_plants_and sea-grass, and robots. Class colors and corresponding labels are shown the right-bottom table in Figure 2. The image curve shows in how many images the classes are observed while the object on the image curve denotes the number of images containing pixel-wise objects by class.

3.2. Patch Generation

Since all the underwater images and annotations are 3D RGB and have different sizes of resolutions, segmentation of these images may lead to high computational costs. To reduce the cost of computation by avoiding resizing operations considering pixel loss, in the pre-training stage the images and annotations were transformed to patches of 256 × 256 × 3 from the original sizes, resulting in different numbers of patches for each original image. After extracting the patches for the input images and annotations, patches that do not contain any class information were removed from both the annotation list and their corresponding input patch list. Finally, the 256 × 256 × 3 input patches containing images and annotations were given as input for the encoder region for network architectures.

3.3. Encoder–Decoder Architectures

It is important to design an efficient architecture for semantic segmentation tasks that will detect all objects in an image at the pixel level and extract regions where objects belong to predefined classes. In the development process of encoders, the use of convolutional layers is a very common solution to extract the form, visual look, and spatial relationship between categories. Some deep architectures have been presented where convolutional layers are used to lead the segmentation phase in the backbone part of the semantic segmentation task.

3.3.1. Encoder Architectures

The encoder learns gradually during the feature-extraction process and reduces the spatial relationship with gradual convolution operations. The features learned by the encoder include low-level features such as edge, contour, and light, while high-level features have more semantic category information.

VGGNet as Backbone. VGG is a deep learning network that combines the ability to perform 3 × 3 convolution with less computational cost and the level of extracting complex features. It has gained this generalization ability thanks to the network in network architecture [28]. Therefore, it is considered a backbone option in segmentation applications.

ResNet as Backbone. ResNet has a deep network architecture and requires a 7 × 7 convolution kernel. It does not have a fully connected layer and performs average pooling operations that prevent overfitting [29]. This provides increased performance and improves precision. Therefore, Resnet is a priority architecture in backbone preferences.

MobileNet as Backbone. To reduce the computational complexity of the VGGs, the Mobilenet network model was developed; it improves the real-time performance of computer vision tasks [30]. The Mobilenet generates six feature maps with different dimensions for the backend detection, which makes the network precise and prior to semantic segmentation tasks.

3.3.2. Decoder Architectures

The decoder reloads the spatial relationship of the features managed by the encoder and leads the upsampling processes to produce the prediction results with the same resolution as the input image. Considering the existence of both large and small samples of the same object class in the same image with different scales in underwater images during the decoding process. In this study, VGG19, MobileNet, and ResNet34 as backbones on the SUIM dataset were used. Encoder–decoder fusion architectures are given as VGG+FPN and Res34+Unet in Figure 3.

Unet as decoder. Unet has a unique U-shaped architecture, comprising an encoder–decoder path. The encoder path contains four blocks within two convolutional layers following the ReLU activation function, and one max-pooling layer for downsampling. In the decoder section, upsampling is generated by concatenating the results [31].

Linknet as decoder. Linknet [32] provides low computational cost with 11.5 M parameters contrary to UNet which is widely used in medical segmentation. During the feature-extraction process achieved by creating a unified mask in the UNet architecture, problems such as these masks overlapping each other or appearing too close to each other may occur.

FPN (Feature Pyramid Network) as decoder. FPN [33] is a feature extractor that takes a single scale image of arbitrary size as input and extracts feature maps at multiple levels in a fully convolutional manner. It can fully connect the information of different scale feature maps, balancing special information and semantic information of each feature map.

3.4. Weighted-Based Model-Optimization Algorithm

Instead of evaluating the success of several models separately, general performance success can be achieved with a new single model that incorporates the capabilities of the successful models that created it by searching for the most appropriate weights through a weighting strategy algorithm. Moreover, more efficient mIoU values of the new model can be obtained from the backbone fusion of the models stated in Algorithm 1.

Algorithm 1 Weighted-based model-optimization algorithm

1. Input
Data: test images D = {(x₁, x₂, …, xm)}, i = 1,2, …,m and
Y_pred- Predictions of each base model
2. Initialization
Set parameters for a dataset: Cnum-Class number of the dataset, Range_ofWeights = (0,1)
Wofmodel = [w₁/10., w₂/10., …w_k/10.], where a weight array for each base model, k corresponds to the finalist model number
3. Prediction
Search for the best combination for Wofmodel that gives maximum IoU
for each w_i in Range_ofweights
Compute the mean IoU values for each class:
Wts_IoU = meanIoU(Cnum)
Sum the predictions of all objects on the specified axis and find the maximum predictions
Wts_Ensemble= tensordot(Y_pred, Wofmodel)_max
Measure the metrics (AUC, IoU), and store them in the wts_ensemble_IoU
wts_ensemble_IoU = GenerateMetrics(D, Wts_Ensemble)
return wts_ensemble_IoU_results

4. Experiments and Analysis

4.1. Evaluation Metrics

To validate the semantic segmentation performance of the pieces of training F1-score and IoU-score between the predicted results and the ground truth, annotations were measured as the main performance on the patchified-SUIM dataset. The IoU metric, often named the Jaccard index, is the ratio of intersection and union between the prediction and target value. The mean intersection on union (mIoU) is one of the common measures used in semantic segmentation tasks. In semantic segmentation, the intersection denotes base truth and union ratios correspond to segmentation prediction. This proportion is an intersection set referred to as TP, FP, and FN (union set).

4.2. Dice Loss and Focal Loss Functions

The use of Dice Loss which is similarity-based and Focal Loss functions is quite common in image-segmentation applications on image data containing imbalance class distributions. Dice Loss focuses on the problem of data imbalance between the foreground and background, while Focal Loss handles the problem of imbalance between positive and negative examples [34]. Balancing the data during training allows focusing on the regions that are more difficult to segment among all samples and increases the sensitivity of the model to complex samples rather than simple scenarios for semantic segmentation. In this study, the total loss function for SUIM semantic segmentation was generated by evaluating both dice loss and focal loss functions together.

5. Results

All the hyperparameters need to be set before the model during learning, such as epoch number, optimization function, activation functions, learning rate, etc. The image patch dataset is randomly split into 80% for training and 20% for validation. Experiments were performed on a combined dataset of 256 × 256 patch images containing annotations, with 1152 images separated for training and 288 images separated for validation using three-fold cross-validation. A total of 110 images out of 1550 were separated for testing. During training, random affine deformations were applied for data augmentation. The Adam optimizer [35] function was used in this study; it is one of the most common optimizers for faster learning capacity in deep learning. Fine-tuning makes the Adam optimizer function able to achieve good results quickly. Sparse categorical cross-entropy was implemented and the initial learning rate of 0.0001 was chosen. In the experiments, the training and testing process for building detection was implemented using the NVIDIA A100 graphics card. The batch size was given as 16, selecting 100 epochs during the nine training processes.

The study undertakes nine experiments based on VGG19, ResNet34, and MobileNet as backbones for decoder stages. Deep learning-based models such as U-Net, FPN, and LinkNet for segmenting class objects were evaluated with the mIoU technique, and F1-score was implemented and used. The fusion combination training results are given in Figure 4.

The encoder–decoder networks trained with the dice and focal loss functions obtained the overall best F1-scores and mIoU-scores and Res34+Unet and VGG19+FPN fusions were observed to be the most stable across all different encoder–decoder fusion validation as in Table 1. A similar statistical trend was observed for training results. Along with 0.510 F1-score and 0.592 mIoU validation results, Res34+Unet outperformed other encoder–decoder fusions. VGG19+FPN validation that has a loss of 0.849, F1-score of 0.491, and mIoU-score of 0.590 was placed in the second raw evaluation with these parameters in descending order. These two models were adopted for the ensemble weighted model algorithm for final model testing.

Since Res34+Unet and VGG19+FPN models were adopted for the ensemble weighted model algorithm for final model testing, mIoU results were as in Table 2. The proposed ensemble model mIoU value of 0.652 is 0.06 higher than Res34+Unet and VGG19+FPN.

The raw and annotated images are shown in Figure 5. Objects belonging to eight classes are placed in annotations. Prediction of Vgg+FPN dominates prediction of the ensemble model for raw#1, while prediction of Res34+Unet dominates prediction of the ensemble for raw#4. This emphasizes that the most successful model is more dominant in ensemble model performance. The proposed model performed similarly or had better capacity than the other models and visually gave a more robust, and precise segmentation result, and better definitions of the semantic class distinction.

6. Conclusions

In this study, performances of semantic segmentation models for low-contrast SUIM images were compared using nine different encoder–decoder networks trained with a softmax activation function. Moreover, Dice Loss and Focal Loss on the problem of data imbalance for the SUIM dataset were considered.

Different variations of encoder–decoder dense framework fusions as a workflow strategy to evaluate the encoder–decoder schemes were performed for semantic segmentation tasks contrary to using only Unet, Linknet, or FPN. Res34+Unet and VGG19+FPN models. mIoU and F1-score values obtained from nine training processes were ranked from high to low according to performance. The most successful two encoder–decoder fusion models, Res34+Unet and VGG19+FPN models, were selected for the last ensemble stage. Section 3.4 described how to obtain a new model which incorporates the capabilities of Res34+Unet and VGG19+FPN models. A weighted-based model optimization algorithm was used. The use of the ensemble model with recreated IoU results proved to be a better alternative for both the Res34+Unet and the VGG19+FPN models, emphasizing that the most successful model is more dominant in ensemble model performance.

Finally, instead of evaluating the success of these two models separately, a general performance success was achieved with a single model that searches for the most appropriate weights through a weighting strategy algorithm. The ensemble model combines the model performances of independent models by considering their superior inference accuracy on a per-class basis separately and combines the model performances by emphasizing the better one on a per-class basis. This situation makes the ensemble model feasible for real-time underwater applications, especially in offline circumstances. In this study, a better segmentation result on an automated encoder–decoder incorporated ensemble model for multi-class low-contrast semantic segmentation tasks was found. Going forward, by combining two or more models that stand out for different datasets with the same algorithmic technique, a new ensemble model can be obtained and still be better than the base models.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is provided by Minnesota Interactive Robotics and Vision Laboratory, University of Minnesota, and can be downloaded from the link https://drive.google.com/drive/folders/10KMK0rNB43V2g30NcA1RYipL535DuZ-h (accessed on 20 May 2024).

Conflicts of Interest

This study received no financial benefits from any organization and has no conflicts of interest.

References

Van Rijthoven, M.; Balkenhol, M.; Siliņa, K.; Van Der Laak, J.; Ciompi, F. HookNet: Multi-resolution convolutional neural networks for semantic segmentation in histopathology whole-slide images. Med. Image Anal. 2021, 68, 101890. [Google Scholar] [CrossRef] [PubMed]
Schutera, M.; Rettenberger, L.; Pylatiuk, C.; Reischl, M. Methods for the frugal labeler: Multi-class semantic segmentation on heterogeneous labels. PLoS ONE 2022, 17, e0263656. [Google Scholar] [CrossRef] [PubMed]
Yi, D.; Ahmedov, H.B.; Jiang, S.; Li, Y.; Flinn, S.J.; Fernandes, P.G. Coordinate-Aware Mask R-CNN with Group Normalization: A underwater marine animal instance segmentation framework. Neurocomputing 2024, 583, 127488. [Google Scholar] [CrossRef]
Ulutas, G.; Ustubioglu, B. Underwater image enhancement using contrast limited adaptive histogram equalization and layered difference representation. Multimed. Tools Appl. 2021, 80, 15067–15091. [Google Scholar] [CrossRef]
Md Jahidul, I.; Chelsey, E.; Yuyang, X.; Peigen, L.; Muntaqim, M.; Christopher, M.; Sadman Sakib, E.; Junaed, S. Semantic Segmentation of Underwater Imagery: Dataset and Benchmark. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020. [Google Scholar]
He, W.; Wang, J.A.; Wang, L.; Pan, R.; Gao, W. A semantic segmentation algorithm for fashion images based on modified mask RCNN. Multimed. Tools Appl. 2023, 82, 28427–28444. [Google Scholar] [CrossRef]
Lai, L.; Chen, J.; Zhang, C.; Zhang, Z.; Lin, G.; Wu, Q. Tackling background ambiguities in multi-class few-shot point cloud semantic segmentation. Knowl. Based Syst. 2022, 253, 109508. [Google Scholar] [CrossRef]
Pranto, T.H.; Noman, A.A.; Noor, A.; Deepty, U.H.; Rahman, R.M. Effect of label noise on multi-class semantic segmentation: A case study on Bangladesh marine region. Appl. Artif. Intell. 2022, 36, 2039348. [Google Scholar] [CrossRef]
Bajcsy, P.; Feldman, S.; Majurski, M.; Snyder, K.; Brady, M. Approaches to training multiclass semantic image segmentation of damage in concrete. J. Microsc. 2020, 279, 98–113. [Google Scholar] [CrossRef]
Yadavendra, S.; Chand, S. Semantic segmentation and detection of satellite objects using U-Net model of deep learning. Multimed. Tools Appl. 2022, 81, 44291–44310. [Google Scholar] [CrossRef]
Duan, C.; Belgiu, M.; Stein, A. Efficient Cloud Removal Network for Satellite Images Using SAR-optical Image Fusion. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Nunes, A.; Matos, A. Improving Semantic Segmentation Performance in Underwater Images. J. Mar. Sci. Eng. 2023, 11, 2268. [Google Scholar] [CrossRef]
Samudrala, S.; Mohan, C.K. Semantic segmentation of breast cancer images using DenseNet with proposed PSPNet. Multimed. Tools Appl. 2024, 83, 46037–46063. [Google Scholar] [CrossRef]
George, G.; Anusuya, S. Enhancing Underwater Image Segmentation: A Semantic Approach to Segment Objects in Challenging Aquatic Environment. Procedia Comput. Sci. 2024, 235, 361–371. [Google Scholar] [CrossRef]
Kumar, S.; Sur, A.; Baruah, R.D. DatUS: Data-driven Unsupervised Semantic Segmentation with Pre-trained Self-supervised Vision Transformer. IEEE Trans. Cogn. Dev. Syst. 2024, 16, 1775–1788. [Google Scholar] [CrossRef]
Chicchon, M.; Bedon, H.; Del-Blanco, C.R.; Sipiran, I. Semantic segmentation of fish and underwater environments using deep convolutional neural networks and learned active contours. IEEE Access 2023, 11, 33652–33665. [Google Scholar] [CrossRef]
Priyanka, N.; Lal, S.; Nalini, S.; Reddy, J.; Dell’Acqua, F. DIResUNet: Architecture for multiclass semantic segmentation of high resolution remote sensing imagery data. Appl. Intell. 2022, 52, 15462–15482. [Google Scholar] [CrossRef]
Pergeorelis, M.; Bazik, M.; Saponaro, P.; Kim, J.; Kambhamettu, C. Synthetic data for semantic segmentation in underwater imagery. In Proceedings of the OCEANS 2022, Hampton Roads, VA, USA, 17–20 October 2022; pp. 1–6. [Google Scholar]
Kabir, I.; Shaurya, S.; Maigur, V.; Thakurdesai, N.; Latnekar, M.; Raunak, M.; Reza, M.A. Few-Shot Segmentation and Semantic Segmentation for Underwater Imagery. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 11451–11457. [Google Scholar]
Yu, L.; Fan, G. DrsNet: Dual-resolution semantic segmentation with rare class-oriented superpixel prior. Multimed. Tools Appl. 2021, 80, 1687–1706. [Google Scholar] [CrossRef]
Liu, X.; Deng, Z.; Yang, Y. Recent progress in semantic image segmentation. Artif. Intell. Rev. 2019, 52, 1089–1106. [Google Scholar] [CrossRef]
He, Z.; Cao, L.; Luo, J.; Xu, X.; Tang, J.; Xu, J.; Chen, Z. UISS-Net: Underwater Image Semantic Segmentation Network for improving boundary segmentation accuracy of underwater images. Aquac. Int. 2024, 32, 5625–5638. [Google Scholar] [CrossRef]
Thisanke, H.; Deshan, C.; Chamith, K.; Seneviratne, S.; Vidanaarachchi, R.; Herath, D. Semantic segmentation using Vision Transformers: A survey. Eng. Appl. Artif. Intell. 2023, 126, 106669. [Google Scholar] [CrossRef]
Pavithra, S. An efficient approach to detect and segment underwater images using Swin Transformer. Results Eng. 2024, 23, 102460. [Google Scholar]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 1–15. [Google Scholar] [CrossRef]
Sun YBi, F.; Gao, Y.; Chen, L.; Feng, S. A multi-attention UNet for semantic segmentation in remote sensing images. Symmetry 2022, 14, 906. [Google Scholar] [CrossRef]
Rajamani, K.T.; Rani, P.; Siebert, H.; ElagiriRamalingam, R.; Heinrich, M.P. Attention-augmented U-Net (AA-U-Net) for semantic segmentation. Signal Image Video Process. 2023, 17, 981–989. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Chiu, Y.C.; Tsai, C.Y.; Ruan, M.D.; Shen, G.Y.; Lee, T.T. Mobilenet-SSDv2: An improved object detection model for embedded systems. In Proceedings of the 2020 International Conference on System Science and Engineering (ICSSE), Kagawa, Japan, 31 August–3 September 2020; pp. 1–5. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting The encoder Representations for Efficient Semantic Segmentation. In Proceedings of the IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017. [Google Scholar]
Jin, L.; Liu, G. An approach on image processing of deep learning based on improved SSD. Symmetry 2021, 13, 495. [Google Scholar] [CrossRef]
Zhao, R.; Buyue, Q.; Xianli, Z.; Yang, L.; Rong, W.; Yang, L.; Yinggang, P. Rethinking dice loss for medical image segmentation. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 20 November 2020; pp. 851–860. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2020, arXiv:1412.6980. [Google Scholar]

Figure 1. Image samples with corresponding annotations from the SUIM dataset.

Figure 2. The image curve shows in how many images the classes are observed. The object on images curve denotes the number of images containing pixel-wise objects by class.

Figure 3. ResNet34+Unet architecture and VGG19+FPN architecture fusions are shown as encoder–decoder flow. Input image patches are focused first layer.

Figure 4. (a) Loss, F1-score, and IoU-score progress for training. (b) Loss, F1-score, and IoU-score progress for validation.

Figure 5. Test image and annotation samples, prediction of Res34+Unet, VGG+FPN, and ensemble model, respectively.

Table 1. Evaluation performance results of deep learning models for multi-class semantic segmentation on the SUIM dataset.

	Training			Validation
Model	Loss	F1-Score	mIoU	Loss	F1-Score	mIoU
Res34+Unet	0.851	0.515	0.601	0.852	0.510	0.592
VGG19+FPN	0.818	0.630	0.723	0.849	0.491	0.590
Mobilenet V2+FPN	0.839	0.520	0.623	0.851	0.468	0.566
Res34+FPN	0.837	0.536	0.641	0.856	0.463	0.561
VGG19+Linknet	0.883	0.439	0.500	0.878	0.443	0.501
Res34+Linknet	0.881	0.409	0.476	0.876	0.425	0.488
Mobilenet V2+Unet	0.883	0.359	0.416	0.882	0.380	0.436
VGG19+Unet	0.889	0.371	0.433	0.886	0.375	0.436
Mobilenet V2+Linknet	0.897	0.267	0.310	0.895	0.283	0.326

Table 2. mIoU evaluation of the proposed ensemble model.

Model	mIoU
Res34+Unet	0.592
VGG19+FPN	0.590
Proposed Ensemble Model	0.652

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bektaş, J. Automating an Encoder–Decoder Incorporated Ensemble Model: Semantic Segmentation Workflow on Low-Contrast Underwater Images. Appl. Sci. 2024, 14, 11964. https://doi.org/10.3390/app142411964

AMA Style

Bektaş J. Automating an Encoder–Decoder Incorporated Ensemble Model: Semantic Segmentation Workflow on Low-Contrast Underwater Images. Applied Sciences. 2024; 14(24):11964. https://doi.org/10.3390/app142411964

Chicago/Turabian Style

Bektaş, Jale. 2024. "Automating an Encoder–Decoder Incorporated Ensemble Model: Semantic Segmentation Workflow on Low-Contrast Underwater Images" Applied Sciences 14, no. 24: 11964. https://doi.org/10.3390/app142411964

APA Style

Bektaş, J. (2024). Automating an Encoder–Decoder Incorporated Ensemble Model: Semantic Segmentation Workflow on Low-Contrast Underwater Images. Applied Sciences, 14(24), 11964. https://doi.org/10.3390/app142411964

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automating an Encoder–Decoder Incorporated Ensemble Model: Semantic Segmentation Workflow on Low-Contrast Underwater Images

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Description

3.2. Patch Generation

3.3. Encoder–Decoder Architectures

3.3.1. Encoder Architectures

3.3.2. Decoder Architectures

3.4. Weighted-Based Model-Optimization Algorithm

4. Experiments and Analysis

4.1. Evaluation Metrics

4.2. Dice Loss and Focal Loss Functions

5. Results

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI