Semantic Segmentation of Corrosion in Cargo Containers Using Deep Learning

Ornelas, David; Canedo, Daniel; Neves, António J. R.

doi:10.3390/su17146480

Open AccessArticle

Semantic Segmentation of Corrosion in Cargo Containers Using Deep Learning

by

David Ornelas

^*

,

Daniel Canedo

and

António J. R. Neves

Institute of Electronics and Informatics Engineering of Aveiro (IEETA), 3810-193 Aveiro, Portugal

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(14), 6480; https://doi.org/10.3390/su17146480

Submission received: 28 March 2025 / Revised: 30 June 2025 / Accepted: 1 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue NEXUS International Conference: Digital and Green Transition in Maritime Ports: Trends and Challenges—DGTMP 2024)

Download

Browse Figures

Versions Notes

Abstract

As global trade expands, the pressure on container terminals to improve efficiency and capacity grows. Several inspections are performed during the loading and unloading process to minimize delays. In this paper, we explore corrosion as it poses a persistent threat that compromises the durability of containers and leads to costly repairs. However, identifying this threat is no simple task. Corrosion can take many forms, progress unpredictably, and be influenced by various environmental conditions and container types. In collaboration with the Port of Sines, Portugal, this work explores a potential solution for a real-time computer-vision system, with the aim to improve container inspections using deep-learning algorithms. We propose a system based on the semantic segmentation model, DeepLabv3+, for precise corrosion detection using images provided from the terminal. After preparing the data and annotations, we explored two approaches. First, we leveraged a pre-trained model originally designed for bridge corrosion detection. Second, we fine-tuned a version specifically for cargo container assessment. With a corrosion detection performance of 49%, this work showcases the potential of deep learning to automate inspection processes. It also highlights the importance of generalization and training in real-world scenarios and explores innovative solutions for smart gates and terminals.

Keywords:

smart ports; corrosion detection; computer vision; deep learning; semantic segmentation; cargo containers; data engineering

Graphical Abstract

1. Introduction

In recent decades, globalization has increased economic activity and international trade. Consequently, the aggregate volume of containers has exhibited an average annual growth rate of 3.84%, resulting in a substantial 56% increase in total volume from 2010 to 2022 [1]. Containers are standardized metal boxes, typically composed of Corten steel panels designed to protect cargo from harsh weather conditions and theft. Typically, they are designed in two main sizes: 20-feet-equivalent-unit (TEU) and 40-feet-equivalent-unit (FEU) with fixed width and length, and can have varying heights. This uniform structure makes them versatile for intermodal transport, allowing them to be properly loaded and stacked. They are typically composed with Corten steel surfaces and classified across several categories, such as general cargo, open-top, flat-rack, platform, refrigerated, and tank units. Each category is distinguished by unique attributes and structural designs that comply with the ISO 1496 [2] and 668 [3] series, as well as related standards [4,5,6]. Containerization replaced traditional transportation methods, such as breaking bulk cargo, significantly reducing costs and loading times while maximizing cargo capacity. This transformation has enabled more cost-effective and efficient global trade connections [7].

Container terminals function as transfer stations for cargo flow and integrity control. Terminals are generally composed of two external regions: the quayside where ships are loaded and unloaded, and the landside where containers are transferred to trucks and trains. Globalization and technological advancements have accelerated the growth of container terminals in seaports, requiring terminals to constantly adapt and look for innovative and efficient operational strategies. These solutions reduce transshipment times and costs, to maintain a smooth flow of logistic operations in ports. In response to this demand, the terminals continue to evolve to incorporate faster and more efficient technologies [4,5,6]. Smart gates contributed to the revolution of maritime ports in the way truck drivers process container pick-ups. These systems function as integrated gate systems at entry and exit points, thereby optimizing operations, procedures, and security through the use of technologies such as sensors, cameras, optical character recognition (OCR) and artificial intelligence (AI) [8].

Inspection processes in container terminals and gates, such as code recognition, seal detection, and damage detection, are some of the operations prone to automatization. Containers are constantly exposed to harsh environmental conditions, heavy loads, and frequent handling, all of which contribute to wear and tear over time. Traditionally, inspections are based on a manual process with subjective judgment from trained professionals, making this conventional approach not only labor intensive and time-consuming, but also susceptible to human error and inconsistencies. Without regular and reliable monitoring, there is an increased chance that containers will be damaged, which puts cargo integrity at risk. To minimize manual labor and increase efficiency, terminals are exploring computer-vision (CV) solutions capable of performing autonomous inspections [9]. Computer-vision technology allows devices to understand visual data as closely as possible to human perception. Through the use of imaging devices and algorithms, these systems can recognize patterns and anomalies within objects [10]. These algorithms require a large amount of prepared data to perform visual tasks. Based on the vast amount of data collected from port cameras in container terminals, with proper preparation, it is possible to perform automatic inspections using deep-learning (DL) algorithms. Therefore, this technology allows the system to detect anomalies in containers, improving not only logistic operations but also the accuracy and reliability of inspections. Although these combined methods provide solutions and could revolutionize the whole sector, the port industry still faces notable challenges in fully deploying these systems [9].

This study proposes a novel and comprehensive blueprint for the detection of corrosion in cargo containers using DL. The blueprint includes critical steps, such as the careful process of data and pixel-level annotation, the evaluation of annotation tools, and the review of optimal training configurations. Notwithstanding the limited dataset size, the proposed approach exhibits a marked improvement in performance when compared to existing pre-trained models. This finding underscores the viability for real-world deployment and underscores the importance of domain-specific fine-tuning. In contrast to previous related studies that employed image classification [11], object detection [12,13], or instance segmentation [14] for container corrosion detection, our work adopts a semantic segmentation approach using the state-of-the-art DeepLabv3+ architecture. This enables precise, pixel-level localization of corrosion, offering a more detailed and scalable solution for real-world inspection scenarios compared to traditional object-detection or image-classification methods. While object detection identifies bounding boxes and classification assigns global labels to images, semantic segmentation provides dense, per-pixel labeling that is critical for accurately mapping the extent and shape of corrosion.

Importantly, semantic segmentation strikes an effective balance between accuracy and computational efficiency. Although instance segmentation is more detailed, it requires multiple processing stages. Semantic segmentation, on the other hand, typically involves a single, streamlined prediction step. This makes semantic segmentation significantly less demanding on local hardware and practical for systems with limited resources. Given that our training data was collected from images captured in uncontrolled environments, the trained model demonstrates promising generalization capabilities in challenging real-world scenarios, as later discussed in Section 4. While the current system offers a broad estimation of corrosion presence and extent, the results remain preliminary. Further iterations, particularly with a larger and more diverse dataset, are essential to enhance robustness and accuracy. Additionally, practical deployment will require system-level design, including the development of an inspection gate, camera placement strategy, and criteria for triggering human intervention based on the ratio of corroded area to total container surface. These components are critical steps for future work toward a fully operational corrosion detection system.

The following paper is structured as follows. Section 2 describes the methodology, including the pre-processing steps for building a custom dataset, a review of segmentation models and the selection of the model for training and testing, along with the strategies and experiments conducted. Section 3 describes the experiment results, highlighting the performance of the proposed approach on several configurations. Section 4 discusses the findings, limitations, and potential improvements. Finally, Section 5 concludes the study and outlines directions for future work.

Related Work

Several articles have studied the detection of surface defects. Regarding steel surface detection, two strategies emerged. Duan et al. [15] proposed an improved faster region-based convolutional neural network (R-CNN) that compares the performance of the proposed version against that of different object-detection algorithms. The model focuses on improving the feature extraction network by using a ResNet backbone, a deformable convolution layer, and a multi-scale feature layer similar to a featured pyramid network (FPN). Region of interest (ROI) align was also used, resulting in greater accuracy than that of base object-detection algorithms. Huang et al. [16] proposed an improved algorithm based on the You Only Look Once (YOLO)v5 model to detect defects on steel surfaces. This model uses a deformable convolution layer instead of the original convolution layer. Additionally, the model uses an attention mechanism called convolutional block attention module (CBAM) and an alternative loss function called focal efficient intersection over union (EIOU). To cluster the anchor boxes, the model uses a k-means algorithm. The model was tested with different variations, with a new feature from the prior techniques mentioned added in each new test. The author achieved the best performance by combining all features in the final model. The final model surpassed the performance of the original YOLOv5 model and some faster R-CNN variations.

One significant problem that has been addressed among various challenges is the detection of road damage. Pham et al. [17] proposed a faster R-CNN model with an FPN and Detectron2, an open-source object-detection library. While the model obtained a high ranking in its respective competition, there is still potential for refinement, as evidenced by the comparatively low prediction results. Chun et al. [18] proposed an alternative system that addresses road damage as a segmentation problem. To segment road defect masks, the authors used an FCN model with batch normalization and a ReLU activation function. To increase data variability, several brightness modifications were used, achieving high accuracy in the dataset. Zhang et al. [19] proposed an improved YOLOv3 model to detect damage on concrete bridge surfaces. In order to enhance the model, the authors introduced a novel transfer-learning approach with fully pretrained weights, a batch normalization layer, and a focal loss function. This improved version of YOLOv3 outperformed the original model and the two-stage faster R-CNN model, producing promising results.

Tabernik et al. [20] proposed a novel network for surface anomaly detection and compared it against Cognex ViDi Suite (commercial software [21]), and two state of the art models: U-Net [22] and DeepLabV3+ [23]. The model, which was divided by a segmentation network and a decision network to classify defects, allowed for a small number of samples. This approach was found to be more efficient than the commercial software and the other models, achieving high accuracy on the dataset.

Detecting defects on containers is challenging due to the complexity of the task, despite these challenges, some attempts were tested in this area. Wang et al. [11] proposed a multi-type damage detection model for shipping containers based on transfer learning and MobileNetv2. This model was trained in a dataset with the following classes: damage, hole, bent, dent, rust, surroundings (port environment), open (door not closed), collapse (container stacks), and norm (normal container). This research uses the weakly supervised photo enhancer (WESPE) algorithm to enhance images for the dataset while also structuring a relationship map between low- and high-quality images. MobileNetv2 and Inceptionv3 were used as pre-trained models, with the original weights preserved. This model integrates feature extraction, global average pooling techniques, dropout regularization to avoid overfitting, and ultimately a classification layer that returns the predicted output. Subsequent to the initial training phase, both MobileNetv2 and Inceptionv3 were subjected to post-training refinements involving fine-tuning to enhance the model’s accuracy. In summary, the author contended that the models were more appropriate for the actual container detection situation at the port. His work facilitates the utilization of mobile devices for the collection of images, which are subsequently transmitted to a central unit. After experimenting results, the MobileNetv2 model had overall better generalization performance compared to the Inceptionv3 model. The findings indicate that the model exhibited difficulty in differentiating the surroundings of a standard container image. However, with regard to damage classification, the outcomes appear to be encouraging. Despite the author’s acknowledgment of potential areas for enhancement, the proposal yielded noteworthy results, demonstrating its capability to influence the shipping industry.

Nguyen Thi Phuong et al. [12] proposed a novel approach for container damage detection, employing a modified YOLO with advanced neural architecture search (YOLO-NAS) model. The custom dataset under consideration contains a total of 4736 images, which have been resized to 640 × 640 pixels while preserving aspect ratios. The images have been divided into three sets: training (70%), validation (15%), and testing (15%). A total of 27,807 manually labeled annotations have been assigned to a single “damage” class. This class encompasses all visible defects, thereby facilitating the detection task. The authors refined a pretrained YOLO-NAS model from Roboflow, incorporating preprocessing steps (including normalization and image augmentation), prediction, and postprocessing. This pipeline improves object detection in seaport environments by refining predictions and filtering false positives. The model performed well, achieving a mean average precision (mAP) of 91.2%. Furthermore, the dataset is publicly available in the Roboflow datasets, promoting transparency and reproducibility. However, the model has limitations, including dependence on high-quality annotations, difficulty with cluttered or occluded scenes, and difficulty detecting subtle defects like rust. Moreover, the computational demands of the model may impose constraints on its applicability in contexts characterized by limited resources.

Bahrami et al. [13] have focused on the detection of corrosion on shipping containers through two different approaches. Firstly, the author developed an optimized deep-learning architecture with anchor box that relies on predefined bounding boxes (anchor boxes) to detect objects at different scales and ratios, leading to the detection of corrosion. This process improves performance and generalization, combined with the following networks: faster R-CNN, single shot detector (SSD) Mobilenet, and SSD Inceptionv2. These architectures demonstrate their suitability to address the challenging task of detecting corrosion defects. Two experiments were proposed with anchor boxes, first with fixed-size boxes and another with flexible size. While performing the experiments, the fixed-size anchor box model failed to detect all corrosion defects and did not provide high performance. For the experiment where the anchor box size was flexible, there was an improvement over 5%. The faster R-CNN network showed better results than the SSD models. Despite these improvements, the three models did not provide solid results (within 90%) in terms of performance, as the best model achieved an accuracy of 66%. Nevertheless, these advances have provided the shipping industry with a fundamental framework for corrosion detection.

Lastly, Bahrami et al. [14] implemented a subsequent approach that aims to address the challenges of generalization, a problem with which most models are to some extent afflicted. The proposed framework consists of two main modules: high-resolution and temporal context region-based CNN (HRTC R-CNN) applied for corrosion defect detection in conjunction with a corrosion defect characterization (CDC) for corrosion defect inspection. Initially, the first part of the framework incorporates a backbone network coupled with a region proposal network (RPN) to generate corrosion proposal regions. These features are subsequently refined in the second phase through the use of ROI Align to sharpen instance-specific features for classification and mask generation purposes. RPN incorporates a multi-depth multi-scale image pyramid network (MD-MS IPN) that functions at varying depths and resolutions, implementing multiple CNNs. This results in a feature map from the MD-MS IPN wide range of scales and hierarchical levels, further enhanced through the integration of a MS-FPN. Moreover, given the camera setup inability to capture the entirety of the container within a single frame, the authors have proposed a methodology for storing all of the key features derived from the enhanced FPN box proposals in memory banks. The model employed an attention mechanism to prioritize significant contextual information and disregard irrelevant data. It also incorporated a short-term memory bank to assess temporal context within neighboring frames and a long-term memory bank to store features from a broader range of frames, thereby considering context over time. For the second part of the framework, CDC calculates the percentage of corroded areas on the container, using an optical flow to measure the displacement of sequential images and avoid overlaps. Following the determination of full dimensions, edge detection techniques are used to define the boundaries, obtain the area, and calculate the percentage of pixels in corrosion regions. The precision of CDC in detecting corrosion in various weather conditions was shown to be above 88%, with a mean error between 5% and 11%. In conclusion, the proposed framework was compared with conventional segmentation techniques and metallic defect detection techniques. HRTC R-CNN outperformed every algorithm comparatively. This model proved to be the leading work in this sector with incredible performance results. Although the work of Bahrami et al. [13,14] provides significant progress in the domain of corrosion defect detection, it does not fully address the issue of generalization to other forms of damage. Nevertheless, it lays the foundation for the potential adaptation of this methodology to a wider range of damage types.

2. Materials and Methods

Deep-learning (DL) tasks require the use of annotated data for the training and identification of patterns. In the context of semantic segmentation, it is imperative that the dataset under consideration contains mask annotations at pixel level. Preparing a suitable dataset is crucial for the success of the model. In the absence of a well-prepared dataset, the model will be incapable of learning patterns within the data and generalizing to new samples. In order to construct an algorithm capable of accessing real-time image input, the following steps were necessary to prepare the data:

Pre-processing of raw data.
Data annotation.
Data splitting.
Data rescaling.

Subsequent to ensuring an adequate balance of the data, a series of experiments were conducted with the DeepLabv3+ model. In conclusion, an evaluation of the model’s performance in a new set of images is presented, along with an analysis of its ability to adapt to new scenes.

2.1. Data Preparation

In order to obtain a more generalized dataset with a wider variety of containers, an attempt was made to retrieve data from authors of similar research. However, the response was not favorable, as the data was classified and could not be disseminated to the scientific community. The subsequent stage of the investigation entailed the pursuit of publicly accessible datasets. Unfortunately, the data obtained in this stage proved to be unreliable due to its composition of images from diverse origins, characterized by substandard quality and the presence of watermarks. This rendered the data unsuitable for deployment with DL models. Additionally, the data retrieved from the Port of Sines had no treatment nor criteria, as it consisted of images from straddle carriers with minimal criteria for DL models. This underscored the necessity to filter and pre-process the provided data, which predominantly consisted of blurry and noisy images.

2.1.1. Pre-Processing

Two image processing techniques were applied to detect blur and noise: Laplace operator and canny edge detector [24,25]. The Laplacian operator calculates the gradient rate of change to find zero-crossings and detect edges. Prior to the implementation of the Laplacian filter, the image underwent a conversion to grayscale and the Laplacian values were calculated for each image. An illustration of a satisfactory image and a rejected image upon application of the filter can be observed in Figure 1.

Canny edge detection is a multi-stage algorithm that detects strong edges in images. Initially, the image’s intensity gradient is calculated, followed by the determination of its direction. Subsequently, non-maximum suppression is employed to eliminate non-significant pixels. This process entails the comparison of each pixel gradient with its neighbors in the gradient direction, with the objective of retaining only the local maxima. This process guarantees the production of edges that are both sharp and thin, a consequence of the meticulous examination of each individual pixel. Finally, the hysteresis thresholding method is employed to compare the gradient values with the two thresholds, thereby ensuring the retention of strong edge pixels and the elimination of weak edge pixels that are connected to strong edge pixels.

A series of tests were conducted using diverse image samples to determine the optimal thresholds. The values for these thresholds were set to 50 and 150, with a ratio of 3:1 between them. The kernel size was set to 3. Examples of images that passed the filter and images that did not pass the filter are shown in Figure 2.

Based on the threshold defined from the two image processing techniques, the images that did not meet the criteria were removed. Selecting a proper threshold value is an arduous task, as it either removes too many acceptable images or keeps too many rejected images. After several tests, since the amount of data provided was immense, the boundaries were defined with a broad margin, as represented in the following Equation (1):

Edges < 30000 \lor Laplacian < 20

(1)

All data falling within the established threshold was deemed unacceptable. These values were chosen based on the analysis of several light conditions and camera perspectives. An example of images used for training is represented in Figure 3 and rejected images in Figure 4.

Based on the previous examples, the following Table 1 represents the number of edge pixels and Laplacian values for the images shown in Figure 3 and Figure 4.

Following the completion of the filtering process, the subsequent phase in the preparation of images for training the DL model involved the process of annotations. Leading to a brief review of recent annotation tools. This evaluation led to the selection of the most practical tool for annotating corrosion masks.

2.1.2. Annotation

Crucially, this process is of extreme importance for the model, as it defines the ground truth of corrosion found in all the container data. These defined annotations are used to train the model to recognize the corrosion in the images. It can be done in different ways, such as bounding boxes or pixel-level labeling. In this work, it was used pixel-level labeling, since it is a segmentation task and it provides more detailed information. Annotation tools are very helpful on this process and they can be generic or specific for a particular task. To understand which tool is the most suitable, an analysis and testing of the available tools was performed. This analysis was based on self-testing, reviews, surveys and recommendations from the community [26,27,28], such as CVAT [29], LabelStudio [30], LabelBox [31], VIA [32] and LabelMe [33]. A comparison regarding generic features of these tools is presented in Table 2.

All of the tools analyzed are open-source, though some also offer premium versions with additional features. A key aspect considered during the analysis was the user-interface and its ease of navigation, which is important for time-limited projects. Another significant feature that was considered was the ability to handle large-scale datasets due to some tools struggling with handling large data. Lastly, the most relevant feature considered, was the support of pixel-level labeling, given that it is the annotation type used in this work.

In terms of annotation tools for CV tasks, only the most relevant ones were considered for this project, such as brush tool, AI magic wand, intelligent scissor, bounding box and points. Brush tool gives the user free control to draw the annotation mask, while the AI magic wand uses AI to predict the annotation mask. Intelligent scissor allows the user to add points around the object and the tool will automatically detect the object. Bounding box is more related to object detection, the user has to draw a box around the object while, with points, the user marks points on the boundaries of the object. From the analysis results the following comparison, presented in Table 3.

In this study, CVAT was selected as the primary tool due to its user-friendly interface and all the features needed for this project. An example of a fully annotated container is shown in Figure 5.

In this manner, 500 images were annotated, 250 of which exhibited corrosion and 250 of which did not. This enables the model to differentiate between containers that are corroded and those that are not, thereby ensuring the balance of the data.

2.1.3. Rescaling Data

Considering that the images provided had a resolution of 3840 × 2160 pixels and most models take as input 512 × 512 or 1024 × 1024, the following technique was used to resize the images to a proper resolution before being analyzed by the model. Thus, the resizing process was executed by employing the PIL library [34], which provides a variety of resampling filters. These filters are used to determine how the pixels in the image are combined to create the new image. Given the relevance of image quality to the task and based on the table from PIL documentation, which shows the quality of downsizing and upscaling images, the LANCZOS filter was chosen for the resizing process.

2.1.4. Data Splitting and Limitations

Upon completion of the annotation process, the data were divided and balanced equally in terms of corroded and non-corroded images for the train, validation, and test sets. In this process, 80% of the data were allocated for training, 10% for validation, and 10% for testing. With regard to the criteria employed for the annotation of corrosion in the images, these were predominantly determined by the presence of rust, with the annotations guided in part by the subjective assessments of the annotator and informed by feedback from port collaborators. The annotations encompass all forms of corrosion, as the primary objective of this study is to comprehensively document all defects present. To ensure comprehensive and meticulous labeling, small areas were also annotated. For a more precise annotation, it would be best to consult a professional with a background in corrosion analysis, who could guide the process of creating such annotations. Given the level of detail required, the annotation process for each image was on average four hours in duration. This renders the process exhaustive and labor-intensive. Although the dataset is small, it was carefully treated to maintain high annotation quality. This deliberate trade-off prioritizes annotation precision over volume. Thus, the dataset serves as a reliable foundation for initial experimentation with the potential for future expansion to improve generalization and robustness. These annotations were then used to train the model.

2.2. Corrosion Segmentation

Based on the review and comparison performed on the semantic segmentation models, the DeepLab model demonstrated remarkable performance outperforming previous state-of-the-art models, as illustrated in Table 4.

2.2.1. DeepLabv3+ Architecture

DeepLab is a series of models that have been developed by the Google Research team, essentially formed by a deep neural network (DNN), trained for image classification and transformed into a fully convolutional network (FCN). This is achieved by replacing the fully connected layers into convolutional layers and increasing the feature resolution to the original input size with atrous convolution layers. DeepLabv3+ [23] introduced a decoder module to the atrous spatial pyramid pooling (ASPP) of the DeepLabv3 model [39], which was responsible for recovering spatial information lost during the encoding process. The extracted encoder feature resolution was freely controlled by atrous convolution, allowing the model to trade off precision and runtime. DeepLabv3+ model is characterized by a set of adjustable parameters that can be adjusted to improve the model performance in training. The main parameters tested were model backbone networks, output speed, epochs, batch size, and learning rate. It is also important to mention that the model already incorporates data augmentation using random scales, crops and horizontal flips, improving model generalization to unknown environments. The model’s performance was evaluated and compared to the actual mask using the intersection over union (IoU) metric. This metric quantifies the intersection of two masks by calculating the ratio of the overlap area to the union area, as represented in Figure 6. A higher IoU value indicates closer agreement between the predicted and real masks, reflecting improved model accuracy.

2.2.2. Background Extraction

To achieve background subtraction, a transformer architecture with a memory encoder called segment anything model (SAM) [40] was used. This model aims to segment any object in an image, regardless of its size, shape or appearance. It supports segmentation through prompts, which are simple instructions passed to the mask decoder that is capable of outputting segmentation masks in real-time.

Initially, the images were processed with the automatic mask generator. SAM returns all the masks in binary format, where the object is represented by white pixels and the background by black pixels and creates a file with several parameters, mainly, the area of the mask, the stability score and the bounding box. To find the container in the image, the mask with the largest area was considered, resulting in a new image where only the container is visible. Regarding the background, two options were tested: black, which is the default color, and white, which is obtained by inverting the binary mask. The initial results of the pre-processing steps are shown in Figure 7.

This method was determined to be unreliable. It was observed that in some cases, the model would segment the outside area at a higher level than the container area, resulting in the failure to extract the background.

2.2.3. Points as Prompt

To address the aforementioned issue, two points were utilized as a prompt to guide the model in segmenting the container, rather than employing the automatic mask generator. In the following figures, the points utilized as prompts and the results of the model prediction are shown. This method was successful in isolating the container from the background, as shown in Figure 8.

While the model demonstrated the ability to segment the container from its background, it was observed that the model encountered difficulties in accurately delineating the container’s borders. To avoid misjudging the container with the annotations, an erosion operation and a dilation operation were applied to the mask.

Both of these operations were performed with the OpenCV library, using a 5 × 5 kernel. This kernel for the erosion operation only considers the pixel as 1 if all the pixels under the kernel are 1. For the dilation operation, it only considers the pixel as 1 if at least one pixel under the kernel is 1. This results in a mask that ensures that all borders are correctly segmented, as shown in Figure 9.

The efficacy of both black and white backgrounds was evaluated in comparison to the standard background in both the pre-trained and trained model, as will be discussed in the experimental section.

2.3. Pre-Trained Model Experiments

Before training the model on the prepared dataset, an effort was made to explore alternative methods that would not require extensive time in annotations. In this context, pre-trained models were identified as a potential solution, given the similarity between corrosion and other segmentation tasks, such as the analysis of metallic surfaces or bridges, despite the absence of training on the specific dataset.

Four variations of a DeepLabv3+ pre-trained model were available, each trained in 440 images for bridge inspections, with different loss functions [41]. These four models were trained for 40 epochs with image sizes of 512 × 512, a batch size of two, horizontal flip augmentations, a ResNet50 backbone and different loss functions. To distinguish between the models, they were denoted as w18, w27, w35, and w40, respectively. The different loss functions used were the following:

w18: Cross entropy loss.
w27: L1 loss.
w35: L2 loss.
w40: Cross entropy loss with weighted classes.

Before obtaining the results, a series of experiments were conducted to understand how well the DeepLabv3+ model would behave in different scenarios. Given that the model was trained with input images measuring 512 × 512 pixels, the first step was to resize the images to this size using the method in Section 2.1.3. Afterwards, three experiments were conducted using images from the provided dataset to assess the model’s performance. The following experiments were conducted:

Experiment 1: 20 random images.
Experiment 2: 40 images without red cargo containers.
Experiment 3: 20 images with only red cargo containers.

All of these experiments were tested in different settings, as described in Section 2.2.2, to see if the model could generalize better. However, firstly, it was necessary to prepare the ground truth masks in order to evaluate the pre-trained model performance. All experiments previously referenced were annotated with CVAT to obtain the ground truth masks. The model predictions were then evaluated in comparison to the ground truth using the IoU metric.

Preliminary experiments have indicated that the model is ineffective in identifying corrosion in red cargo containers. Since this process of degradation typically appears in nature as a brown or orange color, closely similar to the color of red cargo containers, the model struggles to differentiate between the two.

2.4. Pre-Trained Results

As previously mentioned, to measure the precision of the pixels, IoU is the best metric to classify the performance of the model. The metrics were computed using the Torchmetrics library [42], which offers an accessible method for calculating these metrics with built-in functions. Based on preliminary experiments, it was noted that the model had difficulty identifying corrosion in red cargo containers. It became evident that this was a limitation of the model, given that its training was based on a distinct dataset. Taking into account the experiments performed on the three data samples, the results can be seen in Table 5. The model performed better in the first two experiments, where images were randomly selected, compared to the third experiment, where only red cargo containers were selected. This is due to the model’s inability to differentiate between corrosion red containers. It is also noteworthy that the model demonstrated superior performance when presented with a white or black background in comparison to the default setting. Considering the fact that the images used in the experiments also include corrosion in other elements and it was only annotated the presence of corrosion in the main container, replacing the background with a white or black color improved the model performance.

In addition, the model performed better with cross-entropy-weighted classes, as seen in the w40 percentage results. In the first two experiments, the model performed around 20% IoU, but in the third experiment, all variations performed poorly, achieving less than 3% IoU. Despite this and considering that it is a model trained for a different task, the results were better than expected for the first two experiments. From the results, we can compare the predictions against the ground truth masks, when the background is similar to the color of corrosion and when it is distinct, as seen in Figure 10.

Despite achieving low IoU values, this was expected, since semantic segmentation is a complex task due to the pixel-by-pixel comparison. Considering all this, the pre-trained models were not used for the corrosion segmentation task, as the models were unprepared for the dataset used and were unable to distinguish containers from corrosion when the color was identical. Therefore, the next step was to train the model on the dataset that had been prepared.

2.5. Fine-Tuned Model Experiments

To train the DeepLabV3+ model in the dataset, previously described in Section 2.2.1, a Pytorch implementation [43] of the model was used with transfer learning on the backbone networks. Before starting the training process, the dataset was transformed to fit the shape and size of the input of the model.

Data Transformation

Firstly, since the model was pre-trained on PASCAL VOC dataset, it was required to create a custom dataset class to load the images and masks. This class was created based on the Pytorch Dataset class, which requires the implementation of two functions: __len__ which returns the size of the dataset and __getitem__ that returns the image and mask of a specific index. For the __len__ function, the size of the dataset was returned by reading the number of images in the respective text file (Train/Val/Test), while for the __getitem__ function, both images and masks were read from the respective folders, using the PIL library. After that, the mask was converted to a binary mask, where all non-zero pixels were converted to 1, resulting in an input shape of [batch size, channels, height, width] and a mask shape of [batch size, height, width], with the following values for the data input: [12, 3, 1024, 1024] and mask: [12, 1024, 1024]. These input shapes were also transformed by the data augmentation applied to the data, which were:

Random Resize: resize the image with a random scale;
Random Crop: crop the image to the model input size;
Random Horizontal Flip: flip the image horizontally;
ToTensor: convert image and mask to a Pytorch tensor;
Normalize: normalize the image tensor with the mean and standard deviation of the ImageNet dataset.

After these conversions, the shape and size of the data were checked to ensure that the model could receive the data correctly, resulting in the following values for the input of the data: [12, 3, 512, 512] and mask: [12, 512, 512].

2.6. Training Experiments

To further evaluate the performance of the models, a few experiments were conducted on the training set, which consisted of 400 images with 200 corroded images and 200 non-corroded images. These experiments were trained and evaluated with the respective partitioning of the dataset and the training plots were created using the Visdom library [44].

Firstly, the model was trained with the default configuration, without any adjustments, to check the performance of the model on the dataset. However, due to the limited memory capacity of the graphics processing unit (GPU), it was necessary to make a slight adjustment to the batch size, reducing it to 12 and adjusting the number of epochs to 300. Regarding the rest of the parameters, they were kept as default, with the following values:

Network model: MobileNetV2.
Output Stride: 16.
Learning rate: 0.01.
Loss Function: Focal loss.
Learning Rate Policy: Poly.
Optimizer: Stochastic Gradient Descent (SGD).

Afterwards, based on a few parameters and techniques used in pretests, like various learning rates, batch sizes, loss functions, and output strides, a standard based model was used in the following experiments with the following configurations:

Background: Normal, white background, and black background.
Data augmentation: With and without data augmentation.
Network comparison: MobileNetV2, ResNet50, and ResNet101.
Optimizer comparison: SGD, Adam, and AdamW.

2.6.1. Default Configuration

Initially, given that this implementation utilized MobileNetV2 as the default backbone, it was decided to train the model with this configuration, because it is a lightweight model that can be trained faster than the other backbones. Based on DeepLabv2 and DeepLabv3 experiments [39,45], the learning rate policy was set to Poly and the output stride was set to 16, respectively. The Poly policy is defined in Equation (2), where the policy gradually decreases the learning rate over time using a polynomial function.

η_{t} = η_{0} \times {(1 - \frac{t}{T})}^{power}

(2)

For the step policy, the learning rate is decreased by a constant factor at fixed intervals, as shown in Equation (3):

η_{t} = η_{0} \times γ^{⌊\frac{t}{step_size}⌋}

(3)

2.6.2. Background

From the experiments made in the pre-trained models, it was observed that changing the background had a significant impact on the model performance. Consequently, the decision was made to assess the model with three distinct backgrounds: normal, white, and black. This preliminary step was previously delineated in Section 2.2.2.

2.6.3. Data Augmentation

Corrosion has a wide variety of sizes and shapes; therefore, it was decided to test the model with and without data augmentation to see if the model could generalize better with the transformed data. Since the transformations, already described in Section 2.5, had rescale and random crop, those same transformations could negatively impact the model performance, as they could remove important spatial information from the images.

2.6.4. Network Backbone Models

Unfortunately, the Pytorch implementation of the model only had the MobileNetV2, ResNet50, and ResNet101 backbone available, so it was not possible to test the latest addition of the Xception. Despite this, ResNet50 and ResNet101 were tested to check if they could improve the performance of the model, as described in the DeepLabv3 article [39].

2.6.5. Optimizers

All the variations from DeepLab papers utilize SGD as the optimizer; however, it was also decided to test the Adam [46] and AdamW [47] optimizers to check if they could improve the model performance.

3. Results

These experiments were performed with different configurations to evaluate the behavior of the model, as described in Section 2.6. Based on these, the model performance was evaluated using the best weights from the validation and compared against the results of the test set. Lastly, of all experiments, the best model configuration was identified for the corrosion segmentation task.

3.1. Default Configuration

Firstly, the model was trained using the default parameters with small adjustments (batch size of 12 and 300 epochs). Analyzing the loss of the training set, the model was able to quickly learn the patterns in the training data, as shown in Figure 11. When comparing the training and validation loss in Figure 12, the model does not overfit, as the validation loss follows a pattern similar to the training loss. Around 100 epochs, the two losses stabilize and remain relatively low and stable, indicating that the model has converged.

Regarding the performance (IoU) of the model, the results are shown in Figure 13, showing a rapid increase in the initial epochs, indicating that the model was able to learn quickly. Around 100 epochs, the model begins to learn more slowly and begins to stabilize over 40% IoU. Following 250 epochs, the learning curve starts to fluctuate slightly, indicating that the model has reached its learning capacity with the current data and parameters.

3.2. Background

Initially, the background removal technique was considered due to the fact that, in addition to the corrosion in the main container, there were other containers and other objects that could contain corrosion as well. This could potentially confuse the model and decrease its performance, since the annotations were made only on the main container. However, despite the improvements made in the pre-trained models, when the background was removed in the training experiments, the results were not as expected. The background removal technique did not improve the model performance, as shown in the smoothed graph represented in Figure 14.

Clearly, the white background had the worst convergence due to instability during training, while the black background had a convergence similar to that of the normal background. Despite this, since there was no significant improvement in model performance, this technique was not used in the following experiments.

3.3. Data Augmentation

Sequentially, it was also tested whether the transformations predefined in the DeepLabv3+ model would actually affect or not the learning behavior of the model. This was due to the fact that the transformations could remove important spatial information from the images, which could be crucial for the model to learn the patterns in the data. The results are shown in Figure 15.

Looking into the results, when applying transformations to the data, the model learns slower. However, this approach enables the model to generalize more effectively, resulting in slightly better performance. Without transformations, the model learned faster since the data patterns were simpler, but the learning curve stagnated relatively quickly, indicating that the model reached its learning capacity. Consequently, the transformations were kept in the following experiments, as it was observed that the model could generalize better with the transformed data.

3.4. Backbone Models

Subsequently, the model was evaluated using various backbone models, including MobileNetv2, ResNet50, and ResNet101. With regard to the MobileNetv2 and ResNet50 models, the standard batch size of 12 and 300 epochs with a learning rate of 0.01 was maintained. However, for the ResNet101 model, the batch size was reduced to 4 due to increased memory usage, consequently adjusting the learning rate to 0.005, resulting in Figure 16.

Similarly to the article [39], ResNet101 proved to be superior to ResNet50. MobileNetv2, despite being the fastest model to train, was low in terms of performance compared to the ResNet101 model. With an increase of 5% in performance compared to the previous best model (default configuration), the ResNet101 model was able to achieve a better learning curve, despite being slower, increasing the performance to 46%. This finding led to the conclusion that the ResNet101 model was the most suitable for this training task, and it was used in the following experiments.

3.5. Optimizers

Lastly, the model was tested with different optimizers, SGD being the currently used one. An additional test was made with the adaptive moment estimation (Adam) and Adam with decoupled weight decay (AdamW) optimizers to check if they could improve the model performance. The results are shown in Figure 17.

Based on Figure 17, since the two tested optimizers had a greater oscillation in the learning curve, the learning rate was reduced to 0.0002 for the Adam and AdamW optimizers. Although the original article [23] does not include any other optimizer besides the SGD optimizer, Adam and AdamW improved the model performance quite significantly. Both optimizers were able to achieve a better learning curve compared to the SGD optimizer, with AdamW being less oscillatory compared to Adam. With an increase of almost 10% in performance compared to the previous best model (ResNet101 model), the AdamW optimizer was able to achieve 53% performance (IoU).

In conclusion of the training experiments, the best configuration of the model was the ResNet101 model with the AdamW optimizer, achieving a precision of 53% IoU, resulting in the best model configuration for the corrosion segmentation task. This model includes the following parameters:

Network model: ResNet101.
Output stride: 16.
Learning rate: 0.0002.
Learning rate policy: Poly.
Optimizer: AdamW.

3.6. Model Evaluation

Subsequent to the conclusion of the training, a comparison was made between the validation and the performance of the test set. The best weights derived from the validation process were used to evaluate the model in the test set. It is noteworthy that the model’s training and validation sets were distinct, with the model being trained on the former and validated on the latter, to find the best weights. Thereafter, for each new experiment, the model demonstrating the highest degree of efficacy from the preceding experiment was selected, as seen in Table 6.

Although in early experiments comparing the validation and test set, the model struggled to generalize, with a drop of 10% in the performance, ResNet101 was able to generalize slightly better, with a drop of only 6% in the performance. Further improvements were implemented with the AdamW optimizer, resulting in a mere 3% decline in performance from the validation to the model testing set.

4. Discussion

Semantic segmentation is a challenging task and model performance can be influenced by several factors. In this section, the impact of the different configurations and hyperparameters used to train the model will be discussed. Despite the promising results obtained, it is also important to highlight that the model performance can be influenced by the quality of the dataset, since the annotations were made manually and can contain errors. Additionally, with only 500 images, the model may not have enough data to learn the patterns in the images, which can also influence the model performance. Furthermore, the quality of the images can also impact the model performance, since the images were taken in different lighting conditions, with different cameras and with low resolution.

Based on the previous challenges, it is important to understand that achieving a high performance in detecting corrosion is not trivial, as the task is complex due to the variety of shapes and colors and the model needs to learn pixel by pixel the patterns in the images. Adding this to the low quality of the dataset, the task becomes even more challenging. However, the results were promising, especially since this is a new technology for this specific task. This result is significant considering the constraints and challenges faced during the development of the model. Regardless, the model performance can be further improved by increasing the dataset size and improving the quality and variety of the images and annotations. Additionally, by using more advanced techniques that require more computational power, such as using more complex models, smaller output strides and increased resolution input images, the model performance can also be increased.

Furthermore, several examples of predictions during the validation process can be observed in Figure 18.

Upon evaluating the finest model, several examples of predictions can also be visualized during the test process, as shown in Figure 19. Overall, the model demonstrated good results, which is apparent when comparing the ground truth and the predicted mask. Considering background class, the model was able to achieve a mean IoU (mIoU) of 76%, which is impressive considering the constraints related to the dataset size.

4.1. Background Class

Until this point, only the corrosion class performance had been considered for clarity sake. However, the background also plays a role in the performance evaluation, especially since the model also needs to distinguish this class. Therefore, it was decided to include the background class upon describing the results, in order to check the model performance in the whole image. The results can be seen in Figure 20.

As illustrated in Figure 20, the model was able to achieve a mean IoU of 76%, which is a good result considering the complexity of the task.

4.2. Comparison with Pre-Trained DeepLabv3+

To conclude the analysis of the fine-tuned model, an additional comparison was made with the pre-trained DeepLabv3+ model. For the pre-trained model, the optimal configuration of the pre-trained experiments was used, as described in Section 2.4. This configuration was the pre-trained model with a white background using cross-entropy weighted classes.

Regarding the fine-tuned model, the optimal from the fine-tuned experiments was used, as described in Section 3.6, with a ResNet101 backbone network using the AdamW optimizer with a learning rate of 0.002. As shown in Table 7, the fine-tuned model attained an IoU of 49.4%, a significant improvement compared to the pre-trained model, which achieved an IoU of 4.0%.

Despite the fact that the pre-trained model demonstrated substandard performance in the detection of corrosion in cargo containers, this outcome was anticipated, given that the model was trained on a different dataset, while the fine-tuned model was trained on the appropriate dataset. This comparison highlights the importance of fine-tuning models on the specific dataset, as it can significantly improve the model performance.

4.3. Model Complexity Testing

To evaluate the computational complexity and runtime performance of the model, a simulation of a continuous test was conducted on a local workstation for about 60 s. The workstation was equipped with an Intel^® Core™ i9-10980XE CPU and had 18 cores, as well as an NVIDIA GeForce RTX 3080 GPUs with 10 GB of VRAM each. A few measures were analyzed for this 60-s simulation: CPU utilization, GPU utilization, GPU memory usage, and GPU temperature. The CPU, which has 18 cores, used an average of 20% of its capacity during the 60 s, with an initial spike to 48% due to the model loading, as illustrated in Figure 21. On average, 40% of the GPU’s capacity was used, as seen in Figure 22. Figure 23 and Figure 24 show GPU memory consumption and temperature, respectively.

Memory usage was around 5000 MB and the temperature reached 60 °C. The test ran for 60 s under controlled conditions. During this period, the model demonstrated an average inference time of approximately 0.65 s per image. This suggests that the model is capable of efficient inference and suitable for near real-time applications when deployed on high-end hardware. These measurements provide insights into the model’s computational demands and can serve as a reference for future deployment.

5. Conclusions

This research presents a comprehensive approach to address the challenges associated with the detection of corrosion in cargo containers using DL techniques. More specifically, detecting corrosion using semantic segmentation, as it can provide pixel-wise predictions, useful to detect the exact location of the damage. Notwithstanding the challenges posed by dataset constraints, noisy data, and models’ generalization issues, this work successfully achieved promising results. A custom dataset was created from port terminal data, consisting of 500 annotated images (250 corroded and 250 non-corroded). Data were carefully processed and split for training, validation, and testing. The annotation process was executed through the utilization of the CVAT tool. This tool was meticulously evaluated and selected on the basis of its proven efficiency, user-friendliness, and accuracy in labeling. Two variants were primarily tested, employing the DeepLabv3+ architecture: one with pre-trained weights (from a bridge inspection dataset) and another trained and fine-tuned on the provided, filtered dataset. Despite the fact that the pre-trained model demonstrated poor adaptability, achieving a modest performance of 22% in the first experiments and 4% in the custom dataset test set, the fine-tuned model trained on the custom dataset demonstrated a notable improvement, achieving 49% accuracy for the corrosion class and an overall accuracy of 76% when including the background class in the test set.

This result represents an important step towards automating corrosion detection in cargo containers, a task that traditionally relies on manual inspections, which are time-consuming, labor-intensive, and prone to human error. Besides showcasing a possible pipeline to create a dataset with pre-processing steps and annotation tools, it also provides valuable insights into the challenges and opportunities of applying DL techniques to real-world problems in the logistics industry. Through systematic experiments, notable improvements were achieved and a performance baseline was established, on which future studies could build on. Working with a small dataset and limited data diversity reflected the inherent challenges of working with real-world data. Particularly container colors, lighting conditions and image resolutions had a clear impact on generalization. While the fine-tuned model exhibited encouraging results, further efforts are necessary for practical implementation. Despite these challenges, it provides a solid proof-of-concept and highlights clear solutions for future improvements. Importantly, the system demonstrates the potential to significantly reduce labor time, improve inspection reliability and streamline logistics operations by automating a traditionally manual process.

Future Work

Several opportunities for future work have resulted from this research. Improving model performance can be achieved by enlarging the dataset and its annotations and by employing more efficient backbone architectures that balance performance and computational cost. To enhance generalization, greater data diversity and quality are required. For instance, a comprehensive investigation into the impact of diverse lighting conditions, environmental settings, and the effect of container colors on corrosion detection should be studied. Integration of this model with a server-side application for reporting and archiving would enable faster and more effective damage assessments at port terminals. This work addressed the challenge of creating a dataset from raw data and detecting corrosion in cargo containers using DL. Through the use of pre-processing techniques to build a dataset and training of a segmentation model, this work demonstrated the potential of DL techniques for damage detection. Despite the persistent limitations, the results are promising given the complexity of the task and the constraints related to the size of the dataset. Future research based on this work can focus on addressing the shortcomings highlighted, enhancing performance and transitioning to real-time solutions that meet the demands of real-world applications.

Author Contributions

Conceptualization, D.O., D.C. and A.J.R.N.; methodology, D.O. and D.C.; software, D.O.; validation, D.O., D.C. and A.J.R.N.; formal analysis, D.O., D.C. and A.J.R.N.; investigation, D.O.; resources, A.J.R.N.; data curation, D.O.; writing—original draft preparation, D.O.; writing—review and editing, D.O., D.C. and A.J.R.N.; visualization, D.O.; supervision, D.C. and A.J.R.N.; project administration, A.J.R.N.; funding acquisition, A.J.R.N. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the PRR—Recovery and Resilience Plan and by the NextGenerationEU funds at Universidade de Aveiro, through the scope of the Agenda for Business Innovation “NEXUS: Pacto de Inovação—Transição Verde e Digital para Transportes, Logística e Mobilidade” (Project no. 53 with the application C645112083-00000059). This work was also supported by the research unit IEETA (UIDB/00127/2020).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy issues.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Adam	Adaptive Moment Estimation
AdamW	Adam with decoupled weight decay
AI	Artificial Intelligence
ASPP	Atrous Spatial Pyramid Pooling
APS	Port of Sines Authority
CASAE	Cascaded AutoEncoder
CBAM	Convolutional Block Attention Module
CDC	Corrosion Defect Characterization
CNN	Convolutional Neural Network
CV	Computer Vision
DL	Deep Learning
DNN	Deep Neural Network
EIOU	Efficient Intersection over Union
FCN	Fully Convolutional Network
FEU	Forty-feet-Equivalent-Unit
FPN	Feature Pyramid Networks
GPU	Graphics Processing Unit
HRNet	High-Resolution Network
HRTC	High-Resolution and Temporal Context
IoU	Intersection over Union
IPN	Image Pyramid Network
ISO	International Organization for Standardization
MD	Multi-Depth
mAP	mean Average Precision
mIoU	mean IoU
MS	Multi-Scale
OCR	Optical Character Recognition
R-CNN	Region-Based CNN
ReLU	Rectified Linear Unit
ResNet	Residual Network
RPN	Region Proposal Network
ROI	Region of Interest
SAM	Segment Anything model
SGD	Stochastic Gradient Descent
SSD	Single Shot Detector
TEU	Twenty-feet-Equivalent-Unit
YOLO	You Only Look Once
WESPE	Weakly Supervised Photo Enhancer

References

United Nations Conference on Trade and Development (UNCTAD). Review of Maritime Transport 2024. United Nations publication. 2024. Available online: https://unctadstat.unctad.org/datacentre/dataviewer/US.ContPortThroughput (accessed on 10 March 2025).
ISO 1496-1:2013; Series 1 Freight Containers—Specification and Testing—Part 1: General Cargo Containers for General Purposes. International Organization for Standardization: Geneva, Switzerland, 2013.
ISO 668:2020; Series 1 freight Containers—Classification, Dimensions and Ratings. International Organization for Standardization: Geneva, Switzerland, 2020.
Altunlu, O.; Elmas, G. Container Inspection and Repair Standards. In Proceedings Book; 2016; pp. 404–412. Available online: https://www.ics-shipping.org/resource/ucirc_revision_3/ (accessed on 10 March 2025).
Carlo, H.J.; Vis, I.F.A.; Roodbergen, K.J. Transport operations in container terminals: Literature overview, trends, research directions and classification scheme. Eur. J. Oper. Res. 2014, 236, 1–13. [Google Scholar] [CrossRef]
Steenken, D.; Voß, S.; Stahlbock, R. Container terminal operation and operations research—A classification and literature review. OR Spectr. 2004, 26, 3–49. [Google Scholar] [CrossRef]
Demil, B.; Lecocq, X. The Box: How the Shipping Container Made the World Smaller and the World Economy Bigger. M@n@gement 2006, 9, 73–79. [Google Scholar] [CrossRef]
Basulo-Ribeiro, J.; Pimentel, C.; Teixeira, L. Digital Transformation in Maritime Ports: Defining Smart Gates through Process Improvement in a Portuguese Container Terminal. Future Internet 2024, 16, 350. [Google Scholar] [CrossRef]
Delgado, G.; Cortés, A.; Loyo, E. Pipeline for Visual Container Inspection Application using Deep Learning. Int. Jt. Conf. Comput. Intell. 2022, 1, 404–411. [Google Scholar] [CrossRef]
Orhei, C.; Mocofan, M.; Vert, S.; Vasiu, R. End-to-End Computer Vision Framework. In Proceedings of the 2020 International Symposium on Electronics and Telecommunications (ISETC), Timisoara, Romania, 5–6 November 2020; pp. 1–4. [Google Scholar] [CrossRef]
Wang, Z.; Gao, J.; Zeng, Q.; Sun, Y. Multitype Damage Detection of Container Using CNN Based on Transfer Learning. Math. Probl. Eng. 2021. [Google Scholar] [CrossRef]
Nguyen Thi Phuong, T.; Cho, G.S.; Chatterjee, I. Automating container damage detection with the YOLO-NAS deep learning model. Sci. Prog. 2025, 108. [Google Scholar] [CrossRef]
Bahrami, Z.; Zhang, R.; Rayhana, R.; Wang, T.; Liu, Z. Optimized Deep Neural Network Architectures with Anchor Box Optimization for Shipping Container Corrosion Inspection. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence, SSCI 2020, Canberra, ACT, Australia, 1–4 December 2020; pp. 1328–1333. [Google Scholar] [CrossRef]
Bahrami, Z.; Zhang, R.; Wang, T.; Liu, Z. An End-to-End Framework for Shipping Container Corrosion Defect Inspection. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [Google Scholar] [CrossRef]
Duan, H.; Huang, J.; Liu, W.; Shu, F. Defective surface detection based on improved Faster R-CNN. In Proceedings of the IEEE International Conference on Industrial Technology, Canberra, ACT, Australia, 1–4 December 2022. [Google Scholar] [CrossRef]
Huang, B.; Liu, J.; Liu, X.; Liu, K.; Liao, X.; Li, K.; Wang, J. Improved YOLOv5 network for steel surface defect detection. Metals 2023, 13, 1439. [Google Scholar] [CrossRef]
Pham, V.; Pham, C.; Dang, T. Road damage detection and classification with Detectron2 and Faster R-CNN. In Proceedings of the 2020 IEEE International Conference on Big Data, Atlanta, GA, USA, 10–13 December 2020; pp. 5592–5601. [Google Scholar] [CrossRef]
Chun, C.; Ryu, S.-K. Road surface damage detection using fully convolutional neural networks and semi-supervised learning. Sensors 2019, 19, 5501. [Google Scholar] [CrossRef]
Zhang, C.; Chang, C.; Jamshidi, M. Concrete bridge surface damage detection using a single-stage detector. Comput.-Aided Civ. Infrastruct. Eng. 2020, 35, 389–409. [Google Scholar] [CrossRef]
Tabernik, D.; Šela, S.; Skvarč, J.; Skočaj, D. Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 2020, 31, 759–776. [Google Scholar] [CrossRef]
Cognex. VISIONPRO VIDI: Deep learning-based software for industrial image analysis. Cognex. 2018. Available online: https://www.cognex.com/products/machine-vision/vision-software/visionpro-vidi (accessed on 16 July 2024).
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Lect. Notes Comput. Sci. (LNCS) 2018, 11211, 833–851. [Google Scholar] [CrossRef]
OpenCV. OpenCV-Python Tutorials. Available online: https://docs.opencv.org/4.x/d6/d00/tutorial_py_root.html (accessed on 16 July 2024).
Bansal, R.; Raj, G.; Choudhury, T. Blur image detection using Laplacian operator and Open-CV. In Proceedings of the 5th International Conference on System Modeling and Advancement in Research Trends, SMART 2016, Moradabad, India, 25–27 November 2017; pp. 63–67. [Google Scholar] [CrossRef]
de Sousa Reis, P.M.L. Data Labeling Tools for Computer Vision: A Review. Master’s Thesis, Universidade Nova de Lisboa, Lisbon, Portugal, 2021. [Google Scholar]
Sager, C.; Janiesch, C.; Zschech, P. A survey of image labelling for computer vision applications. J. Bus. Anal. 2021, 4, 91–110. [Google Scholar] [CrossRef]
Aljabri, M.; AlAmir, M.; AlGhamdi, M.; Abdel-Mottaleb, M.; Collado-Mesa, F. Towards a better understanding of annotation tools for medical imaging: A survey. Multimed. Tools Appl. 2022, 81, 25877–25911. [Google Scholar] [CrossRef]
CVAT.ai Corporation. Computer Vision Annotation Tool (CVAT). Zenodo. 2024. Available online: https://zenodo.org/records/12771595 (accessed on 3 September 2024).
Tkachenko, M.; Malyuk, M.; Holmanyuk, A.; Liubimov, N. Label Studio: Data Labeling Software. Available online: https://labelstud.io/ (accessed on 3 September 2024).
Labelbox. The Data Factory for AI Teams. Available online: https://labelbox.com/ (accessed on 3 September 2024).
Dutta, A.; Zisserman, A. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (MM ’19), Nice, France, 21–25 October 2019. [Google Scholar] [CrossRef]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Pillow Documentation. Pillow Handbook—Concepts. Available online: https://pillow.readthedocs.io/en/stable/handbook/concepts.html (accessed on 5 October 2024).
Papers with Code. Semantic Segmentation on PASCAL VOC 2012. Available online: https://paperswithcode.com/sota/semantic-segmentation-on-pascal-voc-2012 (accessed on 5 October 2024).
Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Fu, J.; Liu, J.; Wang, Y.; Zhou, J.; Wang, C.; Lu, H. Stacked deconvolutional network for semantic segmentation. IEEE Trans. Image Process. 2019. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. arXiv 2023. [Google Scholar] [CrossRef]
Bianchi, E.; Hebdon, M. Development of Extendable Open-Source Structural Inspection Datasets. J. Comput. Civ. Eng. 2022, 36. [Google Scholar] [CrossRef]
Detlefsen, N.S.; Borovec, J.; Schock, J.; Harsh, A.; Koker, T.; Di Liello, L.; Stancl, D.; Quan, C.; Grechkin, M.; Falcon, W. TorchMetrics—Measuring Reproducibility in PyTorch. J. Open Source Softw. 2022. [Google Scholar] [CrossRef]
VainF. DeepLabV3Plus-PyTorch. 2022. Available online: https://github.com/VainF/DeepLabV3Plus-Pytorch (accessed on 8 October 2024).
FOSSASIA. Visdom. Available online: https://github.com/fossasia/visdom (accessed on 8 October 2024).
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017. [Google Scholar] [CrossRef]

Figure 1. Laplacian filter in selected and rejected examples.

Figure 2. Canny edge detection in selected and rejected examples.

Figure 3. Accepted images for the dataset, respecting the boundaries of the threshold blur (1–4).

Figure 4. Rejected images for dataset due to quality (1,2), blur (3), and brightness (4).

Figure 5. Example of a fully annotated container.

Figure 6. Intersection over union metric.

Figure 7. Largest area segmented mask output from SAM.

Figure 8. Predictor output with 2 points (red star) as prompt. Best mask score: 0.975.

Figure 9. Black and white background container masks.

Figure 10. Comparison of ground truth masks and model predictions for containers with distinct and similar corrosion colors.

Figure 11. DeepLabv3+: Training loss on 300 epochs using default parameters.

Figure 12. DeepLabv3+: Validation loss on 300 epochs using default parameters.

Figure 13. DeepLabv3+: Validation performance of corrosion IoU on 300 epochs using default parameters.

Figure 14. DeepLabv3+ (bold lines: smoothed): Impact on validation performance using different backgrounds (default, white, and black) on 300 epochs.

Figure 15. DeepLabv3+: Impact on validation performance with or without data augmentation on 300 epochs.

Figure 16. DeepLabv3+ (bold lines: smoothed): Impact on validation performance using MobileNetv2, Resnet50, and Resnet101 backbone network models on 300 epochs.

Figure 17. DeepLabv3+ (bold lines: smoothed): Impact on validation performance using SGD on 300 epochs, Adam and AdamW on 200 epochs.

Figure 18. Validation: Comparison of the output prediction with the ground truth mask and the original image.

Figure 19. Test: Comparison of the output prediction with the ground truth mask and the original image.

Figure 20. Validation mean IoU using ResNet101 and AdamW (lr = 0.002) on 200 epochs.

Figure 21. CPU utilization during 60 s simulation using trained DeepLabv3+ as inference model.

Figure 22. GPU utilization during 60 s simulation using trained DeepLabv3+ as inference model.

Figure 23. GPU memory usage during 60 s simulation using trained DeepLabv3+ as inference model.

Figure 24. GPU temperature during 60 s simulation using trained DeepLabv3+ as inference model.

Table 1. Number of canny edge pixels and Laplacian values.

Image	Canny Edges	Laplacian Variance
Rejected 1	≈77,000	11
Rejected 2	≈42,000	11
Rejected 3	≈6000	2
Rejected 4	≈5500	3
Accepted 1	≈316,000	54
Accepted 2	≈129,000	23
Accepted 3	≈137,000	85
Accepted 4	≈318,000	50

Table 2. Comparison of generic tools. (✓: Feature available, ✗: Feature not available).

Tool	Open Source	Friendly UI	Large-Scale Datasets	Pixel-Level Labeling
CVAT	✓	✓	✓	✓
LabelStudio	✓	✗	✓	✓
LabelBox	✓	✗	✓	✓
VIA	✓	✓	✗	✗
LabelMe	✓	✓	✗	✗

Table 3. Comparison of CV annotation tools. (✓: Feature available, ✗: Feature not available).

Tool	Brush Tool	AI Magic Wand	Intelligent Scissors	Bounding Box	Points
CVAT	✓	✓	✓	✓	✓
LabelStudio	✓	✓	✓	✓	✓
LabelBox	✓	✓	✓	✓	✓
VIA	✗	✗	✗	✓	✓
LabelMe	✗	✗	✗	✓	✓

Table 4. Comparison of semantic segmentation models on PASCAL VOC 2012 dataset [35].

Model	mIoU %
FCN [36]	67.2
PSPNet [37]	85.4
SDN+ [38]	86.6
DeepLabV3-JFT [39]	86.9
DeepLabV3+ [23]	89.0

Table 5. DeepLabv3+ pre-trained performance (IoU) on experiments 1 (20 random images), 2 (40 images without red cargo containers), and 3 (20 images with only red cargo containers). The experiments were tested on all different model weights with each respective loss function: w18 (cross entropy), w27 (L1 loss), w35 (L2 loss), w40 (cross entropy with weighted classes). Comparison of model performance in original image against white and black backgrounds.

Experiments	Weights	Original	White Background	Black Background
1	18	9.3%	14.5%	16.3%
	27	15.5%	20.4%	21.1%
	35	14.9%	18.7%	18.9%
	40	10.3%	22.0%	20.5%
2	18	6.9%	13.1%	14.9%
	27	12.2%	17.7%	18.5%
	35	9.2%	14.4%	14.9%
	40	9.3%	18.3%	14.8%
3	18	0.6%	1.4%	1.6%
	27	0.9%	2.4%	1.9%
	35	1.0%	1.5%	1.5%
	40	0.8%	0.9%	1.7%

Table 6. DeepLabv3+ fine-tuned performance (IoU) on validation set and test set for each configuration in all experiments.

Experiments	Configuration	Validation	Test
Background	Default	41.3%	30.0%
	Black	40.1%	30.4%
	White	41.0%	29.5%
Data augmentation	Default	41.3%	30.0%
Data augmentation	Without	39.4%	29.3%
Networks	MobileNetV2 (def)	41.3%	30.0%
	ResNet50	43.2%	34.7%
	ResNet101	46.3%	40.1%
Optimizers	SGD (def)	46.3%	40.1%
	Adam	52.5%	49.4%
	AdamW	52.7%	49.4%

Table 7. Performance comparison of DeepLabv3+ pre-trained model with fine-tuned model.

DeepLabv3+ Model	Performance (IoU)
Pre-trained	4.0%
Fine-tuned	49.4%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ornelas, D.; Canedo, D.; Neves, A.J.R. Semantic Segmentation of Corrosion in Cargo Containers Using Deep Learning. Sustainability 2025, 17, 6480. https://doi.org/10.3390/su17146480

AMA Style

Ornelas D, Canedo D, Neves AJR. Semantic Segmentation of Corrosion in Cargo Containers Using Deep Learning. Sustainability. 2025; 17(14):6480. https://doi.org/10.3390/su17146480

Chicago/Turabian Style

Ornelas, David, Daniel Canedo, and António J. R. Neves. 2025. "Semantic Segmentation of Corrosion in Cargo Containers Using Deep Learning" Sustainability 17, no. 14: 6480. https://doi.org/10.3390/su17146480

APA Style

Ornelas, D., Canedo, D., & Neves, A. J. R. (2025). Semantic Segmentation of Corrosion in Cargo Containers Using Deep Learning. Sustainability, 17(14), 6480. https://doi.org/10.3390/su17146480

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Segmentation of Corrosion in Cargo Containers Using Deep Learning

Abstract

1. Introduction

Related Work

2. Materials and Methods

2.1. Data Preparation

2.1.1. Pre-Processing

2.1.2. Annotation

2.1.3. Rescaling Data

2.1.4. Data Splitting and Limitations

2.2. Corrosion Segmentation

2.2.1. DeepLabv3+ Architecture

2.2.2. Background Extraction

2.2.3. Points as Prompt

2.3. Pre-Trained Model Experiments

2.4. Pre-Trained Results

2.5. Fine-Tuned Model Experiments

Data Transformation

2.6. Training Experiments

2.6.1. Default Configuration

2.6.2. Background

2.6.3. Data Augmentation

2.6.4. Network Backbone Models

2.6.5. Optimizers

3. Results

3.1. Default Configuration

3.2. Background

3.3. Data Augmentation

3.4. Backbone Models

3.5. Optimizers

3.6. Model Evaluation

4. Discussion

4.1. Background Class

4.2. Comparison with Pre-Trained DeepLabv3+

4.3. Model Complexity Testing

5. Conclusions

Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI