Inspection of Underwater Hull Surface Condition Using the Soft Voting Ensemble of the Transfer-Learned Models

In this study, we propose a method for inspecting the condition of hull surfaces using underwater images acquired from the camera of a remotely controlled underwater vehicle (ROUV). To this end, a soft voting ensemble classifier comprising six well-known convolutional neural network models was used. Using the transfer learning technique, the images of the hull surfaces were used to retrain the six models. The proposed method exhibited an accuracy of 98.13%, a precision of 98.73%, a recall of 97.50%, and an F1-score of 98.11% for the classification of the test set. Furthermore, the time taken for the classification of one image was verified to be approximately 56.25 ms, which is applicable to ROUVs that require real-time inspection.


Introduction
The submerged part of a ship's hull is susceptible to biofouling in the form of pollutants or organisms such as water mosses and seagrass that attach themselves to the bottom and sides of the submerged surfaces. This phenomenon not only damages the surface of ships but also provides unwanted resistance during normal operation, resulting in inferior performance [1][2][3]. In addition, when a ship enters a port, the various pollutants attached to the hull surface can contaminate the seawater in the port. In the case of ships traveling abroad, aquatic alien creatures that are transported to a different region can disrupt local marine ecosystems [4]. Thus, hull surfaces must be cleaned periodically while ships are anchored in a port.
Conventional hull cleaning is performed by divers. Recently, however, studies have been conducted on cleaning hull surfaces using a remotely operated underwater vehicle (ROUV) [5][6][7][8][9]. A human operator observes the submerged hull surface through a camera mounted on the ROUV and checks its condition. The ROUV is subsequently remotely controlled to clean the affected parts of the hull. However, autonomous hull cleaning without human intervention requires ROUVs capable of recognizing the hull condition. In addition, since the hull condition should be immediately fed back to the ROUV, the process of recognition must occur in real-time.
However, owing to underwater conditions, the images observed by an ROUV through its camera are not clear. In addition, the images may differ depending on the depth of operation, underwater conditions, and lighting; consequently, existing image-processing methods are insufficient for the accurate recognition of hull conditions. Therefore, this study proposes a classification method to recognize the hull condition using convolutional neural networks (CNNs) [10] with images of the hull surface acquired through the ROUV camera. Based on the image, the hull condition is categorized into two classes: (1) positive 1.
The condition of the hull surface was classified with high accuracy by using CNN models with hull surface images.

2.
Our own training dataset was obtained under various underwater conditions using the developed ROUV. 3.
Transfer learning of the pretrained models was used to adapt the pretrained models to classification of the hull surface.

4.
A higher accuracy was obtained using a soft voting ensemble technique comprising several transfer-learned models.
The remainder of this paper is organized as follows. Section 2 describes previous related studies. Section 3 describes the soft voting ensemble classifier used in this study, and Section 4 describes the generation of the dataset used for training the CNNs. Section 5 discusses the proposed method and experimental results. Finally, Section 6 concludes the paper.

Inspection of Products and Underwater Objects
Several studies have used images from cameras to inspect product defects in various fields. Chang et al. [15] proposed a method for the defect inspection of the color filter, which is a component of the TFT-LCD module, using fuzzy inference from the inspection images. Jiang et al. [16] proposed a method for inspecting printed circuit boards (PCBs) for defects using logistic regression from inspection images. Zhang et al. [17] used a genetic algorithm, artificial neural network, and expert system to inspect copper strip images for defects. Zhao et al. [18] studied the image-based defect inspection of concrete surfaces. Siegel et al. [19] and Mumtaz et al. [20] studied aircraft defect inspection. Amosov et al. [21] studied the defect inspection of rivet joints in aircrafts. Raouf et al. [22] proposed a machinelearning-based fault classification system for the fault detection of rotating vector reducers.
For hull surface inspection, methods using ultrasound [23,24] or sonar [25] have previously been used to inspect coating breakdown, corrosion, and cracks. However, with the improvement in underwater camera performance and image-processing technology, studies on the automatic inspection of hull surfaces have received significant attention. Neghdaripour and Firoozfam [26] proposed a stereo vision system for underwater hull inspections. Navarro et al. [27] proposed a sensor system and a method for detecting defects on a hull surface using thresholds from the images obtained. Fernández-Isla et al. [28] proposed a method for detecting defects from images of a hull surface using wavelet transform. Masi et al. [29] and Ortiz et al. [30] used artificial neural networks to detect corrosion in seabed pipelines and hulls. Chin et al. [31] classified biofouling images using transfer learning of the Inception V3 model. Gormley et al. [32] classified images of aquatic creatures attached to marine structures using CoralNet [33]. Bloomfield et al. [34] classified images of aquatic creatures using a CNN. Liniger et al. [35] reviewed classification methods using deep learning to categorize marine growth on offshore structures.
In this study, similar to the studies by Chin et al. [31], Gormley et al. [32], and Bloomfield et al. [34], CNNs were used to classify underwater images. However, the classification target in this study was the condition of hull surfaces and not aquatic creatures. This study utilized transfer learning and a soft voting ensemble.

Datasets for Training
Many publicly available datasets exist for training the CNN models. The MNIST dataset [36] contains a set of handwritten digits from zero to nine. It contains 60,000 training images and 10,000 test images. The images are 28 × 28 grayscale images. They are often used to train simple models. ImageNet [11] is the largest image dataset used in computer vision. It contains more than 14 million images and more than 20,000 categories with a typical category, such as "balloon" or "strawberry". Most image classification studies have used it as a benchmark dataset. The COCO dataset [37] is a large-scale object detection, segmentation, and captioning dataset published by Microsoft. It contains image annotations across 80 categories with over 1.5 million object instances. It is often used as a benchmark algorithm to compare object detection performance. In addition, image datasets for indoor scenes [38], celebrities [39], dog breeds [40], and flowers [41] exist.
A small number of publicly available datasets exist for underwater images. Moreover, these are mainly datasets for aquatic creatures living underwater. Chin et al. [42] shared a dataset with 1326 labeled images divided into 10 classes, such as algae and balanus. Shihavuddin [43] published a dataset for the identification of coral reef species. Coral-Net [33] is a dataset used for benthic image analysis. It also functions as a data repository and collaboration platform. This platform for sharing training data can help overcome the lack of available data. O'Bryne et al. [44] presented a method for overcoming the lack of underwater images. They generated a photorealistic synthetic scene of underwater inspection sites using an encoder-decoder model trained with 2500 images.
In this study, images of underwater hull surfaces were required, but there is no publicly available dataset for them. The existing datasets do not consider underwater objects or focus only on aquatic creatures, such as coral reefs. In this study, we collected images using the ROUV by SLM Global [45].

Problem Definition
The problem to be solved in this study is defined as follows: Given a two-dimensional image x of hull surfaces that is input through the ROUV's camera and labeled as clean (negative class) or unclean (positive class), define a binary classifier h(x) that can classify the hull condition via image x, where the output of h(x) indicates the probability P( unclean|x) that the input image x is unclean. For a given threshold ε, if P( unclean|x) > ε, x is classified as unclean, and if P( unclean|x) ≤ ε, x is classified as clean.

Soft Voting Ensemble Architecture
In this study, we defined a classifier using a soft voting ensemble of the well-known CNN models DenseNets [46], EfficientNets [47], Inceptions [48][49][50], MobileNets [51][52][53], ResNets [54,55], and VGGs [56], as shown in Figure 1. The soft voting ensemble classifier is a combination of multiple models. In these models, decisions are made by combining individual decisions based on probability values to specify that the data belong to a particular class. [14] In the soft voting ensemble, predictions are weighted based on the classifier's importance and merged to obtain the sum of weighted probabilities. individual decisions based on probability values to specify that the data belong to a particular class. [14] In the soft voting ensemble, predictions are weighted based on the classifier's importance and merged to obtain the sum of weighted probabilities. The classification method using the soft voting ensemble is as follows: Step 1. Each of the six models is represented by Equation (1): where is the number representing each model participating in the soft voting, ℎ represents the -th model for classification, is the weights of the -th model, is the input image, and | , ∈ 0,1 -the output value of the -the model-is the probability that the input image is clean. In this study, = 1~6 represent DenseNet, EfficientNet, Inception, MobileNet, ResNet, and VGG, respectively.
Step 2. Each model is retrained with our dataset using transfer learning to determine the weights . The dataset creation and transfer learning method are described in Sections 3.3 and 4, respectively.
Step 3. | , is evaluated for each model. | , is the probability that input image x is clean by the -th classification model.
Step 4. By averaging all | , s, the final prediction value | ∈ 0,1 is evaluated using Equation (2): Step 5. Finally, for a given threshold , image is classified as clean if | > .
Even if the number of models participating in soft voting changes, the overall process does not change. Only the number six in Equation (2) changes to the number of models.

Transfer Learning of the Pretrained Models
The optimal weights of the six models comprising the soft voting ensemble were selected using transfer learning of the pretrained models. Transfer learning [12,13] is a machine learning technique in which a model developed for a task is reused as the starting point for a model for a second task. The six models used in this study comprised 26 submodels, as shown in Table 1, and they were pretrained for the ImageNet dataset [11]. For each model, optimal hyperparameters and weights were selected through transfer learning and hyperparameter tuning. The classification method using the soft voting ensemble is as follows: Step 1. Each of the six models is represented by Equation (1): where k is the number representing each model participating in the soft voting, h (k) θ k (x) represents the k-th model for classification, θ k is the weights of the k-th model, x is the input image, and P k (unclean|x, θ k ) ∈ [0, 1]-the output value of the k-the model-is the probability that the input image x is clean. In this study, k = 1 ∼ 6 represent DenseNet, EfficientNet, Inception, MobileNet, ResNet, and VGG, respectively.
Step 2. Each model is retrained with our dataset using transfer learning to determine the weights θ k . The dataset creation and transfer learning method are described in Sections 3.3 and 4, respectively.
Step 3. P k (unclean|x, θ k ) is evaluated for each model. P k (unclean|x, θ k ) is the probability that input image x is clean by the k-th classification model.
Even if the number of models participating in soft voting changes, the overall process does not change. Only the number six in Equation (2) changes to the number of models.

Transfer Learning of the Pretrained Models
The optimal weights of the six models comprising the soft voting ensemble were selected using transfer learning of the pretrained models. Transfer learning [12,13] is a machine learning technique in which a model developed for a task is reused as the starting point for a model for a second task. The six models used in this study comprised 26 submodels, as shown in Table 1, and they were pretrained for the ImageNet dataset [11]. For each model, optimal hyperparameters and weights were selected through transfer learning and hyperparameter tuning. Transfer learning is applied as follows. First, the input size of the pre-learned models is redefined to the size of the input image. Subsequently, the pixel values of the input image are normalized to ensure that each pixel value is between 0 and 1. Second, the layers for multiclass classification used in the pretrained models are replaced with layers for binary classification. For this purpose, a global average pooling [57] layer and dropout layer [58] are appended to the last convolution layer of the pretrained model. Finally, a fully connected layer with one node is appended using a sigmoid function as an activation function, as defined in Equation (3): The redefined models are trained as follows. First, only the weights of the newly appended layers among the layers of the redefined models are tuned by training. Training lasts for 20 epochs with a given learning rate α 1 for the training dataset, for which the mini-batch gradient descent method is used. The Adam optimizer [59] (with momentum parameters, β 1 = 0.9, β 2 = 0.999, ε = 10 −7 ) is used as an optimizer. For the loss function, the average of the binary cross-entropy values between the actual label values, y (i) , and predicted values, h (k) θ k x (i) , of the images is evaluated, as defined in Equation (4): where m is the number of the images used for training. Following this, the weights of all the layers are fine-tuned for 10 epochs at a new learning rate α 2 (= λ · α 1 ) that is obtained by reducing the learning rate α 1 by a factor of λ(< 1). Finally, the weights corresponding to the highest validation accuracy among all the epochs are selected.
Optimal hyperparameters such as dropout rate, learning rates, batch size, and submodels are selected using hyperparameter tuning. First, the hyperparameters are tuned for each sub-model in Table 1. Subsequently, the optimal hyperparameters are selected using a random search method [60]. In this method, the value of each hyperparameter is randomly sampled from the search space comprising them, and the validation accuracy is measured. This is repeated dozens of times for each sub-model. Finally, among the sub-models of each model, the model with the highest validation accuracy was selected.

Description of the ROUV
In this study, images of the hull surfaces were collected using the ROUV developed by SLM Global [45] to clean underwater hull surfaces, which is illustrated in Figure 2. The ROUV attaches itself to the hull surface and crawls along it using electrically driven magnetic wheels. It is remotely controlled and monitored by an operator through a tether cable. While moving along the hull surface, the ROUV brushes off the pollutants on the hull surface with two brushes installed at the bottom of the ROUV. The ROUV possesses one camera and two lights in the front and one camera at the rear. The front and rear cameras are used to check the condition of the hull surface before and after cleaning, respectively. The videos are recorded at 10 frames per second (FPS). Table 2 lists the main specifications of the ROUV.
Sensors 2022, 22, x FOR PEER REVIEW 6 of surface with two brushes installed at the bottom of the ROUV. The ROUV possesses on camera and two lights in the front and one camera at the rear. The front and rear camer are used to check the condition of the hull surface before and after cleaning, respectivel The videos are recorded at 10 frames per second (FPS). Table 2 lists the main specificatio of the ROUV.

Dataset Creation
To retrain the pretrained models, images of the hull surfaces and their labels are r quired. The size and number of channels of the images were 512 × 512 and 3, respectivel The labels are represented as either clean or unclean. Images were extracted at interva of 1 s from the video that was recorded by the ROUV. Subsequently, the images we manually labeled.
First, a rectangular area of 512 × 512 pixels containing the camera image was cut fro the dashboard image of the ROUV to obtain an image of the camera region, as shown Figure 3a. As the camera lens has a circular shape, its pure image also has a circular are To use only the pure camera image, the circular portion is extracted using the Boolea intersection of the front camera image (

Dataset Creation
To retrain the pretrained models, images of the hull surfaces and their labels are required. The size and number of channels of the images were 512 × 512 and 3, respectively. The labels are represented as either clean or unclean. Images were extracted at intervals of 1 s from the video that was recorded by the ROUV. Subsequently, the images were manually labeled.
First, a rectangular area of 512 × 512 pixels containing the camera image was cut from the dashboard image of the ROUV to obtain an image of the camera region, as shown in Figure 3a. As the camera lens has a circular shape, its pure image also has a circular area. To use only the pure camera image, the circular portion is extracted using the Boolean intersection of the front camera image ( Figure 3a) and mask (Figure 3b). The resulting image, shown in Figure 3c, was used for training. surface with two brushes installed at the bottom of the ROUV. The ROUV possesses one camera and two lights in the front and one camera at the rear. The front and rear cameras are used to check the condition of the hull surface before and after cleaning, respectively. The videos are recorded at 10 frames per second (FPS). Table 2 lists the main specifications of the ROUV.

Dataset Creation
To retrain the pretrained models, images of the hull surfaces and their labels are required. The size and number of channels of the images were 512 × 512 and 3, respectively. The labels are represented as either clean or unclean. Images were extracted at intervals of 1 s from the video that was recorded by the ROUV. Subsequently, the images were manually labeled.
First, a rectangular area of 512 × 512 pixels containing the camera image was cut from the dashboard image of the ROUV to obtain an image of the camera region, as shown in Figure 3a. As the camera lens has a circular shape, its pure image also has a circular area. To use only the pure camera image, the circular portion is extracted using the Boolean intersection of the front camera image ( Figure 3a) and mask (Figure 3b). The resulting image, shown in Figure 3c, was used for training.     Figure 4d, the tail fin of a fish can be seen, and in Figure 4g, the coating on the hull surface is shown to have been peeled off. In Figure 4e, the image is obscured by floating matter, and in Figure 4h, the lighting is too strong. As shown in Figure 4, underwater images include various objects, such as draft marks, fish, and floating matter. Furthermore, the underwater conditions such as lighting and the physical state of the hull surface vary. Thus, identifying the hull condition is difficult. Faulty classification due to similarity in colors of different objects is also a concern. For instance, draft marks, peeled sections of the hull surface, and barnacles are all generally white; however, only the images with barnacles should be classified as unclean. coating on the hull surface is shown to have been peeled off. In Figure 4e, the image is obscured by floating matter, and in Figure 4h, the lighting is too strong. As shown in Figure 4, underwater images include various objects, such as draft marks, fish, and floating matter. Furthermore, the underwater conditions such as lighting and the physical state of the hull surface vary. Thus, identifying the hull condition is difficult. Faulty classification due to similarity in colors of different objects is also a concern. For instance, draft marks, peeled sections of the hull surface, and barnacles are all generally white; however, only the images with barnacles should be classified as unclean. In this study, 5683 images were extracted from videos of 20 hull surfaces at different dates and locations. These were split into two image sets: 2035 clean and 3648 unclean images. To obtain an equal number of images from the two image sets, for training, validation, and testing, 2000 images were randomly selected from each image set. Finally, each image set was split into a training, validation, and testing set in a 60:20:20 ratio. Consequently, for each class, the image set was split into 1200, 400, and 400 images, respectively.
To increase accuracy, the images of the training set were augmented by randomly applying one or more of the following four methods: 1. Brightness adjustment to randomly adjust the brightness of an image; 2. Contrast adjustment to randomly adjust the contrast of an image; 3. Saturation adjustment to randomly adjust the saturation of an image; 4. Cropping to randomly remove a particular region from an image.
Considering that the images acquired at the same position on the same hull surface may vary according to the depth and ambient brightness of the seawater, the adjustment of the brightness, contrast, and saturation can improve the accuracy. Cropping can also improve accuracy. However, the commonly used augmentation techniques of translation, rotation, flipping, and scaling were avoided because we experimentally verified that they did not improve the accuracy. We assume that such transformations do not significantly alter the images. Using the aforementioned methods, four augmented images were generated from per image. Figure 5 shows examples of augmented images. In this study, 5683 images were extracted from videos of 20 hull surfaces at different dates and locations. These were split into two image sets: 2035 clean and 3648 unclean images. To obtain an equal number of images from the two image sets, for training, validation, and testing, 2000 images were randomly selected from each image set. Finally, each image set was split into a training, validation, and testing set in a 60:20:20 ratio. Consequently, for each class, the image set was split into 1200, 400, and 400 images, respectively.
To increase accuracy, the images of the training set were augmented by randomly applying one or more of the following four methods:

1.
Brightness adjustment to randomly adjust the brightness of an image; 2.
Contrast adjustment to randomly adjust the contrast of an image; 3.
Saturation adjustment to randomly adjust the saturation of an image; 4.
Cropping to randomly remove a particular region from an image.
Considering that the images acquired at the same position on the same hull surface may vary according to the depth and ambient brightness of the seawater, the adjustment of the brightness, contrast, and saturation can improve the accuracy. Cropping can also improve accuracy. However, the commonly used augmentation techniques of translation, rotation, flipping, and scaling were avoided because we experimentally verified that they did not improve the accuracy. We assume that such transformations do not significantly alter the images. Using the aforementioned methods, four augmented images were generated from per image. Figure 5 shows examples of augmented images. The configuration of the final dataset is listed in Table 3. The test set is used to retrain the pretrained models. The validation set is used to optimize the models via hyperparameter tuning. The test set is used for testing the models and soft voting ensemble classifier.

Implementation and Experiments
In this study, the proposed soft voting ensemble classifier was implemented using Python and Google's TensorFlow 2 and was run on computers with an Intel Xeon 3.00 GHz CPU, 128 GB RAM, and two NVIDIA TITAN RTX graphic cards. The pretrained models and weights provided by TensorFlow 2 were used. Retraining and hyperparameter tuning were performed according to the methods described in Section 4. The average time for retraining each model is shown in Table 4.  The configuration of the final dataset is listed in Table 3. The test set is used to retrain the pretrained models. The validation set is used to optimize the models via hyperparameter tuning. The test set is used for testing the models and soft voting ensemble classifier.

Implementation and Experiments
In this study, the proposed soft voting ensemble classifier was implemented using Python and Google's TensorFlow 2 and was run on computers with an Intel Xeon 3.00 GHz CPU, 128 GB RAM, and two NVIDIA TITAN RTX graphic cards. The pretrained models and weights provided by TensorFlow 2 were used. Retraining and hyperparameter tuning were performed according to the methods described in Section 4. The average time for retraining each model is shown in Table 4.  Table 5 lists the search space for hyperparameter tuning. Some of the values of the batch size in Table 5 may have been selected owing to the memory limitations of the graphic cards.

Hyperparameters Values
Sub-models Sub-models of each model in To determine the optimal values of the hyperparameters for each sub-model in Table 1, 50 samples per sub-model were randomly selected from the values in Table 5. Subsequently, the sub-model was trained with the selected hyperparameter values. For the validation set, the sub-model with the highest accuracy was selected as the optimal model. The accuracy is defined as: where the threshold for classification, ε, is set to 0.5. Table 6 lists the optimal hyperparameter values for each model. Table 7 shows that the training and validation accuracies are greater than 98% and 97%, respectively.  Table 8 presents the classification results of the test set using the soft voting ensemble classifier that comprises six optimal models. The precision, recall, and F 1 -score were calculated as: Precision = True positive True positive + False positive , Recall = True positive True positive + False negative , and Table 8 shows that both the test accuracies and F 1 -scores of the six models are higher than 96%. Therefore, even if used independently for classification, the six models can achieve an accuracy of 96% or higher. The soft voting ensemble classifier has a higher accuracy, precision, and F 1 -score than the six models. Only the recall value of the soft voting ensemble classifier comes behind that of one of the models, i.e., EfficientNet. Therefore, we verified that the images of hull surfaces can be classified with higher accuracy when using the soft voting ensemble classifier. Figure 6 shows examples of the classification of the images of the underwater hull surfaces using the soft voting ensemble classifier. The soft voting ensemble classifier correctly classifies the images with seagrass and barnacles, shown in Figure 6a-d, as unclean. The images with draft marks, floating matter, and peeled surfaces, shown in Figure 6e-g, were correctly classified as clean. Furthermore, in Figure 6h, the dark surface color due to lighting is not recognized as seagrass but as a clean surface.  Table 8 shows that both the test accuracies and F1-scores of the six models are higher than 96%. Therefore, even if used independently for classification, the six models can achieve an accuracy of 96% or higher. The soft voting ensemble classifier has a higher accuracy, precision, and F1-score than the six models. Only the recall value of the soft voting ensemble classifier comes behind that of one of the models, i.e., EfficientNet. Therefore, we verified that the images of hull surfaces can be classified with higher accuracy when using the soft voting ensemble classifier. Figure 6 shows examples of the classification of the images of the underwater hull surfaces using the soft voting ensemble classifier. The soft voting ensemble classifier correctly classifies the images with seagrass and barnacles, shown in Figure 6a-d, as unclean. The images with draft marks, floating matter, and peeled surfaces, shown in Figure 6eg), were correctly classified as clean. Furthermore, in Figure 6h, the dark surface color due to lighting is not recognized as seagrass but as a clean surface.  Figure 7 shows examples of the mis-classified images. Since Figure 7a-c only contain a small area of seagrass, the images were labeled as clean. However, the soft voting classifier seems to classify these images as unclean because of the seagrass. In Figure 7d, the  Figure 7 shows examples of the mis-classified images. Since Figure 7a-c only contain a small area of seagrass, the images were labeled as clean. However, the soft voting classifier seems to classify these images as unclean because of the seagrass. In Figure 7d, the dark colored seagrass and seawater overlap; consequently, the hull surface was erroneously identified as seawater. In Figure 7e, the seagrass was not correctly recognized owing to the disturbance caused by the floating matter. The image in Figure 7f has draft marks and seagrass; however, only the seagrass was recognized. Figures 8-14 show the receiver operating characteristic (ROC) and precision call (PR) curves for varying classification threshold. For the soft voting ensemble classifier in Figure 8, the area under the curve (AUC) was almost one, and the highest over the other six models. This indicates that the soft voting ensemble classifier has the best ability to classify the conditions of the hull surface among the other six models. Figure 15 shows the results of the sensitivity analysis, in which one of the models was eliminated. Compared with the results in Table 7, the cases using the five models among the six models were also superior to using only one model. Compared with the case using the six models, except for precision, using the six models (depicted as None in Figure 15) is superior in terms of accuracy, recall, and F 1 -score. The case eliminating EfficientNet is superior in terms of precision but inferior in accuracy, recall, and F 1 -score. Specifically, the F 1 -score, which is the harmonic mean of the precision and recall, of the six models was higher than that of the five models. In conclusion, the case using the six models was superior to that using the five models. dark colored seagrass and seawater overlap; consequently, the hull surface was errone ously identified as seawater. In Figure 7e, the seagrass was not correctly recognized owin to the disturbance caused by the floating matter. The image in Figure 7f has draft mark and seagrass; however, only the seagrass was recognized.  dark colored seagrass and seawater overlap; consequently, the hull surface was erron ously identified as seawater. In Figure 7e, the seagrass was not correctly recognized owin to the disturbance caused by the floating matter. The image in Figure 7f has draft mark and seagrass; however, only the seagrass was recognized.            Figure 15 shows the results of the sensitivity analysis, in which one of the models was eliminated. Compared with the results in Table 7, the cases using the five models among the six models were also superior to using only one model. Compared with the case using the six models, except for precision, using the six models (depicted as None in   Figure 15 shows the results of the sensitivity analysis, in which one of the models was eliminated. Compared with the results in Table 7, the cases using the five models among the six models were also superior to using only one model. Compared with the case using the six models, except for precision, using the six models (depicted as None in Figure 15) is superior in terms of accuracy, recall, and F1-score. The case eliminating Effi- The processing speed for the classification is a key factor for real-time application To verify this, the time required to classify an image was measured. Table 9 presents t results of this study. Classifying one image in the test set requires an average of 56.25 m using only the CPU, which equates to approximately 17 FPS. Based on the speed of t ROUV, images can be sufficiently processed in real-time, even with ROUVs possessing relatively low CPU performance.

Conclusions
In this study, a method for inspecting the condition of hull surfaces using imag from an ROUV camera was proposed. The classification of images was achieved using soft voting ensemble classifier comprising six well-known CNN models. To tune the mo els, they were retrained with images of the hull surfaces. The results of the implementatio and experiments showed that the classification accuracy and F1-score of the test set we approximately 98.13% and 98.11%, respectively. Furthermore, the proposed method w found to be highly applicable to ROUVs, which require real-time inspection performanc However, the proposed method requires further improvement. As the dataset us in this study was collected from only a small number of inspection videos, the scope the results of this study is limited. Therefore, many images that include various types ships, underwater conditions, and lighting are needed. However, because ship owners a reluctant to provide hull images of their ships, collecting images is difficult. Therefore, future studies, we plan to apply a data augmentation method using generative models generate artificial images of the hull surfaces.  The processing speed for the classification is a key factor for real-time applications. To verify this, the time required to classify an image was measured. Table 9 presents the results of this study. Classifying one image in the test set requires an average of 56.25 ms using only the CPU, which equates to approximately 17 FPS. Based on the speed of the ROUV, images can be sufficiently processed in real-time, even with ROUVs possessing a relatively low CPU performance.

Conclusions
In this study, a method for inspecting the condition of hull surfaces using images from an ROUV camera was proposed. The classification of images was achieved using a soft voting ensemble classifier comprising six well-known CNN models. To tune the models, they were retrained with images of the hull surfaces. The results of the implementation and experiments showed that the classification accuracy and F 1 -score of the test set were approximately 98.13% and 98.11%, respectively. Furthermore, the proposed method was found to be highly applicable to ROUVs, which require real-time inspection performance.
However, the proposed method requires further improvement. As the dataset used in this study was collected from only a small number of inspection videos, the scope of the results of this study is limited. Therefore, many images that include various types of ships, underwater conditions, and lighting are needed. However, because ship owners are reluctant to provide hull images of their ships, collecting images is difficult. Therefore, in future studies, we plan to apply a data augmentation method using generative models to generate artificial images of the hull surfaces.