A Novel Target Detection Method of the Unmanned Surface Vehicle under All-Weather Conditions with an Improved YOLOV3

The USV (unmanned surface vehicle) is playing an important role in many tasks such as marine environmental observation and maritime security, for the advantages of high autonomy and mobility. Detecting the targets on the surface of the water with high precision ensures the subsequent task implementation. However, the changes from the lights and the surface environment influence the performance of the target detecting method in a long-term task with USV. Therefore, this paper proposed a novel target detection method by fusing DenseNet in YOLOV3 to improve the stability of detection to decrease the feature loss, while the target feature is transmitted in the layers of a deep neural network. All the image data used to train and test the proposed method were obtained in the real ocean environment with a USV in the South China Sea during a one month sea trial in November 2019. The experiment results demonstrate the performance of the proposed method is more suitable for the changed weather conditions though comparing with the existing methods, and the real-time performance is available in practical ocean tasks for USV.


Introduction
In recent years, the unmanned surface vehicle (USV) as a typical automatic unmanned system has made considerable and rapid development. It is playing an important role in both military and civilian missions to reduce human casualties as well as to create mission efficiencies, covering submarine tracking, environmental monitoring, patrol, reconnaissance, and so on [1]. Furthermore, USVs are also implemented in hydrographic measurement or bathymetric survey in shallow water regions because of some of their special advantages [2][3][4][5]. The autonomous and reliable navigation without obstacle collision is one of the important preconditions to ensure the completion of these tasks. To achieve superior perception performance, the USV generally requires employing heterogeneous sensors covering radar, lidar, camera, and infrared sensors [6]. They provide advantages of computer vision in terms of power consumption, size, weight, cost, and the readability of data, unlike radar or LIDAR, which may require heavy equipment placed on the vehicle [7][8][9]. Therefore, vision-based target detection on the sea for USVs has received much attention. Meanwhile, the camera is becoming conditions by adjusting the brightness and rotation, which improves the robustness of the model to changes in environmental conditions. The rest of this paper is organized as follows. In Section II, we introduce the image data preprocessing, including data acquisition and augmentation. Next, the target detection method with an improved YOLOV3 is proposed in Section III. The evaluation of target detection in some factors is addressed in Section IV, while the experimental results are discussed in Section V. Finally, the conclusions are provided in Section VI.

Image Data Acquistion
In this study, image acquisition was conducted using a forward-looking camera with 1280 × 720 pixel resolution. The camera was installed horizontally on the top of the USV developed by Shenyang Institute of Automation, Chinese Academy of Sciences to monitor the surface of the water. The USV platform and the visual system installation are shown in Figure 1. The camera model is the iDS-2DF8837I5X of Hikvision with 8 megapixels. The image data used in this paper were collected in the South China Sea during a one month sea trial in November 2019 under sunny weather and cloudy conditions. The collection periods were throughout the day from 07:00 to 17:00. The ship was selected as the target in this paper and 3000 images of ships were collected. The collected image data covered as many different environmental conditions as possible during the day. The environmental conditions and basic parameters of the collected images are shown in Table 1. A total of 1000 ship images were randomly selected for use as the training dataset. To make the collected image data reflect more different environmental conditions, the 1000 images were then expanded to 4000 images by implementing data augmentation methods to yield the training dataset.

Image Data Augmentation
Considering that the intensity of light illumination and USV attitude caused by the waves vary greatly during the day, this could influence the performance in both the model training and method verifying steps. Therefore, the dataset used to train the model was augmented to enrich the The image data used in this paper were collected in the South China Sea during a one month sea trial in November 2019 under sunny weather and cloudy conditions. The collection periods were throughout the day from 07:00 to 17:00. The ship was selected as the target in this paper and 3000 images of ships were collected. The collected image data covered as many different environmental conditions as possible during the day. The environmental conditions and basic parameters of the collected images are shown in Table 1. A total of 1000 ship images were randomly selected for use as the training dataset. To make the collected image data reflect more different environmental conditions, the 1000 images were then expanded to 4000 images by implementing data augmentation methods to yield the training dataset.

Image Data Augmentation
Considering that the intensity of light illumination and USV attitude caused by the waves vary greatly during the day, this could influence the performance in both the model training and method verifying steps. Therefore, the dataset used to train the model was augmented to enrich the experimental dataset by adjusting the brightness and rotation. This step of training dataset augmentation not only should enrich the deep feature maps of targets, but also could be considered to improve the Sensors 2020, 20, 4885 4 of 14 robustness of the target detection method in the realistic environment condition. The framework of data augmentation is shown in Figure 2.
Sensors 2020, 4 of 14 experimental dataset by adjusting the brightness and rotation. This step of training dataset augmentation not only should enrich the deep feature maps of targets, but also could be considered to improve the robustness of the target detection method in the realistic environment condition. The framework of data augmentation is shown in Figure 2.

Data Augmentation: Brightness
To improve the robustness to the luminance varies of the natural light, the original images were augmented by adjusting the brightness, and the pre-processed images were added to the training dataset. The threshold values were randomly set in the range between min l and max l . However, if the brightness images are set too high or too low, the image annotation will become difficult because the edge of the target is unclear for manual annotation. On the other hand, these images will also influence the performance of the model training. Therefore, there will be some constraints in the threshold values selection; in this work, the threshold values of the brightness were set to [0.3, 0.7] for min l and max l , respectively.

Data Augmentation: Rotation
Considering the influence on camera attitude from the sea waves, especially USV sailing at high speed, the training dataset was also manual augmented by rotating the image data with different angular degrees. Here, 15° and −15° were utilized in our work, thus the training data were enhanced three times after rotation. The rotated images can also improve the detection performance of the proposed method.

Image Annotation
To compare the performance with other algorithms, the images for training the model weights were converted to PASCAL VOC format. The targets in training images were labeled manually by drawing bounding boxes with a software called LabelImg, which is a graphical annotation tool designed for use in deep learning algorithms. The completed dataset is shown in Table 2.

Data Augmentation: Brightness
To improve the robustness to the luminance varies of the natural light, the original images were augmented by adjusting the brightness, and the pre-processed images were added to the training dataset. The threshold values were randomly set in the range between l min and l max . However, if the brightness images are set too high or too low, the image annotation will become difficult because the edge of the target is unclear for manual annotation. On the other hand, these images will also influence the performance of the model training. Therefore, there will be some constraints in the threshold values selection; in this work, the threshold values of the brightness were set to [0.3, 0.7] for l min and l max , respectively.

Data Augmentation: Rotation
Considering the influence on camera attitude from the sea waves, especially USV sailing at high speed, the training dataset was also manual augmented by rotating the image data with different angular degrees. Here, 15 • and −15 • were utilized in our work, thus the training data were enhanced three times after rotation. The rotated images can also improve the detection performance of the proposed method.

Image Annotation
To compare the performance with other algorithms, the images for training the model weights were converted to PASCAL VOC format. The targets in training images were labeled manually by drawing bounding boxes with a software called LabelImg, which is a graphical annotation tool designed for use in deep learning algorithms. The completed dataset is shown in Table 2.  [26], and the core structure of YOLO is a convolutional neural network that can predict multi-class targets at one time. It can realize the end-to-end target detection in a real sense, with advantages in terms of high detection accurate rate and fast speed. YOLOV3 was released in 2018, and is the state-of-the-art version of YOLO [27].
YOLO divides the input image into a grid. If the center point of the object's ground truth falls within a certain grid, the grid is responsible for detecting the object. Each grid outputs prediction bounding boxes, and the information for each bounding box contains five values (x, y, width, height, and prediction con f idence). The prediction confidence is defined as follows: where IoU as a standard indicator in target detection defining the detection accuracy by calculating the overlap ration between the true bounding box and the bounding box predicted using detection methods.
If the target falls in the grid, p r (Object) = 1, and 0 for otherwise. Then, a tensor of dimensions is predicted by building a single CNN network: where S × S is the number of grids and each grid can predict B bounding boxes, and C is the number of the object classes in the model. YOLOV3 includes a CNN feature extractor named Darknet-53 as the backbone, which is a 53 layered CNN network. Compared with the previous versions, YOLOV3 predicts boxes at three different scales and the tensor dimensions are correspondingly changed as follows: The loss function of the YOLOV3 constitutes three parts, including coordinate prediction error, IoU error, and classification error, shown as follows: where S 2 means the number of grids covered in the input image. The coordinate prediction error is defined as follows: where λ coord is the weight of the coordinate prediction error. I obj ij = 1 if the target falls into the jth bounding box in grid i, otherwise I obj ij = 0. (x i , y i , w i , h i ) are true values of a target, and x i ,ŷ i ,ŵ i ,ĥ i are information of predicted bounding box in terms of the center coordinate, height, and width.
The IoU error is defined as follows: Sensors 2020, 20, 4885 where λ noobj is the weight of the IoU error, C i is the true confidence, andĈ i is the predicted confidence.
The classification error is defined as follows: where c means the class to which the detected target belongs, p i (c) is the true probability that the target belonging to class c is in grid i, andp i (c) is the predicted probability.

DenseNet
The targets should be detected as early as possible, especially while the USV sailing at high speed, which could save sufficient time to plan subsequent operations. This requires the detection method to be sensitive to the targets with long distance, even though the features of targets are not enough at this time. Furthermore, for the complex ocean environment, the targets are usually blurred in cases of the foggy weather or the water droplets adhere to the camera lens. These bring a huge challenge in target detection. Although the YOLOV3 is sensitive to the small-scale objects, the features information of the targets is lost in the transmission of the neural network owing to convolution and down-sampling. Therefore, in this paper, DenseNet is proposed to improve the original YOLOV3, which could make more effective use of feature information [31].
Between the transition layers of the YOLOV3, a structure referred to as the Dense Block is added to ensure all the feature information has no loss in the transition. The structure of DenseNet is demonstrated in Figure 3. The Dense Blocks facilitate feature reuse and mitigate gradient vanishing. where noobj  is the weight of the IoU error, i C is the true confidence, and ˆi C is the predicted confidence.
The classification error is defined as follows: where c means the class to which the detected target belongs,   i pc is the true probability that the target belonging to class c is in grid i, and   i pc is the predicted probability.

DenseNet
The targets should be detected as early as possible, especially while the USV sailing at high speed, which could save sufficient time to plan subsequent operations. This requires the detection method to be sensitive to the targets with long distance, even though the features of targets are not enough at this time. Furthermore, for the complex ocean environment, the targets are usually blurred in cases of the foggy weather or the water droplets adhere to the camera lens. These bring a huge challenge in target detection. Although the YOLOV3 is sensitive to the small-scale objects, the features information of the targets is lost in the transmission of the neural network owing to convolution and down-sampling. Therefore, in this paper, DenseNet is proposed to improve the original YOLOV3, which could make more effective use of feature information [31].
Between the transition layers of the YOLOV3, a structure referred to as the Dense Block is added to ensure all the feature information has no loss in the transition. The structure of DenseNet is demonstrated in Figure 3. The Dense Blocks facilitate feature reuse and mitigate gradient vanishing.

Proposed Method
This paper takes Darknet-53 of YOLO-V3 as the basic network structure for feature extraction. Considering that DenseNet has the characteristics of feature reuse and enhances the feature propagation, the down-sampling layers in Darknet-53 are replaced with DenseNet, which is likely to cause the feature loss.
The network structure diagram of YOLOV3-dense is shown in Figure 4. Considering both the calculating cost and the network structure, the size of the input images is changed from 1280 × 720 to 416 × 416. In the improved YOLOV3 network, the DenseNet structure replaces the 26 × 26 and 13 × ,, x x x , and so on. Finally, the feature layer is spliced into 26 × 26 × 512 and propagates forward. In the layers with 13 × 13 resolution, the feature layer finally is spliced into 13 × 13 × 1024 and propagates forward. The transition-layer is used to connect dense-block and the feature map applies BN-ReLU-Conv(1 × 1)-average pooling in this layer to reduce the size.

Proposed Method
This paper takes Darknet-53 of YOLO-V3 as the basic network structure for feature extraction. Considering that DenseNet has the characteristics of feature reuse and enhances the feature propagation, the down-sampling layers in Darknet-53 are replaced with DenseNet, which is likely to cause the feature loss.
The network structure diagram of YOLOV3-dense is shown in Figure 4. Considering both the calculating cost and the network structure, the size of the input images is changed from 1280 × 720 to 416 × 416. In the improved YOLOV3 network, the DenseNet structure replaces the 26 × 26 and 13 × 13 down-sampling layers; it contains the dense-block and transition-layer. The transfer function of dense-block is made up of batch normalization (BN), rectified linear units (ReLU), and convolution (Conv), which is used for nonlinear transformation between x 0 , x 1 , . . . , x l−1 layers. The specific operation is as follows. In the layers with 26 × 26 resolution, the input layer x 0 first applies BN-ReLU-Conv(1 × 1) operation, and then applies BN-ReLU-Conv(3 × 3) operation and outputs x 1 . x 0 is spliced with x 1 as the new input [x 0 , x 1 ], and the above operation is repeated to output x 2 . [x 0 , x 1 ] is spliced with x 2 as the new input [x 0 , x 1 , x 2 ], and so on. Finally, the feature layer is spliced into 26 × 26 × 512 and propagates forward. In the layers with 13 × 13 resolution, the feature layer finally is spliced into 13 × 13 × 1024 and propagates forward. The transition-layer is used to connect dense-block and the feature map applies BN-ReLU-Conv(1 × 1)-average pooling in this layer to reduce the size. In the prediction process, the YOLOV3-dense model proposed in this paper predicts the bounding boxes at three different scales: 52 × 52, 26 × 26 and 13 × 13, and improved the detection accuracy of small targets.

Precision, Recall, and F-Measure
To evaluate the detection performance of the proposed YOLOV3-dense model, the original YOLOV3 and the Faster-RCNN-resnet101 are also applied to the detection of targets from realistic images obtained in the sea trial via USV, as well as the YOLOV3-dense model. The precision and recall analysis is utilized as an evaluation method after the detection of targets [32]. Precision refers to the percentage of correctly identified targets from the total extracted results. A high precision value indicates that the detection results contain a high percentage of useful information (true positive, TP) and a low percentage of false alarms (false positive, FP). The false-positive rate discussed in this study refers to the percentage of the false alarms in the total results, which has a value equal to 1 -precision: The term recall indicates the accuracy of detecting the target objects (i.e., ships) and refers to the true-positive rate. A high recall value indicates that most of the targets have been detected. The sum of true positive and false negative (FN) equals the actual number of targets in the total images: In the prediction process, the YOLOV3-dense model proposed in this paper predicts the bounding boxes at three different scales: 52 × 52, 26 × 26 and 13 × 13, and improved the detection accuracy of small targets.

Precision, Recall, and F-Measure
To evaluate the detection performance of the proposed YOLOV3-dense model, the original YOLOV3 and the Faster-RCNN-resnet101 are also applied to the detection of targets from realistic images obtained in the sea trial via USV, as well as the YOLOV3-dense model. The precision and recall analysis is utilized as an evaluation method after the detection of targets [32]. Precision refers to the percentage of correctly identified targets from the total extracted results. A high precision value indicates that the detection results contain a high percentage of useful information (true positive, TP) and a low percentage of false alarms (false positive, FP). The false-positive rate discussed in this study refers to the percentage of the false alarms in the total results, which has a value equal to 1 -precision: Sensors 2020, 20, 4885 8 of 14 The term recall indicates the accuracy of detecting the target objects (i.e., ships) and refers to the true-positive rate. A high recall value indicates that most of the targets have been detected. The sum of true positive and false negative (FN) equals the actual number of targets in the total images: The average precision (AP) can be calculated by the precision-recall curve as follows: In the F-measure, both precision and recall are taken into account to evaluate the overall performance of object detection. A high F-measure score indicates that the detection results contain fewer false alarms and more correct detections. The F-measure is calculated as follows:

Average Detection Time Cost
Otherwise, the average detection time cost is also evaluated in the experiments for the reason that the time cost is related to the feasibility of real-time application in practice.

Experimental Results and Discussions
Some experiments were implemented to evaluate the performance of the proposed model. A total of 3000 original image datasets were carried out in these experiments, which were randomly subdivided into three groups including the training dataset, validation dataset, and testing dataset. The completed dataset distribution in experiments is shown in Table 3, and all experiments are run on a server equipped with Intel XEON Gold 5217 CPU and NVIDIA RTX TITAN GPU cards.

Detection Performance Evaluation
The loss curves of the proposed YOLOV3-dense and the YOLOV3 during 45 thousand iterations are shown as Figure 5. The loss of both the two models decreases gradually with the increase of the iteration, and eventually converges to a low constant. After the 45 thousand iterations, the final loss of the proposed YOLOV3-dense is 0.67, while the final loss of the original YOLOV3 is 0.68. It is notable that the proposed YOLOV3-dense has a slightly higher convergence speed compared with YOLOV3 in the early stages of training, which means the weights of the proposed method could be trained with a lower time cost. The evaluation index covering TP, FP, AP, and F-measure of the proposed YOLOV3-dense model and the other comparison models are listed in Table 4, and the precision-recall curves for these models during testing are shown in Figure 6. On the basis of the above results, the F-measure of the proposed YOLOV3-dense model is 0.962, which is higher than the other two models. This indicates that the comprehensive performance of the The evaluation index covering TP, FP, AP, and F-measure of the proposed YOLOV3-dense model and the other comparison models are listed in Table 4, and the precision-recall curves for these models during testing are shown in Figure 6.

Model
Ground  The evaluation index covering TP, FP, AP, and F-measure of the proposed YOLOV3-dense model and the other comparison models are listed in Table 4, and the precision-recall curves for these models during testing are shown in Figure 6. On the basis of the above results, the F-measure of the proposed YOLOV3-dense model is 0.962, which is higher than the other two models. This indicates that the comprehensive performance of the On the basis of the above results, the F-measure of the proposed YOLOV3-dense model is 0.962, which is higher than the other two models. This indicates that the comprehensive performance of the YOLOV3-dense balancing the performance of both precision and recall is superior to the other two models. The AP of YOLOV3-dense is higher than YOLOV3 and basically equal to Faster R-CNN. The YOLOV3-dense predicted 1317 targets in the testing image data with 1406 ground truth. The performance of YOLOV3-dense in the index TP and FP is better than YOLOV3. The Faster R-CNN predicted 1363 targets, which is more than both YOLOV3-dense and YOLOV3. The TP of Faster RCNN increases only two times compared with YOLOV3-dense, but the FP of Faster R-CNN reaches 51, more than 7 times higher than YOLOV3-dense. This indicates that the Faster R-CNN has a higher false alarm compared with YOLOV3-dense; in other words, more noises are falsely identified as targets when Faster R-CNN is implemented. The experimental results demonstrated that the overall detection performance of the proposed YOLOV3-dense is superior to the other two models.

Real-Time Performance Evaluation
The average detection time cost of the proposed YOLOV3-dense model is 67.5 ms for one testing image data, which is 10 ms slower than YOLOV3 because more features were processed in the YOLOV3-dense model. However, this detection speed of YOLOV3-dense is enough for the applications with USV in real time. It notable that the average detection time cost of Faster R-CNN is 963.8 ms, more than 14 times slower than the YOLOV3-dense model, though the detection performance of Faster R-CNN is slightly better than YOLOV3-dense. However, this is difficult to apply on the USV to detect the targets, especially for the fast-moving targets.

Performance of Data Augmentation
Brightness and rotation transform were used to augmented the training data to simulate the changes of the light and the ocean environment. For the purpose of evaluating the influence of the data augmentation to target detection performance, 1000 original image data and 4000 augmented image data are utilized as input to train the proposed YOLOV3-dense model, respectively. The components of the augmented data are the same as those shown in Table 2, and the same 800 testing image data were utilized to evaluate the performance. The results are shown in Table 5 and the precision-recall curves for these two models are shown in Figure 7.
Sensors 2020, 10 of 14 YOLOV3-dense balancing the performance of both precision and recall is superior to the other two models. The AP of YOLOV3-dense is higher than YOLOV3 and basically equal to Faster R-CNN. The YOLOV3-dense predicted 1317 targets in the testing image data with 1406 ground truth. The performance of YOLOV3-dense in the index TP and FP is better than YOLOV3. The Faster R-CNN predicted 1363 targets, which is more than both YOLOV3-dense and YOLOV3. The TP of Faster RCNN increases only two times compared with YOLOV3-dense, but the FP of Faster R-CNN reaches 51, more than 7 times higher than YOLOV3-dense. This indicates that the Faster R-CNN has a higher false alarm compared with YOLOV3-dense; in other words, more noises are falsely identified as targets when Faster R-CNN is implemented. The experimental results demonstrated that the overall detection performance of the proposed YOLOV3-dense is superior to the other two models.

Real-Time Performance Evaluation
The average detection time cost of the proposed YOLOV3-dense model is 67.5 ms for one testing image data, which is 10 ms slower than YOLOV3 because more features were processed in the YOLOV3-dense model. However, this detection speed of YOLOV3-dense is enough for the applications with USV in real time. It notable that the average detection time cost of Faster R-CNN is 963.8 ms, more than 14 times slower than the YOLOV3-dense model, though the detection performance of Faster R-CNN is slightly better than YOLOV3-dense. However, this is difficult to apply on the USV to detect the targets, especially for the fast-moving targets.

Performance of Data Augmentation
Brightness and rotation transform were used to augmented the training data to simulate the changes of the light and the ocean environment. For the purpose of evaluating the influence of the data augmentation to target detection performance, 1000 original image data and 4000 augmented image data are utilized as input to train the proposed YOLOV3-dense model, respectively. The components of the augmented data are the same as those shown in Table 2, and the same 800 testing image data were utilized to evaluate the performance. The results are shown in Table 5 and the precision-recall curves for these two models are shown in Figure 7.   The AP and F-measure of the model trained without the data augmentation are 92.44% and 0.957, respectively, while these evaluation indexes of the model trained with the data augmentation are 93.13% and 0.962, respectively. The AP and F-measure are increased by 0.69% and 0.005, respectively, through the operation of data augmentation, which verified that the data augmentation is to some extent effective to improve the detection.

Performance under Different Environment Conditions
In the realistic environment, the changes of the light and weather, as well as the water droplets adhering to the lens of the camera, would influence the target detection performance. All the detection results were reviewed manually, and the typical detection results under different environmental conditions are illustrated in Figure 8.  The AP and F-measure of the model trained without the data augmentation are 92.44% and 0.957, respectively, while these evaluation indexes of the model trained with the data augmentation are 93.13% and 0.962, respectively. The AP and F-measure are increased by 0.69% and 0.005, respectively, through the operation of data augmentation, which verified that the data augmentation is to some extent effective to improve the detection.

Performance under Different Environment Conditions
In the realistic environment, the changes of the light and weather, as well as the water droplets adhering to the lens of the camera, would influence the target detection performance. All the detection results were reviewed manually, and the typical detection results under different environmental conditions are illustrated in Figure 8. The upper, middle, and lower row of Figure 8 listed the detection results achieved under different environmental conditions by implementing the model YOLOV3, Faster R-CNN, and proposed YOLOV3-dense, respectively. In the case of the scattering of light caused by water droplets adhering to the lens of the camera (Figure 8a,e,i) the YOLOV3-dense and Faster R-CNN can properly identify the target falling into the region of water droplets. For the case of light reflection ( Figure  8c,g,k) and the case of cloudy weather (Figure 8d,h), and Figure 8l), the YOLOV3-dense and Faster R-CNN can also properly identify more targets than YOLOV3. These validate that DenseNet is conducive to improving the detection performance in different environmental conditions. Otherwise, the proposed YOLOV3-dense has the effect of suppressing the false alarm (marked with a red dotted circle in Figure 8f) compared with the Faster R-CNN. These experimental results further demonstrate that the proposed YOLOV3-dense in this paper is robust against the changes in the environmental conditions. The upper, middle, and lower row of Figure 8 listed the detection results achieved under different environmental conditions by implementing the model YOLOV3, Faster R-CNN, and proposed YOLOV3-dense, respectively. In the case of the scattering of light caused by water droplets adhering to the lens of the camera (Figure 8a,e,i) the YOLOV3-dense and Faster R-CNN can properly identify the target falling into the region of water droplets. For the case of light reflection (Figure 8c,g,k) and the case of cloudy weather (Figure 8d,h), and Figure 8l), the YOLOV3-dense and Faster R-CNN can also properly identify more targets than YOLOV3. These validate that DenseNet is conducive to improving the detection performance in different environmental conditions. Otherwise, the proposed YOLOV3-dense has the effect of suppressing the false alarm (marked with a red dotted circle in Figure 8f) compared with the Faster R-CNN. These experimental results further demonstrate that the proposed YOLOV3-dense in this paper is robust against the changes in the environmental conditions.

Conclusions
This study proposed an improved YOLOV3 model by fusing DenseNet to detect sea surface targets under different environmental conditions, which is expected to enhance the environmental adaptability of the USV during a long-term task. The YOLOV3-dense model proposed in this paper takes advantage of DenseNet's feature reuse character to optimize the sample layer of the feature extraction part in the YOLOV3 model, and to promote feature propagation. The realistic images obtained in the sea trial via USV are used to train models and evaluate the performances of the proposed YOLOV3-dense model compared with the model YOLOV3 and Faster R-CNN with ResNet-101. The F-measure of the proposed YOLOV3-dense model is 0.962, which is higher than YOLOV3 (0.958) and Faster R-CNN (0.954). Simultaneously, the AP of YOLOV3-dense achieves 93.13%, which is higher than YOLOV3 (92.47%) and basically equal to Faster R-CNN (93.21%). However, the Faster R-CNN has a higher false alarm compared with YOLOV3-dense, as the TP of Faster R-CNN is much higher. These experimental results show that the YOLOV3-dense model proposed in this paper is superior to the YOLOV3 model and has better overall performance compared with the Faster R-CNN with ResNet-101. Besides, the YOLOV3-dense model is robust to the weather changes of the realistic ocean environment and meets the requirement of the real-time prediction (67.5 ms/frame) for USVs.
The focus of future work will be on deploying the proposed model as a hardware module on USVs and implementing it to detect sea-surface targets in actual tasks. Moreover, the detection model will be optimized to accelerate the training process and further improve the detection performance.