A Lightweight Model for Real-Time Monitoring of Ships

Xing, Bowen; Wang, Wei; Qian, Jingyi; Pan, Chengwu; Le, Qibo

doi:10.3390/electronics12183804

Open AccessArticle

A Lightweight Model for Real-Time Monitoring of Ships

by

Bowen Xing

^1,*,†

,

Wei Wang

^1,†,

Jingyi Qian

^2,3,

Chengwu Pan

⁴ and

Qibo Le

⁴

¹

College of Engineering Science and Technology, Shanghai Ocean University, Shanghai 201306, China

²

College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China

³

Shanghai Aerospace Electronics Co., Ltd., Shanghai 201800, China

⁴

Ningbo Communication Center, Ningbo 315800, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2023, 12(18), 3804; https://doi.org/10.3390/electronics12183804

Submission received: 14 August 2023 / Revised: 3 September 2023 / Accepted: 6 September 2023 / Published: 8 September 2023

(This article belongs to the Special Issue State-of-the-Art Navigation, Control Science and Engineering in Celebrating the 70th Anniversary of Harbin Engineering University)

Download

Browse Figures

Versions Notes

Abstract

:

Real-time monitoring of ships is crucial for inland navigation management. Under complex conditions, it is difficult to balance accuracy, real-time performance, and practicality in ship detection and tracking. We propose a lightweight model, YOLOv8-FAS, to address this issue for real-time ship detection and tracking. First, FasterNet and the attention mechanism are integrated and introduced to achieve feature extraction simply and efficiently. Second, the lightweight GSConv convolution method and a one-shot aggregation module are introduced to construct an efficient network neck to enhance feature extraction and fusion. Furthermore, the loss function is improved based on ship characteristics to make the model more suitable for ship datasets. Finally, the advanced Bytetrack tracke is added to achieve the real-time detection and tracking of ship targets. Compared to the YOLOv8 model, YOLOv8-FAS reduces computational complexity by

0.8 \times 10^{9}

terms of FLOPs and reduces model parameters by 20%, resulting in only

2.4 \times 10^{6}

parameters. The mAP-0.5 is improved by 0.9%, reaching 98.50%, and the real-time object tracking precision of the model surpasses 88%. The YOLOv8-FAS model combines light weight with high precision, and can accurately perform ship detection and tracking tasks in real time. Moreover, it is suitable for deployment on hardware resource-limited devices such as unmanned surface ships.

Keywords:

ship monitoring; deep learning; lightweight model; real-time tracking; YOLOv8

1. Introduction

Inland waterway transportation is an essential component of the integrated transportation system, playing a significant role in promoting urban development, optimizing resource allocation, and facilitating communication and cooperation [1,2]. In urban inland waters or coastal environments such as harbors there is a diverse range of ship types and a relatively dense distribution of ships [3]. While inland waterway transportation has its advantages, it presents formidable challenges to navigational supervision [4]. Tasks such as ship density statistics, ship behavior monitoring, accident investigation, and aiding navigation rely on ship monitoring. The core of ship monitoring is efficient detection and real-time tracking of ship objects.

The application of Convolutional Neural Networks (CNN) [5] in various object detection and tracking fields has become increasingly widespread. In recent years, researchers have introduced it into ship monitoring tasks. Object detection algorithms based on convolutional neural networks can automatically extract essential features from images through training, providing a way to break free of the limitations involved in manual extraction of unbalanced qualities and leading to more accurate and efficient detection results. Object tracking, as the subsequent task following detection, has gained popularity in engineering applications. Currently, deep learning-based object detection algorithms mainly fall into the categories of one-stage detection algorithms, represented by Single Shot MultiBox Detector (SSD) [6] and the YOLO series, and two-stage detection algorithms, represented by the R-CNN and Faster R-CNN series [7]. ZHAO et al. [8] employed pretrained networks for feature extraction, reducing redundant feature mapping to enhance the Faster R-CNN detection network and achieving promising detection results. However, two-stage algorithms such as Faster R-CNN fail to meet real-time detection requirements. Zhang et al. [9] proposed a lightweight model based on SSD for ship detection by introducing a bidirectional feature fusion module and an attention mechanism to the model. While they achieved improved detection results, the detection ability of small targets and dense targets remains an area that needs to be strengthened. Building upon YOLOv3, Yang et al. [10] introduced the K-means clustering algorithm and Soft-NMS algorithm and modified the output classifier, leading to improved the precision of ship detection, though at the cost of increased model complexity and memory usage. They coupled these modifications with the DeepSORT tracker to provide effective ship tracking [10].

While progress has been made in real-time ship monitoring using deep learning methods, this task involves several further challenges under natural conditions. First, real-time monitoring of ships is mostly based on SAR ship detectors [11], and the research on ship detection and tracking in real situations needs to be deepened [12]. Second, the surface conditions of water are complex, being characterized by various interference factors such as river structures, buoys, wakes, and other obstacles, all of which impact detection accuracy [13]. Furthermore, variations in the similar appearance of ships contribute to these challenges. Achieving a balanced trade-off between accuracy, speed, and computational cost while deploying ship monitoring models on devices with limited memory and computing capabilities for practical applications remains a major challenge [14].

In this paper, we introduce a lightweight ship detection and tracking model to address real-time ship monitoring challenges within natural water environments. This model is an enhanced version of the YOLOv8n algorithm designed for deployment on devices with limited memory and computational resources. It combines the improved YOLOv8 detector with the advanced Bytetrack tracker to reduce the parameters and computational complexity of the model while maintaining model performance. According to the characteristics of the ship dataset, the loss function is improved to improve the prediction performance and optimization effect of the model. Our contributions are detailed as follows:

Inspired by FasterNet, we integrate simple and effective FasterNet blocks into the backbone of YOLOv8n. Additionally, we fuse the attention mechanism into the FasterNet block, enhancing the backbone’s lightweight nature and feature extraction capabilities.
We introduce a lightweight yet feature-rich neck network, and employ the lightweight GSConv convolution approach as a substitute for conventional convolution modules. Additionally, we replace the complex CSP module with a one-shot VoV-GSCSP aggregation module based on the GSConv design. Flexibly combining GSConv and VoV-GSCSP achieves an improved balance between computational costs and the performance of the feature fusion network.
We introduce an IoU loss measure called MPDIoU based on the minimum points distance to address the limitations of existing loss functions, leading to faster convergence speed and more accurate regression results.
We collected and processed surveillance images of waterborne ships in order to create a dataset designed explicitly for ship detection and tracking. This dataset includes various types of ships, making it suitable for real-time ship monitoring tasks.

2. Related Works

2.1. Object Detection

As one of the representative examples of one-stage object detection algorithms [15], the YOLO series [16] utilizes deep neural networks to identify and locate objects, offering high operational speeds suitable for real-time monitoring and tracking tasks. The authors of YOLOv5 have recently introduced a novel state-of-the-art model known as YOLOv8. The specific architecture of YOLOv8 is shown in Figure 1. Building upon previous iterations of the YOLO series, YOLOv8 incorporates several improvements that enhance detection accuracy and speed. This makes it particularly well-suited to serve as a baseline for ship detection.

The entire YOLOv8 network’s operation involves feature extraction, feature enhancement, and prediction of object conditions corresponding to prior bounding boxes.

The backbone is the primary feature extraction network within YOLOv8, where input images are initially processed to extract features. These extracted features are referred to as feature layers, constituting a collection of characteristics from the input images. YOLOv8 leverages three effective feature layers within the backbone for constructing subsequent network components. Compared to previous YOLO series algorithms, YOLOv8 employs 3 × 3 convolution kernels with a stride of two for initial feature extraction, sacrificing receptive field while enhancing the model’s speed. The preprocessing of the CSP [17] module involves replacing three successive convolutions with two convolutions, drawing inspiration from the ELAN [18] architecture of YOLOv7. The specific implementation method is to expand the number of channels for the first convolution to twice the original number, then split the convolution results in half on the channels. This approach reduces the number of convolutions and accelerates the network’s speed.

The Feature Pyramid Network (FPN) [19] is the enhanced feature extraction network in YOLOv8. The three effective feature layers obtained from the backbone in the main section of YOLOv8 are fused in the FPN component. Feature fusion aims to combine features from different scales to facilitate the extraction of more refined characteristics. Within the FPN segment, the obtained effective feature layers are employed to further extract the features. YOLOv8 continues to adopt the PANet structure [20], which involves upsampling features for feature fusion and then downsampling them again to achieve feature fusion.

The YOLO head serves as the classifier and regressor within YOLOv8. With the contribution of the backbone and neck, the network obtains three enhanced and effective feature layers. Each feature layer has the dimensions of width, height, and channel count. If we consider the feature map as a collection of individual feature points, each feature point acts as a prior point, eliminating the need for prior bounding boxes. Instead, each prior point contains features equal to the number of channels. The role of the YOLO Head is to determine whether an object is associated with each prior point by examining the corresponding priors’ conditions. YOLOv8 transitions from the previous coupled head design to a decoupled head design in which classification and regression are no longer realized within the same 1 × 1 convolution layer.

The loss function of YOLOv8 comprises both regression and classification components. The predicted category results for the priors are taken in the classification part; the cross-entropy loss is calculated based on the true box category and the predicted category for each prior. YOLOv8 employs the Distribution Focal (DF) loss [21] for the final regression prediction, necessitating the inclusion of the DF loss in the regression section. The regression loss in YOLOv8 comprises the CIoU loss [22] and the DF loss.

2.2. Lightweight Object Detection Models

In order to achieve effective detection results for ship detection models under constrained memory and computational resources, researchers have proposed a series of lightweight object detection algorithms. The YOLO series [23,24,25,26,27,28,29], MobileNet [30], GhostNet [31], and ShuffleNet [32] are all widely employed as lightweight object detection models. MobileNets extensively employ 1 × 1 convolutions to fuse separately calculated channel information. ShuffleNets introduce channel shuffling to facilitate mutual communication of channel information. GhostNets utilize half the standard convolution operations to maintain inter-channel information exchange.

Recently, Li et al. [33] introduced a novel approach to reduce model complexity while maintaining accuracy. The authors combined the concepts of MobileNet, GhostNet, and ShuffleNet, resulting in a lightweight convolution called GSConv. As illustrated in Figure 2, GSConv initially performs downsampling on the input through a standard convolution, followed by depth-wise separable convolution (DWConv). Subsequently, an SC and a DSC are concatenated. Finally, the shuffle operation is applied to mix SC information with DSCs. The computational complexity of GSConv is approximately half that of SC while retaining a similar learning capacity to SC.

In addition to object detection models, there are ongoing efforts to achieve fast neural networks, which hold significant relevance for object detection models. Chen et al. [34] introduced a new neural network, FasterNet, with a simple architecture that exhibits remarkable speed; it proved to be highly effective for various visual tasks as well as being hardware-friendly. The authors introduced a simple and rapid partial convolution, PConv, to reduce redundant calculations and memory access, enabling better utilization of computational capabilities on devices and more efficient spatial feature extraction. However, the PConv operation can result in loss of information, affecting the accuracy and generalization capability of the model. The optimization methods of FasterNet may need to be adjusted and optimized for different tasks and datasets. As depicted in Figure 3, PConv leverages redundancy within the feature map, applying general convolution (Conv) to only a subset of input channels for spatial feature extraction while leaving the remaining channels unchanged. The FasterNet architecture built upon PConv performs well, and exhibits universal speed on different devices such as GPU, CPU, and ARM processors. It is well-suited for real-time ship detection, ship tracking, and similar tasks.

2.3. Attention Mechanism

The attention mechanism can allocate computing resources to tasks that need to be focused on in the case of limited computing power and quickly screen out more important information related to the target task. In the target detection model, the introduction of an attention mechanism can help the model to strengthen the extraction of features and improve the performance of the detection network. Researchers have introduced various attention mechanisms, primarily classified into channel attention, spatial attention, or both. The Squeeze-and-Excitation (SE) attention mechanism models cross-dimensional interactions to extract channel-wise attention [35]. However, the SE mechanism overlooks critical spatial information in the image, limiting the improvement in model accuracy to an extent. To address this concern, Woo et al. [36] proposed the Convolutional Block Attention Module (CBAM), which establishes cross-channel and cross-spatial information and then integrates cross-dimensional attention weights into the input features. The implementation schematic of CBAM is illustrated in Figure 4.

The channel attention module is used to adjust the weights of each channel in the feature map, aiding the network in selecting relevant feature channels. This keeps the channel dimension unchanged while compressing the spatial dimensions. The input feature layer initially undergoes global average pooling and global maximum pooling. A shared fully connected layer processes the pooling results individually before being added together. Finally, the sigmoid activation function is applied to obtain the weight for each channel in the input feature layer, which is then multiplied by the original input feature layer. The expression for channel attention is as follows:

M_{C} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(1)

The spatial attention module is employed to adjust weights at different positions in the feature map, aiding in selecting relevant feature regions. It maintains the spatial dimension while compressing the channel dimension. The input feature layer computes the maximum and average values for each feature point’s channel, then stacks these two results. Convolution with a channel count of 1 is applied to adjust the channel count. Finally, the sigmoid activation function is applied to obtain the weight for each feature point in the input feature layer, which is then multiplied by the original input feature layer. The expression for spatial attention is as follows:

M_{s} (F) = σ (f^{7 \times 7} ([A vg P o o l (F); M a x P o o l (F)])

(2)

where F is the input feature map, AvgPool represents average pooling, MaxPool represents maximum pooling, MLP is the shared fully connected layer module,

σ

is the activation function sigmoid, and

f^{7 \times 7}

represents 7 × 7 convolution.

2.4. Loss Function

The bounding box regression loss function is a crucial component of the object detection loss function, and has a significant impact on the performance of object detection models [37]. The original bounding box loss function for the YOLOv8 network is the CIoU loss. The CIoU takes into account the overlap area, the distance between central points, and the aspect ratio of the width and height between the prediction boxes and ground true boxes; its loss function is shown in Equation (6):

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(3)

where IoU is the intersection ratio between the predicted box and the ground truth box, b represents the center point of the ground truth box,

b^{g t}

represents the center point of the predicted box, c represents the diagonal distance of the smallest rectangular box covering both frames,

ρ

represents the distance between the center points of b and

b^{g t}

,

α

is the weight function, and v describes the aspect ratio consistency, with

α

and v defined below.

α = \frac{v}{1 - I o U + v}

(4)

\begin{matrix} v = \frac{4}{π^{2}} {(arctan \frac{w^{g t}}{h^{g t}} - arctan \frac{w}{h})}^{2} \end{matrix}

(5)

Most existing approaches represented by CIoU do not involve image dimensions, leaving them unable to optimize cases in which the predicted boxes and ground truth boxes share the same aspect ratio while having completely different width and height values. As a result, we introduced a novel bounding box similarity comparison metric, MPDIoU, based on the minimum point distance.

2.5. Multiple Object Tracking

Object tracking is a task in computer vision that involves real-time localization and tracking of specific objects in video sequences. Multiple Object Tracking (MOT) is a task in which objects such as ships, pedestrians, and cars are detected and assigned unique IDs for trajectory tracking in video sequences without prior knowledge of the number of targets. Different objects are assigned different IDs. MOT typically comprises a detector module and a data association module. With advancements in object detection techniques, ’tracking-by-detection’ has emerged as one of the mainstream frameworks for MOT.

Zhang [22] proposed a multi-object tracking model called ByteTrack based on object detection. This model retains all detected boxes and categorizes them into high-score and low-score detection boxes. It performs tracking by associating each detection box rather than just high-score ones. The ByteTrack model utilizes YOLOX as its detector module. A simple and efficient data association method called BYTE is introduced in the data association part. BYTE leverages the similarity between detection boxes and tracking trajectories. It retains high-score detection results while removing the background from low-score detection results, thereby uncovering genuine objects (e.g,. challenging samples such as occluded or blurred instances). This approach reduces missed detections and enhances trajectory coherence. Specifically, BYTE first matches high-score boxes with previous tracking trajectories and then matches low-score boxes with tracking trajectories that are not initially matched with high-score boxes. BYTE creates a new tracking trajectory for detection boxes without matched tracking trajectories that have sufficiently high scores. To track trajectories without matched detection boxes, BYTE retains them for 30 frames and attempts matching again when they reappear.

3. Methodology

Our detector model is an enhancement of the YOLOv8 network. The YOLOv8 project categorizes the network into five sizes based on different depth, width, and maximum channel combinations (n, s, m, l, x). YOLOv8n was selected as the baseline with a minor parameter count and a balanced detection performance. Building upon YOLOv8n, we incorporated the ideas of FasterNet, attention mechanism, slim-neck, and MPDIoU to create an optimized model called YOLOv8-FAS. The overall architecture of the YOLOv8-FAS model is illustrated in Figure 5. YOLOv8-FAS enhances detection accuracy while reducing parameter count and computational complexity. The detection result inputs the ByteTrack tracker, which depends on the detection accuracy, ultimately realizing the real-time monitoring of surface ships.

3.1. Backbone

In this paper, we propose a lightweight yet solid feature extraction backbone. The primary architecture is depicted in Figure 5; the backbone section is designed following the feature extraction network structure of YOLOv8 while incorporating both the efficient concept of FasterNet and the excellent feature extraction capabilities of the attention mechanism. Initially, YOLOv8-FAS employs ordinary 3 × 3 convolutional kernels with a stride of two for initial feature extraction. After drawing inspiration from the ideas of FasterNet and the attention mechanism, the original CSP module is modified, leading to the creation of a novel CSP-A module. The CSP-A module is illustrated in Figure 6.

The CSP-A module ingeniously combines the FasterNet block and the CBAM attention mechanism. The CSP-A module shows fewer parameters, lower computation, and higher feature extraction efficiency than the traditional CSP module. Its specific implementation involves doubling the channel count of the first convolutional layer, then splitting the convolutional output in half along the channel dimension. One of the halves is fed into the FasterNet block for processing. The concatenated output is subjected to the lightweight CBAM attention mechanism, further enhancing the extraction of image features. The specific structure and functioning of CBAM have been detailed in the second section of this paper. The faster module comprises an inverted residual block consisting of a PConv layer and two 1 × 1 convolutional layers alongside batch normalization and ReLU activation layers. The FasterNet block is visualized in Figure 7.

3.2. Neck

In this paper, we improve the backbone section while retaining light weight and enhancing the feature extraction capability of the neck section. The primary task of the neck section is to fuse and further extract features from the three effective feature layers obtained in the main section, as illustrated in the diagram above. We introduce GSConv and a one-shot aggregation module known as VoV-GSCSP based on GSConv into the neck section of YOLOv8. The structure of VoV-GSCSP is depicted in Figure 8. The feature maps received by the neck have the highest channel count and the smallest spatial dimension, containing minimal redundant information. As a result, there is no need for compression, making GSConv particularly effective for lightweight models. The flexible combination of the GSConv and VoV-GSCSP modules accelerates model inference speed, reduces computational costs, and concurrently enhances detection accuracy.

3.3. Loss Function

Computing the loss is a comparison between the predicted results of the network and the ground truth. The loss function of our model in this paper is the same as that of YOLOv8, consisting of regression and classification components. The regression component pertains to the regression parameters of feature points, which determine the object’s category that is present at the feature point. Considering the advantages and disadvantages of the existing BBR loss functions and aligning with the practical tasks of ship object detection, we incorporate MPDIoU [22] into our work, inspired by the geometric characteristics of rectangular boxes. MPDIoU is a loss function for efficient and accurate bounding box regression, and encompasses all relevant factors considered in existing loss functions. MPDIoU simplifies the calculation process by minimizing the distances between the top left and bottom right corners of the predicted bounding box and the annotated bounding box, achieving accurate and efficient bounding box regression. The computation method for MPDIoU is as follows:

d_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2}

(6)

d_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2}

(7)

M P D I o U = \frac{A \cap B}{A \cup B} - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(8)

where A and B are two arbitrary convex shapes, w and h represent the width and height of the input image, respectively,

(x_{1}^{A}, y_{1}^{A})

and

(x_{2}^{A}, y_{2}^{A})

represent the top left and bottom right point coordinates of A, respectively, and

(x_{1}^{B}, y_{1}^{B})

and

(x_{2}^{B}, y_{2}^{B})

represent the top left and bottom right point coordinates of B, respectively.

3.4. Ship Tracking

Building on this well-optimized detection network, we proceeded with the selection of a suitable tracker. Compared to classical methods such as Sort and DeepSort, ByteTrack exhibits superior performance and offers a more streamlined solution in practical applications. The related work involving ByteTrack has been explained in the second part of this paper. The model remains simple and fast, as ByteTrack solely employs a motion model without relying on ReID features for appearance similarity calculations. However, this implies that the tracking effectiveness relies heavily on the detection performance. When the detector performs well, the tracking outcome is favorable. Therefore, leveraging this paper’s optimized object detection algorithm, we substitute the original detector by integrating the enhanced YOLOv8 with ByteTrack. This synergy gives rise to a lightweight and efficient ship object detection and tracking model.

4. Experiments

In this section, we first introduce the evaluation metrics and experimental platform; then, the datasets are introduced; finally, we conduct ablation experiments and demonstrate the effectiveness and applicability of our model.

4.1. Evaluation Metrics

In order to clearly and objectively evaluate the effectiveness of algorithm improvement in terms of model weight reduction, we selected FLOPs and the number of parameters to evaluate model complexity and size. In terms of object detection accuracy, the evaluation indicators we chose were the Precision (P), Recall (R), Mean Average Precision (mAP), and P-R curve. The P-R curve uses the Recall and Precision as the horizontal and vertical coordinates, and can directly reflect the global performance of the model.

4.2. Experimental Platform

The experiments were based on an Ubuntu 20.04 operating system, NVIDIA RTX A4000 GPU, and Intel(R) Xeon(R) Silver 4210R CPU @ 2.39GHz. The deep learning framework was Pytorch 1.12.1, and the programming language was Python 3.8.16. The GPU acceleration library was CUDA 11.4. The number of iterations for model training was set to 300 and the batch size to 16. Optimization used the SGD optimizer with the momentum set to 0.937.

4.3. Dataset

The experimental data used in the model training conducted in this paper were taken from actual river ship data collected by fixed-point cameras on the shore. Referring to the public dataset Seaships [38] and the evaluation index requirements, we framed and labeled the videos and obtained 3824 ship pictures. The dataset contained four ship target categories: passenger ship, yacht, bulk carrier, and general cargo ship. The characteristics of the dataset are as follows:

(1): The picture backgrounds are highly complex and disturbed, including but not limited to nearshore buildings;
(2): The size difference between the ship targets in the images is significant, and it is difficult to identify small targets;
(3): The appearance of similar ships is quite different, with the bulk carrier ship being the most complex.

Figure 9 shows example images from the datasets.

We used LabelImg 1.8.6 software to label the datasets, in which the training set accounted for 70%, the validation set for 10%, and the test set for 20%. We used a rectangular frame to mark the ship objects. The label information included the corresponding picture name, ship category, ship label frame position, etc., and was generated as an XML file in PASCAL VOC format.

4.4. Ablation Experiments

We conducted ablation experiments to verify the improved effectiveness and reliability of the detection model and to explore the specific contributions of each improvement strategy to the optimization of the detection model. The models used in these experiments were trained on private datasets, and the experimental conditions such as equipment, experimental environment, and the number of iterations were kept as the same in all the experiments. We selected parameters and FLOPs as the measures of the size and complexity of the model and mAP-0.5 to measure the accuracy of the detection model. The experimental results are shown in Table 1, where ‘+’ represents the corresponding improvement strategy of the model.

The YOLOv8-FA model replaces the backbone of YOLOv8 with a feature extraction network that combines FasterNet and a CBAM attention mechanism. The number of model parameters is reduced by

0.4 \times 10^{6}

, the FLOPs are reduced by

1.0 \times 10^{9}

, and the mAP-0.5 is increased by 0.4 percentage points. These results indicate that the enhancement strategy applied to the backbone reduces the number of model parameters and computational load while enhancing the feature extraction capabilities, resulting in improved detection accuracy. By incorporating the slim-neck strategy in the feature fusion network, the YOLOv8-S model is created. This reduces the number of model parameters by

0.2 \times 10^{6}

and FLOPs by

0.8 \times 10^{9}

while increasing mAP-0.5 by 0.2 percentage points. Consequently, the combination of GSConv and VoVGSCSP optimizes the lightweight design of the detection network’s neck portion, resulting in more effective feature fusion and enhanced feature extraction without affecting the model size and computational complexity. Finally, an IoU loss algorithm called MPDIoU based on the minimum point distance is introduced into YOLOv8, effectively improving model detection accuracy by 0.2%.

In summary, the results of the ablation experiment show that the multiple improvements to the backbone and neck parts improve the model’s detection accuracy while ensuring its light weight. On this basis, the accuracy of the model can be effectively improved by further optimizing the loss function, and the inference effect is better as well. These improvements culminate in the development of the final YOLOv8-FAS detection model, which achieves both light weight and high detection accuracy.

4.5. Validation of the Improved Model

In order to verify the effectiveness of the YOLOv8-FAS algorithm proposed in this paper, it was compared with the original YOLOv8 model. This experiment used the same datasets for YOLOv8-FAS and the traditional YOLOv8 model, keeping the same parameters; the input image size was 640 × 640, the number of epochs was 100, the batch size was 16, and the other parameters were the same as well. In order to measure the weight and detection accuracy of the models more objectively, we examined and compared multiple indicators. The detection results of the improved YOLOv8-FAS model and the original model YOLOv8 on our datasets are shown in Table 2.

From the data comparison in Table 2, it can be seen that the detection accuracy of the YOLOv8-FAS model is dramatically improved compared with the traditional YOLOv8; the mAP-0.5 is increased by 0.9 percentage points, and the mAP-0.5:0.95 is increased by 3.7 percentage points.

At the same time, the YOLOv8-FAS model further reduces the amount of calculation and number of parameters of the lightweight YOLOv8n model. Compared with the original model, the FLOPs of YOLOv8-FAS are reduced by

0.8 \times 10^{9}

and the number of parameters by 20%. The light weight of the model is notable, as YOLOv8-FAS enhances the accuracy of ship detection and meets the requirements for a lightweight design, making it hardware-friendly and facilitating subsequent application of the detection results.

Figure 10 compares the P-R curves before and after the improvements to the YOLOv8 algorithm. The P-R curve can be used to reflect a model’s performance; P stands for precision and R stands for recall. When P = R, the Break-Even Point (BEP) is reached. The larger the area under the PR curve and the larger the value of the balance point, the better the performance of the learner. The P-R curve of YOLOv8-FAS has a larger area enclosed by the two coordinate axes, and its BEP is closer to the coordinate point (1,1). Based on these comparisons, it can be concluded that the improved YOLOv8-FAS ship detection model exhibits better overall system performance.

Figure 11 shows the real-time detection results of the YOLOv8 model for ships in different situations before and after improvement; in addition, it shows the types and confidence of bounding boxes along with the ships. We set the IOU threshold to 0.7. Figure 11a,b shows the detection results of YOLOv8 and YOLOv8-FAS for two ships of different types with similar appearances, one of which is partially occluded. The results show that while the original model can correctly identify the two ships, the position of the detection frame is not accurate enough for the partially occluded ship. On the other hand, the improved model can correctly identify the two ships and the positioning is accurate, demonstrating a 7% increase in the confidence score for ship detection compared to the original model. Figure 11c,d shows the detection results of YOLOv8 and YOLOv8-FAS for severely occluded ships. It can be seen that the original YOLOv8 model fails to detect the occluded general cargo ship, while the improved YOLOv8-FAS model successfully identifies all ships, providing accurate category and location information. Figure 11e,f shows the respective detection results of YOLOv8 and YOLOv8-FAS for small and incomplete objects in the image. Both models accurately identify the small bulk carrier and the incomplete general cargo ship in the image; however, the original model suffers from false positives, misidentifying a bridge as a yacht due to interference from coastal buildings in the background. A similar problem exists in Figure 11g. The original YOLOv8 model is disturbed by the shoreline, and the bulk carrier is misidentified in the background. YOLOv8-FAS avoids both of these problems; as shown in Figure 11h, it accurately identifies the number, type, and location information of multiple ships, exhibiting a high confidence score for ship detection. The sets of detection results in Figure 11 demonstrate that the proposed YOLOv8-FAS model achieves light model weight with high detection accuracy under varied circumstances, and that it can significantly reduce the rates of ship omissions and false positives. Overall, YOLOv8-FAS exhibits superior system performance on the ship dataset.

In order to further assess the model’s performance, we conducted experiments on the Seaships dataset [39], with the improved model obtaining a mean average precision of 98.9%. The specific results are shown in Table 3.

After optimizing the detection network, we fed the more accurate detection results into the ByteTrack tracker, which relies on the detector’s accuracy, for real-time ship tracking tests. We considered a variety of monitoring situations, including partial occlusion, scale change, multiple targets, and camera movement. Selected test result images are shown in Figure 12. The tracking results shown in the figure include the number of tracking targets, the ID assigned to each target, the category of the tracking targets, and the confidence level.

The test results show that the frame rate FPS when the model tracks the ships in the video can reach more than 60 frames per second. Compared with the video frame input of 25 frames per second, the object tracking model proposed in this paper fully meets the needs of real-time ship monitoring. Even in the cases of occlusion, interference from the water surface, incomplete display of the ship, an uncertain number of ship types, etc., the corresponding ships can be accurately positioned and tracked. The video tracking results provide a real-time display of target IDs, ship categories, and ship tracking accuracy, with the multiple object tracking precision (MOTP) exceeding 88%. MOTP is an evaluation metric used to measure the accuracy of multiple object tracking algorithms. It calculates the average distance error between all targets and their corresponding predicted positions to assess the precision of the tracking. MOTP is typically computed over a video sequence, analyzing and processing the positions of targets across consecutive frames to evaluate the performance of the tracking algorithm. Nevertheless, there is room for improvement in detecting and tracking ships. Considering both the real-time performance and accuracy of tracking, this model is well-suited for real-time ship monitoring and tracking applications. Although it performed well on the ship dataset, further evaluation and validation are needed in order to determine its performance in additional scenarios.

In summary, our optimized YOLOv8-FAS model further reduces the number of parameters and calculations, making it very friendly to hardware devices with limited memory and computing resources. Our proposed improvements increase the model’s detection accuracy while reducing its weight of the model and greatly reducing both its missed detection rate and false positive rate. Integrating YOLOv8-FAS with the high-performance ByteTrack tracker yields excellent tracking results, meeting the practical engineering demands of real-time ship monitoring. Therefore, the proposed model holds significant practical value in real-time ship monitoring tasks.

5. Conclusions

In this paper, we have proposed a lightweight approach called YOLOv8-FAS for real-time ship monitoring. First, an efficient FasterNet module coupled with attention mechanisms for feature extraction was integrated into the backbone network, achieving a lightweight initial model and enhanced feature extraction capabilities. Second, a lightweight convolutional method called GSConv and a one-shot aggregation module were introduced in the feature enhancement and fusion stage to construct an efficient neck network, further enhancing detection speed and accuracy. In addition, we introduced the MPDIoU, a loss function that uses the minimum point distance based on the geometric characteristics of ships, which can lead to faster convergence and more accurate regression results. Finally, the advanced tracker Bytetrack was introduced to accomplish real-time ship detection and tracking tasks. Compared to the conventional lightweight YOLOv8n detection network, YOLOv8-FAS reduces computational complexity by

0.8 \times 10^{9}

and model parameters by 20%, with only

2.4 \times 10^{6}

parameters. YOLOv8-FAS achieves a detection precision of 98.50% in terms of mAP-0.5, an improvement of 0.9%, and achieves a 3.7% increase in mAP-0.5:0.95. The real-time frame rate for ship object tracking based on detection surpasses 60 frames/s, significantly exceeding the typical video input frame rate of 25 frames/s. It achieves real-time transmission of object IDs, ship types, positions, and quantities, and maintains a multi-object tracking precision of over 88%. The verification results using the datasets described in this paper show that YOLOv8-FAS has good overall performance, achieving an effective balance between light weight and high precision. It accurately performs the tasks of real-time ship detection and tracking, and can be deployed on devices with limited memory and computational resources. In future research, we intend to focus on further optimizing the object detection and tracking models to enhance their simplicity, speed, and efficiency and to deploy them on resource-constrained devices such as unmanned surface ships.

Author Contributions

Conceptualization, B.X. and W.W.; Methodology, B.X. and W.W.; formal analysis, B.X., W.W. and J.Q.; data curation, C.P. and Q.L.; software, C.P.; writing—original draft preparation, B.X. and W.W.; writing—review and editing, W.W.; supervision, B.X., J.Q. and C.P.; project administration, B.X. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Science and Technology Committee (STCSM) Local Universities Capacity-Building Project (No. 22010502200).

Data Availability Statement

The data are available on request.

Acknowledgments

The authors would like to express their gratitude for the support of the Fishery Engineering and Equipment Innovation Team of Shanghai High-Level Local University and Daishan County Transportation Bureau.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bauwens, J. Datasharing in Inland Navigation. In PIANC Smart Rivers 2022: Green Waterways and Sustainable Navigations; Springer Nature: Singapore, 2023; pp. 1353–1356. [Google Scholar]
Wu, Z.; Woo, S.H.; Lai, P.L.; Chen, X. The economic impact of inland ports on regional development: Evidence from the Yangtze River region. Transp. Policy 2022, 127, 80–91. [Google Scholar] [CrossRef]
Zhou, J.; Liu, W.; Wu, J. Strategies for High Quality Development of Smart Inland Shipping in Zhejiang Province Based on “Four-Port Linkage”. In PIANC Smart Rivers 2022: Green Waterways and Sustainable Navigations; Springer Nature: Singapore, 2023; pp. 1409–1418. [Google Scholar]
Zhang, J.; Wan, C.; He, A.; Zhang, D.; Soares, C.G. A two-stage black-spot identification model for inland waterway transportation. Reliab. Eng. Syst. Saf. 2021, 213, 107677. [Google Scholar] [CrossRef]
Deo, N.; Trivedi, M.M. Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 28–23 June 2018; pp. 1468–1476. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, Z.; Ji, K.; Leng, X.; Kuang, G. Squeeze and excitation rank faster R-CNN for ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2018, 16, 751–755. [Google Scholar] [CrossRef]
Zhang, X.; Wang, H.; Xu, C.; Lv, Y.; Fu, C.; Xiao, H.; He, Y. A Lightweight Feature Optimizing Network for Ship Detection in SAR Image. IEEE Access 2019, 7, 141662–141678. [Google Scholar] [CrossRef]
Jie, Y.; Leonidas, L.; Mumtaz, F.; Ali, M. Ship Detection and Tracking in Inland Waterways Using Improved YOLOv3 and Deep SORT. Symmetry 2021, 13, 308. [Google Scholar] [CrossRef]
Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep learning for SAR ship detection: Past, present and future. Remote. Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
Xing, Z.; Ren, J.; Fan, X.; Zhang, Y. S-DETR: A Transformer Model for Real-Time Detection of Marine Ships. J. Mar. Sci. Eng. 2023, 11, 696. [Google Scholar] [CrossRef]
Er, M.J.; Zhang, Y.; Chen, J.; Gao, W. Ship detection with deep learning: A survey. Artif. Intell. Rev. 2023, 56, 11825–11865. [Google Scholar] [CrossRef]
Yun, J.; Jiang, D.; Liu, Y.; Sun, Y.; Tao, B.; Kong, J.; Tian, J.; Tong, X.; Xu, M.; Fang, Z. Real-time target detection method based on lightweight convolutional neural network. Front. Bioeng. Biotechnol. 2022, 10, 861286. [Google Scholar] [CrossRef]
Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7, 128837–128868. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wang, C.Y.; Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 390–391. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing Network Design Strategies Through Gradient Path Analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Loey, M.; Manogaran, G.; Taha, M.H.N.; Khalifa, N.E.M. Fighting against COVID-19: A novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection. Sustain. Cities Soc. 2020, 65, 102600. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. Ultralytics/Yolov5: V7. 0-Yolov5 Sota Realtime Instance Segmentation. Zenodo. 2022. Available online: https://ui.adsabs.harvard.edu/abs/2022zndo...7347926J/abstract (accessed on 22 November 2022).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Wey, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 28–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Siliang, M.; Yong, X. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]

Figure 1. YOLOv8 model structure.

Figure 2. GSConv.

Figure 3. PConv.

Figure 4. CBAM.

Figure 5. YOLOv8-FAS model structure.

Figure 6. CSP-A module.

Figure 7. FasterNet block.

Figure 8. VoV-GSCSP.

Figure 9. Dataset examples.

Figure 10. P-R curves of the model before and after improvement.

Figure 11. Comparison of ship recognition images before and after model improvement.

Figure 12. Ship tracking.

Table 1. Effects of different improvement operations.

Model	Improvement			Parameters	FLOPs	mAP0.5
Model	Backbone	Slimneck	MDPIoU	/ $\times 10^{6}$	/ $\times 10^{9}$	/%
YOLOv8				3.0	8.1	97.6
YOLOv8-FA	+			2.6	7.1	98.0
YOLOv8-S		+		2.8	7.3	97.8
YOLOv8-I			+	3.0	8.1	97.8
YOLOv8-FAS	+	+	+	2.4	6.3	98.5

Table 2. Test results of the algorithm before and after improvement.

Detection Network	FLOPs	Parameters	Precision/%	Recall/%	mAP0.5/%	mAP0.5:0.95/%
YOLOv8	$8.1 \times 10^{9}$	$3.0 \times 10^{6}$	98.0	94.4	97.6	81.2
YOLOv8-FAS	$6.3 \times 10^{9}$	$2.4 \times 10^{6}$	98.4	95.8	98.5	84.9

Table 3. Experiments on the Seaships public dataset.

Class	Precision/%	Recall/%	mAP0.5/%	mAP0.5:0.95/%
all	97.6	97.5	98.9	78.8
ore carrler	100	96.8	99.4	81.0
passenger ship	94.4	96.7	97.9	74.2
general cargo ship	96.5	96.7	98.3	74.6
bulk cargo carrier	97.6	98.7	99.5	81.4
container ship	99.2	100	99.5	85.2
fishing boat	97.8	95.9	99.0	76.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xing, B.; Wang, W.; Qian, J.; Pan, C.; Le, Q. A Lightweight Model for Real-Time Monitoring of Ships. Electronics 2023, 12, 3804. https://doi.org/10.3390/electronics12183804

AMA Style

Xing B, Wang W, Qian J, Pan C, Le Q. A Lightweight Model for Real-Time Monitoring of Ships. Electronics. 2023; 12(18):3804. https://doi.org/10.3390/electronics12183804

Chicago/Turabian Style

Xing, Bowen, Wei Wang, Jingyi Qian, Chengwu Pan, and Qibo Le. 2023. "A Lightweight Model for Real-Time Monitoring of Ships" Electronics 12, no. 18: 3804. https://doi.org/10.3390/electronics12183804

APA Style

Xing, B., Wang, W., Qian, J., Pan, C., & Le, Q. (2023). A Lightweight Model for Real-Time Monitoring of Ships. Electronics, 12(18), 3804. https://doi.org/10.3390/electronics12183804

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Model for Real-Time Monitoring of Ships

Abstract

1. Introduction

2. Related Works

2.1. Object Detection

2.2. Lightweight Object Detection Models

2.3. Attention Mechanism

2.4. Loss Function

2.5. Multiple Object Tracking

3. Methodology

3.1. Backbone

3.2. Neck

3.3. Loss Function

3.4. Ship Tracking

4. Experiments

4.1. Evaluation Metrics

4.2. Experimental Platform

4.3. Dataset

4.4. Ablation Experiments

4.5. Validation of the Improved Model

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI