Improved Ship Detection with YOLOv8 Enhanced with MobileViT and GSConv

: In tasks that require ship detection and recognition, the irregular shapes of ships and complex backgrounds pose signiﬁcant challenges. This paper presents an advanced extension of the YOLOv8 model to address these challenges. A lightweight visual transformer, MobileViTSF, is proposed and combined with the YOLOv8 model. To address the loss of semantic information that arises from inconsistent scales in the detection of small ships, a layer intended for the detection of small targets is introduced to lead to improved fusion of deep and shallow features. Furthermore, the traditional convolution (Conv) blocks are replaced with GSConv blocks, and a novel GSC2f block is designed for fewer model parameters and improved detection performance. Experiments on a benchmark dataset suggest that this new model can achieve signiﬁcantly improved accuracy for ship detection with fewer model parameters and a reduced model size. A comparison with several other state-of-the-art methods shows that higher accuracy can be obtained for ship detection with this model. Moreover, this new model is suitable for edge computing devices, demonstrating practical application value.


Introduction
Ship target detection is beneficial for port management and maritime safety and has been extensively used in both the civil and military sectors.In the civilian sector, ship detection can be used for real-time monitoring of maritime traffic and also plays an important role in combating illegal fishing, pollution and smuggling.Meanwhile, in the military field, ship target detection can be used to determine whether a ship has crossed the border or not and recognize other abnormal behaviors by detecting the ship's position, size and other information [1].Currently, modern ships make extensive use of a wide range of sensors, including LIDAR (light detection and ranging), GPS (global positioning systems) and AIS (automatic identification systems).Through sensor fusion technology, they combine sensor data from different sources to accurately detect surrounding obstacles.However, with the development of inland transport modes and the increase in inland traffic density, traditional AISs for ships are facing challenges [2,3].For example, although AIS can provide detailed information about nearby vessels, small boats without transmitters, buoys, etc., cannot be detected using AIS systems.In this case, it is more effective to use VTS (vessel traffic service) systems installed at coastal observatories or other key locations.VTS uses optical cameras to capture high-resolution images and video, allowing observers to remotely monitor vessel traffic in shipping lanes, ports, anchorages and critical sea areas.By monitoring and planning vessel movements in and out of busy waters, it reduces congestion and the risk of accidents and improves the safety and efficiency of shipping.
Ship target detection using computer vision is characterized by strong targeting, large detection area and excellent performance in real-time applications.Currently, research on deep learning-based ship detection algorithms has made progress.Lee et al. [4] addressed the real-time detection problem of surface USVs (unmanned surface vehicle) and experimented on a self-constructed dataset with a proposed CNN (convolutional neural network) model; the proposed model was able to process 30 frames for detection in one second.Shao et al. [5] proposed a saliency-aware ship detection framework for images obtained with a network of surveillance cameras.The missed ship rate is reduced by extracting ship and background features via CNN, and salient region detection based on global contrast is utilized to segment coastlines and correct ship positions.Ting et al. [6] proposed a method based on YOLO (you only look once) v5 for ship detection.The method uses stacked networks for the extraction of features to reduce the model parameters.Li et al. [7] performed ship detection from visual images by resizing YOLOv3 tiny preset anchors, resizing the feature maps in the backbone and incorporating the CBAM (convolutional block attention module); a significant amount of speedup was achieved.Xie et al. [8] developed a lightweight channel attention residual component and integrated it with other lightweight methods into the original YOLOv4 to reduce a large number of parameters and improve the detection accuracy.In order to attain efficient and cost-effective ship target detection with rapid response, visible images, which are readily available and contain abundant information, can be utilized to extract and learn image features layer by layer.Moreover, deep learning can be employed to acquire features from an enormous amount of data, resulting in significant improvements [9,10].
In recent years, methods for target detection with deep learning models, such as YOLO, have achieved remarkable success in ship detection and recognition [11].YOLOv8 is the most recently developed YOLO model; it is an efficient and fast approach that can perform target detection in one stage.It thus has the potential for applications that require ship detection in real time [12].In this paper, we propose an improved YOLOv8 model that can detect and recognize ships both accurately and efficiently; the proposed model reduces the hardware cost of the application and facilitates its diffusion.The major contributions of this work can be summarized as follows.
Firstly, in the backbone network of YOLOv8, a vision transformer structure called MobileViTSF (mobile-friendly vision transformer shuffle function) is proposed.A channel shuffle block is added to the global representation block of MobileViT (mobile-friendly vision transformer) [13,14] to improve the ability of the transformer encoder to accurately obtain semantic information and remove some shortcut connections to reduce operations.
Secondly, the P2 feature map detection layer is added to the neck network of YOLOv8 for better detection of small target ships, and the number of detection heads is increased to four.We refer to the slim-neck [15] structure and use GSConv (gfriend simulator conversation) to further reduce the number of parameters; the detection accuracy can be maintained with reduced model complexity.
Thirdly, the performance of the proposed model is evaluated using SeaShips datasets.Testing results show that the proposed method can perform ship detection with improved accuracy and satisfy lightweight requirements.We also compare with other state-of-the-art methods and explore relevant scenarios to test the advantages of the proposed approach.
The remaining part of this paper contains the following sections.Section 2 reviews conventional approaches for target detection, algorithms that use CNN for ship detection, and methods that detect ships witha visual transformer, respectively.Section 3 provides a description of the YOLOv8 model and the improvements proposed for the backbone and neck structures.Section 4 outlines the dataset and experimental details.Section 5 presents and analyzes the testing results.Finally, Section 6 summarizes and concludes this paper.

Algorithms Based on Feature Extraction
In 2004, David et al. [16] proposed SIFT (scale invariant feature transform) as an image descriptor based on image matching.In 2006, Dalal et al. [17] developed the HOG (histogram of oriented gradients) algorithm, which calculates and counts histograms along the gradient direction in a local region for feature generation.Subsequently, Felzenszwalb et al. [18] developed the DPM (deformable part models) approach, where image features are handled with their excitation templates.The location of a target is determined based on the excitation distribution.However, target detection often requires the prediction of numerous redundant boundaries.To address this difficulty, Neubeck et al. [19] proposed the NMS (non-maximum suppression) algorithm for the elimination of redundant boundaries.Approaches that can detect objects with deep learning models have extensively utilized this idea in the development of their models.Traditional approaches for the detection of objects have various limitations and cannot provide satisfactory descriptions for crucial features of images.

Algorithms Based on CNN
Mainstream approaches for the detection of targets with deep learning include detection approaches that detect targets in two stages and those that can perform the detection in one stage.For example, Girshick et al. [20] designed the R-CNN (region-based convolutional neural network) algorithm for target detection.The R-CNN employs a selective search algorithm to select candidate regions and then crops each candidate region and feeds it into CNN to extract the features.Finally, a box that bounds the target can be determined with a SVM (support vector machine).In 2015, Girshick et al. [21] developed the Fast R-CNN model.This model performs a max-pooling operation on the ROI (region of interest), eliminating the cropping step; a classifier based on SVM and a bounding box predictor are then merged and trained based on a single loss function to speed up the detection process.Subsequently, Ren et al. [22] improved the Fast R-CNN model.In this work, RPN (region proposal networks) was developed based on Fast R-CNN.RPN is able to simultaneously locate the bounding box of a candidate region and determine whether it contains a target or not, thus obtaining a candidate region with high confidence.In 2019, Pang et al. [23] proposed Libra R-CNN.It introduces a sampling strategy based on category weights, which makes the samples of each category balanced in the training process.It decomposes the target detection task into a two-stage task including both regression and classification.Methods that detect targets in two stages often have the disadvantages of higher model complexity and increased computational effort due to the fact that they usually need to obtain the region that may contain the target candidate first and then perform the detection.
Therefore, one-stage target detection methods, with their fast computational speed, lightweight design and suitability for deployment, are widely used in the industry.In 2016, Redmon et al. [24] proposed YOLO, where targets are detected by solving a regression problem.Specifically, the image is divided into a number of grids, and the target's bounding box and category are predicted based on the processing performed on each grid.YOLOv2 [25] introduces anchor boxes and applies batch normalization, which helps stabilize the training process and accelerate convergence.The YOLOv3 [26] network is one of the most classical target detection networks, which combines the advantages of ResNet (residual network) [27] and FPN (feature pyramid network) [28] structures.YOLOv4 [29] employs more powerful BottleneckCSP (Bottleneck with Cross Stage Partial) in the backbone, introduces advanced CIoU (complete intersection over union) loss function [30] for multiscale training and testing, and also uses data augmentation techniques.YOLOv5 continues to improve backbone performance using the C3 (BottleneckCSP with three convolutions) architecture.Users can set up automatic search and optimization hyperparameters.A multi-scale detection strategy is used in the inference phase, allowing the use of four scales of detection heads.YOLOX [31] decouples the classification and localization heads, which means that the model independently predicts category labels and bounding box coordinates.An anchorless approach is also used, which no longer uses fixed anchor frames but instead directly locates a target and determines its type with the network.
Moreover, recent research has developed several deep learning-based approaches that can detect targets with less computation time.In 2020, Google proposed EfficientNet [32].Using BiFPN (bidirectional feature pyramid network), multi-scale feature fusion can be performed easily and quickly.And, a composite feature pyramid network scaling method is proposed that can simultaneously perform composite scaling on the modules of backbone, BiFPN, class and bounding box prediction network.Cui et al. [33] proposed a dynamic module, which automatically selects different numbers of suggested proposals for an input image based on its complexity; the model efficiency is improved to enhance the speed of two-stage target detection and instance segmentation.Jung et al. [34] used the ELU (exponential linear unit) activation function in a new version of YOLOv5, which speeds up the training process and achieves higher detection accuracy on the test set.Zhao et al. [35] proposed a spatial and channel attention-based data adaptive magnitude method to improve the adaptability in binary object detection.By improving the 1-bit convolution to achieve higher capability for representation, the network performance is significantly improved with few parameters.

Algorithms Based on Transformer
A transformer consists of an encoder and a decoder: the encoder generates a highdimensional vector as the encoding result of an input sequence, and the decoder processes an encoder output and provides a target sequence as the result of decoding.Transformer [36] is a deep learning model proposed in 2017 by Vaswani et al.A transformer processes sequence data with the attention mechanism, which was originally designed for the processing of natural languages.It enables a model to concentrate on different regions inan input sequence during sequence generation.This has led to applications in image processing.In 2020, Carion et al. [37] developed a novel framework DETR (detection transformer) for the detection of targets, which used CNN to extract features, and then gives prediction results from a transformer without NMS post-processing steps, a priori knowledge or constraints such as anchors.Dosovitskiy et al. [38] proposed ViT (vision transformer), which divides an input image into several patches, converts the patches into sequences, and then adds a sequence for classification, which is processed by positional coding and embedding layers to enable information fusion of different patches.A ViT does not use CNN frameworks and purely uses the transformer's encoder-decoder architecture to construct a visual classifier.In 2021, Liu et al. [39] designed swin transformer, where a layered transformer architecture is employed to improve feature extraction by connecting across layers and using the self-attention mechanism to make the window and the patches in it biased, which enables efficient image processing tasks.In 2022, Li et al. [40] proposed a backbone network ViTDet (vision transformer detection) using a non-hierarchical structure.ViTDet does not use the commonly used FPN structure in CNN target detection algorithms but extracts features by fusing different feature maps.It adjusts the feature map size by up-sampling and down-sampling on the last layer to obtain scale invariant features.

YOLOv8 Algorithm
The YOLOv8 model includes these major parts: input, backbone, neck and head (Figure 1).In practical applications, the image is scaled to meet the size requirements of the input.Data enhancement methods such as mosaic and mixup can also be used for this step.The convolutions down-sample the backbone to extract features, and each convolution contains batch normalization and SiLU (sigmoid linear unit) activation functions.YOLOv8 uses the C2f (CSPBottleneck with 2 convolutions) block to further extract features, which references the E-ELAN (extended efficient layer aggregation network) structure of YOLOv7 [41], enriching the model gradient flow with more branching connections that cross layers to improve detection results.The shortcut connection is true when C2f blocks in the backbone, and false when it in the neck.The SPPF (spatial pyramid pooling fast) block serves as the end of the backbone and uses three max-pooling layers to collectively process features at various scales to enhance the network's capability for feature abstraction.FPN and PAN (path aggregation network) [42] structures are utilized by the neck network to fuse information from feature maps with various sizes and pass them to the head.YOLOv8 uses a decoupled detection header to compute losses separately for regression and categories through two parallel branches of convolutions.Each branch is thus enabled to concentrate on the task of its own.As a result, the model can generate detection results with improved accuracy.
extract features, which references the E-ELAN (extended efficient layer aggregation network) structure of YOLOv7 [41], enriching the model gradient flow with more branching connections that cross layers to improve detection results.The shortcut connection is true when C2f blocks in the backbone, and false when it in the neck.The SPPF (spatial pyramid pooling fast) block serves as the end of the backbone and uses three max-pooling layers to collectively process features at various scales to enhance the network's capability for feature abstraction.FPN and PAN (path aggregation network) [42] structures are utilized by the neck network to fuse information from feature maps with various sizes and pass them to the head.YOLOv8 uses a decoupled detection header to compute losses separately for regression and categories through two parallel branches of convolutions.Each branch is thus enabled to concentrate on the task of its own.As a result, the model can generate detection results with improved accuracy.YOLOv5 has preset anchors to help with positive and negative sample assignment.It has three loss functions: regression loss obtained via CIoU regression, loss in confidence and class loss based on BCE (binary cross entropy).Anchor-free frames and the taskaligned assigner [43] are used by YOLOv8 to choose positive samples with scores calculated from a weighted combination of regression and classification, as shown in Equation (1).

𝑡 𝑠 𝑢
(1 where  is the predicted value obtained based on the category of the label; is the IoU obtained with the predicted box and labeled box; and  are weight values. The class loss for YOLOv8 remains to be the BCE loss, as shown in Equation (2).
where denotes the predicted value; denotes the labeled value;n denotes the sample number.
There are two types of regression loss, CIoU loss and DFL (distribution focal loss) [44].CIoU loss is calculated with Equation (3).YOLOv5 has preset anchors to help with positive and negative sample assignment.It has three loss functions: regression loss obtained via CIoU regression, loss in confidence and class loss based on BCE (binary cross entropy).Anchor-free frames and the task-aligned assigner [43] are used by YOLOv8 to choose positive samples with scores calculated from a weighted combination of regression and classification, as shown in Equation (1).
where s is the predicted value obtained based on the category of the label; u is the IoU obtained with the predicted box and labeled box; α and β are weight values.The class loss for YOLOv8 remains to be the BCE loss, as shown in Equation (2).
where x n denotes the predicted value; y n denotes the labeled value; n denotes the sample number.
There are two types of regression loss, CIoU loss and DFL (distribution focal loss) [44].CIoU loss is calculated with Equation (3).
where α is the parameter used for a trade-off; v is the parameter that evaluates the consistency of the aspect ratio.b x , b y represent the centers of the predicted box and the labeled box, respectively.ρ represents the Euclidean distance that separates the two center points, and c represents the length of the diagonal line of the minimum outer rectangle that contains both the predicted box and the labeled box.DFL works by enabling the model to rapidly concentrate on locations near the target.It uses the distance from a point within the labeled box to the four edges of the predicted box as the value for regression.DFL assigns locations around object y lower loss values to allow the network to quickly focus on pixels in the neighborhood of the target location.DFL is calculated based on Equation (6).
where S i = . y is the labeled location of the target to be detected; y i and y i+1 are locations of the left and right sides of the predicted box (y i ≤ y ≤ y i+1 ).
We opted for YOLOv8 for several compelling reasons.First, the YOLO algorithm has consistently exhibited exceptional performance in target detection tasks, rendering it an ideal choice for addressing our specific ship target detection challenge.YOLOv8, in particular, offers real-time detection capabilities, which are of paramount importance in applications pertaining to public safety and emergency response.This has been corroborated by the successful track record of previous iterations of YOLO across diverse target detection scenarios.Second, YOLOv8 represents a well-established methodology supported by a thriving user community, thereby furnishing readily accessible implementation resources.Third, YOLOv8 is easy to deploy, with multiple versions available.

Improved YOLOv8
In ship detection and monitoring scenarios, the YOLOv8 algorithm still has problems such as insufficient accuracy and an insufficiently streamlined model.In this paper, the lightest YOLOv8n will be chosen as the original network for improvement.Figure 2 shows the improved network structure.The proposed model replaces the backbone network used for feature extraction with a combination of MobileViTSF and Mobilenetv2 [45].MobileViTSF is a novel visual transformer structure that we propose based on MobileViT and MobileViTv3.Mobilenetv2 is a classical lightweight backbone structure that uses depthwise separable convolution as a down-sampling module.For the neck network used for feature fusion, we add P2 layers that have greater resolution and, correspondingly, increase the number of detection heads.We use GSConv and depthwise convolution in the bottleneck structure in the c2f block to reduce the computational effort.

MobileViTSF Block
MobileViT is a hybrid CNN-ViT strategy.The general ViT structure is as follows: first a number of patches are obtained based on a partition of the input image; each individual patch is then mapped into a one-dimensional vector via linear variation, which is regarded as a token.Then, the position bias information with learnable parameters is added, and it is processed next by the same number of transformers, and finally the final prediction output is obtained by a layer that is fully connected, and the target is classified according to individual class tokens.The MobileViT structure is shown in Figure 3.

MobileViTSF Block
MobileViT is a hybrid CNN-ViT strategy.The general ViT structure is as follows: first a number of patches are obtained based on a partition of the input image; each individual patch is then mapped into a one-dimensional vector via linear variation, which is regarded as a token.Then, the position bias information with learnable parameters is added, and it is processed next by the same number of transformers, and finally the final prediction output is obtained by a layer that is fully connected, and the target is classified according to individual class tokens.The MobileViT structure is shown in Figure 3.For a given input feature map X ∈ R H×W×C , its local spatial information is encoded by a convolutional layer with a 3 × 3 convolutional kernel, and then a projection is applied to map the tensor to a high d-dimensional space with a 1 × 1 convolutional layer.At this time, the shape of the feature map is X L ∈ R H×W×d .Global feature modeling is then performed through the transformer structure; in order to enable MobileViT to obtain global representations that contain spatial inductive bias, it unfolds X L into N non-overlapping flat patches X U ∈ R P×N×d , where P = w • h, N = (H • W)/P, P is the patch number, h is the patch height, and w is the patch width.p is the pixel feature in the p-th position for each patch, and the inter-patch relationship is encoded by the transformer to obtain the global feature sequence X G ∈ R P×N×d , as shown in Equation (7).
Unlike the original transformer, MobileViT retains information on both the patch order and the spatial order of the pixels a patch contains (Figure 4).Therefore, we can fold X G ∈ R P×N×d to obtain X F ∈ R H×W×d .X F is then resized back to its original size in low c-dimensional space using a convolution kernel of 1 × 1 and connected in a series with input X via shortcut branches.These connected feature maps are then fused with another 3 × 3 convolutional layer for final output.
Unlike the original transformer, MobileViT retains information on both the patch order and the spatial order of the pixels a patch contains (Figure 4).Therefore, we can fold  ∈  to obtain  ∈  . is then resized back to its original size in low c-dimensional space using a convolution kernel of 1 × 1 and connected in a series with input via shortcut branches.These connected feature maps are then fused with another 3 × 3 convolutional layer for final output.Our MobileViTSF is a simple and effective modification of MobileViT, as shown in Figure 3c.Its biggest modification is the addition of channel shuffle block in the global representation block.We hope that the shuffle operation on transformer processing can compensate for the degradation of detection performance caused by depthwise separable convolution, while improving the expression of features.We refer to MobileViTv3 and use Note that for the unfolded feature map X L , it is generally spread as a sequence, then input into the transformer, and at the time of self-attention, each pixel in the feature map and the other pixels are processed, so that the amount of computation is W • H • d.In MobileViT, patches are first generated based on a partition of the input feature map, and then the self-attention is calculated only for pixels in the same location of each individual patch, as shown in Figure 4. Here, a patch is represented by a cell surrounded by black lines, while a pixel is represented by a cell surrounded by gray lines.A transformer is utilized to associate the red pixel in the center with blue pixels.Due to the fact that the information of their neighboring pixels has been encoded by blue pixels with convolution, the transformer is able to incorporate the pixel information of all patches in Figure 4.At this time, the amount of computation is reduced to Since X U (p) uses convolution to encode localized information from within a 3 × 3 region, and X G (p) encodes global information for the p-th position of each patch, any pixel included in X G contains information obtained based on all pixels from X.
Our MobileViTSF is a simple and effective modification of MobileViT, as shown in Figure 3c.Its biggest modification is the addition of channel shuffle block in the global representation block.We hope that the shuffle operation on transformer processing can compensate for the degradation of detection performance caused by depthwise separable convolution, while improving the expression of features.We refer to MobileViTv3 and use a depthwise convolutional layer to replace the convolutional layer in the local representation block.We also use 1 × 1 convolutional layers instead of 3 × 3 convolutional layers to simplify the structure of the fusion block.Compared with MobileViT, our method maintains high detection performance with fewer FLOPS and parameters.

Small Target Detection Layer
Due to the fact that some targets are of small sizes, and YOLOv8 has a large downsampling rate, obtaining feature information for small targets from deeper maps of features is difficult, so the original YOLOv8 model has poor detection ability for targets of small sizes.The original model has a detection scale of 80 × 80 for small targets, and the sensory field obtained from detecting each grid is 8 × 8.In cases where the target in the input image has heights and widths no larger than 8 pixels, the original network has difficulties recognizing the feature information of the target in the grid [46].Therefore, as Figure 2 shows, the proposed model contains a small target detection layer.A 160 × 160 layer is added to the original network, which includes a complementary fusion feature layer as well as the introduction of additional detection heads, in order to obtain more accurate feature representation and semantic information for targets of small sizes.Firstly, the feature layer with a scale of 80 × 80 (P3) continues to be stacked in the neck, and the feature layer with deep semantic feature information of small targets is constructed after the up-sampling process.It then continues to be spliced with a MobileNetv2 (P2) feature layer with a scale of 160 × 160 in the 4-th layer of the backbone, to supplement and enhance the expression capability of the fusion feature layer for the location information and semantic features of small targets.Then, the 160×160 fusion feature layer is incorporated into the C2f block to improve the expression capability for the semantic features and location information of small targets.The fusion feature layer with a scale of 160 × 160 is sent to the next Conv (convolution) block and an additional detection head after the C2f block.
The addition of the 160 × 160 scale layer allows the information for features of small targets to continue to be passed along the down-sampling path to the other 3 scale feature layers, thus strengthening the capability of the model for the fusion of features and improving the accuracy of small target detection.Introducing an additional decoupling header expands the detection range for ships of small sizes.The improvement in detection accuracy as well as range allows the network to more accurately recognize ships in the river channel.

GSConv
Slim-neck was first applied to vision systems for unmanned vehicles.Its core idea is to keep the excellent and reliable backbone while slimming the neck, which leads to a reduced model size and maintains the detection accuracy.The structure of GSConv in Slim-neck is shown in Figure 5, which incorporates the lightweight idea of GhostNet [47] and ShuffleNetv2 [48].GhostNet mainly solves the problem that the output of standard convolution usually has many similar feature maps, which can bring redundancy to the computation.Instead of standard convolution, ghost convolution from GhostNet first uses standard convolution to obtain the first part.This part is then convolved via depthwise convolution to get several similar feature maps as the result of the second part.Finally, the two parts are concatenated together as the output feature map.ShuffleNetv2 mainly addresses the fact that the channel information in the depthwise separable convolutional input feature maps is separated during computation, and therefore the information interaction between the channels is separated from each other.Channel shuffle is to shuffle the feature maps along the direction of the channels, using a lower computational cost to make the information between the channels interact with each other.In this paper, as shown in Figure 2, the GSC2f (gfriend simulator CSPBottleneck with 2 convolutions) structure is further designed on the basis of GSConv.Specifically, GSConv blocks are used instead of the Conv blocks in the neck network, and the original C2f blocks are replaced with the GSC2f blocks for a reduced computational complexity.

Dataset Production
The SeaShips dataset was captured using surveillance cameras near the Island of Hengqin in Zhuhai, Guangdong Province of China.Its public part consists of 7000 annotated images.It contains images for 6 types of ships, including passenger ships, In this paper, as shown in Figure 2, the GSC2f (gfriend simulator CSPBottleneck with 2 convolutions) structure is further designed on the basis of GSConv.Specifically, GSConv blocks are used instead of the Conv blocks in the neck network, and the original C2f blocks are replaced with the GSC2f blocks for a reduced computational complexity.

Experiments 4.1. Dataset Production
The SeaShips dataset was captured using surveillance cameras near the Island of Hengqin in Zhuhai, Guangdong Province of China.Its public part consists of 7000 annotated images.It contains images for 6 types of ships, including passenger ships, container ships, bulk cargo ships, ore carriers and fishing boats.Different ship monitoring situations are covered: different hull sections, proportions, viewing angles and lighting in complex environments and with different levels of occlusion.In addition, we also collected 500 images of the same category for labeling and formed a new dataset with SeaShips; Figure 6 shows the number of instances and the size of ground truths.A total of 80% of the dataset was chosen for training, and the remaining 20% was used for validation.Figure 7 gives actual examples for each category in SeaShips.In this paper, as shown in Figure 2, the GSC2f (gfriend simulator CSPBottleneck with 2 convolutions) structure is further designed on the basis of GSConv.Specifically, GSConv blocks are used instead of the Conv blocks in the neck network, and the original C2f blocks are replaced with the GSC2f blocks for a reduced computational complexity.

Dataset Production
The SeaShips dataset was captured using surveillance cameras near the Island of Hengqin in Zhuhai, Guangdong Province of China.Its public part consists of 7000 annotated images.It contains images for 6 types of ships, including passenger ships, container ships, bulk cargo ships, ore carriers and fishing boats.Different ship monitoring situations are covered: different hull sections, proportions, viewing angles and lighting in complex environments and with different levels of occlusion.In addition, we also collected 500 images of the same category for labeling and formed a new dataset with SeaShips; Figure 6 shows the number of instances and the size of ground truths.A total of 80% of the dataset was chosen for training, and the remaining 20% was used for validation.Figure 7 gives actual examples for each category in SeaShips.

Experimental Configuration
The hardware and software environments in the experiment were configured as follows.An NVIDIA RTX 3090 GPU (made by the NVIDIA Corporation, Santa Clara, CA, USA) with a video memory of 24GB was used along with a 12-core Intel(R) Xeon(R) Platinum 8255C processor (made by the Intel Corporation, Santa Clara, CA, USA) with 45GB of RAM (made by Kingston, Fountain Valley, USA) for the computation tasks in testing.The runtime environment is 64-bit ubuntu 18.04, configured with Cuda 11.1, PyTorch 1.9.0 and Python 3.8.
The following parameters are used for the experiments.The optimizer uses SGD: a value of 0.01 is chosen for learning rate, a value of 0.937 is chosen for momentum, and the weight decay is set to be 0.0005.To achieve a balance for the number of sample types, the mixup and mosaic augmentation is set to be 0.1.Due to hardware limitations, the batch size is set to be 8, and the number of train epochs is chosen to be 300.

Experimental Configuration
The hardware and software environments in the experiment were configured as follows.An NVIDIA RTX 3090 GPU (made by the NVIDIA Corporation, Santa Clara, CA, USA) with a video memory of 24 GB was used along with a 12-core Intel(R) Xeon(R) Platinum 8255C processor (made by the Intel Corporation, Santa Clara, CA, USA) with 45 GB of RAM (made by Kingston, Fountain Valley, CA, USA) for the computation tasks in testing.The runtime environment is 64-bit ubuntu 18.04, configured with Cuda 11.1, PyTorch 1.9.0 and Python 3.8.
The following parameters are used for the experiments.The optimizer uses SGD: a value of 0.01 is chosen for learning rate, a value of 0.937 is chosen for momentum, and the weight decay is set to be 0.0005.To achieve a balance for the number of sample types, the mixup and mosaic augmentation is set to be 0.1.Due to hardware limitations, the batch size is set to be 8, and the number of train epochs is chosen to be 300.

Model Evaluation Indicators
Evaluation indicators are essential for the assessment of model performance.In examining the effectiveness of the proposed approach, this paper focuses on precision, recall, FLOPs (floating-point operations), mAP (mean average precision), model size and FPS (frames per second) as measures of algorithm performance.Precisions and recalls are calculated with Equations ( 8) and (9).
where FN, FP, and TP represent the values of false negative, false positive and true positive, respectively.The mAP metric is based on a precision-recall metric that deals with multiple object classes and defines positive predictions using IoU (intersection over union).It chooses a given threshold of IoU and calculates the mean of the precision values obtained at different recall levels for the threshold.IoU is a measure of the similarity of two sets and is a metric commonly used in computer vision and image processing to numerically evaluate the degree of overlap for two bounding boxes (or two regions): The AP (average precision) for a particular category is obtained by ranking the predictions of a model with the values of recalls and calculating the area enclosed by a line generated with precisions represented by the vertical axis and recalls by the horizontal axis in a cartesian coordinate system.
The value of n is 10.By using mAP0.5 and mAP0.5:0.95,we evaluate the capability of the model for accurate detection of ships at various IoU thresholds.In addition, the performance of the proposed model is also described by its number of parameters and the FLOPs, which represent the amount of computation needed by the model and measure the model complexity.The model size can reflect the number of parameters the model contains.For real-world applications, we also consider the number of FPS.

Ablation Experiments
In order to test the effectiveness of the improved method in this paper, ablation experiments are performed with the same training strategy and hyperparameters.As can be seen from Table 1, in cases where only a replacement of the backbone is implemented, the FLOPs and model size decrease significantly, while the model performance decreases slightly.When changing only the neck structure and adding a large-scale detection layer, the model's detection rises significantly, as does the amount of computation for the model.When using only the lightweight GSConv block and the GSC2f block, the model's mAP0.5:0.95rises slightly, and FLOPs (floating-point operations per second) and model size decrease.In cases where both the backbone structure is replaced and the lightweight neck is used, the model's computation and size decrease substantially with increased values of mAP.Compared to the original YOLOv8, the final model with all three improvements combined achieves improvements of 0.9% and 4.5% in mAP0.5 and mAP0.5:0.95,respectively, while the GFLOPs (giga floating-point operations per second) decrease by 51.9%, and the model size decreases by 41.9%.

Comparison Experiments
This part compares the experimental results obtained with the proposed approach with those of several other state-of-the-art lightweight methods for target detection on the SeaShips dataset.Table 2 shows that the proposed approach achieves performance better than that of other tested methods in terms of mAP, computation amount and model size.In Table 3, we make a comparison between the mAP of YOLOv8n and our method in each ship's category.The mAP0.5 and mAP0.5:0.95 of our proposed algorithms improve 4.4% and 12.5% compared to the YOLOv7-tiny algorithm, 6.7% and 5.7% compared to the YOLOX-s algorithm, 0.8% and 3.8% compared to the YOLOv6 algorithm, and 1.1% and 8% compared to the 6.0 version of YOLOv5 in 2021.Our model that also incorporates a P2 layer in YOLOv5 is also compared, with mAP0.5 and mAP0.5:0.95improving by 0.6% and 1.6%, respectively.Our model still outperforms the earlier tiny models of YOLOv3 and YOLOv4.
Compared to the original Slim-neck, our method improves mAP0.5 and mAP0.5:0.95 by 0.8% and 4.8%, respectively.Compared to the minimalist-structured VanillaNet, the proposed model improves on mAP0.5 and mAP0.5:0.95 by 2% and 9.2%.Our algorithm also outperforms the architecture where MobileViT and MobileViTv3 are added to YOLOv8.Our model has the smallest FLOPs, and the model size is only slightly larger than that of MobileViT and MobileViTv3.Due to the addition of a large-scale detection layer, some of the detection speed is sacrificed, but the FPS on the GPU is over 60, which is sufficient for the human eye.A comparison of these models on detection performance suggests that the proposed model is computationally efficient and has higher detection accuracy.

Experimental Analysis
The performance of the proposed model is also compared with that of the YOLOv8n model.Figure 8 shows a comparison of the recognition results for small ship targets where the heat map was generated using Grad-CAM (gradient-weighted class activation mapping) [50].The original YOLOv8n did not detect ships that were far away from the camera and small ships that appeared between larger ships.As the heat map demonstrates, our method incorporates a large-scale small target detection layer.As a result, higher weights are generated for these small targets after shallow features are fused.These small ship targets can thus be detected using the proposed approach.

Experimental Analysis
The performance of the proposed model is also compared with that of the YOLOv8n model.Figure 8 shows a comparison of the recognition results for small ship targets where the heat map was generated using Grad-CAM (gradient-weighted class activation mapping) [50].The original YOLOv8n did not detect ships that were far away from the camera and small ships that appeared between larger ships.As the heat map demonstrates, our method incorporates a large-scale small target detection layer.As a result, higher weights are generated for these small targets after shallow features are fused.These small ship targets can thus be detected using the proposed approach.We also show the results of target detection in other scenarios.YOLOv8n tends to miss and generate false detection results when images with more complex backgrounds need to be processed, especially in blurred images where different types of ships overlap with one another.For example, Figure 9 demonstrates a case where ships overlap, and the overlapping part of the large ship is recognized with YOLOv8n as a false target.YOLOv8n may generate false detection results under foggy weather and complex backgrounds.In contrast, the algorithm proposed in this paper incorporates a visual transformer MobileViTSF and a large-scale small target detection layer, which can locate the target with high accuracy and reduce the rate of false detections effectively.Therefore, the proposed approach achieves higher detection accuracy and detection speed while maintaining a smaller size, which can meet the practical needs of real-time detection.We also show the results of target detection in other scenarios.YOLOv8n tends to miss and generate false detection results when images with more complex backgrounds need to be processed, especially in blurred images where different types of ships overlap with one another.For example, Figure 9 demonstrates a case where ships overlap, and the overlapping part of the large ship is recognized with YOLOv8n as a false target.YOLOv8n may generate false detection results under foggy weather and complex backgrounds.In contrast, the algorithm proposed in this paper incorporates a visual transformer MobileViTSF and a large-scale small target detection layer, which can locate the target with high accuracy and reduce the rate of false detections effectively.Therefore, the proposed approach achieves higher detection accuracy and detection speed while maintaining a smaller size, which can meet the practical needs of real-time detection.
miss and generate false detection results when images with more complex backgrounds need to be processed, especially in blurred images where different types of ships overlap with one another.For example, Figure 9 demonstrates a case where ships overlap, and the overlapping part of the large ship is recognized with YOLOv8n as a false target.YOLOv8n may generate false detection results under foggy weather and complex backgrounds.In contrast, the algorithm proposed in this paper incorporates a visual transformer MobileViTSF and a large-scale small target detection layer, which can locate the target with high accuracy and reduce the rate of false detections effectively.Therefore, the proposed approach achieves higher detection accuracy and detection speed while maintaining a smaller size, which can meet the practical needs of real-time detection.

Conclusions
This paper proposes a new YOLOv8 detection model to support port management and vessel monitoring.To improve the reliability of the model for the detection of ship targets in river traffic situations, a MobileViTSF block is added to the backbone network, which can capture the most discriminative regions in the complex river background image by using the global information learning capability of the visual transformer, making the model focus on the target rather than the background.To enhance the ability of the model to accurately obtain features for small target vessels, we redesigned the neck network by adding a shallow feature fusion layer of 160 × 160 size and a corresponding detection head.To make the network lightweight and ensure detection effectiveness, GSConv is employed, and the C2f block in YOLOv8 is redesigned, which leads to a significantly reduced model size and computational complexity.The improved model improves mAP0.5 and mAP0.5:0.95 by 0.9% and 5.5%, respectively.Compared to the original model, a reduction of 51.9% is achieved for the GFLOPs, and the size of the model is reduced by 41.9%, which achieves improved detection performance and lower computational cost.The testing results show that the proposed method outperforms several state-of-the-art target detection models in detection accuracy.It can meet the requirements of reliable, accurate and fast target detection for ship identification and surveillance.
However, the proposed method still has some shortcomings.For example, when a certain number of large ships overlap with one another, the detection accuracy of our proposed model will be adversely affected, and the FPS still has room to increase.For future research, the proposed model will be continuously improved so that it can achieve higher detection accuracy and detection speed.Considering the practical value of this approach in real-time applications, the proposed model can also be ported to mobile edge platforms (such as mobile phones and NVIDIA Jetson series).

17 Figure 2 .
Figure 2. Improved YOLOv8 network structure.3.2.1.MobileViTSF BlockMobileViT is a hybrid CNN-ViT strategy.The general ViT structure is as follows: first a number of patches are obtained based on a partition of the input image; each individual patch is then mapped into a one-dimensional vector via linear variation, which is regarded

Figure 4 .
Figure 4. Local representation and global representation schema.Note that for the unfolded feature map  , it is generally spread as a sequence, then input into the transformer, and at the time of self-attention, each pixel in the feature map and the other pixels are processed, so that the amount of computation is  ⋅  ⋅ .In MobileViT, patches are first generated based on a partition of the input feature map, and then the self-attention is calculated only for pixels in the same location of each individual patch, as shown in Figure4.Here, a patch is represented by a cell surrounded by black lines, while a pixel is represented by a cell surrounded by gray lines.A transformer is utilized to associate the red pixel in the center with blue pixels.Due to the fact that the information of their neighboring pixels has been encoded by blue pixels with convolution, the transformer is able to incorporate the pixel information of all patches in Figure4.At this time, the amount of computation is reduced to ⋅ ⋅ ⋅

Figure 4 .
Figure 4. Local representation and global representation schema.

Figure 5 .
Figure 5. Schematic diagram of the different convolution processes and channel shuffle.

Figure 5 .
Figure 5. Schematic diagram of the different convolution processes and channel shuffle.

Figure 5 .
Figure 5. Schematic diagram of the different convolution processes and channel shuffle.

Figure 6 .
Figure 6.SeaShips dataset details;(a) the instance number for each category in the SeaShips dataset; (b) the size of the GT (ground truth)box for each instance; (c) the coordinates of the midpoint of each GT box; (d) the height and width of each GT box.

Figure 6 .Figure 7 .
Figure 6.SeaShips dataset details; (a) the instance number for each category in the SeaShips dataset; (b) the size of the GT (ground truth)box for each instance; (c) the coordinates of the midpoint of each GT box; (d) the height and width of each GT box.Electronics 2023, 12, x FOR PEER REVIEW 11 of 17

Figure 7 .
Figure 7. Examples for all categories of ships in the dataset; (a) an example of ore carriers; (b) an example of bulk cargo carriers; (c) an example of general cargo ships;(d) an example of fishing boats; (e) an example of container ships; (f) an example of passenger ships.

Figure 8 .
Figure 8.Comparison of heatmaps for small target detection, the tested image on the right contains a Chinese marker generated by the imaging system to indicate the date when the image was obtained; (a) YOLOv8n.(b) Ours.

Figure 8 .
Figure 8.Comparison of heatmaps for small target detection, the tested image on the right contains a Chinese marker generated by the imaging system to indicate the date when the image was obtained; (a) YOLOv8n.(b) Ours.

Figure 9 .
Figure 9.Comparison of model detection results, all tested images contain two Chinese markers (one is on the top, the other one is on the bottom of the image) generated by the imaging system to show information on when and where the image was obtained; (a) YOLOv8n.(b) Ours.

Figure 9 .
Figure 9.Comparison of model detection results, all tested images contain two Chinese markers (one is on the top, the other one is on the bottom of the image) generated by the imaging system to show information on when and where the image was obtained; (a) YOLOv8n.(b) Ours.

Table 1 .
Results of the ablation experiment, where a ' √ ' indicates that the corresponding technique is used to construct the model for detection.

Table 2 .
The results of the experiments for comparison with other models, a number in bold shows the best performance achieved for each performance index.

Table 3 .
YOLOv8n and our model for each category of mAP.