1. Introduction
Shipping is an important link in the global trade network. With the continuous expansion of the scale of international trade, the flow of ships presents significant growth, with this development trend being particularly prominent in major ports and nearshore waters. Furthermore, this growth creates higher requirements regarding the safety of navigation in nearshore waters [
1]. Unlike the open sea, the traffic environment in nearshore waters is more complex, featuring narrow waterways and large numbers of ships gathered at ports. There are many types of ships, which present great differences in their speed and maneuverability; therefore, if there is no timely warning regarding dangerous situations, collisions between ships, ship–bridge collisions, and other major traffic accidents may occur, especially at night or under poor visibility conditions. In this context, the performance of target detection methods directly affects the safety of navigation. Therefore, it is of great practical significance to accurately realize ship detection in complex nearshore scenarios.
At present, sensor technologies applied in the ship detection field are based on four main categories: remote sensing image-, navigation radar-, visible light vision-, and infrared vision-based detection. Satellite remote sensing imagery is known for its wide field of view, high point of view, and fast data acquisition capability; however, it is not applicable to real-time detection due to the delayed signal transmission and the relatively low resolution of the image, making it inapplicable for the accurate detection of ships [
2,
3,
4]. Navigation radar has the advantage of facilitating all-weather detection; however, targets are presented as points or bands, making it difficult to recognize ship scale and category information [
5,
6,
7]. Detection based on visible light is characterized by high resolution and allows rich target feature information to be obtained; however, it is greatly affected by the environment, and it is difficult to distinguish targets at night or in bad weather conditions, in which case the detection effect will be seriously affected [
8,
9,
10]. Regarding detection based on infrared vision, the resolution of imagery acquired using infrared sensors has a certain gap in comparison with visible light sensors; however, its imaging mode is based on the infrared radiation emitted by the object itself, and passive detection can be realized by capturing the characteristic temperature differences between the target and the background; thus, it has a natural sensitivity to target with heat sources (e.g., ship’s engine, personnel body temperature). In this way, it can effectively identify targets according to their heat signal, while the background sea surface forms a stable low-temperature field, significantly contrasting with the target [
11,
12,
13]. Therefore, infrared vision allows for the acquisition of environmental information around the clock, providing ideal application conditions for ship detection in complex nearshore scenarios.
Infrared target detection algorithms can be macroscopically categorized into two groups: traditional detection algorithms and deep learning-based detection algorithms. In terms of traditional algorithms, scholars have carried out a great deal of research in the past. For example, Pokrajac et al. [
14] effectively detected moving objects in infrared videos by suppressing the effects of background noise based on the dimensionality reduction decomposition of spatio-temporal blocks. Man et al. [
15] constructed an infrared single-frame detection method by extracting blocks from images and finding similar blocks, which achieved good detection results in the infrared small target detection task under complex background interference. Khare et al. [
16] have proposed a background modeling method to achieve target detection using statistical changes in infrared pixels. Liu et al. [
17] have proposed a non-convex tensor low-rank approximation (NTLA) method for the detection of small infrared targets, which achieved accurate background estimation in complex scenes. Although the above traditional detection algorithms have achieved certain results, they rely on the separation of background and target in the image, and suffer from defects relating to limited feature extraction and biased classification ability, leading to insufficient generalization ability to different infrared scenes. In recent years, the rapid progress of deep learning technology has greatly promoted the development of target detection research. Convolutional neural networks, with their excellent feature learning ability, not only simplify the complex process of artificial feature design but can also effectively focus on the target itself, filtering out background interference. Therefore, the computational accuracy and generalization ability in the field of target detection have been significantly improved, successfully addressing many problems faced by traditional target detection approaches. Deep learning-based detection algorithms can be mainly categorized into two types: two-stage target detection models, represented by the R-CNN [
18,
19] series of algorithms; and single-stage target detection models, represented by the YOLO [
20] series of algorithms. In the former, target detection is accomplished using a two-stage algorithm that first extracts the candidate region and then classifies and recognizes the region; these algorithms possess high accuracy but exhibit poor real-time performance due to a large number of complex calculations. Compared to two-stage algorithms, the single-stage algorithms skip the candidate region extraction phase and directly generate the category probability and position coordinate values of objects, significantly improving the speed in the target detection task. For example, Firdiantika et al. [
21] applied the CA attention mechanism to a single-stage network, which worked well for the detection of small targets. Zhang et al. [
22] have proposed a YOLO-IR-Free method based on an anchorless detection head, which simplifies the process of anchor selection and adjustment, optimizes the convergence speed of the model, and better achieves real-time detection. Bao et al. [
23] designed a dual feature extraction channel based on YOLO for infrared and visible images. This method compensates for the lack of target texture features in infrared images, unifies infrared and visible features, and reduces the impact of redundant information on the accuracy of target detection. Zhang et al. [
24] developed an efficient lightweight convolutional neural network and designed a new bi-directional weighted feature pyramid network (BWFPN) by integrating multi-scale features. This strategy not only improves the detection accuracy of the model but also ensures the computational efficiency of the model. Li et al. [
25] have proposed an improved YOLO-FIR algorithm, in which the CSP module in the shallow layer is extended and iterated in the feature extraction part, the improved SK attention module is invoked to improve the robustness of the model, and the structure of the network detection head is improved, which significantly increases the detection accuracy of small targets in infrared imagery.
The YOLO algorithms possess unique performance advantages for application to infrared scenes; however, in the existing literature, it can be found that most deep learning-based infrared target detection research has focused on unilateral optimization of the network structure and, although such methods can achieve good results in target detection in some fields, there are still unsolved problems regarding the detection of ships in infrared imagery considering the characteristics of nearshore scenes. In recent years, some scholars have already conducted relevant research in the field of infrared ship detection using deep learning methods. For example, Liu et al. [
26] proposed a new method of MAFF-DETR with the advantage of detecting small infrared ship targets, but the method could not solve the missed detection behavior in the case of severe target occlusion. Wang et al. [
27] optimized the YOLOv5 algorithm by combining the PP-LCNet backbone network with it to design a lightweight detection algorithm, which effectively reduces the computational load of the model at the expense of some of the detection performance in the complex context of the nearshore.
However, by analyzing the imaging characteristics of infrared technology and the distribution characteristics of nearshore vessels, we believe that the challenges faced in the detection process still exist as follows: (1) compared with visible light, the captured infrared image will have insufficient resolution, resulting in blurring of detailed features of the ship, and there are a variety of backgrounds such as vegetation and rocks near the coastline that exhibit temperature, texture, and other characteristics of infrared radiation, which are easy to confuse with the ship’s target; furthermore, the blurring of the features and confusion often lead to false detection of the target. (2) Compared with imagery captured in the far sea direction, the high density of nearshore ships and the intensive parking of ships in the harbor area lead to problems such as mutual obscuration between ships and ships, ships and buildings, and ships and terminal facilities, which will lead to the absence of some features of the ship and increase the difficulty of detection. (3) Various types of ships make significant size differences between ships, which leads to the complex background of the acquired images and makes it difficult to achieve the expected detection results, which puts higher requirements on the detection effectiveness of multi-scale ships. Therefore, when applying deep learning-based target detection algorithms to nearshore ship detection in complex scenes, they should be optimized for the improvement of infrared image quality, the focus on key features, and the enhancement of multiscale detection performance, to provide better conditions for the realization of accurate ship detection.
To address the above problems, this study proposes an algorithm (called CGSE-YOLOv5s) for the ship target detection task in complex nearshore scenes. The contributions of this paper are summarized as follows:
We propose a new method for detecting ship targets in nearshore waters. The approach involves enhancing the infrared image and optimizing the network structure separately, then combining them to improve the ship detection performance in complex nearshore scenes.
We perform infrared image enhancement by combining Contrast Limited Adaptive Histogram Equalization with Gaussian Filter (CLAHE-GF), apply it to the input layer of the model, and compare the practical effectiveness of different data enhancement methods.
We integrate the Swin Transformer coding structure into the C3 module of the neck layer and verify its effect in terms of model performance improvement.
We introduce the Efficient Channel Attention (ECA) mechanism and integrate it in front of each detection head, additionally comparing the impacts of different attention mechanisms on model performance.
We compare the enhancement effect of the improved model with respect to the traditional model.
We add and replace the improvement modules from different architectures to compare their effects on the proposed model.
We use the same infrared nearshore vessel dataset to compare the proposed model with several other target detection models.
Our experimental results show that the method proposed in this study exhibits excellent performance in the nearshore ship detection task.
The remaining sections of the paper are structured as follows:
Section 2 details CGSE-YOLOv5s;
Section 3 presents the dataset and the various experiments;
Section 4 discusses the results;
Section 5 concludes the paper.
3. Experiment and Result Analysis
3.1. Experimental Dataset
The dataset used in this experiment for training and testing the improved model was mainly sourced from the InfiRay Infrared Open Source Database [
41], which is an infrared image dataset for target detection of ships at sea, in addition to 977 infrared images collected at several ports and terminals in Dalian, China, which were used to supplement the nearshore scenario data. The overall dataset contained 9379 infrared images of various types of ships and their corresponding labels, including seven types of ships of different sizes: liner, bulk carrier, warship, sailboat, canoe, container ship, and fishing boat. The images in the dataset were mostly captured at nearshore ports, with a large number of ships docked in the ports, thus meeting the practical application scene requirements of the experiment. A total of 33,281 ships were collected in the dataset, with liners accounting for 4.8%; bulk carriers accounting for 7.0%; warships accounting for 9.2%; sailboats accounting for 22.8%; canoes accounting for 18.1%; container ships accounting for 2.5%; and fishing boats accounting for 35.6%.
Figure 8 summarizes the number of labels for the ships in the above categories. In order to ensure the accuracy and reliability of the experimental data, we re-labeled the 977 unlabeled data and the data in the public dataset whose labels did not match the actual ship categories using the labeling annotation tool before the start of the experiments, corrected and adjusted the label formatting errors in the annotation of the public dataset, and converted the VOC format files provided in the public dataset stored in xml files into the model training required by the txt files, and converted the diagonal coordinates of the labeled box (x
min, y
min, x
max, y
max) into the form of a column of numbers with the width and height of the corresponding position. Finally, the dataset was divided into training, validation, and test sets in the ratio of 7:2:1.
3.2. Experimental Environment
Infrared image data acquisition equipment was InfiRay’s FT II 384 series of alarm-type long-range monitoring of infrared thermal imaging movement, the sensor was based on 12 μm ceramic detector, supported for black and white heat as well as 18 kinds of pseudo-color modes, with a resolution of 640 × 512, the frame rate of 50 Hz, and a response band of 8–14 μm.
The experiment was run on the Windows 10 operating system. The experimental machine was configured with an Intel (R) Core (TM) i7-1160G7 processor (Intel, Santa Clara, CA, USA), 16 GB of RAM, and an NVIDIA GeForce RTX 4090 (24 GB) graphics card (Nvidia, Santa Clara, CA, USA). The experimental language was python (version 3.10.8), and the deep learning framework was PyTorch (version 2.1.2).
Several key hyperparameters were set in the experiment: the number of training rounds (epochs) was 150, batch size was 16, the initial value of the learning rate (lr0) was 0.01, the optimization algorithm (optimizer) adopted was the SGD optimizer, the momentum setting (momentum) was 0.937, and the weight decay value (weight_decay) was 0.0005.
3.3. Evaluation Metrics
In order to demonstrate the contribution of CGSE-YOLOv5s to the performance enhancement observed in the nearshore ship detection task, we adopted commonly used performance metrics for the target detection task, including precision (P), recall (R), and mean average precision (mAP@0.5, mAP@0.5:0.95), as the evaluation metrics for the experiments. Some of our experiments also involved assessments of FPS (Frames Per Second), the model training parameters and GFLOPs (Giga Floating-Point Operations Per Second), in order to evaluate model complexity.
where
TP (True Positive) is the number of positive samples correctly identified,
FP (False Positive) represents the number of negative samples misclassified as positive,
FN (False Negative) denotes the number of positive samples missed,
AP (Average Precision) is the area under the Precision–Recall curve, and
N stands for the total number of target categories.
3.4. Comparison of Infrared Image Enhancement Algorithms
We used the combined strategy of CLAHE and GF algorithms to effectively resolve the problems of insufficient target texture information due to infrared imaging characteristics and excessive background noise caused by the climatic conditions of nearshore waters. In order to verify that the fusion of the two algorithms enabled our model to extract the target feature information more effectively and contributes to improvement of the model’s performance, we combined different combinations of image enhancement algorithms with the original YOLOv5s model and tested these new models separately. Meanwhile, the parameter setting of the CLAHE-GF module is an important factor that affects the detection effect of the model, so we also carried out ablation experiments with different parameter settings for the block size and the Gaussian kernel size of the CLAHE-GF to validate the detection of the dataset in this paper, and the comparison results are shown in
Table 1. It is worth emphasizing that we adopted a progressive tuning strategy for CLAHE in the training phase of the improved model, adopting a stronger contrast limiting parameter clipLimit = 3.0 in the early phase of training (epoch 1–50), and gradually weakening it to clipLimit = 2.0 in the middle and late phases of training (epoch 50–150), which was designed to enable the model to be exposed to data samples with different enhancement strengths to better ensure the robustness of the model. The experimental results show that the CLAHE-GF (16, 11) fusion strategy adopted in this paper achieved higher model detection performance than any other image enhancement algorithms, which proves that the image enhancement method under the selection of this parameter is able to achieve better detection effectiveness.
3.5. Comparison Experiments on Attention Mechanisms
In order to compare the superiority of the ECA attention mechanism used in this study with other attention mechanisms, we added four attention mechanisms (including SE, EMA, CBAM, and ECA) at the same location (i.e., in front of the detection header), and compared the mean average precision and model parameters. The results are shown in
Table 2. The detection results for the different attention mechanisms that were added are shown in
Figure 9.
Due to the infrared sensing characteristics, the infrared images collected in the nearshore scene will have the phenomenon that some of the ships and the background of the dock facilities overlap, resulting in a certain degree of missing ship features. The integrated ECA mechanism and the structure of the detection head adopted in this paper enable the model to reduce the focus on the background region and enhance the focus on the defective region, thus improving the model’s detection capability. In
Figure 9(1), YOLOv5s-SE, YOLOv5s-EMA, and YOLOv5s-CBAM all misidentify the buildings along the shore as sailboats consistently. In contrast, YOLOv5s-ECA can avoid the misidentification between ships and objects and accurately detect the correct ship locations. In
Figure 9(2), YOLOv5s-SE, YOLOv5s-EMA, and YOLOv5s-CBAM fail to accurately identify the canoe type at a relatively long distance, while YOLOv5s-ECA can distinguish the correct ship type. In addition, the results in the table show that the introduction of the ECA attention mechanism had a more significant improvement effect, in terms of mAP@0.5 and mAP@0.5:0.95, when compared to the other attention mechanisms. Therefore, the ECA attention mechanism was found to provide the optimal detection performance for nearshore infrared scenes in this study.
3.6. Network Structure Ablation Experiment
In order to measure the impact of the improved modules of different architectures on the model detection performance, we sequentially designed experiments by adding and replacing the improved network’s Input Layer, the feature fusion network, and the attention mechanism. The results of the ablation experiment regarding the network structure are provided in
Table 3. After first introducing the CLAHE-GF data enhancement strategy, the precision of the model was improved by 0.8% but the recall did not change significantly. When the Swin Transformer module was introduced, it compensated for the lack of positive-example regression of the model and increased the recall by 1.7%; however, limited by the layer stacking design of the Swin Transformer module, the replacement of C3STR increased GFLOPs by 18.7 and decreased FPS by 50. This processing speed remained acceptable, although the increase in detection accuracy sacrifices some computational efficiency. Finally, the introduction of the ECA attention mechanism allowed the model to further take into account the channel information. The experimental results reveal that the improved model has improved the precision, recall, and mAP, with 1.2%, 2.5%, and 1.3% increases when compared to the YOLOv5s model, respectively.
3.7. Comparison with Other Detection Algorithms
In order to verify that the improved algorithm has relatively superior performance for the detection of ships of different categories and scales in nearshore scenarios, the algorithm was comprehensively compared with the mainstream YOLO series algorithms and other mainstream algorithms (including faster RCNN, DETR [
42], both models based on ResNet50 as the backbone network) on the same dataset. The experimental results are given in
Table 4.
In the YOLO series of algorithms, the YOLOv7, YOLOv8n, YOLOv8s, YOLOv10, and YOLOv12 models derived from the same baseline model were selected for comparison, and the experiments were all based on the dataset detailed earlier in this paper. Compared with other baselines of the YOLO series, the core advantage of YOLOv5s lies in its lightweight architecture and speed. Its FPS performance on the dataset of this paper is remarkable. Meanwhile, it can be seen from the table data that the detection accuracy of sailboats, canoes and fishing boats is generally lower than that of other types of ships. This is because in our dataset, these three types of ships are more numerous in complex nearshore scenes and are more difficult to detect. However, for small ship targets such as fishing boats and canoes, the performance of YOLOv5s is better than that of other baseline models of the YOLO series. Therefore, choosing YOLOv5s as the baseline of this study offered more advantages in terms of real-time performance and accuracy in nearshore target detection. As shown in
Table 4, the FPS of the proposed CGSE-YOLOv5s reaches 162. Although its FPS is lower compared to some baselines, the effective balance between speed and accuracy far exceeds other baseline models. In the comparison with other series of detection algorithms, faster RCNN (R50) and DETR (R50) lag significantly behind CGSE-YOLOv5s in terms of detection speed and accuracy on the dataset used in this study. In summary, the CGSE-YOLOv5s demonstrates practical application potential in the task of infrared ship detection. By comparing the AP values of the seven types of vessels, the improved model CGSE-YOLOv5s proposed in this paper shows the most significant improvement in the detection of these several types of vessels. Moreover, its overall detection performance for almost all types of vessels is superior to that of other models. This is because the updated models such as YOLOv10 and YOLOv12 introduce more complex convolutional modules and neck designs, which require higher quality and quantity of datasets. However, due to the low image quality caused by infrared features and the complexity of nearshore scenes, the amount of data for complex scenes in the infrared nearshore is difficult to support the training of complex models, making these models unable to exert their advantages. Finally, comparing the mAP values, there were 2.1%, 1.9%, 1.4%, 1.7%, 1.2%, 7.4%, and 2.9% differences between YOLOv7, YOLOv8n, YOLOv8s, YOLOv10, YOLOv12, faster RCNN(R50) and DETR(R50) with respect to CGSE-YOLOv5s, respectively, demonstrating the excellent performance of the proposed model when applied for the detection of ships in complex nearshore scenes.
3.8. Comparison of Model Performance
In order to verify the overall improvement effect of the method proposed in this paper on the model’s performance, we compared the CGSE-YOLOv5s model with the original model YOLOv5s. The comparison results are shown in
Table 5. As can be seen from the data in the table, the CGSE-YOLOv5s model showed negligible change in the model parameters compared to the original model; the model complexity approximately doubled, and the detection accuracy has been obviously improved. The improvement of the detection accuracy has aggravated the computation of the improved model to a certain extent, and at the same time affected the FPS of the model. However, it can be seen through the results that, even though the processing frame rate of the new model has been compared to that of the baseline model, the FPS of the YOLOv5s has decreased, and our new model still has excellent comprehensive performance compared to other mainstream models. Meanwhile, in
Figure 10, we show a comparison of the training process for the improved model (CGSE-YOLOv5s) and the original YOLOv5s model, including precision (P), recall (R), and mean average precision (mAP@0.5), using change curves. The results in the figure show that the training curves for the three indicators of CGSE-YOLOv5s quickly reached the convergence state within the first 30 epochs, gradually stabilized after 90 epochs, and finally reached a smooth state by 150 epochs, indicating that the model has good robustness.
We plotted P–R curves to compare the performance of the original model YOLOv5s with the CGSE-YOLOv5s model (with recall as the
x-axis and precision as the
y-axis), as shown in
Figure 11. From the P–R curve, we are able to assess the variation between precision and recall, with a larger area enclosed by a curve representing the higher performance of a model. In the nearshore multi-scale ship background, the detection of small-scale ship targets is usually more difficult. As can be seen from
Figure 11a, the three types of small-scale ships—sailboat, canoe, and fishing boat, represented by the red, purple, and pink curves, respectively—had smaller areas under the curve (AUC) compared to the other ship types. Relatively, in
Figure 11b, the graph surrounded by the same three curves is significantly enlarged to the upper right, which proves that the improved model has an obvious enhancement effect regarding the detection performance of small-scale ship targets. Comparison of the two models using P–R curves shows that the CGSE-YOLOv5s model outperforms the basic YOLOv5s model for nearshore vessel detection. The main improvement is reflected in higher AP values for most individual vessel categories and higher overall mAP. This suggests that improvements to CGSE-YOLOv5s can effectively improve the model’s ability to accurately detect different types of vessels in nearshore IR scenes.
3.9. Comparison Experiment for Visualization of Test Results
In order to more intuitively compare the detection performance effect between the YOLOv5s and CGSE-YOLOv5s algorithms, we selected a variety of infrared nearshore scenes with complex backgrounds and other interference problems for visualization and comparison. The detection results of the two algorithms are shown in
Figure 12, where (a) is the detection result of YOLOv5s and (b) is the detection result of CGSE-YOLOv5s.
The figure shows the ship distributions in nearshore water scenes presenting complex scenarios such as dense, multi-scale objects, mutual occlusion, insufficient image resolution, and interference from rear buildings. The comparison results show that the original YOLOv5s algorithm has difficulty in detecting small and occluded targets in complex nearshore water scenes, often leading to target loss and false detection. In contrast, the CGSE-YOLOv5s algorithm effectively suppresses background interference in different nearshore scenarios, enabling it not only to identify multiple ship targets that were undetected or falsely detected by the original model, but also to achieve high confidence results. Our algorithm adopts the CLAHE-GF traditional image enhancement fusion strategy to refine the edge feature information of infrared images, accurately identify targets of different scales by integrating C3STR to capture multi-scale features, and enhance the ability to capture ship features through the ECA mechanism, thereby providing excellent performance for ship detection. Comprehensive comparative analysis verified that the CGSE-YOLOv5s algorithm proposed in this study can effectively reduce the false detection rate and omission rate of infrared ship targets in complex nearshore scenes, providing excellent performance in terms of detection accuracy.
4. Discussion
The CGSE-YOLOv5s model proposed in this paper can be applied to target detection tasks involving images of ships in nearshore waters with complex backgrounds, while maintaining excellent detection performance (including accuracy and speed). CGSE-YOLOv5s synthesizes image enhancement operations and improved convolution neural network structures, unlike other research which has involved making direct improvements only to the network structure. The proposed algorithm prioritizes the quality of the infrared images, aiming to improve the contrast between the target and the background within the image and reduce the effects of low resolution for optimization of the network structure. CGSE-YOLOv5s includes the Swin Transformer module to extract more adequate global information based on the model’s self-attention mechanism. Due to the general application of attention mechanisms and their role in the performance of the network, we tested a variety of attention mechanisms and compared the differences between them and the ECA mechanism; in this way, we found the ECA mechanism to be more suitable for targeting the infrared scene, as it obtained superior results when compared to several other mechanisms. Due to the superior detection performance demonstrated by the proposed CGSE-YOLOv5s, compared to other models, the developed technology expands the potential of ship detection applications characterized by complex nearshore scenarios.
However, despite the considerable improvement in detection accuracy in nearshore scenarios achieved by the proposed model, we did not conduct in-depth research on similar tasks in different scenarios, such as the detection of small targets at long distances. As such, we will explore more effective methods to conduct further research on various ship detection scenarios in the future. In addition, although infrared image enhancement based on traditional algorithms can achieve certain results, there are still some limitations, such as poor generalization in cross-dataset scenarios, and in the future, we will apply the latest neural network-based image enhancement mechanism considering model lightweight requirements in order to train the model more effectively.