4.1. Ablation Experiments
The accuracy evaluation results of ablation experiments are shown in
Table 6, and the loss changes during the training of the ES-YOLO model are shown in
Figure 7.
Figure 7 shows the changes in the training set and validation set loss of the ES-YOLO model within 100 epochs. It can be seen that the training set loss and validation set loss values are decreasing, and the validation set loss value starts to level off at a certain state, indicating that the model is trained normally.
- (1)
EACSA
In the EACSA module, the channel attention realizes the adaptive recalibration of channel features through a two-layer convolution structure, and the compression ratio is used to control the dimension reduction ratio of the channel to achieve a balance between model complexity and feature expression ability. The experimental results are shown in
Table 7.
The experimental results show that the compression ratio has a significant impact on the model checking performance. When reduction is 4, more feature information is retained, but the redundancy between channels is high, and the attention distribution is not concentrated. When reduction is 16 or 32, the channel information is excessively compressed, and the key semantic features are lost, resulting in a decrease in detection accuracy. Comprehensive comparison shows that when reduction is 8, the model achieves 91.54% in precision, 70.77% in recall, 82.74% in mAP, and 80% in F1, respectively, and the overall performance is the best. Moderate channel compression can not only reduce the computational burden but also enhance the ability of the attention module to focus on key features, making the model more stable in small target detection and complex background scenes. Therefore, reduction to 8 can be used as the best balance setting between detection accuracy and model complexity for EACSA module.
In order to verify the influence of EACSA module on the performance of the model, this module is introduced on the base model for comparative experiments, and the results are shown in
Table 6. After the introduction of EACSA, the overall detection performance of the model is improved, in which precision is increased to 94.01%, recall is increased to 72.38%, mAP is increased to 84%, and F1 value is increased to 82%. From the results, the precision is increased by 2.9%, indicating that the false detection rate of the model is significantly reduced in complex backgrounds. mAP was increased by 0.83%, indicating that the overall detection accuracy was optimized. At the same time, recall is increased by 1.6%, which reflects the enhanced detection ability of the model in weak-target or low-contrast scenes. This shows that the EACSA module plays a positive role in feature extraction and salient region focusing. The reason for the performance improvement is that the EACSA module enhances the responsiveness of the model to key features by introducing a joint attention mechanism in the channel and spatial dimensions. Channel attention uses the fusion of global average pooling and maximum pooling to adaptively allocate channel weights, which effectively strengthens semantic salient features and suppresses irrelevant background information, thereby reducing false detections and improving precision. Spatial attention captures salient regions in the two-dimensional space, so that the model can more accurately locate the edge and contour structure of the ship in the complex sea background, thereby improving recall and F1 value. In addition, Sobel convolution enhancement, an edge-aware mechanism embedded in EACSA, further strengthens the gradient response of the target boundary and improves the adaptability of the model to blurred contours, uneven illumination, and wave interference scenes. This synergistic effect of edge enhancement and attention focusing makes the model have stronger environmental robustness while maintaining high detection accuracy.
- (2)
LSCD
To investigate the impact of nonlinear activations on the proposed LSCD architecture, five commonly used activation functions (LeakyReLU, GeLU, PReLU, SiLU, and ReLU) were evaluated under identical network configurations and training settings. The results are summarized in
Table 8. Among them, ReLU achieves the highest detection accuracy, with precision, recall, mAP, and F1 reaching 92.53%, 70.46%, 83.56%, and 0.80, respectively. PReLU and GeLU obtain comparable performance but remain slightly inferior to ReLU in all metrics, while SiLU and LeakyReLU show a noticeable decline in both recall and mAP.
Although SiLU is the default activation in YOLOv7, its smooth nonlinear compression weakens the response at object boundaries, resulting in insufficient preservation of fine-grained features for small ships. In contrast, the hard-threshold property of ReLU introduces stronger activation sparsity, effectively suppressing sea surface noise and enhancing the feature contrast between ships and background regions. This behavior is particularly beneficial in the down-sampling stage, where edge-aware decoupling is employed. Moreover, the residual and multi-branch fusion structure of LSCD mitigates the neuron inactivation issue commonly associated with ReLU, ensuring stable gradient flow during optimization.
Overall, the results indicate that activation selection is highly architecture dependent. In our LSCD framework, ReLU provides a more favorable trade-off between gradient sparsity, edge preservation, and convergence stability, leading to superior detection performance in complex maritime scenes.
In order to evaluate the influence of LSCD module on the performance of the model, LSCD is introduced into the basic model for comparison test. The experimental results are shown in
Table 6. When only LSCD module is introduced, the precision, recall, mAP, and F1 values of the model are increased to 93.74%, 71.46%, 83.71%, and 81%, respectively. The LSCD module reduces the computational cost of feature down-sampling while maintaining accuracy and improves the detection efficiency of the model while maintaining the accuracy of the model. The LSCD module introduces a feature mapping method that decouples space and channel, so that the down-sampling process can focus on spatial structure and semantic information, respectively, avoiding the information loss problem caused by feature coupling in traditional convolution down-sampling. At the same time, the lightweight design inside the LSCD module can improve the accuracy of feature selection while reducing the amount of parameters and calculations, which makes the model perform better in small ship detection and complex background suppression.
In addition, the stable contribution of LSCD can be further verified from the results of the combination with other modules. When LSCD and EACSA are used at the same time, the mAP of the model is improved by about 0.63% compared with the Baseline, indicating that the two have complementary advantages in the feature extraction and down-sampling stage. When LSCD is combined with the multi-scale structure, the model improves the recall significantly, indicating that the LSCD module can effectively enhance the retention ability of features at different scales and further improve the recall performance of detection.
In summary, the LSCD module significantly improves the information expression quality of the feature down-sampling stage under the premise of keeping the overall model size unchanged and improves the feature retention and complex background suppression effect of small ships. It is an important part of the ES-YOLO framework to achieve efficient feature extraction and performance improvement.
- (3)
Multi-scale
In order to verify the influence of the multi-scale structure on the performance of the model, this paper introduces the multi-scale structure on the basis of the benchmark model for comparison tests. The experimental results are shown in
Table 6. The precision, recall, mAP, and F1 values of the model reach 91.03%, 72.62%, 84.23%, and 81%, respectively. Among them, the precision is slightly decreased compared with the baseline model, while the recall and mAP are significantly improved, especially the mAP, which is increased by about 1.1%, indicating that the multi-scale structure can significantly enhance the detection ability of the model on objects of different scales. It can be seen from the results that the multi-scale structure effectively improves the detection performance of small objects. The core idea is to enable the model to simultaneously obtain high-level semantic information and low-level detailed features through the parallel multi-scale feature pathway, so as to improve the diversity and complementarity of the feature space. Compared with the traditional single-path feature transfer method, the problems of the decline of spatial resolution of high-level features and the insufficient semantic expression of low-level features after multiple down-sampling are effectively alleviated. The improvement of recall indicates that the model is more sensitive in capturing fine-grained targets such as small ships, and the overall improvement of mAP reflects that multi-scale information fusion enhances the robustness of the model in complex scenes. The precision slightly decreases, which is caused by the redundancy of some features or the introduction of noise in the scale fusion process, but the overall F1 value is still higher than that of the baseline model, indicating that the multi-scale structure effectively improves the comprehensiveness and robustness of detection while maintaining the overall detection accuracy.
In addition, in order to more intuitively illustrate the improvement of model performance by each module, this paper conducts visual analysis. As shown in
Figure 8, where (a) represents the real labeling results, (b)–(h) correspond to the detection results after removing or adding different modules, respectively, the red box represents the detection results, and the yellow box represents the missed detection. It can be seen from the Figure that when the key module is missing, the model is prone to problems such as incomplete detection boxes and missed detection in complex backgrounds or dense ship areas. With the gradual introduction of the improved module, the detection results are gradually closer to the real annotation, which can better cover the ship targets in different scales and complex environments and significantly reduce the missed detection phenomenon.
Figure 9 presents the heatmap visualization results of ES-YOLO and the baseline model. High-activation regions correspond to ship bodies and boundary structures, indicating strong feature responses, whereas low-activation regions mainly appear in sea clutter and non-target areas. This visualization confirms that the designed modules not only enhance the representation of ship-related features but also effectively suppress background responses, thereby improving detection robustness in complex maritime scenes.
4.2. Comparative Experiments
In order to prove the effectiveness of ES-YOLO method, it is compared with three other object detection algorithms: Faster R-CNN, RetinaNet, YOLOv5, and YOLOv8. The detection performance on TJShip dataset is shown in
Table 9.
As shown in
Table 9, the proposed model algorithm shows significant advantages in all performance metrics. The precision, recall, mAP, and F1 of ES-YOLO reach 94.16%, 73.89%, 84.92%, and 82%, respectively, and the ship recognition accuracy is the highest. The recognition accuracy of Faster R-CNN network is the lowest, which is due to the fact that Faster R-CNN is a two-stage detection method with a complex structure and slow inference speed, which is not conducive to efficient detection under complex background and multi-scale ship. Compared with YOLOv5 and YOLOv8, with relatively high accuracy, ES-YOLO has an accuracy increased by 1.42% and the recall rate increased by 7.04%. Due to the limited capability of feature fusion and context modeling, the detection performance of small ships and dense ships in complex port scenes is still insufficient. ES-YOLO greatly improves its feature extraction ability through the introduction of EACSA and significantly improves the recall ability of small ships through multi-scale structure. Although RetinaNet has a certain ability to detect ships with high accuracy, the recall is only 65.92%, indicating that most real ships have not been detected.
In order to visually show and compare the detection performance of various algorithms, this paper carries out a visual analysis, and the results are shown in
Figure 10.
Figure 10a–e correspond to the visualization of the detection results of Faster R-CNN, RetinaNet, YOLOv5, YOLOv8, and ES-YOLO, respectively.
Figure 10a shows that Faster R-CNN has a large number of false detections and missed detections, which further illustrates that the complex two-stage algorithm is not suitable for ship detection in complex scenes. It can be seen from
Figure 10b that RetinaNet has achieved good detection accuracy on some ships with no background occlusion and relatively complete hull, but there are still many missed detections. YOLOv5 and YOLOv8 achieve a certain balance between detection accuracy and recall rate, while ES-YOLO, proposed in this paper, can still accurately identify multi-scale ship targets under complex backgrounds, improve the detection effect of small ships, improve the detection accuracy, and obtain excellent detection performance.
In addition to the quantitative comparison, we further verify the robustness of ES-YOLO under challenging maritime scenarios. Although no dedicated quantitative subset is provided for extreme-density or weather-specific evaluation, we show in
Figure 11 that the proposed method remains robust under dense berth scenes, wake interference, and low-contrast illumination conditions.
These visual examples provide complementary evidence that ES-YOLO maintains consistent detection behavior in real-world degraded environments.
4.3. Generalization Experiments
According to the quantitative results in
Table 10 and the mAP comparison shown in
Figure 12, clear performance differences can be observed among various detectors in port-scene ship detection. Overall, one-stage detectors achieve high accuracy while maintaining real-time inference capabilities. YOLOv8 and ES-YOLO exhibit particularly strong performance, achieving near-saturated accuracy across almost all categories, with mAP values of 97.50% and 97.83%, respectively. Notably, ES-YOLO achieves nearly perfect accuracy for bulk cargo and ore carriers, indicating that its enhanced feature representation is highly effective at capturing large ship structures while suppressing ocean-surface background interference.
In comparison, Faster R-CNN demonstrates stable performance, but its accuracy decreases for small-scale categories such as fishing boats and passenger ships. This suggests that the traditional two-stage framework still has limitations in localizing fine-grained targets under complex port backgrounds. YOLOv5 shows balanced performance overall, but its accuracy remains lower than the more advanced YOLOv8 family.
A prominent observation is the significant performance drop of RetinaNet in the ore carrier category, where the accuracy falls to only 0.22, resulting in a relatively low overall mAP of 74.66%. This indicates that RetinaNet’s feature pyramid struggles with large-scale variations and elongated ship structures commonly found in port environments.
In summary, the experimental results demonstrate that advanced one-stage detectors (ES-YOLO) significantly outperform both traditional two-stage models and earlier one-stage approaches in port-scene ship detection. Their superior robustness to multi-scale targets, densely berthed ship clusters, and visually complex maritime backgrounds make them more suitable for practical deployment in real-world port monitoring systems.