Enhanced YOLOv7 Based on Channel Attention Mechanism for Nearshore Ship Detection

Zhu, Qingyun; Zhang, Zhen; Mu, Ruizhe

doi:10.3390/electronics14091739

Open AccessArticle

Enhanced YOLOv7 Based on Channel Attention Mechanism for Nearshore Ship Detection

by

Qingyun Zhu

^1,*,

Zhen Zhang

¹ and

Ruizhe Mu

²

¹

School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

²

School of Automation and Electrical Engineering, Tianjin University of Technology and Education, Tianjin 300222, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1739; https://doi.org/10.3390/electronics14091739

Submission received: 6 March 2025 / Revised: 15 April 2025 / Accepted: 15 April 2025 / Published: 24 April 2025

(This article belongs to the Special Issue Intelligent Systems in Industry 4.0)

Download

Browse Figures

Versions Notes

Abstract

Nearshore ship detection is an important task in marine monitoring, playing a significant role in navigation safety and controlling illegal smuggling. The continuous research and development of Synthetic Aperture Radar (SAR) technology is not only of great importance in military and maritime security fields but also has great potential in civilian fields, such as disaster emergency response, marine resource monitoring, and environmental protection. Due to the limited sample size of nearshore ship datasets, it is difficult to meet the demand for the large quantity of training data required by existing deep learning algorithms, which limits the recognition accuracy. At the same time, artificial environmental features such as buildings can cause significant interference to SAR imaging, making it more difficult to distinguish ships from the background. Ship target images are greatly affected by speckle noise, posing additional challenges to data-driven recognition methods. Therefore, we utilized a Concurrent Single-Image GAN (ConSinGAN) to generate high-quality synthetic samples for re-labeling and fused them with the dataset extracted from the SAR-Ship dataset for nearshore image extraction and dataset division. Experimental analysis showed that the ship recognition model trained with augmented images had an accuracy increase of 4.66%, a recall rate increase of 3.68%, and an average precision (AP) with Intersection over Union (IoU) at 0.5 increased by 3.24%. Subsequently, an enhanced YOLOv7 algorithm (YOLOv7 + ESE) incorporating channel-wise information fusion was developed based on the YOLOv7 architecture integrated with the Squeeze-and-Excitation (SE) channel attention mechanism. Through comparative experiments, the analytical results demonstrated that the proposed algorithm achieved performance improvements of 0.36% in precision, 0.52% in recall, and 0.65% in average precision (AP@0.5) compared to the baseline model. This optimized architecture enables accurate detection of nearshore ship targets in SAR imagery.

Keywords:

nearshore ship detection; synthetic aperture radar (SAR); concurrent single-image GAN (ConSinGAN); squeeze-and-excitation (SE); YOLOv7 algorithm; average precision (AP)

1. Introduction

Synthetic Aperture Radar (SAR) [1] is an active microwave imaging device capable of operating under all weather and lighting conditions, day and night, making it crucial for monitoring ship targets in both military and commercial sectors [2]. Consequently, ship detection based on SAR imagery has become a focal point of research. Traditional methods for ship detection in SAR imagery primarily include the Constant False Alarm Rate (CFAR) algorithm [3] and feature extraction-based algorithms [4]. In nearshore areas, ship detection must contend with the dense arrangement of ships, as well as numerous man-made targets in terrestrial environments, such as buildings, whose characteristics are similar to those of ships, posing significant difficulties and challenges for precise target detection. Looking ahead, with the development of intelligent maritime surveillance systems, SAR-based nearshore ship target recognition technology will become an indispensable component thereof.

Remote sensing imagery is a key technological means of acquiring surface information through aerial or satellite platforms. It provides significant assistance to researchers in scientific research, but its drawbacks are becoming increasingly apparent. On one hand, remote sensing imagery has high requirements for photographic equipment; on the other hand, there is a scarcity of labeled raw image samples, which necessitates a substantial amount of manual time for annotation. Data augmentation, as an effective means to address the aforementioned issues, is widely applied by researchers [5]. Depending on the principles of image generation, data augmentation can be divided into methods based on traditional image processing and methods based on neural networks. Data augmentation based on image processing mainly includes geometric transformation models [6], noise perturbation models [7], and photometric conversion models [8]. Data augmentation based on neural networks primarily utilizes image information for data expansion, mainly divided into Variational Auto Encoding (VAE) models [9] and Generative Adversarial Network (GAN) models [10]. The research team from Southwest Jiaotong University [11] constructed a position-aware conditional generative adversarial network (PCGAN) to generate high-quality and robust SAR ship imagery, thereby enhancing ship detection accuracy. This approach demonstrated the effectiveness of adversarial generation strategies in addressing data scarcity challenges, while highlighting the potential application of GANs in maritime target detection through synthetic SAR data augmentation.

In recent years, the rapid development of deep learning technology has led to its widespread application in the field of image recognition, achieving significant results. Many researchers have begun to apply deep learning technology to the task of SAR ship recognition [12,13]. Deep learning-based object detection is divided into two major categories: region proposal-based methods, such as the R-CNN [14,15,16] series and regression-based methods such as SSD [17], YOLO [18,19,20,21] series, and Retina Net [22]. In deep learning technology, Convolutional Neural Networks (CNNs) [23] can automatically extract features through hierarchical learning, making full use of image information, breaking the limitations of traditional methods, and demonstrating strong robustness in various complex environments. For example, a team from the Academy of Military Science [24] proposed a feature reconstruction network combined with an attention mechanism for ship target detection. This method reduced noise interference caused by dense targets by introducing attentional rotation anchor boxes and enhanced the precision of features through a feature reconstruction module. A team from Nanjing Agricultural University [25] applied CIoU-Loss and DIoU-NMS to the original model, reducing the interference of background information on target detection and improving the model’s detection performance in occlusion scenarios. These studies indicate that deep learning technology has great potential and application value in the field of ship recognition.

Attention mechanisms [26], which have emerged as a core technology in recent years driven by advances in deep learning, have been widely applied in fields such as natural language processing, computer vision, speech recognition, and statistical learning. Inspired by human cognitive processes, these mechanisms abstract the biological concept of attention into machine learning, where their essence lies in intelligently allocating limited computational resources to prioritize critical information, akin to how humans “focus on the important and ignore the trivial”. In the domain of object detection, attention mechanisms enhance performance by dynamically allocating feature weights, enabling models to focus on key regions of an image, much like human vision. Typical applications include integrating channel attention modules into backbone networks to enhance target feature representation [27] and incorporating cross-scale attention mechanisms into feature pyramids to improve small object detection [28], among others. These approaches not only boost detection accuracy but also effectively address challenges such as occlusion and scale variations in complex scenarios.

This paper addresses the small-sample learning issue of SAR nearshore ship datasets characterized by a scarcity of samples and poor diversity. By employing a Concurrent Single-Image GAN (ConSinGAN), which includes the generator and discriminator network’s game theory and multi-scale feature extraction mechanisms, high-quality synthetic samples are generated to enhance data diversity and ensure the consistency of generated samples at different resolutions. This approach strengthens the anti-overfitting capability of the ship recognition model. To improve the detection performance of nearshore ship targets against complex backgrounds, structural improvements are made to the YOLOv7 model. The Squeeze-and-Excitation (SE) [29] module is utilized, which leverages global information to enhance informative positive sample features while suppressing unimportant negative sample features. By integrating the SE module with the backbone network layer of the YOLOv7 model, the model’s feature extraction capability for nearshore ships is enhanced. Finally, through comparative experiments with various attention modules and the analysis of detection results for ship targets in SAR images under complex environments, the practicality and effectiveness of the proposed algorithm in detecting nearshore ship targets in complex backgrounds of SAR images are validated.

2. Related Work

2.1. Data Enhancement

The Generative Adversarial Network (GAN) was proposed by Goodfellow and colleagues in 2014 [10]. A GAN consists of a generator network and a discriminator network; the generator combines original images with random noise to generate synthetic samples, aiming to produce images indistinguishable from real ones. The discriminator helps the generator optimize its generation capabilities by classifying real and generated images as true or fake. With the rapid development of deep learning, GAN has been widely applied in the field of data augmentation, leading to the derivation of various types of GAN models, such as Wasserstein GAN [30], Progressive GAN [31], and Conditional GAN [32]. However, the aforementioned data augmentation methods are only applicable when there is an abundance of training samples and are not suitable for small-sample or single-sample data generation. Currently, only a few models can achieve training based on a single original image or perform transfer learning between small samples, among which Single-Image GAN (SinGAN) [33] and Concurrent Single-Image GAN (ConSinGAN) [34] are relatively outstanding performers. The team from Liaoning Technical University [35] utilized the ConSinGAN model to generate images of varying resolutions through scale transformation, thereby increasing the volume of image data. The team from Central South University for Nationalities [36] tackled the problem of limited image samples by enhancing images with the ConSinGAN model, thereby augmenting the training sample set and enhancing the stability of image detection tasks.

The ConSinGAN model takes a single real image as input and, during training, scales the image proportionally to serve as input for the discriminator, thereby supporting training across multiple iterative stages, which is why it is called the Concurrent Single-Image GAN (ConSinGAN). As shown in Figure 1, the model consists of N proportionally scaled generators, each corresponding to a training stage.

The ConSinGAN features a multi-stage iterative training characteristic. To achieve end-to-end image generation, each stage of the ConSinGAN utilizes the raw features from the previous stage as input, distinct from conventional iterative generators that produce intermediate images. This design avoids the overfitting issues associated with using generated intermediate images as input for subsequent generators in conventional methods, which can lead to the model’s inability to generate GAN models with rich feature representations, thereby causing mode collapse.

In the later stages of training, the dimensions of the random noise vector increase as the training progresses, leading to a corresponding increase in the size of the generated images. Throughout this process, the size of the generated images remains consistent with the proportionally scaled single input image. Concurrently, as the image size increases, the model’s convolutional receptive field decreases, which is more conducive to extracting shallow-level image information. However, in each stage, the number of feature channels and the size of the feature maps remain constant, with only the image resolution changing.

For the discriminator, the parameter count remains the same at each stage. At any given stage n, the discriminator weights from the previous stage are employed. The optimization process aims to minimize the adversarial loss and the reconstruction loss, ensuring robust and accurate image generation:

L = \min_{G_{n}} \max_{D_{n}} L_{a d v} (G_{n}, D_{n}) + α L_{r e c} (G_{n})

(1)

L_{a d v} (G_{n}, D_{n})

employs WGAN-GP for adversarial loss, and the reconstruction loss is used to enhance training stability (α = 10 for all experiments). For the reconstruction loss, the generator

G_{n}

takes the downsampled version (

x_{0}

) of the original image (

x_{n}

) as input and reconstructs the image at stage n:

L_{r e c} (G_{n}) = {‖G_{n} (x_{0}) - x_{n}‖}_{2}^{2}

(2)

In each stage, the discriminator is trained identically, with both generated and authentic images as inputs, and is optimized to maximize the objective function

L_{a d v}

.

2.2. Object Detection Model

The currently popular detection algorithms are the NanoDet algorithm [37] and DETR algorithm [38]. Although the NanoDet algorithm has a fast computational speed, its detection accuracy is relatively low. Although DETR has the advantage of detection accuracy, it has a fast inference speed and a large amount of model computation. The YOLOv7 [39] model not only maintains high detection accuracy and inference speed but also has a unique improvement in small object detection accuracy, which is more in line with the requirements of nearshore ship detection in our work. Although the performance indicators of subsequent algorithms based on YOLOv7 have been improved, YOLOv7 has unique advantages in algorithm improvement [40]. Adding the same attention module has achieved better results in F1 score and confidence threshold compared to subsequent versions. Based on this, YOLOv7 model was used for research and algorithm improvement in this work.

YOLOv7 is a single-stage object detection algorithm developed based on YOLOv4 [41] and YOLOv5, effectively balancing detection speed and accuracy, providing strong technical support for detecting nearshore ship targets in complex environments. The object detection process relies on the YOLOv7 network model, which includes four main levels: the input layer, the backbone network (Backbone), the neck network (Neck), and the output detection layer. Specifically, the input layer is responsible for receiving image data; the backbone network is used for extracting image features; the neck network further processes and fuses features; finally, the output detection layer is responsible for generating the detection results of the targets. The network structure design of YOLOv7 is shown in Figure 2, aimed at improving the efficiency and effectiveness of detection.

Figure 3 delineates the architectural framework of the YOLOv7 network model. The input layer (Input) is tasked with resizing the incoming three-channel raw images to a uniform dimension of 640 × 640 pixels, employing Mosaic data augmentation techniques. This methodology enhances dataset variability and intricacy through random scaling, cropping, and rearrangement, amalgamating four distinct images into a single composite image.

The backbone network layer is primarily constituted by the Extended Efficient Layer Aggregation Network (E-ELAN) [42] and Max Pooling (MP) [43] modules, which synergistically engage in deep convolution to extract multi-scale feature information from the imagery.

The neck network (Neck), alternatively referred to as the feature fusion layer, integrates Spatial Pyramid Pooling (SPP) with Cross-Stage Partial Connections (CSPC) within the SPPCSPC [44] module, alongside Conv-BN-Sigmoid (CBS) configurations, an enhanced variant of ELAN termed ELAN-W, and upsampling mechanisms. These components are pivotal in the profound integration of feature maps across disparate scales, culminating in the generation of three-tiered feature maps of varying dimensions—small, medium, and large—to facilitate subsequent analytical processes.

The output detection layer (Head) assumes the responsibility of processing the extracted features, refining the predicted anchor box coordinates, classifying categories, and ascertaining confidence levels. It subsequently yields the conclusive detection outcomes following the application of Non-Maximum Suppression (NMS), thereby ensuring the precision and dependability of the detection results.

2.3. Attention Mechanism

The Squeeze-and-Excitation Network (SENet), as a representative of channel attention mechanisms, won the classification project in the 2017 ImageNet competition. This network models the importance of different channels by capturing the interrelationships between convolutional layer features, using global information to enhance informative positive sample features while suppressing unimportant negative sample features, thereby optimizing network performance. The SE attention mechanism significantly improves the performance of existing networks with only a slight increase in computational load. SE first simplifies the feature map to a feature vector through a squeeze operation, then learns the weights of each channel through an excitation operation, and these weights are subsequently used to adjust the channels in the original feature map, achieving dynamic feature recalibration.

The SE [29] attention mechanism primarily enhances key features while suppressing less important ones by dynamically adjusting the weights of feature channels. This is achieved by performing three key steps on convolutional feature maps: Squeeze, Excitation, and feature reweighting, as shown in Figure 3.

The Squeeze step, aimed at addressing channel dependencies, employs global average pooling across channels to compress global spatial information into a channel descriptor. Specifically, it reduces a feature map of size W × H × C to a 1 × 1 × C feature vector Z, where each of the C channels is compressed into a single value. This results in channel-level statistics Z that incorporate contextual information, thereby mitigating the issue of channel dependencies. This process is defined as follows:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(3)

Among them, Z_c is the c-th element of Z.

The Excitation step consists of two fully connected layers. The first fully connected layer compresses the C channels into C/r channels to reduce computational complexity, followed by a ReLU non-linear activation layer. The second fully connected layer restores the number of channels back to C, and a Sigmoid activation is applied to generate the weights s. The resulting s has dimensions 1 × 1 × C, where each element represents the importance weight for the corresponding channel in the feature map. The parameter r denotes the compression ratio.

s = σ (g (z, W)) = σ (W_{2} δ (W_{1} z))

(4)

Feature reweighting uses the weights obtained from the Excitation step to recalibrate each channel of the original input feature maps. By weighting specific channel features, it enhances or suppresses them, ultimately producing the output

\tilde{X}

.

{\tilde{X}}_{c} = s_{c} u_{c}

(5)

Due to its plug-and-play nature, the SE module is widely applied across various network architectures and significantly enhances model performance. Through this attention mechanism, networks can focus more on features that are critical for object recognition tasks while ignoring unimportant information, thereby improving feature representation and model recognition accuracy.

3. Material and Method

3.1. Experimental Setup and Parameter Metrics

All experiments were conducted on a server with the following hardware setup: NVIDIA GeForce RTX 3090 GPUs, 64 GB of RAM, and an Intel X5 CPU. The software environment consisted of Ubuntu 18.04, Python 3.9, PyTorch 2.2.1, and CUDA 12.1.

The ConSinGAN model, during the data augmentation training process for SAR nearshore ship images, employed a learning rate of 0.3, with each image undergoing 10 training epochs and 2000 iterations per epoch, utilizing the Adam optimizer.

In assessing the object detection performance of the YOLOv7 model, this study employed four core metrics, including precision, recall, average precision (AP), mean average precision (mAP), and F1 score (F1-Score). The calculation methods for these metrics are as follows:

Precision = \frac{T P}{T P + F P}

(6)

Recall = \frac{T P}{T P + F N}

(7)

A P = \int_{0}^{1} P d R

(8)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(9)

F 1 - S c o r e = \frac{2 * \Pr e c i s i o n * Re c a l l}{\Pr e c i s i o n + Re c a l l}

(10)

In this context: True Positives (TP) denote the number of instances where the model correctly identifies positive samples, False Positives (FP) represent the instances where the model erroneously classifies negative samples as positive, and False Negatives (FN) indicate the cases where the model mistakenly labels positive samples as negative. Precision is the likelihood that a detection box marked as a positive class by the model is indeed a positive class, while recall is the probability that a positive sample is accurately predicted among all actual positive samples. Mean average precision (mAP) is derived from the average of average precision (AP) values across all classes, serving as a critical metric for evaluating the performance of the object detection network model. AP, in this case, refers to the mean precision at various levels of recall. The total number of classes is represented by N. Since this experiment involved a single target class, namely, ships, the evaluation metric was simplified to AP. An elevated AP value signifies superior detection capabilities of the model. The F1 score ranges between 0 and 1, with an ideal value of 1, indicating that the model has perfect precision and recall. A high F1 score reflects a good balance between precision and recall, and it is commonly used in tasks that involve imbalanced datasets or require a trade-off between accuracy and completeness.

In the experiments, we adhered to the principle of controlled variables. Specifically, the model training epochs were set to 400, the batch size was 16, the input image batch dimensions were 640 × 640 pixels, the optimizer used was SGD, and the warm-up epochs were 3.

3.2. Dataset Creation

The SAR-Ship dataset [45] consists of 102 Gaofen-3 (GF-3) images and 108 Sentinel-1 images from China. These images were processed into 43,819 image patches of 256 × 256 pixels, each containing one or more ship targets, totaling 59,535 annotated ship instances. The spatial resolution of the images ranges from 22 m, 8 m, 3 m, 5 m, and 10 m to 25 m. The annotation files are in XML format, recording the position and category information of each ship within the image patches. The dataset includes a variety of offshore and nearshore images. This paper filtered the nearshore data images from the SAR-Ship dataset, extracting a total of 3644 images.

During the construction of the dataset, annotating images with bounding boxes is a critical step. This study employed the LabelImg annotation tool for marking sample images, with annotation files adhering to the YOLO format. Figure 4a depicts an unannotated image, while Figure 4b shows an image annotated in the YOLO format, where the annotation file is in text format, containing a total of five parameters following the format {label, x-coordinate, y-coordinate, width, height}. The label represents the class identifier of the object within the bounding box; the x-coordinate and y-coordinate indicate the center coordinates of the object; and the width and height refer to the length and width of the annotation box, respectively.

3.3. Improvement of Backbone Layer Based on SE

For the task of ship target detection under complex nearshore backgrounds, the YOLOv7 algorithm was adopted and optimized. We strengthened the model’s backbone network for more efficient feature extraction and integrated the SE module with the backbone network to enhance the detection capability of nearshore ships in complex backgrounds.

Given that the primary function of the SE (Squeeze-and-Excitation) module is to reinforce important channel features, it was integrated into the Backbone module, particularly at the position responsible for feature fusion. ELAN (Efficient Layer Aggregation Network) improves feature extraction capabilities through depthwise separable convolutions and progressive feature aggregation, effectively enhancing feature representation. In the YOLOv7 model, the feature map output by the ELAN module contains crucial semantic information at the current layer. The SE module can dynamically adjust channel weights to enhance target-relevant features by boosting key features and suppressing noise or irrelevant features, thereby improving the overall performance and accuracy of the network. This improvement not only strengthens feature discriminability but also aids the network in better recognizing and localizing targets in object detection tasks.

The YOLOv7 model’s Backbone network is composed of key components such as CBS layers, ELAN layers, and MP-1 layers. In YOLOv7 + SE, we introduced the SE module after the four ELAN structures within the Backbone network, as shown in Figure 5. By doing so, we enhanced the feature representation capability at the intermediate stages of the Backbone network.

3.4. YOLOv7 Model Integrating ELAN and SE

The ELAN module is an efficient network design that captures richer feature sets by managing the shortest and longest gradient paths. It consists of two branches: one adjusts the number of channels using a 1 × 1 convolution, and the other combines a 1 × 1 convolution with four 3 × 3 convolutional modules to adjust the number of channels and then extract features. These features are eventually merged. The SE module was integrated into the last CBS of ELAN to form CBS-SE, replacing the conventional CBS module. The improved ELAN network structure is shown in Figure 6.

By combining the ELAN module with the SE attention mechanism, an enhanced network structure, referred to as the ESE network in this paper, was formed. The core task of this network was to extract multi-level features from input images and integrate them to obtain comprehensive ship images, thereby improving the detection accuracy of ship targets.

4. Result

4.1. Data Augmentation Experiment Results

After training, this experiment utilized 300 original dataset images to generate 3000 nearshore ship images. Figure 7 illustrates the image generation process at different stages, revealing that ConSinGAN introduced more detailed variations at each stage of the nearshore ship image generation.

In the experiment, p represented different training stages ranging from 1 to 10, indicating that ConSinGAN iteratively refined a single image through multiple iterations, gradually generating clearer images with increasing resolution.

Figure 8 displays the final generated images, which, by introducing random noise, exhibited noticeable detail differences compared to the original images, thereby enhancing the diversity of the nearshore ship images produced by ConSinGAN.

In image generation tasks, the Fréchet Inception Distance (FID) [46] is a metric used to evaluate the quality of images generated by models. A lower score correlates strongly with higher-quality images, and the smaller the value, the higher the “realism” of the generated images. The FID metric is the squared Wasserstein distance between two multidimensional Gaussian distributions and is given by the following formula:

F I D = ‖μ_{real} - μ_{G}‖ + T r (\sum r e a l + \sum g - 2 {(\sum r e a l \sum g)}^{\frac{1}{2}})

(11)

Among them,

μ_{r e a l}, μ_{G}

represent the mean of the features of the real image and the generated image, respectively, and

\sum r e a l, \sum g

represent the covariance matrix of the features of the real image and the generated image, respectively.

Since the images were generated from single-sample sampling, the Single FID (SIFID) was used as the improvement evaluation metric. The value calculated here was the average of multiple images generated by the model after successful training. The formula was as follows:

S I F I D = \frac{‖μ_{real} - μ_{G}‖ + T r (\sum r e a l + \sum g - 2 {(\sum r e a l \sum g)}^{\frac{1}{2}})}{n}

(12)

The statistical distributions of real images and generated image data were compared one-to-one and then summed and averaged to obtain the SIFID evaluation index. With a calculated SIFID of 25.24 with the original SAR nearshore ship image training set and 3000 generated images, it was evident that the image generation quality was high, with a certain variability present.

To validate the effectiveness of the work, 3000 generated nearshore ship images were annotated and merged with the original SAR-Ship dataset containing 3644 nearshore images. The combined dataset was then used to train the original YOLOv7 model, with training results compared using precision, recall, and average precision (AP@0.5) as metrics. The specific experimental results are shown in Table 1.

In the experimental results table, “original” refers to the original dataset, and “fake” represents the generated dataset. From the experimental results, it can be observed that precision, recall, and AP@0.5 increased to varying degrees. Compared to the original nearshore dataset, the model training precision increased by 4.66%, recall increased by 3.68%, AP@0.5 increased by 3.24%, AP@0.5:0.95 increased by 3.11%, and the F1 score increased by 3.17%. The comparison between the original nearshore data and “original + 3000 fakes” is illustrated in the data curves shown in Figure 9, Figure 10 and Figure 11.

Figure 12 presents some representative SAR images with ship detection results. The rectangular boxes in the images indicate the predicted boundaries of the ships by the algorithm, and the numbers above the boxes represent the confidence level that the target is a ship. These results demonstrate the effectiveness of GAN data augmentation in ship detection tasks. It can be observed from the images that GAN data augmentation improved the detection performance for complex-background nearshore ship SAR images.

4.2. Experimental Results of Attention Mechanism Ablation

To evaluate the enhancement effects of the SE attention mechanism on the YOLOv7 network, we implemented improvements to the YOLOv7 network based on CBAM, CA, and ECA attention mechanisms, respectively. After training the modified networks, we assessed the performance of these three attention mechanisms. The specific experimental results are presented in Table 2, which compares the performance of the original model with those incorporating CBAM, CA, SE, and ECA attention mechanisms.

From the experimental results, it is evident that after adding the CBAM and ECA attention mechanisms, although precision improved, the overall metrics showed a declining trend. Similarly, after adding the CA attention mechanism, although recall improved, the overall metrics also exhibited a decline, indicating poor results. Therefore, this demonstrated that the SE attention mechanism placed greater emphasis on the ship’s features. Despite a decrease in accuracy, recall improved by 2.47%, AP@0.5 increased by 0.09%, AP@0.5:0.95 increased by 0.14%, and the F1 score increased by 0.22%.

4.3. Experimental Results of ESE Network Ablation

The YOLOv7 + ESE and YOLOv7 + SE models were trained separately on the dataset, and the two attention mechanism-based models were then evaluated. The detailed experimental results are presented in Table 3, which compares the performance of the YOLOv7 + ESE and YOLOv7 + SE models.

From the experimental results, it can be seen that the fusion of SE and ELAN modules improved the accuracy by 0.36%, recall rate by 0.52%, and average accuracy compared to the original YOLOv7 model’s AP@0.5 increase of 0.65%. This demonstrates that incorporating the SE attention mechanism effectively addressed complex nearshore scenarios, enhanced the rich features of nearshore ships in SAR images, and improved the detection of nearshore ships by capturing more ship-related information. As shown in Figure 13, Figure 14 and Figure 15, the red solid line represents the YOLOv7 + ESE model. After integrating the SE module with the ELAN module, both precision and AP@0.5 improved.

Figure 16 presents the detection results on the test set using the SE and ESE modules. Rectangular boxes indicate the ship boundaries predicted by the algorithm, and the numbers above the boxes represent the confidence levels of the detected ships. Compared to Figure 4a and Figure 16a, the confidence level improved in Figure 4b and Figure 16b. The YOLOv7 + ESE model demonstrated enhanced feature acquisition capabilities for nearshore ship SAR images with complex backgrounds. These results confirmed the effectiveness of the YOLOv7 + ESE model in ship detection tasks.

5. Discuss

5.1. The Accuracy of Adding SE Decreased

The experimental results of the Backbone layer improvement based on the SE module showed a decrease in precision. This was mainly because the SE module efficiently captured the dependencies of global channel features and had a strong ability to model long-range contextual information. However, this could also amplify some irrelevant features in the background, such as sea waves, dock structures, and occluding objects in nearshore environments, leading to an increase in false positives (FPs). Moreover, inserting the SE attention mechanism module directly after the ELAN structure in the Backbone network maintained the original efficiency and structure of ELAN but altered the overall model structure, preventing the full utilization of the SE module’s advantages and thus increasing the false detection rate. As shown in Figure 17, after adding the SE attention mechanism, some coastal backgrounds were mistakenly detected as ships. Therefore, the ESE network structure was formed by integrating the SE module with the ELAN module.

5.2. The Influence of Speckle Noise

The speckle noise in SAR images has a profound and complex impact on target detection. This multiplicative noise not only reduces the signal-to-noise ratio, making weak targets submerged in noise, but also damages the edges and texture features of targets, causing detection models to have positioning deviations and false detections. The interference of speckle noise is more significant for small target detection, which may lead to a large number of missed detections. In addition, the data distribution shift caused by noise can also affect the generalization ability of the model, resulting in a decline in its performance in real-world scenarios.

The SE attention mechanism helps to extract ship features in situations where low-level features of background and ships are too similar, especially under the influence of speckle noise. This leads to an increase in recall rate and a reduction in missed detections. As shown in Figure 18, compared to the original model, the model with the added SE attention mechanism showed an enhanced ability to detect ships, being able to detect nearshore ships that the original model failed to detect. This demonstrates the effectiveness of the SE attention mechanism in improving detection performance in SAR images with speckle noise.

While the SE module can enhance the model’s robustness to speckle noise in SAR images, its ability to completely eliminate such noise is limited due to the multiplicative nature of the noise and its deep coupling with the signal. To address these challenges, future research can explore solutions from multiple dimensions: In the preprocessing stage, deep learning-based denoising methods have demonstrated superior performance compared to traditional filtering techniques, while self-supervised denoising can alleviate the problem of insufficient clean samples in real-world data. In data augmentation strategies, synthesizing data through precise modeling of speckle noise characteristics or converting optical images using domain adaptation techniques can effectively expand the training dataset. In loss function design, introducing noise-aware mechanisms and contrastive learning can enhance the model’s discriminative ability in noisy environments. Finally, in the postprocessing stage, optimizing the non-maximum suppression (NMS) algorithm or fusing multi-temporal data can further improve the stability of detection results.

6. Conclusions

This paper addressed the challenges of acquiring SAR nearshore ship images, which are difficult to obtain and have a limited number of samples. By employing ConSinGAN single-image data generation technology, we augmented nearshore ship images to generate a large number of rich and effective nearshore ship data image samples. The training results of the algorithm validated the effectiveness of the generated images in nearshore ship image detection.

In response to the significant impact of densely packed nearshore ships and numerous man-made targets in terrestrial environments, such as buildings, on the detection of nearshore ship targets, we proposed an improved YOLOv7 algorithm based on the SE attention mechanism. We also verified the effectiveness of the algorithm and compared the performance differences of various attention mechanisms. Among them, the SE attention mechanism showed a more significant comprehensive performance advantage, increasing the object detection accuracy without affecting the model’s portability, which was more conducive to the model’s recognition of ship targets in complex nearshore backgrounds.

Author Contributions

Conceptualization, Q.Z. and R.M.; methodology, Q.Z.; software, R.M.; writing Q.Z. and R.M.; supervision, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bai, L.; Yao, C.; Ye, Z.; Xue, D.; Lin, X.; Hui, M. Feature enhancement pyramid and shallow feature reconstruction network for SAR ship detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1042–1056. [Google Scholar] [CrossRef]
Su, J.; Yang, L.; Huang, H.; Jin, G. Improved SSD algorithm for small-sized SAR ship detection. Syst. Eng. Electron. 2020, 42, 1026–1034. [Google Scholar]
Zeng, T.; Zhang, T.; Shao, Z.; Xu, X.; Zhang, W.; Shi, J.; Wei, S.; Zhang, X.; Jun, W. CFAR-DP-FW: A CFAR-Guided dual-polarization fusion framework for large-scene SAR ship detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7242–7259. [Google Scholar] [CrossRef]
Sun, Z.; Leng, X.; Zhang, X.; Xiong, B.; Ji, K.; Kuang, G. Ship recognition for complex SAR images via Dual-Branch transformer fusion network. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4009905. [Google Scholar] [CrossRef]
Shi, C.; Shen, D.; Ma, X.; Wang, H. Oriented ship detection in SAR image with data augmentation based on swin transformer. In Proceedings of the 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Dalian, China, 7–9 June 2024; pp. 434–439. [Google Scholar]
Jia, S.; Wang, P.; Jia, P.; Hu, S. Research on data augmentation for image classification based on convolution neural networks. In Proceedings of the 2017 Chinese Automation Congress, Jinan, China, 20–22 October 2017; pp. 4165–4170. [Google Scholar]
Nasiri, S.; Helsper, J.; Jung, M.; Fathi, M. DePicT melanoma Deep-CLASS: A deep convolutional neural networks approach to classify skin lesion images. BMC Bioinform. BioMed Cent. 2020, 21, 84. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Image net classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
Chen, X.; Sun, Y.; Zhang, M.; Peng, D. Evolving Deep Convolutional Variational Autoencoders for Image Classification. IEEE Trans. Evol. Comput. 2021, 25, 815–829. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial Nets. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Pan, L.; Guo, Y.; Li, H.; Wang, W.; Li, Z.; Ma, T. SAR image generation method via PCGAN for ship detection. J. Southwest 2024, 59, 547–555. [Google Scholar]
Lin, X.W.; Xu, Z.J.; Huang, H. Multi-scale detection of ship target against complex background out of SAR image. Navig. China 2023, 46, 17–24. [Google Scholar]
Tang, H.; Gao, S.; Li, S.; Wang, P.; Liu, J.; Wang, S.; Jiang, Q. A lightweight SAR image ship detection method based on improved convolution and YOLOv7. Remote Sens. 2024, 16, 486. [Google Scholar] [CrossRef]
Sasirekha, R.; Surya, V.; Nandhini, P.; Preethy Jemima, P.; Bhanushree, T.; Hanitha, G. Ensemble of Fast R-CNN with Bi-LSTM for object detection. In Proceedings of the 2025 6th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI), Goathgaun, Nepal, 7–8 January 2025; pp. 1200–1206. [Google Scholar]
Wang, D.; Li, X.; Hao, M. Aircraft target detection in remote sensing images based on improved Faster R-CNN. In Proceedings of the 2023 IEEE 5th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 11–13 October 2023; pp. 302–306. [Google Scholar]
Jia, Z.; Zhang, Y.; Yang, H. Research on High-Precision object detection and instance segmentation using Mask-RCNN. In Proceedings of the 2024 IEEE 6th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Hangzhou, China, 23–25 October 2024; pp. 1050–1055. [Google Scholar]
Hu, Y.; Zhang, Q. Improved Small Target Detection Algorithm Based on SSD. In Proceedings of the 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 19–21 April 2024; pp. 1421–1425. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Zhao, L.; Ning, F.; Xi, Y.; Liang, G.; He, Z.; Zhang, Y. MSFA-YOLO: A multi-scale SAR ship detection algorithm based on fused attention. IEEE Access 2024, 12, 24554–24568. [Google Scholar]
Li, T.; Ma, Y.T.; Endoh, T. Neuromorphic processor-oriented hybrid Q-format multiplication with adaptive quantization for tiny YOLO3. Neural Comput. Appl. 2023, 35, 11013–11041. [Google Scholar] [CrossRef]
Gong, R.; Tan, S.; Xing, X.; Cao, Z.; Wang, L.; Tian, L. An improved ship detection and recognition algorithm based on YOLOv7. In Proceedings of the 2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC), Qiangdao, China, 17–19 November 2023; pp. 302–305. [Google Scholar]
Jiao, C. OS-Net: A novel oriented ship detector based on RetinaNet. In Proceedings of the 2023 IEEE 2nd International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 24–26 February 2023; pp. 99–102. [Google Scholar]
Ge, X.; Li, X.; Zhang, C.; Li, J.; Gao, Y. Robust and real-time ship object detection method based on enhanced CNN. IEEE Access 2024, 12, 112196–112210. [Google Scholar] [CrossRef]
Niu, G.; Chen, X.; Ji, M.; Guo, P.; Liu, Y.; Ran, D. Ship target detection based on attention mechanism feature reconstruction network. Shanghai Aerosp. 2021, 38, 128–136. [Google Scholar]
Zhao, Y.; Shen, M.; Liu, L.; Chen, J.; Zhu, W. Study on the method of detecting dead chickens in caged chicken based on improved YOLOv5s and image fusion. J. Namin. Agric. Univ. 2024, 47, 369–382. [Google Scholar]
Huan, R.; Xuguang, W. Overview of Attention Mechanisms. J. Comput. Appl. 2021, 41 (Suppl. S1), 1–6. [Google Scholar]
Qian, K. Research on aerial image object detection methods based on YOLOv7 with different attention mechanisms. In Proceedings of the 2024 IEEE 2nd International Conference on Electrical, Automation and Computer Engineering (ICEACE), Changchun, China, 29–31 December 2024; pp. 191–195. [Google Scholar]
Li, R. Improved YOLOv7 aerial small target detection algorithm based on hole convolutional ASPP. In Proceedings of the 2023 3rd International Conference on Electronic Information Engineering and Computer Communication (EIECC), Wuhan, China, 22–24 December 2023; pp. 663–666. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Qin, J.; Liu, Z.; Ran, L.; Xie, R.; Tang, J.; Guo, Z. A target SAR image expansion method based on conditional wasserstein deep convolutional GAN for automatic target recognition. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7153–7170. [Google Scholar] [CrossRef]
Luo, Z.; Jiang, X.; Liu, X. Synthetic minority class data by generative adversarial network for imbalanced SAR target recognition. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2459–2462. [Google Scholar]
Liu, X.; Huang, Y.; Wang, C.; Pei, J.; Huo, W.; Zhang, Y.; Yang, J. Semi-supervised SAR ATR via conditional generative adversarial network with multi-discriminator. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Brussels, Belgium, 11–16 July 2021; pp. 2361–2364. [Google Scholar]
Shaham, T.; Dekel, T.; Michaeli, T. Singan: Learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 4569–4579. [Google Scholar]
Hinz, T.; Fisher, M.; Wang, O.; Wermter, S. Improved techniques for training single-image GANs. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 1299–1308. [Google Scholar]
Xing, W.; Feng, G.; Ji, C.; Hao, P.; Jing, Z. Generative adversarial networks based sample generation of coal and rock images. J. China Coal Soc. 2021, 46, 3066–3078. [Google Scholar]
Hu, W.; Wu, X.; Li, B.; Xu, T.; Yao, W. Industrial defect sample image generation based on self attention single sample ConSinGAN model. J. Cent. South Univ. Natl. 2022, 41, 356–364. [Google Scholar]
Zhang, Z.; Deng, A.; Cao, X. Remote Sensing Ship Detection based on Feature Fusion. In Proceedings of the 2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC), Qiangdao, China, 17–19 November 2023; pp. 1099–1103. [Google Scholar]
Li, C.; Hei, Y.; Xi, L.; Li, W.; Xiao, Z. GL-DETR: Global-to-Local Transformers for Small Ship Detection in SAR Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4016805. [Google Scholar] [CrossRef]
Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Pham, V.; Dong, L.; Bui, D. Optimizing YOLO architectures for optimal road damage detection and classification: A comparative study from YOLOv7 to YOLOv10. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 1–8. [Google Scholar]
Alexander, B.; Liao, H. YOLOv4: A case study on traffic sign recognition. In Proceedings of the International Conference on Computer Vision Workshops, Virtual, 11–17 October 2021; pp. 1–10. [Google Scholar]
Zhou, N.; Tao, Q.; Peng, B. Lightweight human detection algorithm suitable for edge devices. Mod. Comput. 2024, 30, 20–25. [Google Scholar]
Pan, H.; Wang, M.; Zhang, F. Traffic sign detection and recognition method based on optimized YOLOv4. Comput. Sci. 2022, 49, 179–184. [Google Scholar]
Qi, X.; Chai, R.; Gao, Y. Refactoring SPPCSPC and optimizing downsampling for small object detection algorithm. Comput. Eng. Appl. 2023, 59, 158–166. [Google Scholar]
Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. A SAR dataset of ship detection for deep learning under complex backgrounds. Anal. Big Data Remote Sens. 2019, 11, 765. [Google Scholar] [CrossRef]
Alt, H.; Godau, M. Computing the Frechet distance between two polygonal curves. Int. J. Comput. Geom. Appl. 1995, 5, 75–91. [Google Scholar] [CrossRef]

Figure 1. ConSinGAN model framework.

Figure 2. YOLOv7 network structure design.

Figure 3. SE Network structure.

Figure 4. Sample image annotation figures: (a) unannotated image (b) annotated image.

Figure 5. YOLOv7 + SE network architecture.

Figure 6. Improved ELAN network structure.

Figure 7. Generated images at different stages.

Figure 8. Image generation results. (a) Original image. (b) Generated image.

Figure 9. Comparison of accuracy.

Figure 10. Comparison of recall rates.

Figure 11. AP@0.5 comparison chart.

Figure 12. Detection results: (a) original; (b) original + 3000 fakes.

Figure 13. Comparison of accuracy.

Figure 14. Comparison of recall rates.

Figure 15. AP@0.5 comparison chart.

Figure 16. Detection results: (a) YOLOv7 + ESE; (b) YOLOv7 + SE.

Figure 17. Comparison of results. (a) Original dataset labels. (b) YOLOv7 + SE detection results.

Figure 18. Comparison of results: (a) Original dataset labels. (b) YOLOv7 detection results. (c) YOLOv7 + SE detection results.

Table 1. Experimental results with GAN-generated images.

Dataset	Precision/%	Recall/%	AP@0.5/%	AP@0.5:0.95/%	F1 Score
Original	82.87	83.64	87.8	56	83.25
Original + 3000 fakes	87.53	87.32	91.04	59.11	87.42

Table 2. Comparison of attention mechanisms’ performance.

Modules	Precision/%	Recall/%	AP@0.5/%	AP@0.5:0.95/%	F1 Score
YOLOv7	87.53	87.32	91.04	59.11	87.42
YOLOv7 + SE	85.59	89.79	91.13	59.25	87.64
YOLOv7 + CBAM	87.78	84.88	90.55	58.81	86.3
YOLOv7 + ECA	87.7	86.19	90.35	58.66	86.94
YOLOv7 + CA	85.16	88.01	90.3	58.82	86.56

Table 3. Model performance comparison.

Modules	Precision/%	Recall/%	AP@0.5/%	F1 Score
YOLOv7	87.53	87.32	91.04	87.42
YOLOv7 + SE	85.79	89.79	91.13	87.64
YOLOv7 + ESE	87.89	87.84	91.69	87.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Q.; Zhang, Z.; Mu, R. Enhanced YOLOv7 Based on Channel Attention Mechanism for Nearshore Ship Detection. Electronics 2025, 14, 1739. https://doi.org/10.3390/electronics14091739

AMA Style

Zhu Q, Zhang Z, Mu R. Enhanced YOLOv7 Based on Channel Attention Mechanism for Nearshore Ship Detection. Electronics. 2025; 14(9):1739. https://doi.org/10.3390/electronics14091739

Chicago/Turabian Style

Zhu, Qingyun, Zhen Zhang, and Ruizhe Mu. 2025. "Enhanced YOLOv7 Based on Channel Attention Mechanism for Nearshore Ship Detection" Electronics 14, no. 9: 1739. https://doi.org/10.3390/electronics14091739

APA Style

Zhu, Q., Zhang, Z., & Mu, R. (2025). Enhanced YOLOv7 Based on Channel Attention Mechanism for Nearshore Ship Detection. Electronics, 14(9), 1739. https://doi.org/10.3390/electronics14091739

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced YOLOv7 Based on Channel Attention Mechanism for Nearshore Ship Detection

Abstract

1. Introduction

2. Related Work

2.1. Data Enhancement

2.2. Object Detection Model

2.3. Attention Mechanism

3. Material and Method

3.1. Experimental Setup and Parameter Metrics

3.2. Dataset Creation

3.3. Improvement of Backbone Layer Based on SE

3.4. YOLOv7 Model Integrating ELAN and SE

4. Result

4.1. Data Augmentation Experiment Results

4.2. Experimental Results of Attention Mechanism Ablation

4.3. Experimental Results of ESE Network Ablation

5. Discuss

5.1. The Accuracy of Adding SE Decreased

5.2. The Influence of Speckle Noise

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI