GDS-YOLOv7: A High-Performance Model for Water-Surface Obstacle Detection Using Optimized Receptive Field and Attention Mechanisms

Yang, Xu; Huang, Lei; Ke, Fuyang; Liu, Chao; Yang, Ruixue; Xie, Shicheng

doi:10.3390/ijgi14070238

Open AccessArticle

GDS-YOLOv7: A High-Performance Model for Water-Surface Obstacle Detection Using Optimized Receptive Field and Attention Mechanisms

by

Xu Yang

^1,2,3,4

,

Lei Huang

^1,2,3,4

,

Fuyang Ke

^5,*,

Chao Liu

^1,2,3,4,

Ruixue Yang

^1,3,4 and

Shicheng Xie

^1,2,3,4

¹

Engineering Research Center of Mining Area Environmental and Disaster Cooperative Monitoring, Anhui University of Science and Technology, Huainan 232001, China

²

State Key Laboratory for Safe Mining of Deep Coal Resources and Environment Protection, Anhui University of Science and Technology, Huainan 232001, China

³

School of Geomatics, Anhui University of Science and Technology, Huainan 232001, China

⁴

Urban 3D Real Scene and Intelligent Security Monitoring Joint Laboratory of Anhui Province, Huainan 232001, China

⁵

School of Software, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(7), 238; https://doi.org/10.3390/ijgi14070238

Submission received: 24 April 2025 / Revised: 17 June 2025 / Accepted: 20 June 2025 / Published: 23 June 2025

(This article belongs to the Topic State-of-the-Art Object Detection, Tracking, and Recognition Techniques)

Download

Browse Figures

Versions Notes

Abstract

Unmanned ships, equipped with self-navigation and image processing capabilities, are progressively expanding their applications in fields such as mining, fisheries, and marine environments. Along with this development, issues concerning waterborne traffic safety are gradually emerging. To address the challenges of navigation and obstacle detection on the water’s surface, this paper presents CDS-YOLOv7, an enhanced obstacle-detection framework for aquatic environments, architecturally evolved from YOLOv7. The proposed system implements three key innovations: (1) Architectural optimization through replacement of the Spatial Pyramid Pooling Cross Stage Partial Connections (SPPCSPC) module with GhostSPPCSPC for expanded receptive field representation. (2) Integration of a parameter-free attention mechanism (SimAM) with refined pooling configurations to boost multi-scale detection sensitivity, and (3) Strategic deployment of depthwise separable convolutions (DSC) to reduce computational complexity while maintaining detection fidelity. Furthermore, we develop a Spatial–Channel Synergetic Attention (SCSA) mechanism to counteract feature degradation in convolutional operations, embedding this module within the Extended Effective Long-Range Aggregation Network (E-ELAN) network to enhance contextual awareness. Experimental results reveal the model’s superiority over baseline YOLOv7, achieving 4.9% mean average precision@0.5 (mAP@0.5), +4.3% precision (P), and +6.9% recall (R) alongside a 22.8% reduction in Giga Floating-point Operations Per Second (GFLOPS).

Keywords:

SCSA; YOLOv7; GhostSPPCSPC; DSC; object detection

1. Introduction

In recent years, with the iterative upgrading of unmanned ship technology, unmanned ships have demonstrated extremely important roles in both the civilian and military fields. Unmanned ships, with their autonomous navigation capabilities and image-processing functions, are continuously expanding their potential application scenarios [1]. For instance, in measuring subsidence areas of mining regions, unmanned ship measurement technology can obtain underwater three-dimensional data more accurately compared to traditional measurement methods, improving measurement efficiency and ensuring the safety of operators. However, water-traffic issues have become increasingly prominent, with water-surface obstacle detection being a key challenge for enabling unmanned ships to perform automatic obstacle avoidance and autonomous navigation [2].

In recent years, deep-learning-driven object detection methods have attracted significant attention owing to their simple structure, superior detection performance, and efficient processing speed. These methods are generally divided into two primary categories based on the object detection procedure [3]. The representative of the two-stage detection approach is Faster Regions with Convolutional Neural Network features (Faster R-CNN) [4] and R-CNN [5]. This approach involves locating candidate areas of the detection object in the initial stage and then accurately classifying and regressing the position of these candidate areas in the subsequent stage. While this method significantly enhances detection accuracy, it also reduces detection speed to a certain extent. In contrast, the single-stage detection strategy, represented by the You Only Look Once (YOLO) series [6], skips the step of generating candidate boxes and directly classifies the extracted features in the image and regresses the position [7]. This method notably enhances detection speed and meets the demands of real-time detection. Despite there being some compromises in accuracy, through continuous optimization of the YOLO algorithm, accurate object detection remains attainable, making it the mainstream research direction in the industry and related research fields.

In comparison to the maritime navigation environment, navigating inland waterways such as rivers and lakes presents greater complexity due to higher vessel-traffic density, more frequent obstacles, and a notable increase in small- and medium-sized objects [8]. These factors pose a considerable risk to navigational safety. Unlike radar and remote sensing imagery, visible light images possess unique strengths, such as superior resolution, vivid color details, and well-defined textures, making them more intuitive for interpreting navigational scenarios [9]. Furthermore, visible light image capture equipment is not only economical and operationally stable, but also simple to deploy, which makes it a valuable supplement to radar and Automatic Identification System (AIS) devices, further enhancing navigational safety [10].

Therefore, it is of paramount importance to develop a model tailored for water-surface object detection in order to realize intelligent perception. Qi L.L. et al. [11] improved the MPConv convolutional module and introduced the Mixed Self-Attention and Convolution (ACmix), effectively reducing feature loss during the network’s feature processing. However, the model’s convergence speed did not show a significant improvement. Hao K. et al. [12] proposed the CPM-YOLOv3 model, incorporating the Convolutional Block Attention Module (CBAM) into detection layers for targets of various sizes. This approach reduced the model’s parameter count but did not significantly improve the detection of small targets, leaving substantial room for improvement in the detection accuracy of underwater small targets. Tang Y.S. et al. [13] introduced a synthetic aperture radar (SAR) ship-detection method enhanced with a contextual decoupling head and Wise IoU loss, along with a Shuffle attention mechanism. This improved the SAR ship detection performance. However, due to the aerial overhead perspective and relatively long distances, the method suffers from false positives and missed detections, particularly for densely packed vessels and small targets. Additionally, the high computational cost of the network does not meet the requirements for model lightweighting. Sun Z. et al. [14] proposed a multi-scale feature fusion and dynamic attention mechanism detection algorithm to address the lack of rich texture and color cues in SAR images, which effectively improves detection performance in complex remote sensing scenarios. Zhang X. et al. [15] proposed a cross-sensor target detection method for SAR images based on dynamic feature discrimination and center sensing calibration, effectively solving the problem of the degradation of target detection performance caused by the training and test data coming from different SAR sensors. Chang S. et al. [16] proposed a novel approach that incorporates blind source separation (BSS) techniques into the suppression of range ambiguity in spaceborne synthetic aperture radar (SAR) systems. This method enhances the accuracy of target feature information extracted from SAR imagery in marine environments, thereby facilitating improved target identification and feature extraction. Zhang M.H. et al. [17] replaced the YOLOv4 backbone with the MobileNetv3 network structure for target detection, making the model more lightweight. However, due to the limitation of the receptive field, the performance in detecting medium and large targets was suboptimal. Li Z.Z. et al. [18] proposed the addition of a small target feature fusion layer, P2, to improve the recall rate of small targets.

The above research has provided valuable insights for surface obstacle detection, particularly within computer vision applications. Nonetheless, certain aspects still require further improvement. Some algorithms have enhanced the detection capability in complex scenes but have not fully addressed the diversity of target classification in practical scenarios. Other algorithms have optimized the model parameters, but the speed of convergence has not been significantly improved. Additionally, some algorithms have undergone lightweight processing, but their performance in detecting small- and medium-sized targets remains insufficient. Considering the diversity of scenarios and target types in this study, and in light of the aforementioned issues, this paper uses YOLOv7 as the baseline model and introduces a deep-learning-based target detection algorithm named GDS-YOLOv7. This study makes the following primary contributions:

(1): To mitigate the limitations caused by insufficient feature representation for small- and medium-scale objects, this paper proposes a novel receptive field enlargement module, Ghost Spatial Pyramid Pooling Cross Stage Partial Connections (GhostSPPCSPC) and introduces an attention mechanism. By reducing the pooling kernel size and decreasing model parameters, this approach enhances the model’s multi-scale target detection capabilities.
(2): Depthwise Separable Convolution (DSC) [19] is introduced to replace some of the traditional convolutional layers in the baseline model, significantly reducing the network’s parameter count, improving computational speed, and minimizing the impact of precision loss on model performance.
(3): To mitigate the issue of feature degradation resulting from the absence of global contextual cues in convolution operations, a novel attention mechanism named Spatial–Channel Synergetic Attention (SCSA) is proposed. This module strengthens the network’s capacity and is applied to the Effective Long-Range Aggregation Network (ELAN) module in the backbone network, leading to improvements in both model accuracy and parameter reduction.

2. Materials and Methods

2.1. Fundamental Concepts of the YOLOv7 Structure

YOLOv7 represents a relatively advanced model in the YOLO family for performing object detection tasks. It integrates model reparameterization techniques, an expanded and efficient layer aggregation network structure, positive and negative sample allocation strategies, and an improved auxiliary head training method, thus achieving an effective balance between inference speed and recognition precision [20]. A diagram illustrating the architecture of the algorithm can be seen in Figure 1 and Figure 2.

The architecture of YOLOv7 is mainly composed of four components: an input layer, and backbone, neck, and head modules [21]. The backbone module performs general feature extraction, with its core components comprising multiple Conv, Extended ELAN (E-ELAN) [20], MaxPooling Conv (MPConv) modules, and the SPPCSPC module. The Conv module includes convolution layers, batch normalization (BatchNorm), and the Sigmoid Linear Unit (SiLU) activation function, which collectively enhance feature extraction. E-ELAN, which evolved through layers from (Residual Network) ResNet [22], DenseNet [23], Vision-Oriented One-Shot Aggregation Network (VOVNeT) [24], CspVoVNet [25], and ELAN, permits the network to manage more intricate features by modifying the gradient flow length. In the MPConv module, the MP module integrates convolutional layers with max-pooled features, facilitating downsampling while mitigating loss of feature detail. The SPPCSPC module introduces parallel MaxPool operations after multiple convolutions, mitigating distortion during image processing while expanding the receptive field, thereby accommodating multi-scale inputs.

The neck and head networks utilize the structure known as Path Aggregation Feature Pyramid Network (PAFPN) [26,27]. PAFPN, when fused with the ELAN-W module, produces multi-scale feature maps with different spatial dimensions. The detection layer within the head module outputs the predicted class probabilities, confidence scores, and bounding box coordinates.

2.2. YOLOv7 Model Improvement

2.2.1. Space Pyramid Pooling Module Based on GhostConv (GhostSPPCSPC)

Although the baseline model utilizes the SPPCSPC module to expand the receptive field and improve performance, compared to other YOLO series, the SPPCSPC module significantly increases the model’s parameter count and computational cost for edge devices. To address this, this paper proposes a novel receptive field expansion module, GhostSPPCSPC, which introduces the following changes:

(1): For the Conv module, it is substituted with GhostConv, a neural-network structure optimized for efficiency, particularly suited for deployment on mobile and edge computing devices [28]. The core idea of GhostConv is to significantly reduce computational resource consumption through an efficient feature reuse strategy, enabling the model to achieve competitive accuracy with markedly reduced architectural complexity. The computational process of GhostConv is shown in Figure 3. GhostConv first applies standard convolution to derive preliminary features from the input data. Then, linear operations are performed to enhance the features and increase the channel count. During this process, identity operations are conducted in parallel with linear operations to preserve the integrity of initial feature representations. The final output is generated by feature concatenation. This approach not only reduces the demand for computational resources but also ensures that the model’s performance remains unaffected.

(2): Introduction of the SimAM—a simple, parameter-free attention module that is simple in structure and highly efficient [29]. This attention module assigns attention weights to the three-dimensional feature information without increasing the network’s parameter count. This design enables the model to prioritize three-dimensional spatial features, addressing the issue of limited features for small target obstacles on the water surface. SimAM is an energy function optimized through neuroscience principles, with its energy function given by Equation (1).

e_{t} (w_{t}, b_{t}, y, x_{i}) = {(y_{t} - \hat{t})}^{2} + \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(y_{0} - {\hat{x}}_{i})}^{2}

(1)

where

\hat{t} = w_{t} t + b_{t}

represents the target neuron,

{\hat{x}}_{i} = w_{t} x_{i} + b_{t}

represents the other neurons, and

M = H \times W

is the total number of neurons in a particular channel. The transformation weights and biases of the target neuron are represented as

w_{t}

and

b_{t}

, respectively, while the label values are represented by

y

,

y_{t}

, and

y_{0}

, with the index on the spatial dimension represented by

i

.

By minimizing Equation (1), the linear separability between neuron

t

and the other neurons in the same channel can be trained. Furthermore, Equation (1) incorporates binary labels and a regularization term. The final expression of the energy function is given by Equation (2).

e_{t} (w_{t}, b_{t}, y, x_{i}) = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(- 1 - (w_{t} x_{i} + b_{t}))}^{2} + {(1 - (w_{t} t + b_{t}))}^{2} + λ w_{t}^{2}

(2)

where

λ

is the regularization coefficient

y_{t} = 1

,

y_{0} = - 1

.

(3): The three MaxPool kernels in the SPPCSPC module have been optimized by reducing the original 5 × 5, 9 × 9, and 13 × 13 pooling kernels to 3 × 3, 5 × 5, and 9 × 9, respectively. While this improvement reduces the original module’s demand for parameters and computational power, it better captures the details and features of small objects, significantly enhancing small-object detection performance. However, this adjustment may also result in a loss of global contextual information for large objects, hindering the integration of larger region information, which could affect the model’s ability to capture background information for larger objects. At the same time, considering the characteristics of the dataset used in this study and the specific requirements of the detection targets, the smaller pooling kernels are more suitable for the needs of this paper, minimizing spatial loss and thus improving detection accuracy.

Experiments were conducted on the reduction of the pooling kernel and the use of GhostSPPCSPC. The experimental results are shown in Table 1. It can be seen from the table that after the pooling kernel was reduced and the GhostConv was applied, the model’s parameter size decreased by 4.47 M, but the model’s accuracy did not decrease; instead, it improved to some extent.

2.2.2. Depthwise Separable Convolution

To mitigate computational inefficiency issues caused by the parameter-heavy baseline model, this study introduces a lighter convolutional structure based on the backbone network. Key modifications include strategically substituting standard convolution layers with (DSC), particularly within the initial feature extraction stages. This can effectively solve the computational load with a relatively high resolution of feature maps. DSC consists of depthwise convolutions and pointwise convolutions. The depthwise convolution performs grouped convolutions on the feature map to extract spatial features, while the pointwise convolution extracts and aggregates features across channels. As demonstrated in Figure 4, DSC achieves substantial computational savings compared to conventional convolution operations.

For a feature map F of size

S_{F} \times S_{F} \times M

, performing convolution with M convolution kernels of size

S_{K} \times S_{K} \times 1

results in M feature maps. Then, a

1 \times 1 \times N

convolution is applied to these M feature maps, yielding a final feature map of size

S_{F} \times S_{F} \times N

. The total computational cost of this process is denoted as

S_{K} \times S_{K} \times M \times S_{F} \times S_{F} + M \times N \times S_{F} \times S_{F}

, and its ratio compared to the computational cost

S_{K} \times S_{K} \times M \times N \times S_{F} \times S_{F}

of a standard convolution is given by

\frac{DSC}{s t a n d a r d c o n v o l u t i o n} = \frac{S_{K} \times S_{K} \times M \times S_{F} \times S_{F} + M \times N \times S_{F} \times S_{F}}{S_{K} \times S_{K} \times M \times N \times S_{F} \times S_{F}} = \frac{1}{N} + \frac{1}{S_{K}^{2}}

(3)

From Equation (3), it can be seen that when input features of the same dimensions are used, depthwise separable convolution requires fewer computations compared to standard convolution. Therefore, in this paper, depthwise separable convolution is employed to replace standard convolution, which significantly reduces both the parameter count and computational load of the entire network architecture.

2.2.3. Integrating the SCSA Attention Mechanism

The water-surface environment, due to its complexity and diversity, is subject to numerous interference factors. To address these challenges, our framework simultaneously optimizes three key objectives: (1) enhanced target detection and classification accuracy, (2) reduced computational complexity and parameter footprint, and (3) improved discriminative feature extraction capability. The proposed solution integrates a Spatial–Channel Synergistic Attention (SCSA) module into the E-EIAN backbone architecture. This innovative module combines two complementary components: a Shared Multi-Semantic Spatial Attention (SMSA) mechanism that hierarchically aggregates contextual information through feature compression, and a Progressive Channel Self-Attention (PCSA) mechanism that dynamically recalibrates channel-wise features. Crucially, the SMSA’s spatial priors guide the PCSA’s channel attention, while the PCSA’s self-attention operations mitigate feature-level semantic discrepancies generated by the SMSA. Figure 5 illustrates this bidirectional feature refinement process.

SMSA:

(1): Spatial and Channel Decomposition: SMSA decomposes the given input $X \in ℝ^{B \times C \times H \times W}$ along the height and width dimensions. By applying global average pooling along each dimension, two unidirectional one-dimensional sequence structures, $X_{H} \in ℝ^{B \times C \times W}$ and $X_{W} \in ℝ^{B \times C \times H}$ , are generated. Simultaneously, in order to capture different spatial distributions and contextual relationships, the feature set is divided into $K$ equally sized independent sub-features, $X_{H}^{i}$ and $X_{W}^{i}$ , with each sub-feature having $C / K$ channels. In this paper, the default value of K is set to 4. The decomposition process of the sub-features is as follows:

$X_{H}^{i} = X_{H} [:, (i - 1) \times \frac{C}{K} : i \times \frac{C}{K}, :]$

(4)

$X_{W}^{i} = X_{W} [:, (i - 1) \times \frac{C}{K} : i \times \frac{C}{K}, :]$

(5)

X^{i}

represents the i-th sub-feature, where

i \in [1, K]

. Each sub-feature is independent, which facilitates the efficient extraction of multi-semantic spatial information.

(2): Lightweight Convolution Strategy Across Non-Intersecting Sub-Features: After performing the cross-channel grouping of the feature set, to effectively learn the different semantic spatial structures within each sub-feature, convolution operations with kernel sizes of 3, 5, 7, and 9 are applied to the corresponding sub-features. This approach optimizes the continuity of feature representation and reduces the representation discrepancy across different semantic layers. Furthermore, SMSA employs lightweight shared convolutions to address the limited receptive field problem caused by one-dimensional convolutions. The information extraction process is defined as follows:

${\tilde{X}}_{H}^{i} = D W C o n v 1 d_{k_{i}}^{\frac{C}{K} \to \frac{C}{K}} (X_{H}^{i})$

(6)

${\tilde{X}}_{W}^{i} = D W C o n v 1 d_{k_{i}}^{\frac{C}{K} \to \frac{C}{K}} (X_{W}^{i})$

(7)

{\tilde{X}}^{i}

represents the spatial structure information of the i-th sub-feature obtained through lightweight convolution operations, and

k_{i}

represents the convolution kernel applied to the i-th sub-feature.

To accurately construct the spatial weight distribution of each feature, we propose using Group Normalization (GN) to normalize the K groups of features, which aggregates different semantic sub-features. Finally, spatial attention is generated through a Sigmoid function, and the feature output can be expressed as follows:

A t t n_{H} = σ (G N_{H}^{K} (C o n c a t ({\tilde{X}}_{H}^{1}, {\tilde{X}}_{H}^{2}, \dots, {\tilde{X}}_{H}^{K})))

(8)

A t t n_{W} = σ (G N_{W}^{K} (C o n c a t ({\tilde{X}}_{W}^{1}, {\tilde{X}}_{W}^{2}, \dots, {\tilde{X}}_{W}^{K})))

(9)

X_{s} = A t t n_{H} \times A t t n_{W} \times X

(10)

σ (•)

represents the Sigmoid function,

G N_{H}^{K} (•)

and

G N_{W}^{K}

represent the Group Normalization operations along the H and W dimensions of the K groups, and

C o n c a t (•)

represents the concatenation of the sub-feature information.

PCSA:

By using MHSA (Multi-Head Self-Attention) [30] for spatial attention calculation, it is possible to effectively evaluate the similarity between different labels. Combining this with the spatial priors from SMSA helps further compute the similarity between channels. Moreover, to preserve and utilize the multi-semantic spatial information extracted by SMSA while reducing the computational cost of MHSA, a progressive compression strategy is employed. This strategy effectively facilitates the collaborative effect. The specific implementation steps of PCSA are as follows:

X_{p} = P o o l_{(7, 7)}^{(H, W) \to (H^{'}, W^{'})} (X_{s})

(11)

F_{p r o j} = D W C o n v 1 d_{(1, 1)}^{C \to C}

(12)

Q = F_{p r o j}^{Q} (X_{p}), K = F_{p r o j}^{K} (X_{p}), V = F_{p r o j}^{V} (X_{p})

(13)

X_{a t t n} = A t t n (Q, K, V) = s o f t \max (\frac{Q K^{T}}{\sqrt{C}}) V

(14)

X_{c} = X_{s} \times σ (P o o l_{(H^{'}, W^{'})}^{(H^{'}, W^{'}) \to (1, 1)} (X_{a t t n}))

(15)

P o o l_{(7, 7)}^{(H, W) \to (H^{'}, W^{'})}

represents the pooling operation with a kernel size of

7 \times 7

, followed by a resolution adjustment.

F_{proj} (•)

represents the function that represents

Q

,

K

,

V

, and softmax is an activation function.

Synergistic Effects:

By guiding channel attention through spatial attention and using a serial integration approach to combine the SMSA and PSCA modules, functional complementarity is achieved, effectively alleviating the accuracy degradation caused by the reduction in the number of channels. The computation process of SCSA is as follows:

S C S A (X) = P C S A (S M S A (X))

(16)

After making the aforementioned improvements to the baseline model, the final GDS-YOLOv7 model can be obtained, and its network structure is shown in Figure 6 and Figure 7.

2.3. Experimental Environment and Parameter Setting

The network environment configured in this experiment is the PyTorch 2.3.1 deep-learning framework built under the Windows 10 operating system, with Python 3.9 as the programming language. The training period (epoch) is set to 200. Related hardware configuration and model parameters are shown in the following Table 2:

The loss function utilized in this study is aligned with the baseline configuration and is categorized into three distinct components: classification, localization, and confidence functions. The formulation of the loss function is presented in (17).

L o s s_{t o t a l} = L o s s_{c l a s s} + L o s s_{l o c} + L o s s_{c o n}

(17)

Among them,

L o s s_{c l a s s}

is the classification loss used to predict the target’s class,

L o s s_{l o c}

is the localization loss used to measure the position difference between the predicted and ground truth bounding boxes, and

L o s s_{c o n}

is the class confidence loss used to assess whether the detection box contains the target. In this work, both the classification and confidence losses are implemented using binary cross-entropy, whereas the localization loss employs the Complete Intersection Over Union (IoU) metric.

Figure 8 shows the variation trend of the loss function value changing with the number of iterations during the model training process. From the figure, it can be observed that during the first 70 iterations, the loss function value decreases significantly. Subsequently, between the 70th and 180th iterations, the rate of decrease gradually slows down and tends to stabilize. In the final 20 iterations, the loss function value remains relatively stable. The final loss function value of the model is below 0.05, indicating that the model training parameters used in this paper are reasonable, and the model’s learning performance has met the expected goals.

2.4. Evaluation Methods

In order to thoroughly evaluate performance in the improved model and compare experimental results, this paper uses various metrics, which comprise model parameters (Params), Giga Floating-point Operations Per Second (GFLOPS), precision (P), recall (R), inference speed in frames per second (FPS), and mean average precision (mAP), as well as the precision–recall (P-R) curve. The mathematical expressions for computing P, R, FPS, and mAP are detailed as follows:

P = \frac{N_{T P}}{N_{T P} + N_{F P}}

(18)

R = \frac{N_{T P}}{N_{T P} + N_{F N}}

(19)

F P S = \frac{1000}{T}

(20)

m A P = \frac{\sum_{i = 1}^{n} \int_{0}^{1} P (R) d R}{n}

(21)

where

N_{T P}

represents the number of correctly detected positive samples,

N_{F P}

represents the number of false positives (incorrectly detected as positive samples, but originally negative), and

N_{F N}

represents the number of false negatives (incorrectly detected as negative samples, but originally positive). T is the average detection time, and n is the number of categories to be detected. This paper employs the commonly used mAP@0.5 under an Intersection over Union (IoU) threshold of 0.5 to evaluate the model performance. The higher the mAP@0.5 value, the better the model’s detection performance.

3. Results and Discussion

The performance of the YOLO model in recognizing obstacles on the water surface for unmanned ships is highly dependent on the quality of the training data. To simulate the various typical obstacles that an unmanned ship may encounter during navigation, the dataset is sourced in two ways: on the one hand, images are captured by cameras cover different lighting conditions, water wave conditions, obstacle types, and background complexity, and on the other hand, water-surface obstacle images are gathered from the internet. These images are then cleaned, and data augmentation techniques such as fog generation and lighting adjustment are applied. The final dataset contains 17 different categories of objects, including ball, boat, bottle, branch, and leaf, totaling 8750 images. By collecting data from multiple viewpoints and angles, the dataset authentically replicates the navigation environment of an unmanned ship. To meet the experimental requirements, the dataset is split into training, validation, and test sets in an 8:1:1 ratio. Some sample data images are shown in Figure 9.

3.1. Comparison of Attention Mechanism Fusion

To further validate the effectiveness of the improved attention mechanism, this paper compares different attention mechanisms, including CBAM, MSAM, and SCSA, by swapping them and analyzing their impact on the baseline model. The experiments are conducted using the same experimental parameters, and the testing is performed on the dataset used in this paper. The results obtained are shown in Table 3.

Based on the experimental results in Table 3, it can be observed that after the introduction of the attention mechanism, the mAP@0.5 of the network model shows varying degrees of improvement. Among the attention mechanisms, the SCSA and MSAM proposed in this paper achieve the most significant mAP@0.5 improvements, reaching 81.00% and 81.40%, respectively. However, while SCSA improves the mAP@0.5, it also reduces the model’s parameter count by 1.33 M, and the FPS shows a certain level of improvement. Therefore, considering all aspects, the SCSA attention mechanism not only enhances precision but also optimizes the model’s computation, allowing it to more accurately capture the detailed features of the target region and better meet the practical needs of water-surface detection. This paper also selects the SCSA module as part of the proposed improvement.

To further demonstrate the effectiveness of adding the SCSA module, a photo with a large number of ships was selected for testing, and its features were visualized. The visual feature maps reveal the areas of interest that the network focuses on. This paper visualizes the feature maps from the initial convolution stage of the backbone network and the outputs from each SCSA module, comparing them to the corresponding positions in the baseline model, as shown in Figure 10. At the initial convolution stage, the network primarily focuses on extracting global features. However, after adding the attention mechanism in the E-ElAN module of the backbone network, it becomes evident that the network gradually concentrates on the target’s edge and shape details, enhancing the fine-grained information representation ability. This result further validates the effectiveness of the mechanism in improving focus perception, especially for targets with less distinct visual features or those affected by interference.

3.2. Ablation Experiment

To evaluate the effectiveness of the three improvement strategies on enhancing the performance of the baseline model, systematic ablation experiments are conducted on the dataset used in this paper. The strategies implemented on the baseline model include the addition of the SCSA module, the GhostSPPCSPC module, and DSC. The changes in P, R, GFLOPS, and mAP@0.5 metrics for the improved model are statistically analyzed and compared. The experimental results are shown in Table 4.

From Table 4, it can be seen that, after introducing SCSA into the baseline model, P, R, and mAP@0.5 increased by 2%, 12.2%, and 7.8%, respectively. After introducing GhostSPPCSPC, P and mAP@0.5 increased by 1% and 0.8%, while R decreased by 2.1%. After adding DSC, P, R, and mAP@0.5 improved by 2.8%, 1.3%, and 1.2%, respectively. After combining multiple modules, the improved model showed improvements in P, R, and mAP@0.5. Simultaneously, the model’s GFLOPS were significantly reduced, lowering the computational complexity of the model. Although some metrics are lower in the combination of modules than with the SCSA attention mechanism alone, the overall effect is an improvement.

In summary, the three improvements proposed for the baseline model in this paper lead to corresponding improvements in the model’s performance, demonstrating the feasibility of the proposed solution. Compared to the baseline model, Model 5 achieved a 4.3% increase in P, a 6.9% increase in R, and a 4.9% increase in mAP@0.5, enhancing the algorithm’s ability to detect water-surface targets. These experimental results indicate that strategically adding the SCSA attention mechanism, GhostSPPCSPC module, and DSC module can significantly improve the performance of the baseline model in water-surface target detection tasks and optimize the model’s computational efficiency.

3.3. Comparison of Model Experiments Before and After Improvement

Figure 11 illustrates a comparison between the P-R curves of the baseline model prior to and following its enhancement on the dataset employed in this study. Recall is plotted along the x-axis, while precision is represented on the y-axis. The region bounded by the P-R curve and the coordinate axes corresponds to the average precision of the model. As depicted in the figure, the P-R curve of the improved model fully encompasses that of the baseline, which further proves that the improved model possesses better detection performance and outperforms the baseline model.

The confusion matrix is introduced to evaluate the model’s detection results. Each column of the confusion matrix shows the predicted proportion for each category, while each row reflects the proportion of each category in the true data, as shown in Figure 12. The analysis indicates that all 17 categories of targets were accurately predicted, with more than 10 categories achieving an accuracy rate of over 85%, and some categories even exceeding 95%. These results demonstrate that the model exhibits high detection accuracy.

3.4. Comparison of the Improved Baseline Model with Other Network Models

To thoroughly validate the effectiveness of the algorithm proposed in this paper, the improved baseline model is compared with currently commonly used algorithms under the same network configuration and initial training parameters. During the experiment, all models are tested using the same training and validation datasets, with the results presented in Table 5.

From Table 5, it can be observed that the GDS-YOLOv7 achieves significant progress in target detection accuracy. Specifically, in comparison with other algorithms, the proposed algorithm shows a remarkable improvement in the mAP@0.5 evaluation metric. Even when compared to the latest surface target detection algorithms [18], GDS-YOLOv7 still performs competitively, with certain metrics surpassing them. It outperforms Faster-RCNN and SSD, with a 0.3% improvement over YOLOv7-tiny. Compared to the higher version YOLOv10, the proposed algorithm shows a 1.4% improvement. The evaluation metrics of YOLO-NAS exhibit significant discrepancies, where the extremely low precision (P) coupled with a high mAP indicates substantial false detection rates and overconfidence in predictions, demonstrating poor balance between these critical performance aspects. Although the mAP@0.5 of the proposed algorithm is slightly lower than that of YOLOv8 and YOLOv9, its recall (R) is higher, meeting the requirements of practical applications. In terms of precision (P), the proposed algorithm shows the highest improvement. In key metrics such as mAP@0.5, P, and R, the proposed improvements strike a good balance. Overall, the algorithm’s improvements are effective and more suitable for water-surface target detection.

The proposed algorithm is compared with the baseline model on the same dataset, as shown in Figure 13. Using heatmap visualization, the feature attention areas are analyzed. From the displayed heatmap, it is evident that the proposed algorithm enhances the ability to extract features of small targets, with a noticeable increase in attention to dense targets. It demonstrates a more precise local feature perception ability.

4. Conclusions

The application scenarios for unmanned ship technology are continuously expanding, but the associated issues related to water traffic are becoming increasingly prominent. Among these, the recognition of water-surface obstacles is a key challenge for enabling unmanned ships to avoid obstacles and achieve autonomous navigation. Utilizing object-detection technology can help unmanned ships perform better obstacle detection, improving the safety of autonomous navigation.

To address the issue of water-surface obstacle recognition during unmanned ship navigation, which raises concerns about the safety of unmanned ship measurement and navigation, this paper proposes the GDS-YOLOv7 water-surface obstacle-detection method. It effectively solves the problem of obstacle detection in complex water environments with many interference factors and improves detection accuracy for medium- and small-sized targets.

(1): On top of the baseline model, improvements were made by enhancing the SPPCSPC module, introducing the DSC module, and adding the SCSA module. Precision (P), recall (R), and mAP@0.5 were improved by 4.3%, 6.9%, and 4.9%, respectively, demonstrating the effectiveness of the improvements.
(2): The proposed method yields a slightly lower mAP@0.5 than the YOLOv8 and YOLOv9 models. However, considering key precision metrics, the proposed method outperforms them and meets the requirements for accuracy and real-time performance in water-surface detection.
(3): Adding more modules to the model is not always better. Excessive additions may lead to a decrease in some model metrics (such as R and mAP@0.5).

The improved algorithm presented in this paper further enhances detection accuracy and outperforms some of the more recent detection algorithms. However, the computational load of the improved network remains relatively high. The next step will involve optimizing the network structure by replacing it with a lightweight model, while further expanding and optimizing the dataset. The goal is to achieve real-time and efficient water-surface obstacle detection while ensuring improvements in both accuracy and speed.

Author Contributions

Conceptualization, Xu Yang, Lei Huang, and Fuyang Ke; methodology, Xu Yang, Lei Huang, and Fuyang Ke; software, Xu Yang, Lei Huang, Fuyang Ke, and Chao Liu; formal analysis, Xu Yang, Lei Huang, Ruixue Yang, and Shicheng Xie; investigation, Xu Yang, Lei Huang, Fuyang Ke, and Chao Liu; writing—original draft preparation, Xu Yang, Lei Huang, and Fuyang Ke; writing—review and editing, Xu Yang, Lei Huang, Fuyang Ke, and Chao Liu; visualization, Lei Huang, Chao Liu, and Ruixue Yang; supervision, Ruixue Yang and Shicheng Xie; project administration, Xu Yang and Lei Huang; funding acquisition, Xu Yang and Ruixue Yang. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Engineering Research Center of Mining Area Environmental and Disaster Cooperative Monitoring (Anhui University of Science and Technology) (No. KSXTJC202403), the National Natural Science Foundation of China (No. 42304050), the Anhui Province Natural Science Foundation (No. 2208085QD115), the Anhui Provincial Quality Project—Graduate Academic Innovation (No. 2024xscx078) and the Graduate Innovation Fund of Anhui University of Science and Technology (No. 2024cx2153).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors would like to acknowledge all reviewers and editors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alamoush, A.S.; Ölçer, A.I. Maritime Autonomous Surface Ships: Architecture for Autonomous Navigation Systems. J. Mar. Sci. Eng. 2025, 13, 122. [Google Scholar] [CrossRef]
Guo, S.Y.; Zhang, X.G.; Zheng, Y.S.; Du, Y.Q. An Autonomous Path Planning Model for Unmanned Ships Based on Deep Reinforcement Learning. J. Mar. Sci. Eng. 2020, 20, 426. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Liu, C.; Filaretov, V.F.; Yukhimets, D. Multi-scale ship detection algorithm based on YOLOv7 for complex scene SAR images. Remote Sens. 2023, 15, 2071. [Google Scholar] [CrossRef]
Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Yang, Z.; Lan, X.; Wang, H. Comparative Analysis of YOLO Series Algorithms for UAV-Based Highway Distress Inspection: Performance and Application Insights. Sensors 2025, 25, 1475. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Gan, L.; Yan, Z.; Zhang, L.; Liu, K.; Zheng, Y.; Zhou, C.; Shu, Y. Ship path planning based on safety potential field in inland rivers. Ocean Eng. 2022, 260, 111928. [Google Scholar] [CrossRef]
Ning, Y.; Zhao, L.; Zhang, C.; Yuan, Z. STD-Yolov5: A ship-type detection model based on improved Yolov5. Ships Offshore Struct. 2024, 19, 66–75. [Google Scholar] [CrossRef]
Yang, S.; Wei, S.; Wei, L.; Shuai, W.; Yang, Z. Review of research on information fusion of shipborne radar and AIS. Ship Sci. Technol. 2021, 43, 167–171. [Google Scholar]
Qi, L.L.; Gao, J.L. Small Object Detection Based on Improved YOLOv7. Comput. Eng. 2023, 49, 41–48. [Google Scholar]
Hao, K.; Wang, K.; Wang, B.B. Lightweight Underwater Biological Detection Algorithm Based on Improved Mobilenet-YOLOv3. J. Zhejiang Univ. (Eng. Sci.) 2022, 56, 1622–1632. [Google Scholar]
Tang, Y.S.; Zhang, Y.; Xiao, J.R.; Cao, Y.; Yu, Z.J. An Enhanced Shuffle Attention with Context Decoupling Head with Wise IoU Loss for SAR Ship Detection. Remote Sens. 2024, 16, 4128. [Google Scholar] [CrossRef]
Sun, Z.; Leng, X.; Zhang, X.; Zhou, Z.; Xiong, B.; Ji, K.; Kuang, G. Arbitrary-Direction SAR Ship Detection Method for Multiscale Imbalance. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5208921. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, S.; Sun, Z.; Liu, C.; Sun, Y.; Ji, K.; Kuang, G. Cross-sensor SAR image target detection based on dynamic feature discrimination and center-aware calibration. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5209417–5209433. [Google Scholar] [CrossRef]
Chang, S.; Deng, Y.K.; Zhang, Y.Y.; Zhao, Q.C.; Wang, R.; Zhang, K. An advanced scheme for range ambiguity suppression of spaceborne SAR based on blind source separation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5230112–5230123. [Google Scholar] [CrossRef]
Zhang, M.H.; Xu, S.B.; Song, W.; He, Q.; Wei, Q.M. Lightweight Underwater Object Detection Based on YOLO v4 and Multi-Scale Attentional Feature Fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Li, Z.Z.; Ren, H.X.; Yang, X.; Wang, D.; Sun, J. LWS-YOLOv7: A Lightweight Water-Surface Object-Detection Model. J. Mar. Sci. Eng. 2024, 12, 861. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.X.; Wang, W.J.; Zhu, Y.K.; Pang, R.M.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Zhang, Y.; Sun, Y.P.; Wang, Z.; Jiang, Y. YOLOv7-RAR for urban vehicle detection. Sensors 2023, 23, 1801. [Google Scholar] [CrossRef] [PubMed]
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Lee, Y.; Hwang, J.-W.; Lee, S.; Bae, Y.; Park, J. An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 752–760. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.F.; Shi, J.P.; Jia, J.Y. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Hu, M.; Li, Y.; Fang, L.; Wang, S.J. A2-FPN: Attention aggregation based feature pyramid network for instance segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15343–15352. [Google Scholar]
Han, K.; Wang, Y.H.; Tian, Q.; Guo, J.Y.; Xu, C.J.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]

Figure 1. Network structure diagram of the YOLOv7 algorithm.

Figure 2. Schematic diagram of some network modules of the YOLOv7 algorithm.

Figure 3. Architecture of the Ghostconv module. Standard convolution is first used to extract primary features, followed by inexpensive linear operations to generate redundant feature maps, which are then concatenated to form the final output.

Figure 4. DSC schematic diagram.

Figure 5. Schematic of the SCSA module, composed of the SMSA and PCSA. By collaboratively modeling multi-semantic spatial and channel information, the SCSA module improves the network’s ability to detect small or occluded targets and effectively alleviates global context loss caused by conventional convolution operations.

Figure 6. Network structure diagram of the GDS-YOLOv7 algorithm.

Figure 7. Schematic diagram of some network modules of the GDS-YOLOv7 algorithm.

Figure 8. Model iteration process.

Figure 9. Sample datasets for this paper.

Figure 10. Visualization comparison of feature maps. (1): (a) conv_stage0, (b) conv_stage11, (c) conv_stage24, (d) conv_stage37, and (e) conv_stage50; (2): (f) conv_stage0, (g) SCSA_stage11, (h) SCSA_stage24, (i) SCSA_stage37, and (j) SCSA_stage51.

Figure 11. Comparison of P-R curves before and after the improvement of the baseline model. The P-R curve of the improved model completely surrounds the baseline P-R curve, demonstrating its superior overall detection performance.

Figure 12. Confusion matrix of the improved model. The results show high classification accuracy across most object categories, with over 10 categories achieving over 85% accuracy and some exceeding 95%.

Figure 13. Visualization of detection results. The “baseline model” now corresponds to YOLOv7, and the “proposed algorithm” corresponds to GDS-YOLOv7. The comparison highlights the improved precision and attention to small/dense targets achieved by the proposed algorithm.

Table 1. Improvement experiment of the Spatial Pyramid Pooling Module.

Model	P/%	R/%	Params/M	mAP@0.5/%
SPPCSPC (13 × 13, 9 × 9, 5 × 5)	79.10	73.80	37.28	73.20
SPPCSPC (9 × 9, 5 × 5, 3 × 3)	79.10	75.80	37.28	73.00
GhostSPPCSPC (13 × 13, 9 × 9, 5 × 5)	78.10	76.20	32.81	73.37
GhostSPPCSPC (9 × 9, 5 × 5, 3 × 3)	80.10	71.70	32.81	74.00

Table 2. Hardware configuration and parameter settings used in the experiment.

Name	Configuration	Parameter	Parameter Values
GPU	RTX 4060Ti	Image size/pixel	640 × 640
CPU	Core(TM) i7-13700KF	Learning rate	0.01
Batch-Size	8	Optimizer	SGD

Table 3. Results of the comparison of attention mechanisms.

Model	Params/M	mAP@0.5/%	FPS/(Frame•s⁻¹) Fame × s⁻¹
-	36.57	73.20	128.21
CBAM	37.92	79.30	131.58
MSAM	37.93	81.40	84.03
SCSA	35.24	81.00	149.25

Table 4. Results of the ablation comparison experiment.

Network Model	SCSA	GhostSPPCSPC	DSC	P/%	R/%	GFLOPS	mAP@0.5/%
Baseline model				79.10	73.80	103.4	73.20
1	√			81.10	86.00	93.4	81.00
2		√		80.10	71.70	102.4	74.00
3			√	81.90	75.10	90.1	74.60
4	√	√		80.50	77.10	90.4	75.80
5	√	√	√	83.40	80.70	79.8	78.10

Table 5. Results of experimental performance indexes compared with different models.

Network Model	Input Size/Pixel	mAP@0.5/%	P/%	Params/M	R/%
Faster-RCNN	640 × 640	19.17	35.79	137.099	21.31
SSD	640 × 640	62.23	69.80	26.285	40.38
YOLOv7	640 × 640	73.20	79.10	37.28	73.80
YOLOv7-tiny	640 × 640	77.80	79.50	25.06	82.00
YOLO-NAS	640 × 640	84.43	25.84	19.03	94.95
YOLOv8	640 × 640	82.00	78.20	30.10	79.80
LWS-YOLOv7 [18]	640 × 640	61.30	72.10	34.77	64.60
YOLOv9	640 × 640	79.80	75.90	19.74	79.40
YOLOv10	640 × 640	76.70	74.80	27.01	78.20
GDS-YOLOv7	640 × 640	78.10	83.40	31.36	80.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, X.; Huang, L.; Ke, F.; Liu, C.; Yang, R.; Xie, S. GDS-YOLOv7: A High-Performance Model for Water-Surface Obstacle Detection Using Optimized Receptive Field and Attention Mechanisms. ISPRS Int. J. Geo-Inf. 2025, 14, 238. https://doi.org/10.3390/ijgi14070238

AMA Style

Yang X, Huang L, Ke F, Liu C, Yang R, Xie S. GDS-YOLOv7: A High-Performance Model for Water-Surface Obstacle Detection Using Optimized Receptive Field and Attention Mechanisms. ISPRS International Journal of Geo-Information. 2025; 14(7):238. https://doi.org/10.3390/ijgi14070238

Chicago/Turabian Style

Yang, Xu, Lei Huang, Fuyang Ke, Chao Liu, Ruixue Yang, and Shicheng Xie. 2025. "GDS-YOLOv7: A High-Performance Model for Water-Surface Obstacle Detection Using Optimized Receptive Field and Attention Mechanisms" ISPRS International Journal of Geo-Information 14, no. 7: 238. https://doi.org/10.3390/ijgi14070238

APA Style

Yang, X., Huang, L., Ke, F., Liu, C., Yang, R., & Xie, S. (2025). GDS-YOLOv7: A High-Performance Model for Water-Surface Obstacle Detection Using Optimized Receptive Field and Attention Mechanisms. ISPRS International Journal of Geo-Information, 14(7), 238. https://doi.org/10.3390/ijgi14070238

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GDS-YOLOv7: A High-Performance Model for Water-Surface Obstacle Detection Using Optimized Receptive Field and Attention Mechanisms

Abstract

1. Introduction

2. Materials and Methods

2.1. Fundamental Concepts of the YOLOv7 Structure

2.2. YOLOv7 Model Improvement

2.2.1. Space Pyramid Pooling Module Based on GhostConv (GhostSPPCSPC)

2.2.2. Depthwise Separable Convolution

2.2.3. Integrating the SCSA Attention Mechanism

2.3. Experimental Environment and Parameter Setting

2.4. Evaluation Methods

3. Results and Discussion

3.1. Comparison of Attention Mechanism Fusion

3.2. Ablation Experiment

3.3. Comparison of Model Experiments Before and After Improvement

3.4. Comparison of the Improved Baseline Model with Other Network Models

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI