YOLO-ELR: A High-Precision Lightweight Object Detection Model in Marine Environment

Yuan, Jianping; Wan, Lei

doi:10.3390/jmse14110998

Open AccessArticle

YOLO-ELR: A High-Precision Lightweight Object Detection Model in Marine Environment

by

Jianping Yuan

^1,2,* and

Lei Wan

¹

College of Shipbuilding Engineering, Harbin Engineering University, Harbin 150006, China

²

Guangdong Provincial Key Laboratory of Intelligent Equipment for South China Sea Marine Ranching, Guangdong Ocean University, Zhanjiang 524088, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(11), 998; https://doi.org/10.3390/jmse14110998 (registering DOI)

Submission received: 28 April 2026 / Revised: 16 May 2026 / Accepted: 18 May 2026 / Published: 28 May 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Target detection in complex marine environments serves as a core technology for marine environmental monitoring, underwater search and rescue, and vessel collision avoidance. However, traditional detection methods struggle to accurately identify low-pixel marine targets (e.g., small vessels, buoys) while balancing detection accuracy and computational efficiency in complex environments. To address the low accuracy and high computational costs of object detection in complex marine environments, this paper proposes YOLO-ELR, a lightweight model based on the YOLOv11 framework, designed to identify object categories and directional positions with enhanced precision while optimizing resource efficiency. The Efficient Multi-Branch Scale and Light Adaptive-weight downsampling (EMBSLaw) backbone network enhances multi-scale object detection in complex scenes by dynamically adjusting feature contributions through adaptive weight computation, while maintaining a lightweight architecture. To reduce computational parameters and complexity, a novel lightweight spatial multi-branch detector Lightweight Shared Convolutional Separamter BN Detection head (LSCSBD) is introduced. Furthermore, the RepGhostCSPELAN and Efficient Multi-Scale Conv (RGCEL_EMSC) module, integrating multi-neural networks, is proposed to improve detection accuracy and precision. Experimental results demonstrate that the YOLO-ELR model achieves an mAP@50 of 84.87%, surpassing the baseline YOLOv11 by 3.94% while reducing parameters by 35% and GFLOPs by 7.56%, which validates the effectiveness in balancing detection accuracy and computational efficiency.

Keywords:

YOLO; object detection; feature fusion; network design

1. Introduction

1.1. Research Background

Object detection is an important branch in the field of computer vision, which involves identifying and positioning one or more target objects in an image or video frame and classifying them into predefined categories. Initially, invariant detection relies on handmade features and traditional machine learning algorithms to detect objects [1,2]. However, the performance of these methods in complex scenarios is constrained by their reliance on handcrafted features and conventional machine learning approaches, which often fail to robustly capture discriminative object characteristics, leading to suboptimal detection accuracy. With the rapid development of deep learning technology, machine vision-based detection methods emerge [3]. Machine vision-based target detection methods in marine environments are widely employed.

1.2. Current Development Status of High-Precision Detection Models

The accuracy of target detection is of paramount importance in computer vision applications, particularly in maritime environments, as it directly influences system reliability, operational safety, and decision-making efficacy. High detection precision ensures robust identification and localization of objects (e.g., ships, obstacles, or floating debris) under challenging conditions such as adverse weather, low visibility, or cluttered backgrounds. Accurate detection minimizes false positives and false negatives, thereby enhancing situational awareness, preventing collisions, and optimizing autonomous navigation. Furthermore, precision improvements contribute to computational efficiency by reducing redundant processing and enabling real-time performance in resource-constrained systems. Consequently, advancing detection accuracy remains a critical research focus to ensure the effectiveness of marine surveillance, security, and automation technologies. Ref. [4] introduced YOLO-FaceV2, an enhanced object detection framework incorporating a Slide Weight Function (SWF) to address sample imbalance between easy and hard cases, along with a Separation and Enhancement Attention Module (SEAM) to improve feature discrimination. This approach effectively mitigates accuracy degradation caused by mask occlusion in facial detection. However, limitations persist in small object detection accuracy, accompanied by significant rates of false positives or missed detections in complex scenarios. Ref. [5] developed YOLO-ACN, an enhanced YOLOv3-based framework integrating attention mechanisms, CIoU loss, Soft-NMS, and depthwise separable convolutions, to address challenges in small and occluded object detection. This architecture achieves balanced improvements in both detection accuracy and inference efficiency compared to baseline models. Ref. [6] proposed YOLO-SASE, an enhanced detection algorithm integrating the YOLO framework with SRGAN, which employs super-resolution reconstructed images as input. The method incorporates a SASE module, SPP module, and multi-level receptive field architecture to optimize detection performance for small infrared targets in complex backgrounds. Ref. [7] proposed YOLOSR-IST, a deep learning-based detection method integrating super-resolution, which enhances YOLOv5 by incorporating CoordinateAttention into the backbone, introducing the high-resolution feature map P2 during feature fusion, and replacing the C3 module’s bottleneck layer with SwinTransformerBlock in the network head. This framework addresses small object detection challenges in infrared remote sensing imagery. Ref. [8] proposed MD-YOLO, a multi-scale detection framework that addresses low accuracy in small target detection by incorporating DenseNet modules and an Adaptive Attention Module (AAM) into the feature extraction component to enhance feature representation utilization, and introducing a multi-scale intensive architecture to optimize detection performance for challenging small targets.However, while these methods improve detection accuracy, they simultaneously increase model parameters and computational complexity during inference. Ref. [9]’s improved backbone network of Faster RCNN was developed to enhance the expression capabilities of the receptive fields of each network layer, which was used for detecting all types of algae, coastal fish, scallops, sea stars and water plants. Ref. [10] proposed an improved SSD object detection algorithm based on dense convolutional networks and feature fusion to enhance the model’s feature extraction capability.

1.3. Current Development Status of Lightweight Detection Models

The lightweight design of the YOLO target detection model is of paramount importance, as it substantially enhances computational efficiency, minimizes hardware demands and energy consumption, and maintains high detection accuracy, thereby rendering it exceptionally well-suited for deployment on edge devices, real-time applications, and large-scale systems. Ref. [11] proposed an enhanced YOLOv3-based method for small target detection in driving scenarios by replicating backbone network layers to construct auxiliary subnetworks; implementing attention mechanisms for feature fusion between primary and auxiliary networks to suppress irrelevant feature channels, and optimizing processing efficiency through this hierarchical architecture. Ref. [12] developed Fire-YOLO, an EfficientNet-enhanced detection model for small targets in forest fire imagery, by implementing three-dimensional expansion of the feature extraction network, and optimizing computational efficiency to accelerate detection speed. Ref. [13] enhanced the YOLOv5s model by replacing the CIoU loss function with Alpha-IoU to address slow gradient convergence during small-target image training. Ref. [14] developed BGF-YOLOv10, a lightweight object detection algorithm addressing small object omissions, by integrating BoTNet, C2f/C3 modules, and a small object detection head in the backbone network, alongside inserting a PatchExpandingLayer in the eck network based on enhanced YOLOv10n. However, the above methods will lead to the decrease of model target detection accuracy in the process of realizing lightweight. Ref. [15] analyzed YOLOv11 architecture and explored its innovations, such as C3k2 blocks, SPPF, and C2PSA components, which enhanced multi-dimensional model performance. Ref. [16] conducted a comparative analysis of YOLOv11, YOLOv10, YOLOv9, and YOLOv8, demonstrating YOLOv11’s superior object detection accuracy and efficient processing speed. Although the YOLO object detection algorithm has achieved notable advancements in handling scene blur and small target detection, its accuracy remains constrained by inherent limitations, failing to simultaneously achieve high precision and computational efficiency in real-time applications. Based on the above analysis, this paper proposes a new object detection method, YOLO-ELR, based on the YOLOv11 object detection model. The main contributions of the text are as follows:

(1): To achieve model lightweighting, this study proposes EMBSLaw—a novel backbone network integrating BIFPN and MAF-YOLO with an adaptive weight downsampling module—and introduces LSCSBD, a spatial multi-branch detector. EMBSLaw enables lightweight multi-scale feature fusion, efficient convolution/upsampling, adaptive downsampling, and global heterogeneous kernel selection, enhancing input adaptability through dynamic feature-weight adjustments while minimizing computational parameters. LSCSBD optimizes flexibility and parameter efficiency via spatial feature processing and branched detection, prioritizing lightweight performance across both architectures.
(2): To enhance model performance in object detection, this study introduces the RGCEL_EMSC neural network module, which integrates reparameterization, Ghost modules, cross-stage partial (CSP) connections, EfficientNet components, and local attention networks (LANs) within a multi-scale feature fusion framework. The module optimizes feature extraction and detection accuracy through cyclic convolution, grouped convolution, spatial pyramid pooling, multi-scale fusion, and enriched contextual information aggregation.

The remainder of this paper is structured as follows: Section 2 provides an in-depth exploration of the related technologies essential for target detection in complex marine environments, establishing the necessary context and background. Section 3 delves into the intricacies of the proposed method, elucidating its design and implementation in detail. Section 4 presents the experimental setup, methodology, and results, accompanied by a thorough analysis and discussion. Finally, Section 5 encapsulates the key findings and contributions of the research, concluding the paper with insights and potential future directions.

2. Related Works

2.1. Marine Environmental Target Detection

Rapid advancements in science and technology have driven the widespread adoption of optoelectronic and radar technologies across military, civil, and aerospace domains. Optoelectronic imaging delivers detailed visual data, while radar detection provides spatial information such as target localization [17]. The application of single technologies often faces limitations: optoelectronic imaging is highly susceptible to lighting and weather conditions [18], while radar tracking may suffer from electromagnetic interference. Consequently, integrating radar detection with intelligent optoelectronic image detection has emerged as a critical approach to enhance target identification accuracy and reliability [19]. Future efforts could integrate radar data to verify target presence, mitigate environmental interference (e.g., ocean waves), and optimize path planning through multi-modal validation.

Marine environment object detection algorithms are usually divided into one-stage and two-stage algorithms. Compared to two-stage algorithms [20,21,22], one-stage approaches [23,24,25] eliminated the need for region proposal generation and post-processing, achieving a better balance between accuracy and speed. The YOLO series exemplifies one-stage object detection, directly predicting bounding boxes via convolutional neural networks (CNNs) to rapidly and accurately infer target categories and probabilities. Ref. [26] integrated the CBAM attention module into the backbone network to enhance focus on targets, improving marine object detection and identification. Ref. [27] proposed YOLO-Fish, a deep learning-based fish detection model: YOLO-Fish-1 modifies YOLOv3 by optimizing upsampling stride to reduce false positives for small fish, while YOLO-Fish-2 incorporates a spatial pyramid pooling module to boost detection robustness in dynamic environments. Ref. [28] introduced a deformable attention module (HDA) combined with an Enhanced Spatial Pyramid Pooling Fast (ESPPF) module to strengthen spatial feature extraction and prioritize critical regions for underwater object detection. Ref. [29] developed an enhanced maritime object detection approach based on the lightweight YOLOv8 architecture. They integrated a multi-scale cross-axis attention mechanism into the backbone network, implemented rapid spatial pyramid pooling, and introduced a refocused convolutional layer to strengthen global feature extraction and boost prediction accuracy. Separately, Ref. [30] proposed a compact underwater detection method using MobileNetV2, combining depthwise separable convolutions with an attention fusion module (AFFM) to reduce model parameters and size while improving detection precision.

2.2. Motivation for Selecting YOLOv11

Real-time object detectors include SSD, Faster R-CNN, EfficientDet, Transformer-based detectors (e.g., DETR), and YOLO-series models. Among them, two-stage detectors (R-CNN series) have high accuracy but suffer from low speed and high computational power requirements. SSD is lightweight but lacks accuracy for small targets in cluttered marine environments. Transformer-based detectors require substantial computation and are not suitable for edge deployment. YOLOv11 provides the optimal balance between speed, accuracy, structural lightness, and multi-scale capability, making it the most appropriate baseline for marine edge devices with limited resources. Therefore, YOLOv11 is selected as the baseline.

YOLOv11 is a state-of-the-art one-stage detector. Its architecture includes a backbone with C3K2 blocks, an SPPF module, a C2PSA attention component, a neck for feature fusion, and a decoupled detection head. This structure balances efficiency and precision, forming the foundation of our modified model.

2.3. Dynamic Adjustment of Adaptive Weights

Adaptive weight dynamic adjustment is a technique that automatically optimizes weight parameters in response to shifts in system states or environmental conditions. Widely utilized in machine learning, optimization algorithms, and control systems, its core principle involves real-time weight adaptation using input data or system feedback to enhance performance. This mechanism improves system robustness against external disturbances and uncertainties. By dynamically adjusting weights, systems gain adaptability to evolving data or environments, thereby boosting overall efficiency, minimizing reliance on manual parameter tuning, and lowering operational complexity. Ref. [31] introduced an adaptive multi-scale YOLO (AMYOLO) algorithm incorporating a Multi-grained Adaptive Feature Enhancement Module (MAEM). This method leverages grouped weighting and multi-level adaptive weighting mechanisms to strengthen fine-grained detail extraction and enhance the precision of multi-scale and global feature representations. Ref. [32] developed an adaptive detection head incorporating an Adaptive Global Feature Aggregation and Weighting (AGFAR) module. This framework addresses adversarial attacks involving imperceptible perturbations (i.e., adversarial samples) added to input data to induce incorrect model predictions. By adaptively modulating layer-wise gradient information across the network, AGFAR strengthens the model’s robustness against such perturbations, improving its resilience to adversarial interference. Ref. [33] introduced a novel adaptive weight feature detection framework based on YOLOv8. This framework incorporates an Adaptive Weight Feature Pyramid Network (AWFPN) to improve multi-scale feature semantic fusion, along with an Adaptive Weight Feature Extraction Module (AWFEM) designed to strengthen underwater object detection. By adaptively capturing discriminative and task-relevant information, the method enhances feature representation for accurate target identification in turbid underwater environments. Ref. [34] introduced a novel residual module utilizing adaptive weighted learning. Through end-to-end training, this approach dynamically updates the weight values of both primary and residual channels, significantly boosting the network’s generalization capability. It further strengthens the PPB module’s capacity to comprehensively predict image parameters via enhanced feature representation.

2.4. Multi-Scale Feature Fusion

In deep learning-based object detection, multi-scale feature fusion is critical for enhancing image perception, model robustness, and information preservation. Low-level feature maps typically retain localized details (e.g., edges, textures), while high-level features encode abstract semantics. Integrating multi-scale features enables models to address noise and occlusion more effectively while mitigating information loss caused by downsampling. Feature pyramid networks (FPNs) are standard approaches for multi-scale fusion, and their optimization for effective feature aggregation remains pivotal to performance improvement. Numerous studies have adapted and optimized such architectures for marine environmental object detection tasks. Ref. [35] optimized the multi-scale feature fusion in Path Aggregation Network (PANet) to prioritize critical resolution features. Additionally, they refined the confidence loss function across detection layers, prioritizing high-quality positive anchor boxes during training. This dual enhancement significantly improves target detection accuracy, particularly for objects with varying scales and complexities. Ref. [36] enhanced the YOLOv8 detector by integrating InceptionNeXt blocks into the backbone to strengthen feature extraction. The method introduces a separate enhanced attention module (SEAM) to address overlapping targets and combines normalized Wasserstein distance (NWD) loss with the original CIoU loss in a weighted manner, optimizing small-target detection performance. Ref. [37] redesigned YOLOv8’s neck structure using an AFPN-based feature fusion network and integrated an LSK attention mechanism to dynamically adapt the receptive field size according to ship scales. This design enhances focus on salient features of offshore ships during multi-level feature aggregation. Ref. [38] introduced a novel residual attention module (R-AM) integrated into YOLOv10’s backbone network. The neck incorporates a bidirectional feature pyramid with adaptive feature fusion, strengthening the fusion of deep-layer semantic features and shallow-layer localization information while prioritizing biological target detail awareness during feature extraction.

2.5. Branch Detection

In object detection algorithms, multi-branch detection serves as a core mechanism for achieving efficient and precise localization. By leveraging multi-scale feature maps, this approach ensures simultaneous detection of objects across varying sizes, improving detection accuracy and computational efficiency, enabling real-time processing while meeting the demands of resource-constrained applications. Ref. [39] introduced a YOLOv8-based framework with a dual-branch architecture, leveraging infrared and visible light image complementarity for object detection. A bidirectional pyramid feature fusion (Bi-Fusion) structure is proposed to integrate multi-modal features, mitigate feature redundancy-induced errors, and enhance fine-grained feature extraction for small-object detection tasks. Ref. [40] proposed MBFormer-YOLO, a YOLO-based framework integrating a multi-branch backbone and adaptive spatial feature fusion detection head to achieve high-precision detection of small infrared targets. The method designed the MetaFormer-based multi-branch structure MBFormer, leveraging multi-branch architectures to compensate for shape-related feature deficiencies and enhance local saliency. Ref. [41] proposed a dual-branch infrared long-range object detection model. The framework integrates a contour feature extraction branch and a multi-level weighted feature fusion method to enhance the network’s capability in identifying distant targets, combining contour features with original features to strengthen target representation.

3. Methodology

This section systematically introduces the proposed YOLO-ELR model. Section 3.1 outlines the overall framework, followed by detailed descriptions of its core modules in Section 3.2, Section 3.3 and Section 3.4, such as the EMBSLaw backbone network, the RGCEL_EMSC convolutional module, and the LSCSBD spatial multi-branch detector.

3.1. Yolo-Elr Overall Network

This study adopts YOLOv11 as the baseline model—a widely used marine object detector balancing accuracy–speed tradeoffs—while introducing three novel components. These innovations collectively form the YOLO-ELR framework, whose comprehensive architecture is depicted in Figure 1.

The proposed EMBSLaw backbone integrates Lawds adaptive downsampling for feature map compression while maintaining critical information, coupled with Efficient Upsampling Convolution Block (EUCB) upsampling that employs efficient convolution operations to minimize computational costs. RGCEL_EMSC modules enable enhanced multi-scale feature fusion, complemented by LSCSBD multi-branch detection heads that optimize target identification and spatial localization through parallel processing pathways.

The structure of the YOLO-ELR network is hierarchically divided into three sequential functional stages, namely Backbone for feature extraction, neck for feature fusion, and head for detection prediction: the Backbone stage, spanning from the left-top to the left-bottom of the network structure diagram, adopts the EMBSLaw backbone, which integrates the Lawds adaptive weight downsampling module, EUCB efficient upsampling unit, C3K2 basic building blocks, C2PSA attention module, and SPPF spatial pyramid pooling layer to implement multi-scale feature extraction and adaptive downsampling, thereby retaining critical target information while reducing computational complexity; the middle neck stage employs multiple RGCEL_EMSC modules to enhance multi-scale feature fusion and representation learning, where convolutional projection, feature concatenation, and fusion nodes are sequentially arranged to aggregate low-level detail features and high-level semantic features; the right head stage deploys the LSCSBD lightweight spatial multi-branch detector with parallel detection branches, and multiple Detect_LSCSBD prediction heads simultaneously perform category classification and bounding box regression, enabling accurate positioning and recognition of multi-scale marine objects.

3.2. YOLO-ELR Overall Backbone Network

The EMBSLaw backbone network is a unified architecture for efficient feature extraction and neural fusion in object detection. It integrates embedded modules, a multi-scale Spatial Feature Pyramid Network (SFPN), adaptive-weighted downsampling, local attention mechanisms, and depthwise separable convolutions. This configuration enhances feature discriminability, detection performance, and localized feature awareness while optimizing computational and memory overhead. The architectural implementation is detailed in Figure 2. The main contents are as follows:

(1): The Trident network framework, incorporating multi-scale convolution modules and adaptive kernel selection mechanisms, demonstrates that receptive field size critically impacts detection performance: larger fields enhance detection of sizable objects, while compact fields improve small-target sensitivity. Accordingly, our FPN design implements scale-adaptive convolutional kernels across hierarchical feature layers, enabling progressive acquisition of multi-range spatial context through field-size modulation.
(2): Building upon BIFPN’s multi-scale feature fusion framework, we replace concatenation with addition operations to reduce parameter and computational costs, while enabling adaptive weight allocation across hierarchical features through self-calibrated importance evaluation of multi-scale representations.
(3): The local attention mechanism enables dynamic spatial focusing on critical regions within input feature maps, amplifying local feature discriminability. Through self-adaptive weighting of attention maps, the architecture optimizes utilization of salient features, thereby strengthening object detection precision and localization accuracy.
(4): Adaptive-weighted downsampling optimizes feature extraction by dynamically adjusting channel-wise significance allocation across feature hierarchies. This technique enables computational efficiency through selective feature prioritization while maintaining representational fidelity by adaptively preserving critical spatial-semantic characteristics during resolution reduction.

In the input feature path (top section), input feature maps are first fed into C3K2 blocks for initial feature extraction, followed by convolutional (Conv) projection and an SPPF (Spatial Pyramid Pooling Fast) layer to effectively expand the receptive field and capture multi-scale contextual information. The middle section consists of a multi-scale fusion branch, which incorporates node_mode fusion nodes, Lawds adaptive weight downsampling modules, and EUCB efficient upsampling units; notably, features with different resolutions are fused through element-wise addition rather than concatenation, contributing to parameter reduction while maintaining feature integrity. The bottom section is a local attention and downsampling path, which implements grouped convolution, adaptive weight calculation, and channel-wise feature recalibration, where the Lawds module dynamically assigns weights to spatial regions during the downsampling process to retain detailed information of small targets. Finally, the output interface (right section) transmits the fused multi-scale features to the subsequent neck and head modules, thereby providing robust support for accurate object detection in complex marine scenes.

The Spatial Feature Pyramid Network (SFPN) achieves multi-scale feature fusion through optimized convolutional layers and activation functions that integrate embedded modules and attention mechanisms. During fusion, feature maps of varying resolutions are dynamically weighted via additive or averaging operations, where contributions are adaptively calibrated based on hierarchical significance.

As illustrated in Figure 3, the structural components of the target downsampling and feature fusion module in the YOLOv11 object detection network, analyzed in accordance with their spatial distribution, are as follows: In the feature conditioning branch (top left), input features first undergo AvgPool (3 × 3) operation to preserve local contextual information, followed by a 1 × 1 convolution to maintain the original channel dimensions without dimensionality change. The middle part is a spatial partitioning unit, where the feature map is reshaped from the dimension (C, 2H, 2W) to (C, H, W, 4), thereby dividing each spatial position into four distinct sub-regions for fine-grained feature processing. The top-right section implements adaptive weight calculation, where attention weights are generated through Softmax normalization, a critical step to ensure that the sum of weights at each spatial position equals 1, enabling effective feature recalibration. The bottom section consists of a downsampling convolution, utilizing a 3 × 3 convolution with a stride of 2 to output 4 × C channels, which corresponds to the four channel-wise partitions generated by the spatial partitioning unit. Finally, the right section achieves weighted fusion output: the downsampled features are element-wise multiplied by the learned attention weights and aggregated along the last dimension to generate the final downsampling output, which effectively balances the efficiency of feature processing and the discriminability of feature representations for robust target detection.

Lawds is a module that combines attention mechanism and depth separable convolution for feature extraction and adaptive weight downsampling. The adaptive weight downsampling module is not used for Maxpooling and Averagepooling. Its main idea is to calculate the weight for each element first, then normalize the weight softmax, and finally multiply and add each element by using the adaptive weight. The attention weight A within the LAW module is computed according to the following mathematical formulation:

Softmax (Conv (1 \times 1) (AvgPool (3 \times 3) (P_{(in)}))) = A_{i}

(1)

The down-sampled feature map Y is derived through the following computational formula:

Conv (3 \times 3, s t r i d e = 2) (P_{in}) = Y_{i}

(2)

The output of each feature map can be calculated by weighted fusion of multiple input feature maps, and the output of the feature map is:

Conv (\frac{\sum_{i} (P_{(in, i)} \cdot W_{i})}{\sum_{i} W_{i} + ε}) = P_{out}

(3)

where,

P_{(in, i)}

is the entered feature map.

W_{i}

is a learned weight, used for weighted fusion, and can be automatically adjusted through training.

ε

is a small constant for numerical stability.

The calculation formula for multi-scale feature fusion is

\sum_{i = 1}^{4} A_{i} ⊙ Y_{i} = Z

(4)

Here,

A_{i}

denotes the attention weight,

Y_{i}

corresponds to the downsampled feature map, and the operator ⊙ signifies element-wise multiplication.

Through the combination of embedded modules and multi-scale feature pyramid networks, the EMBLaw backbone network can significantly improve the model’s detection performance of targets of different sizes, helping the model better identify and locate targets. A feature fusion mechanism and a local attention mechanism enable the model to dynamically adjust the attention to different features, enhancing the robustness of the model in the face of noise or data distribution changes. The design of embedded modules focuses on light weight and efficiency. The weighted downsampling mechanism enhances computational efficiency in feature extraction by eliminating redundant operations, reducing both computational complexity and parameter requirements while preserving model performance.

3.3. Multi-Channel Branching and Multi-Scale Feature Fusion

The RGCEL_EMSC module integrates design principles from CSPNet (Cross-Stage Partial Network) and ELAN (Efficient Layer Aggregation Network), leveraging RGCSPELAN’s cyclic convolutions, grouped convolutions, and spatial pyramid pooling. This architecture strengthens multi-scale feature fusion and contextual information integration, achieving an optimal balance between efficient object detection and lightweight computational design. The specific content is as follows:

(1): Inspired by the concept in GhostNet—which identifies significant redundancy in intermediate feature maps of conventional CNNs—this method employs low-cost operations to generate a portion of redundant features, effectively minimizing computational and parametric overhead while maintaining model performance.
(2): The conventional bottleneck architecture employed in YOLOv5 and YOLOv11 has been discarded. To compensate for performance degradation caused by eliminating residual connections, RepConv is strategically implemented on the gradient circulation branch, enhancing feature extraction capability and gradient propagation efficiency. Crucially, RepConv’s structural reparameterization allows seamless layer fusion during inference, simultaneously addressing performance maintenance and computational efficiency in a unified framework.
(3): The output channel count of RepConv can be modulated through a scaling factor, enabling adaptive adjustment of feature extraction granularity to accommodate both compact and large-scale model architectures.
(4): The module optimizes gradient flow pathways and enhances gradient propagation efficiency, thereby strengthening the network’s learning capacity. This design improves multi-scale feature fusion and extraction to boost model performance, while enabling more effective capture of cross-scale target characteristics and contextual information.

The model enhances temporal modeling capacity through iterative convolution operations across multi-scale temporal blocks. As depicted in Figure 4. As a key component of the YOLOv11 object detection network, the cyclic convolution and grouped fusion mechanism, which enhances multi-scale context modeling for marine targets, consists of five sequential functional segments arranged from left to right: In the input projection (left segment), input features with 8C channels are projected through 1 × 1 convolution while maintaining the original channel count to ensure feature consistency. The middle-left segment implements channel splitting, where the feature map is divided into five groups with channel configurations of 4C, C, C, C, and C, respectively, to enable differentiated feature processing tailored to distinct feature characteristics. The middle segment comprises RepConv and convolution branches, where four small-channel branches sequentially adopt RepConv (structural reparameterization) and 3 × 3 convolution, with both input and output channels set to 2C to facilitate efficient gradient propagation and enhance feature representation capability. In the middle-right segment, feature aggregation is achieved by concatenating all intermediate outputs along the channel dimension to integrate multi-branch feature information. Finally, the right segment incorporates a residual connection, through which the concatenated features are fused with the original input features, effectively preserving primitive feature information and stabilizing the network training process.

The RGCEL_EMSC module augments multi-scale feature integration and contextual awareness through a synergistic combination of cyclic convolution, grouped convolution, spatial pyramid pooling (SPP), localized attention mechanisms, and the C3k2-EMSC architecture, achieving substantial performance gains in multi-scale target detection. The integrated SPP and attention framework enables dynamic feature prioritization through context-aware adaptation, enhancing model robustness against noise perturbations and distribution shifts. Concurrently, the cyclic convolution operators and C3k2-EMSC topology enforce computational efficiency via parameter-sharing group operations and recursive computation patterns, maintaining near-linear parameter growth while delivering nonlinear performance improvements. This balanced design achieves optimal accuracy-efficiency equilibrium through coordinated spatial-channel refinement and adaptive context modeling.

3.4. YOLO-ELR Space Multi-Branch Detection Head

The LSCSBD (Lightweight Spatial-Contextual Split Branch Detection) head, as illustrated in Figure 5, implements an efficient multi-branch architecture optimized for target detection through NAS-FPN-inspired design principles. To address feature distribution discrepancies across network hierarchies while maintaining computational efficiency, we deploy independent BatchNorm (BN) operations with parameter-shared convolutional layers, circumventing both GN-induced inference latency and BN’s sliding mean deviation in shared-parameter scenarios. This configuration preserves essential normalization benefits while enhancing spatial-contextual modeling through parallel processing branches that synergistically optimize detection accuracy for multi-scale targets via hierarchical feature refinement and large-receptive-field context aggregation.

As revealed in Figure 6, the multi-scale prediction module of the YOLOv11 object detection network, which is critical for accurate multi-scale marine object detection, consists of five functional components arranged in a structured layout, with the overall structure analyzed from left to right and top to bottom: The multi-scale input interface (left section) receives three hierarchical feature maps (P3, P4, P5) transmitted from the neck module, where P3, P4, and P5 correspond to small, medium, and large targets respectively, laying a foundation for multi-scale object recognition. The top section implements initial feature transformation, where each input feature map undergoes non-shared channel projection through a 1 × 1 Conv_GN (Group Normalization) layer to flexibly adjust feature dimensions and optimize feature representation. The middle section is a parameter-shared convolution branch, comprising three parallel 3 × 3 Conv_BN (Batch Normalization) layers that share convolution parameters to reduce computational overhead while retaining independent BN statistics, thereby effectively avoiding feature distribution shift during network training. The right section consists of prediction separation branches, which are divided into three task-specific branches: Conv_cls for category classification, Conv_reg for bounding box regression, and Conv_mask for spatial feature enhancement; each branch is equipped with a scale factor to adapt to the varying resolutions of input features and improve prediction accuracy. Finally, the output layer (bottom section) fuses all prediction results and outputs the integrated outcomes, enabling precise detection of multi-scale marine objects in complex marine environments.

The BN layer needs to calculate the mean and variance of all elements in a minibatch input feature

z^{(i)}

, then divide the subtraction mean by the standard deviation, and finally use the learnable parameters

γ

and

β

to perform affine transformation to obtain the final BN output. Among these, the formula for computing the average value

μ

of each batch along the channel is as follows:

μ = \frac{1}{m} \sum_{i = 1}^{m} z^{(i)}

(5)

where m denotes the number of samples in the minibatch, and

z^{(i)}

represents the activation value of the i-th sample. The formula for computing the variance

σ^{2}

of each batch along the channel is as follows:

σ^{2} = \frac{1}{m} \sum_{i = 1}^{m} {(z^{(i)} - μ)}^{2}

(6)

The standardized processing formula for sample data is as follows:

{Norm}^{(i)} = \frac{z^{(i)} - μ}{\sqrt{σ^{2} + ε}}

(7)

where

ε

is a small constant introduced to ensure numerical stability and prevent division by zero, ensures numerical stability during normalization. Two parameters

γ

and

β

were introduced. To train the two parameters of

γ

and

β

. These learnable reconstruction parameters

γ

and

β

are introduced, so that our network can learn to restore the feature distribution to the original network to learn, and the translation and scaling processing formulas are as follows:

{\tilde{z}}^{(i)} = γ z_{Norm}^{(i)} + β

(8)

Group Normalization (GN) addresses Batch Normalization’s sensitivity to small batch sizes by partitioning input channels into groups and performing intra-group normalization, eliminating batch dimension dependence. This approach is particularly advantageous for memory-intensive tasks like high-resolution image processing, as it maintains stable activation statistics through channel-wise grouping while circumventing BN’s batch-statistics instability and GPU memory constraints associated with large feature maps.

(C / G) \times H \times W

(9)

Group Normalization (GN) achieves batch-agnostic standardization by partitioning the channels of each sample’s feature map into G groups (each containing C/G channels), then computing group-wise mean and standard deviation exclusively within these spatial-channel subgroups. This grouped normalization strategy decouples statistical estimation from batch size constraints, as each channel subset undergoes independent standardization using its group-specific normalization parameters, thereby eliminating batch dimension dependence while maintaining stable activation distributions. The formula is as follows:

μ_{n g} (x) = \frac{1}{(C / G) H W} \sum_{c = g C / G}^{(g + 1) C / G} \sum_{h = 1}^{H} \sum_{w = 1}^{W} x_{n c h w}

(10)

σ_{n g} (x) = \sqrt{\frac{1}{(C / G) H W} \sum_{c = g C / G}^{(g + 1) C / G} \sum_{h = 1}^{H} \sum_{w = 1}^{W} {(x_{n c h w} - μ_{n g} (x))}^{2} + ϵ}

(11)

The LSCSBD spatial multi-branch detection head employs hierarchical feature fusion through multi-scale feature pyramids to enhance multi-scale object detection, integrates spatial attention modules and spatial pyramid pooling layers to strengthen positional-shape perception, and implements parallel processing branches (scale-specific and task-oriented) with parameter-shared/independent configurations to optimize adaptability. This architecture systematically improves detection robustness by synergizing cross-scale contextual information, spatial-semantic awareness enhancement, and specialized feature processing pathways.

4. Experiments

4.1. Data Description

The dataset presented in this study is constructed by curating samples from two publicly available Kaggle sources: Boat Types Recognition https://www.kaggle.com/datasets/Clorichel/boat-types-recognition (accessed on 16 January 2024) containing 1500 images across 9 marine categories (buoys, cruise ships, ferries, cargo ships, gondolas, inflatable boats, kayaks, paper boats, sailboats), and Boat vs Sea Images Dataset https://www.kaggle.com/datasets/waqasahmedbasharat/boat-vs-sea-images-dataset (accessed on 16 January 2024) comprising 1000 ship/sea scene images. We selected four representative classes (buoys, ships, cruise ships, military vessels from these sources, where military vessels refer to typical surface combat ships in maritime scenes, and added a wave category to construct the DroneTargetDataset (DTA). The finalized DTA contains 3659 annotated images spanning five discriminative classes: waves, buoys, ships, cruise ships, and warships, with a standardized split of 70% (2561 images) for training, 10% (366 images) for validation, and 20% (732 images) for testing. Dataset composition and class distribution are illustrated in the accompanying Figure 6. To evaluate the generalization capability of the YOLO-ELR object detection model, this study conducts experiments on the Pascal VOC (07+12) dataset https://aistudio.baidu.com/datasetdetail/128395 (accessed on 16 January 2024).

4.2. Training Setting

Our experimental setup aligns with YOLOv8’s training protocol, implemented on a Windows 11 workstation equipped with an Intel Core i5-12490F CPU, PRIME H610M-D DDR4 motherboard, and NVIDIA GeForce RTX 4060 Ti GPU. The framework utilizes PyTorch 2.0.0 with CUDA 12.1 acceleration. Optimization employs stochastic gradient descent (SGD) with cosine annealing learning rate scheduling, initialized at 0.001. Training configurations maintain 640 × 640 input resolution across 1000 epochs (batch size = 32), with Mosaic augmentation persistently activated. All unspecified hyperparameters strictly follow baseline specifications to ensure methodological consistency. Hardware-specific implementation details include full-stack acceleration through NVIDIA’s TensorRT integration and PyTorch’s AMP (Automatic Mixed Precision) for memory optimization.

4.3. Evaluation Metrics

The performance of our proposed method is evaluated using three standard metrics: average precision (AP), precision (P), recall (R), mean average precision (mAP), and F1-Score. Precision (P) quantifies the proportion of correctly predicted positive instances among all model-predicted positives. Recall (R) measures the fraction of true positive instances successfully identified by the model. The F1-Score, defined as the harmonic mean of precision and recall, balances these two metrics. mAP extends AP by averaging the per-class AP values across all categories, providing a comprehensive assessment of multi-class detection robustness. Precision and recall are defined as follows:

P = \frac{TP}{TP + FP}

(12)

Within the confusion matrix framework, true positives (TP) represent the count of correctly identified positive instances (i.e., objects both detected and accurately classified), while false positives (FP) denote instances erroneously detected as positive by the model (e.g., background regions or misclassified objects).

R = \frac{TP}{TP + FN}

(13)

False negatives (FN) quantify instances where the model fails to detect true positive targets. The evaluation metrics mean average precision (mAP) and F1-Score are formally defined as:

AP = \int_{0}^{1} P (R) dR

(14)

\begin{matrix} mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i} \end{matrix}

(15)

mAP@50 denotes the mean average precision computed at an Intersection-over-Union (IoU) threshold of 0.5, a standard criterion for evaluating localization accuracy in object detection tasks. This metric reflects the model’s ability to balance precise bounding box alignment (IoU ≥ 0.5) with classification correctness across all classes [42].

\begin{matrix} F = 2 \times \frac{P \times R}{P + R} \end{matrix}

(16)

The F1-Score quantifies model performance within a [0, 1] range, with higher values indicating superior precision-recall balance. For the YOLOv11 architecture, this metric evaluates detection efficacy in image detection tasks by harmonizing two critical aspects: precision (minimizing false positives) and recall (minimizing false negatives). An F1-Score approaching 1 signifies robust alignment between accurate localization (correct target identification) and comprehensive coverage (minimal missed detections), reflecting optimal model calibration for real-world deployment.

4.4. Ablation Experiment and Visualization Results Analysis

4.4.1. YOLO-ELR Ablation Experiment

To evaluate the impact of individual components on the performance of the YOLO-ELR object detection model, we conducted comprehensive ablation studies and quantitatively analyzed the corresponding performance metrics. Results are summarized in Table 1, demonstrating the incremental impact of each module on detection accuracy and robustness. Here, A represents the YOLOv11 basic network, B represents the EMBSLaw network, C represents the RGCEL_EMSC component, and D represents the LSCSBD detection head.

The YOLO-ELR model builds upon the YOLOv11-Baseline architecture (AP: 79.7%, AR: 56.9%, mAP@50: 80.93%, parameters: 2,583,127, Model File Size: 5.3 MB) by introducing the EMBSLaw module as its core component. This module incorporates adaptive downsampling and efficient upsampling to enhance model efficiency, reducing both parameter count (1,833,528) and computational load while improving performance (AP: 82.7% [+3.0%], AR: 59.2% [+2.3%], mAP@50: 83.23% [+2.3%], Model File Size: 4.1 MB).

Further optimization is achieved through the RGCEL_EMSC module, which employs grouped circular convolution for multi-scale feature fusion, boosting detection accuracy (AP: 83.0% [+3.3%], AR: 60.1% [+3.2%], mAP@50: 83.71% [+2.78%]) with only 1,785,464 parameters—a 30.9% reduction compared to the baseline.

The spatial branch detector (LSCSBD), a critical component of YOLO-ELR, enhances localization and recognition by sharing BN layer parameters. Integrating this module into the YOLOv11+EMBSLaw+RGCEL_EMSC framework yields an AP of 83.9% (+4.2%), AR of 62.9% (+6.0%), and mAP@50 of 84.87% (+3.94%), while further compressing the model to 1,678,769 parameters.

4.4.2. Visual Feature Heatmap

To empirically validate the proposed model’s perceptual capability for object features, this work employs the hierarchical class activation map (LayerCAM) method [43] to generate feature heatmaps, enabling intuitive visualization of region-specific feature activation patterns. LayerCAM enhances heatmap generation by aggregating multi-layer feature maps, contrasting with Grad-CAM’s reliance on final-layer features. This method hierarchically weights activations from distinct network levels to synthesize fine-grained visualizations, preserving target-specific details while suppressing background noise. Its core innovation lies in a layer-wise weighting mechanism, where feature maps from multiple scales are dynamically fused based on their spatial saliency. Comparative heatmaps of YOLO-ELR and the baseline model Figure 7 demonstrate LayerCAM’s superior ability to localize multi-scale object features with precision.

To visually validate the proposed model’s capability in extracting key features and acquiring global information, the visualization results demonstrate YOLO-ELR’s superior performance over the baseline. As illustrated in Figure 8, the proposed detector retains its focus on object centers while exhibiting lower activation intensity (dimmer regions) in critical object areas, indicating enhanced adaptability to complex backgrounds and improved holistic object understanding.

4.4.3. Effective Receptive Field

The receptive field, defined as the region in the input image influencing a specific feature map activation [44], quantifies the spatial correspondence between a feature map point and its input area. Its calculation depends on the convolutional kernel’s size, stride, and padding parameters. The relationship is formalized as:

\begin{matrix} R F_{l} = R F_{l - 1} + (k_{l} - 1) \times \prod_{i = 1}^{l - 1} s_{i} \end{matrix}

(17)

where

R F_{l}

denotes the receptive field at layer l,

k_{l}

the kernel size, and

s_{i}

the stride at layer i. This hierarchical aggregation ensures progressive contextual integration across network layers.

The receptive field defines the spatial extent of an input image influencing feature map activations, where larger fields capture global context and smaller fields prioritize local details. As network depth increases, the receptive field expands hierarchically: shallow layers emphasize fine-grained local patterns, while deeper layers integrate broader semantic information. This progressive contextual aggregation aligns with the hierarchical feature extraction paradigm in CNNs. Comparative receptive field analyses of the baseline and improved models, illustrated in Figure 8, validate the enhanced global-local tradeoff achieved through architectural optimization.

A larger receptive field enables networks to integrate broader spatial contexts, enhancing global information comprehension. This facilitates precise identification of holistic object structures in tasks like image detection, surpassing reliance on isolated local features. Furthermore, expanded receptive fields strengthen contextual awareness critical for semantic segmentation and object detection, improving model accuracy and robustness by resolving ambiguities through spatial-semantic relationships. Additionally, such architectures mitigate reliance on data preprocessing (e.g., cropping, scaling) by autonomously capturing rich multi-scale features, thereby reducing sensitivity to input variations.

4.4.4. Object Detection Detection System

To streamline marine environmental target detection, this study designed and implemented YOLO-ELR, a real-time detection system capable of identifying targets in single images, image batches, and video streams. The system architecture, optimized for operational efficiency and multi-modal adaptability, integrates dynamic inference pipelines to balance speed and accuracy. Its user interface, designed for intuitive interaction with visual analytics, is presented in Figure 9, highlighting workflow modularity and real-time performance metrics.

The system supports dynamic selection of pre-trained weights and models for cross-dataset target detection, enabling identification of target categories, quantities, and center-oriented positional judgments. Users can manually adjust confidence thresholds or selectively identify specific classes via category filters, with per-target bounding box details accessible through ID-based queries. Real-time detection workflows allow [pause/continue] control, while detected images and coordinate data can be saved to a default directory via the [Save] function. Post-detection, the [End] button terminates ongoing detection processes without system exit; full system closure is achieved via the [Exit System] interface button Figure 10.

The system incorporates multi-modal visualization capabilities for heterogeneous object detection and detection tasks, as demonstrated in Figure 11. This feature enables comparative analysis of detection variants (e.g., class-specific, rotated bounding boxes) through interactive visualization panels, supporting granular inspection of spatial-semantic relationships across detection paradigms.

4.5. Analysis of the Results of Comparative Experiment and Generalization Experiment

4.5.1. Analysis of Generalization Experiment Results

The Pascal VOC (07+12) dataset https://aistudio.baidu.com/datasetdetail/128395 (accessed on 16 January 2024), accessible via Baidu AI Studio, is a consolidated benchmark combining the Pascal VOC 2007 http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ (accessed on 16 January 2024) and Pascal VOC 2012 datasets http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ (accessed on 16 January 2024). This comprehensive collection encompasses annotations for 20 distinct object categories. Specifically, the Pascal VOC 2007 subset contains:

Training set: 2501 images (6301 objects);
Validation set: 2510 images (6307 objects);
Test set: 4952 images (12,032 objects).

The PascalVOC 2012 subset includes:

Training set: 5717 images (13,609 objects);
Validation set: 5823 images (13,841 objects).

Quantitative comparisons of detection performance metrics are systematically presented in Table 2 and Table 3, with qualitative visual analyses provided in Figure 12 and Figure 13. Comparative experimental results demonstrate that the YOLO-ELR model achieves superior performance on the PascalVOC07+12 dataset, with a precision of 71.18%, recall of 63.18%, average precision (AP) of 67%, and mAP@0.5 of 70.49%. Notably, the model maintains this leading accuracy across objects of varying scales while maintaining an efficient architecture with only 1.7 million parameters and a compact Model File Size of 3.6 MB, demonstrating excellent lightweight characteristics.

Experimental results demonstrate that the YOLO-ELR object detection model achieves an optimal balance between accuracy and computational efficiency across both the DTA and Pascal VOC (07+12) datasets, validating its strong generalization capability.

4.5.2. Analysis of Comparative Experimental Results

To comprehensively evaluate the overall performance of the proposed model, this paper conducts a comparative experiment with several mainstream object detection models such as YOLOv10, YOLOv9, YOLOv8, YOLOv6, YOLOv5-P6, YOLOv5, Faster R-CNN and YOLO-Damo. All comparison models are retrained on the same DTA marine dataset with identical input size, epochs, batch size, and augmentation for fair benchmarking.

Table 4 and Table 5 summarizes the evaluation metrics for the YOLO-ELR object detection model alongside twelve alternative architectures. Benchmarked against YOLO-series variants and other state-of-the-art models, YOLO-ELR achieves superior accuracy while maintaining significantly fewer parameters and lightweight computational demands (GFLOPs), positioning it as a leading solution among the thirteen evaluated frameworks. The comparison chart of visualization results is shown in Figure 14. The YOLO-ELR object detection model demonstrates dual advantages by simultaneously enhancing recognition accuracy while maintaining robust detection integrity and comprehensive coverage. This balanced performance ensures reliable identification across diverse object categories while preserving complete spatial detection capabilities.

In terms of detection accuracy, the proposed YOLO-ELR model achieves an average precision (AP) of 83.9%, an average recall (AR) of 62.9%, and a mean average precision at IoU threshold 0.5 (mAP@50) of 84.87%, demonstrating superior performance in object localization and recognition, demonstrating significant advantages over competing models while ensuring robust detection performance. In terms of lightweight design, YOLO-ELR demonstrates the smallest model file size and the fewest parameters compared to all other evaluated architectures. Although its computational complexity slightly exceeds that of YOLOv5-n and YOLOv5-P6n, it remains markedly lower than other models, as illustrated in Figure 15 and Figure 16.

5. Conclusions

The EMBSLaw backbone network utilizes efficient up/downsampling methods to reduce computational load and parameter requirements. To enhance object detection and localization, the YOLO-ELR model replaces conventional detectors with the LSCSBD spatial multi-branch detection head, enabling parameter-independent BN layer convolution sharing to minimize parameters during inference. For improved accuracy, the RGCELEMSC module substitutes standard convolutional neural networks, enhancing feature extraction and fusion capabilities. The YOLO-ELR detection model achieves significant performance improvements with an AP of 83.9% (+4.2%), AR of 62.9% (+6.0%), and mAP@50 of 84.87% (+3.94%) over baseline metrics, while simultaneously reducing model parameters by 35% and computational cost by 7.56% in GFLOPs. Comparative evaluations demonstrate that YOLO-ELR maintains superior accuracy relative to both first-stage detectors (YOLO series) and second-stage architectures (e.g., Faster R-CNN), while achieving optimal model efficiency. The framework exhibits robust performance under challenging conditions including complex scenes and ambiguous target features, establishing its effectiveness for practical deployment scenarios.

While YOLO-ELR significantly enhances object detection accuracy and reduces computational load and parameter requirements in challenging scenarios, its performance remains constrained in highly complex environments and certain object categories, partly due to limitations in dataset completeness. Additionally, its inference speed currently falls short of higher performance benchmarks. Future work will focus on leveraging more diverse and comprehensive datasets to improve robustness, optimizing detection speed without compromising accuracy or lightweight efficiency, and developing a systematic framework to visually compare results across models. We also aim to extend and refine the proposed algorithms for broader detection tasks, such as instance segmentation and keypoint detection, ensuring versatility alongside precision and efficiency.

Author Contributions

Conceptualization, J.Y. and L.W.; methodology, J.Y.; software, L.W.; validation, J.Y. and L.W.; formal analysis, L.W.; investigation, J.Y.; resources, J.Y.; data curation, J.Y.; writing—original draft preparation, J.Y.; writing—review and editing, L.W.; visualization, J.Y.; supervision, J.Y.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shandong Provincial Scientific Research Project: Research on High-Precision Ship Trajectory Control and Energy-Saving Technology under Complex Sea Conditions (2025TSGCCZZB0210).

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available because the associated project has not yet been completed, but they are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05); IEEE: New York, NY, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Papageorgiou, C.; Poggio, T. A trainable system for object detection. Int. J. Comput. Vis. 2000, 38, 15–33. [Google Scholar] [CrossRef]
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef] [PubMed]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. Yolo-facev2: A scale and occlusion aware face detector. arXiv 2022, arXiv:2208.02019. [Google Scholar] [CrossRef]
Li, Y.; Li, S.; Du, H.; Chen, L.; Zhang, D.; Li, Y. YOLO-ACN: Focusing on small target and occluded object detection. IEEE Access 2020, 8, 227288–227303. [Google Scholar] [CrossRef]
Zhou, X.; Jiang, L.; Hu, C.; Lei, S.; Zhang, T.; Mou, X. YOLO-SASE: An improved YOLO algorithm for the small targets detection in complex backgrounds. Sensors 2022, 22, 4600. [Google Scholar] [CrossRef]
Betti, A.; Tucci, M. YOLO-S: A lightweight and accurate YOLO-like network for small target detection in aerial imagery. Sensors 2023, 23, 1865. [Google Scholar] [CrossRef]
Tian, Y.; Wang, S.; Li, E.; Yang, G.; Liang, Z.; Tan, M. MD-YOLO: Multi-scale Dense YOLO for small target pest detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar] [CrossRef]
Wang, H.; Xiao, N. Underwater object detection method based on improved faster RCNN. Appl. Sci. 2023, 13, 2746. [Google Scholar] [CrossRef]
Zhai, S.P.; Shang, D.R.; Wang, S.H.; Dong, S.S. DF-SSD: An improved SSD object detection algorithm based on DenseNet and feature fusion. IEEE Access 2020, 8, 24344–24357. [Google Scholar] [CrossRef]
Xu, Q.; Lin, R.; Yue, H.; Huang, H.; Yang, Y.; Yao, Z. Research on small target detection in driving scenarios based on improved yolo network. IEEE Access 2020, 8, 27574–27583. [Google Scholar] [CrossRef]
Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A small target object detection method for fire inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
Chang, Y.; Li, D.; Gao, Y.; Su, Y.; Jia, X. An improved YOLO model for UAV fuzzy small target image detection. Appl. Sci. 2023, 13, 5409. [Google Scholar] [CrossRef]
Mei, J.; Zhu, W. BGF-YOLOv10: Small object detection algorithm from unmanned aerial vehicle perspective based on improved YOLOv10. Sensors 2024, 24, 6911. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Sapkota, R.; Meng, Z.; Churuvija, M.; Du, X.; Ma, Z.; Karkee, M. Comprehensive performance evaluation of yolo11, yolov10, yolov9 and yolov8 on detecting and counting fruitlet in complex orchard environments. arXiv 2024, arXiv:2407.12040. [Google Scholar] [CrossRef]
Feng, Q.; Li, J.; He, Q. Photoelectric Measurement and Sensing: New Technology and Applications. Sensors 2023, 23, 8584. [Google Scholar] [CrossRef]
Ruizhong, R. Analysis and prospect of modern atmospheric optics and its applications in optoelectronic engineering. Infrared Laser Eng. 2022, 51, 20210818-1. [Google Scholar]
Yu, Q.; Wang, B.; Su, Y. Object detection-tracking algorithm for unmanned surface vehicles based on a radar-photoelectric system. IEEE Access 2021, 9, 57529–57541. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 6154–6162. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 2961–2969. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; pp. 7263–7271. [Google Scholar]
Li, H.; Deng, L.; Yang, C.; Liu, J.; Gu, Z. Enhanced YOLO v3 tiny network for real-time ship detection from visual image. IEEE Access 2021, 9, 16692–16706. [Google Scholar] [CrossRef]
Al Muksit, A.; Hasan, F.; Emon, M.F.H.B.; Haque, M.R.; Anwary, A.R.; Shatabda, S. YOLO-Fish: A robust fish detection model to detect fish in realistic underwater environment. Ecol. Inform. 2022, 72, 101847. [Google Scholar] [CrossRef]
Feng, J.; Jin, T. CEH-YOLO: A composite enhanced YOLO-based model for underwater object detection. Ecol. Inform. 2024, 82, 102758. [Google Scholar] [CrossRef]
Yu, C.; Yin, H.; Rong, C.; Zhao, J.; Liang, X.; Li, R.; Mo, X. YOLO-MRS: An efficient deep learning-based maritime object detection method for unmanned surface vehicles. Appl. Ocean Res. 2024, 153, 104240. [Google Scholar] [CrossRef]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight underwater object detection based on yolo v4 and multi-scale attentional feature fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Yuan, M.; Meng, H.; Wu, J. AM YOLO: Adaptive multi-scale YOLO for ship instance segmentation. J. Real.-Time Image Process. 2024, 21, 100. [Google Scholar] [CrossRef]
Guo, Z.; He, X.; Yang, Y.; Qing, L.; Chen, H. DAG-YOLO: A context-feature adaptive fusion rotating detection network in remote sensing images. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–24. [Google Scholar] [CrossRef]
Guo, Q.; Wang, Y.; Zhang, Y.; Qin, H.; Qi, H.; Jiang, Y. AWF-YOLO: Enhanced Underwater Object Detection with Adaptive Weighted Feature Pyramid Network; OAE Publishing Inc.: Alhambra, CA, USA, 2023. [Google Scholar]
Hui, Y.; Wang, J.; Li, B. WSA-YOLO: Weak-supervised and adaptive object detection in the low-light environment for YOLOV7. IEEE Trans. Instrum. Meas. 2024, 73, 2507012. [Google Scholar] [CrossRef]
Lei, F.; Tang, F.; Li, S. Underwater target detection algorithm based on improved YOLOv5. J. Mar. Sci. Eng. 2022, 10, 310. [Google Scholar] [CrossRef]
Jia, R.; Lv, B.; Chen, J.; Liu, H.; Cao, L.; Liu, M. Underwater object detection in marine ranching based on improved YOLOv8. J. Mar. Sci. Eng. 2023, 12, 55. [Google Scholar] [CrossRef]
Wang, S.; Li, Y.; Qiao, S. ALF-YOLO: Enhanced YOLOv8 based on multiscale attention feature fusion for ship detection. Ocean Eng. 2024, 308, 118233. [Google Scholar] [CrossRef]
Wang, J.; Mai, R. Um-Yolov10: An Underwater Object Detection algorithm for Marine Environment Based on Yolov10 Model. Fishes 2025, 10, 173. [Google Scholar] [CrossRef]
Tian, D.; Yan, X.; Zhou, D.; Wang, C.; Zhang, W. Iv-yolo: A lightweight dual-branch object detection network. Sensors 2024, 24, 6181. [Google Scholar] [CrossRef] [PubMed]
Luo, X.; Luo, S.; Chen, M.; Zhao, G.; He, C.; Wu, H. MBFormer-YOLO: Multi-Branch Adaptive Spatial Feature Detection Network for Small Infrared Object Detection. IEEE Sens. J. 2024, 24, 19517–19530. [Google Scholar] [CrossRef]
Jing, J.; Jia, B.; Huang, B.; Liu, L.; Yang, X. YOLO-D: Dual-branch infrared distant target detection based on multi-level weighted feature fusion. In Proceedings of the International Conference on Neural Information Processing; Springer: Singapore, 2023; pp. 140–151. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Jiang, P.; Zhang, C.; Hou, Q.; Cheng, M.; Wei, Y. LayerCAM: Exploring hierarchical class activation maps for localization. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef] [PubMed]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 11963–11975. [Google Scholar]

Figure 1. YOLO-ELR network structure diagram.

Figure 2. EMBS backbone network structure diagram.

Figure 3. Law downsampling module structure diagram.

Figure 4. Schematic diagram of RGCEL_EMSC module structure.

Figure 5. LSCSBD module network structure diagram.

Figure 6. Partial images of the DTA dataset.

Figure 7. Feature heatmap comparison.

Figure 8. Effective receptive field comparison. (a) YOLOv11-BaseLine; (b) YOLOv11+EMBSLaw; (c) YOLOv11+EMBSLaw+RGCEL_EMSC; (d) YOLO-ELR.

Figure 9. UI of marine target detection system.

Figure 10. UI of marine target detection system.

Figure 11. Object detection visual result diagram.

Figure 12. PR values of PascalVoc (07+12) data set on different models.

Figure 13. Comparison of PascalVoc (07+12) data set on different models: Model File Size, Prarmeters, GFLOPs and F1-Score. (a) Comparison Chart of Model File Size and mAP50; (b) Comparison chart of Parameters and mAP50; (c) Comparison chart of GFLOPs and mAP50; (d) Comparison chart of F1-Score and mAP50.

Figure 14. Visual result comparison chart.

Figure 15. Visual result comparison chart. (a) Compare the Model File Size of the model with mAP50; (b) Compare the Parameters of the model with mAP50; (c) Compare the F1-Score of the model with mAP50; (d) Compare the GFLOPs of the model with mAP50.

Figure 16. A comparative analysis of parameters and performance metrics across various models.

Table 1. Experimental analysis of YOLO-ELR model ablation.

A	B	C	D	AP (%)	AR (%)	F1	mAP@50 (%)	mAP@75 (%)	mAP@50-95 (%)	GFLOPs	Parameters	Model File Size
✓				79.7	56.9	78.75	80.93	52.57	49.84	6.6	2,583,127	5.3 MB
✓	✓			82.7	59.2	78.85	83.23	52.38	50.00	6.3	1,833,528	4.1 MB
✓	✓	✓		83.0	60.1	79.61	83.71	53.49	49.69	6.3	1,785,464	4.0 MB
✓	✓	✓	✓	83.9	62.9	80.51	84.87	53.95	50.90	6.1	1,678,769	3.8 MB

✓: Indicates that the corresponding module is included in the model.

Table 2. PascalVOC 07+12 data set in YOLO-ELR model and other model index results.

Model	Size	P (%)	R (%)	F1 (%)	mAP@50 (%)	mAP@75 (%)	mAP@50-95 (%)	Parameters	Model File Size (MB)	GFLOPs
YOLO-ELR	640 × 640	71.18	63.18	66.44	70.49	55.12	50.22	1,679,744	3.6	6.1
YOLOv11-n	640 × 640	70.59	61.14	65.07	68.44	52.07	47.76	2,586,052	5.2	6.6
YOLOv10-n	640 × 640	65.91	59.04	61.82	64.22	48.48	44.27	2,269,068	5.5	6.5
YOLOv8-n	640 × 640	70.76	60.87	64.96	68.17	51.93	47.14	2,688,268	5.3	5.5
YOLOv6-n	640 × 640	67.02	59.92	62.60	65.68	50.35	45.99	4,157,004	8.1	8.3
YOLOv5-n	640 × 640	65.72	58.33	61.09	64.26	45.63	42.66	2,185,564	4.4	4.5

Table 3. AP and AR index results of PascalVOC 07+12 dataset on YOLO-ELR model and other models.

Model	Size	AP@50 (%)	AP@75 (%)	AP@50-95 (%)	${AR}_{S}$ (%)	${AR}_{M}$ (%)	${AR}_{L}$ (%)	${AR}_{All}$ (%)
YOLO-ELR	640 × 640	67.0	49.4	45.4	55.0	62.2	73.6	68.3
YOLOv11-n	640 × 640	65.0	46.4	43.4	44.1	59.9	73.2	66.9
YOLOv10-n	640 × 640	60.9	44.1	40.4	47.9	60.9	73.4	67.6
YOLOv8-n	640 × 640	64.7	46.9	43.0	47.0	60.1	72.6	66.6
YOLOv6-n	640 × 640	61.7	44.4	41.1	44.8	59.1	72.6	66.2
YOLOv5-n	640 × 640	59.3	40.8	37.9	48.1	58.7	69.4	64.3

Table 4. YOLO-ELR model and comparison model indicators.

Model	Size	P (%)	R (%)	F1 (%)	mAP@50 (%)	mAP@75 (%)	mAP@50-95 (%)	GFLOPs	Parameters	Model File Size (MB)
YOLO-ELR	640 × 640	81.81	77.16	80.51	84.87	53.95	50.90	6.1	1,678,769	3.8
YOLOv11-s	640 × 640	81.44	66.58	72.92	77.65	52.44	49.45	21.3	9,414,735	18.4
YOLOv10-n	640 × 640	83.39	7738	79.84	82.04	53.73	50.61	6.5	2,266,143	5.6
YOLOv8-s	640 × 640	81.84	75.88	78.33	81.15	56.02	52.31	23.4	9,829,599	19.1
YOLOv8-n	640 × 640	79.30	71.82	74.98	76.94	51.65	48.62	6.8	2,685,343	5.5
YOLOv6-s	640 × 640	75.93	77.31	76.27	79.07	53.05	50.80	42.8	15,977,119	30.8
YOLOv6-n	640 × 640	79.05	76.18	76.29	78.51	55.53	50.41	11.5	4,155,519	8.3
YOLOv5-P6s	640 × 640	78.11	73.98	75.42	77.86	53.12	49.68	19.1	13,437,892	26.1
YOLOv5-P6n	640 × 640	76.02	75.84	75.28	80.11	53.24	51.62	5.9	3,677,044	7.5
YOLOv5-n	640 × 640	78.08	78.53	77.68	81.64	54.71	50.06	5.8	2,182,639	4.5
YOLOv5-s	640 × 640	84.99	76.60	80.37	81.98	56.66	52.35	18.7	7,815,551	15.32
Fsater-RCNN	640 × 640	81.61	77.09	79.87	82.52	52.11	49.33	118.6	60,514,962	114.38
YOLO-Damo	640 × 640	81.06	76.96	78.89	81.47	51.04	48.77	97.3	42,591,276	32.53

Table 5. AP and AR indexes of YOLO-ELR model and comparison model.

Model	Size	AP@50 (%)	AP@75 (%)	AP@50-95 (%)	${AR}_{S}$ (%)	${AR}_{M}$ (%)	${AR}_{L}$ (%)	${AR}_{All}$ (%)
YOLO-ELR	640 × 640	83.9	52.3	49.6	47.6	39.7	63.7	62.9
YOLOv11-s	640 × 640	76.2	49.8	47.3	47.4	38.2	62.5	57.8
YOLOv10-n	640 × 640	80.8	52.6	48.9	46.8	37.6	64.5	59.7
YOLOv8-s	640 × 640	78.7	52	48.8	47.3	37.3	61.7	57.0
YOLOv8-n	640 × 640	76	50.3	47.5	45.0	37.1	63.7	58.8
YOLOv6-s	640 × 640	78.1	51.3	49.3	46.0	45.1	65.7	61.0
YOLOv6-n	640 × 640	77.2	53.2	48.4	47.3	41.5	64.0	59.5
YOLOv5-P6s	640 × 640	77	51.5	48.3	51.0	44.4	64.9	60.9
YOLOv5-P6n	640 × 640	79	51.1	49.9	50.2	40.4	63.5	59.4
YOLOv5-n	640 × 640	80.6	52.8	48.6	51.1	43.5	63.1	59.1
YOLOv5-s	640 × 640	79.7	53.2	49.5	46.5	37.5	62.5	58.0
Fsater-RCNN	640 × 640	82.1	50.8	48.8	46.8	39.1	62.6	59.9
YOLO-Damo	640 × 640	81.06	53.44	49.97	47.5	39.6	63.3	62.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, J.; Wan, L. YOLO-ELR: A High-Precision Lightweight Object Detection Model in Marine Environment. J. Mar. Sci. Eng. 2026, 14, 998. https://doi.org/10.3390/jmse14110998

AMA Style

Yuan J, Wan L. YOLO-ELR: A High-Precision Lightweight Object Detection Model in Marine Environment. Journal of Marine Science and Engineering. 2026; 14(11):998. https://doi.org/10.3390/jmse14110998

Chicago/Turabian Style

Yuan, Jianping, and Lei Wan. 2026. "YOLO-ELR: A High-Precision Lightweight Object Detection Model in Marine Environment" Journal of Marine Science and Engineering 14, no. 11: 998. https://doi.org/10.3390/jmse14110998

APA Style

Yuan, J., & Wan, L. (2026). YOLO-ELR: A High-Precision Lightweight Object Detection Model in Marine Environment. Journal of Marine Science and Engineering, 14(11), 998. https://doi.org/10.3390/jmse14110998

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-ELR: A High-Precision Lightweight Object Detection Model in Marine Environment

Abstract

1. Introduction

1.1. Research Background

1.2. Current Development Status of High-Precision Detection Models

1.3. Current Development Status of Lightweight Detection Models

2. Related Works

2.1. Marine Environmental Target Detection

2.2. Motivation for Selecting YOLOv11

2.3. Dynamic Adjustment of Adaptive Weights

2.4. Multi-Scale Feature Fusion

2.5. Branch Detection

3. Methodology

3.1. Yolo-Elr Overall Network

3.2. YOLO-ELR Overall Backbone Network

3.3. Multi-Channel Branching and Multi-Scale Feature Fusion

3.4. YOLO-ELR Space Multi-Branch Detection Head

4. Experiments

4.1. Data Description

4.2. Training Setting

4.3. Evaluation Metrics

4.4. Ablation Experiment and Visualization Results Analysis

4.4.1. YOLO-ELR Ablation Experiment

4.4.2. Visual Feature Heatmap

4.4.3. Effective Receptive Field

4.4.4. Object Detection Detection System

4.5. Analysis of the Results of Comparative Experiment and Generalization Experiment

4.5.1. Analysis of Generalization Experiment Results

4.5.2. Analysis of Comparative Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI