Next Article in Journal
Optimization Study for Enhancing Internal Module Communication Efficiency in Integrated Circuits
Previous Article in Journal
DROPc-Dynamic Resource Optimization for Convolution Layer
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CIMB-YOLOv8: A Lightweight Remote Sensing Object Detection Network Based on Contextual Information and Multiple Branches

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(13), 2657; https://doi.org/10.3390/electronics14132657
Submission received: 21 May 2025 / Revised: 11 June 2025 / Accepted: 20 June 2025 / Published: 30 June 2025

Abstract

A lightweight YOLOv8 variant, CIMB-YOLOv8, is proposed to address challenges in remote sensing object detection, such as complex backgrounds and multi-scale targets. The method enhances detection accuracy while reducing computational costs through two key innovations: Contextual Multi-branch Fusion: Integrates a space-to-depth multi-branch pyramid (SMP) to capture rich contextual features, improving small target detection by 1.2% on DIOR; Lightweight Architecture: Employs Lightweight GroupNorm Detail-enhance Detection (LGDD) with shared convolution, reducing parameters by 14% compared to YOLOv8n. Extensive experiments on DIOR, DOTA, and NWPU VHR-10 datasets demonstrate the model’s superiority, achieving 68.1% mAP on DOTA (+0.7% over YOLOv8n) and 82.9% mAP on NWPU VHR-10 (+1.7%). The model runs at 118.7 FPS on NVIDIA 3090, making it well-suited for real-time applications on resource-constrained devices. Results highlight its practical value for remote sensing scenarios requiring high-precision and lightweight detection.

1. Introduction

Remote sensing target detection is an important foundation for the application of remote sensing technology in land use, urban planning, and natural disaster detection [1]. The principle is to use computer vision algorithms to classify and detect targets in remote sensing images, determine their precise location, and extract their feature information [2]. Unlike natural images, remote sensing images have characteristics such as fragmented distribution of target information, diverse and complex backgrounds, and significant differences in target scales, which further increase the difficulty of detection. Traditional detection methods, such as Histogram of Oriented Gradients (HOG) [3] and Scale-Invariant Feature Transform (SIFT) [4], can only recognize specific single-class objects and have limited effectiveness in certain scenarios. When the environment changes, the recognition efficiency is low.
Advances in artificial intelligence have led to the extensive application of deep learning techniques in the field of computer vision. In the domain of object detection, convolutional neural networks (CNNs) have exhibited remarkable performance, primarily attributed to their hierarchical network architectures and robust feature representation capabilities. The current CNN-based object detection algorithms are mainly divided into two types: one-stage object detection methods and two-stage object detection methods. The latter, including fast R-CNN [5], faster R-CNN [6], and cascade R-CNN [7], use a regional proposal network (RPN) to generate candidate regions and classify them as background or target object regions; then, the RPN output is transported to the detection head and mapped to the appropriate position on the feature map for the final classification and regression process. The most influential one-stage object detection models are SSD [8], RetinaNet [9], and the YOLO series [10,11,12,13,14,15]; unlike the R-CNN series algorithms, these methods directly regress the position of objects on the feature map, transforming the localization problem into a regression problem.
The above methods are mainly designed for use in natural images. Detection in remotely sensed imagery, however, focuses on different aspects compared to natural image detection, owing to its distinctive top-down viewpoint and the considerable imaging distances involved. In the scenario of remote sensing imagery analysis, background information and comprehensive information often play an auxiliary role in target recognition, as they may contain clues that help distinguish targets. However, for natural scene images, these same types of information may be considered to be interference factors because they may not be directly related to the target task and may even introduce noise or confusing signals, thereby affecting the accuracy of recognition. This inherent characteristic of remote sensing data makes object detection in such scenarios more challenging than conventional detection tasks. Weiya et al. [16] solved the problem of small-object detection in remote sensing images by introducing a cross-layer attentional fusion module and a weighted multi-acceptance domain null-space pyramid pooling module to address the issues of deep feature loss and background interference. Xu and Wu [17] developed an efficient anchor-free remote sensing target detector based on YOLO, which allows for the high-precision detection of small targets on account of an improved CJAM and other feature extraction modules, as well as lightweight auxiliary networks and Swin Transformers. Tang et al. [18] adopted a combined module to enhance channel information and added a new detection head to develop the HIC-YOLOv5 algorithm. Li et al. [19] enhanced the YOLOv8 network by using the concepts of dual-channel feature fusion and BiFPN [20] to improve small-object detection performance; they also replaced some CSP bottlenecks with two convolution (C2f) modules of the Ghostblock V2 [21] structure to minimize feature loss during network transmission.
While the detection accuracy for small objects has seen notable improvement in recent algorithms, the marginal effect of model parameters relative to accuracy improvement is rarely considered, indicating a lack of practical consideration. Therefore, we propose a small-object detection algorithm based on the scale characteristics of small objects in high-resolution remote sensing images and drone aerial images, as well as the structural characteristics of baseline algorithms. The contributions of this study can be summarized as follows:
(1) A resource-efficient feature pyramid named SMP is proposed; it uses SPConv instead of the spanning convolutional layer and pooling layer, allowing the P3 layer to obtain features richer in small-target information. The module is designed to efficiently capture feature representations spanning global-to-local hierarchies, integrating contextual information with localized semantic cues with the designed Omni Cross Stage Partial (OCSP) module for feature integration, ultimately improving performance in small-target detection.
(2) We propose a lightweight detection head called LGDD (Lightweight Generalized Detection Head) that employs three key innovations for efficient object detection: First, it utilizes shared convolution to dramatically reduce parameter count. Second, it incorporates a Scale layer to normalize feature responses across different detection heads, effectively addressing scale inconsistency in target detection. Third, it introduces DEConv (Detail-Enhanced Convolution), a novel operation that enhances detail capture through two phases: (1) during training, it integrates prior knowledge into standard convolutional layers to boost representation capacity, and (2) during inference, it transforms into regular convolution via reparameterization. This dual-phase design achieves improved generalization without introducing additional parameters or computational overhead, maintaining the model’s lightweight architecture.
The rest of this article is organized as follows: In Section 2, we provide the current state of research on the YOLOv8 algorithm and feature pyramid. In Section 3, we introduce the CIMB-YOLO model, and in Section 4, we present the dataset we used and the corresponding experiments and analysis. Finally, the conclusions are drawn in the Section 5.

2. Related Work

2.1. YOLOv8 Series Development

YOLOv8 introduces new features and improvements based on YOLOv5, further enhancing performance and flexibility. The main workflow of the YOLOv8 model is detailed below.
The first step is image preprocessing. YOLOv8 has a preset image input size of 640 × 640, to which it is necessary to resize images of normal size. The model uses letterbox scaling technology to prevent the image distortion caused by simple resizing operations, the principle of which is to scale the image proportionally and fill other parts with background color to maintain the original aspect ratio of the image while adapting it to the input size of the model. The images are normalized to eliminate the influence of different feature dimensions. This typically involves scaling pixel values within 0–1 or standardizing them based on the statistical properties of the dataset to accelerate the model training process and improve stability.
The next step is model inference. In typical object detection frameworks, the backbone network serves as the primary component for extracting hierarchical features from input images. YOLOv8 uses, as its backbone network, CSPDarknet53, a deep convolutional neural network that uses residual connections and CSP connections to improve feature extraction efficiency. In the backbone network, a series of convolutional layers and C2f modules are used to gradually extract the high-level features of the image. The neck module is designed to fuse feature maps from the backbone with multi-scale representations, thereby enhancing the model’s capability for multi-scale object detection. YOLOv8 adopts a PAN-FPN structure similar to that of YOLOv5, which fuses multi-scale feature maps via bidirectional pathways: bottom-up for capturing coarse-grained context and top-down for propagating fine-grained details. In the neck section, feature fusion is performed with a Spatial Pyramid Pooling Fast (SPPF) structure and PANet to enable the model to more effectively detect targets of different scales. The detection head serves to transform the fused feature maps into final detection results, including object classification and bounding box regression. YOLOv8 adopts the decoupled head structure, which separates the regression branch from the classification branch, a design that helps to improve the convergence speed and detection performance of the model. The feature map fused in the neck is sent to the head for the prediction of target categories and bounding boxes. The head generates detection layers of multiple scales through a series of convolution and upsampling operations, where each detection layer includes regression and classification branches, which are used to predict the bounding boxes and categories of the target, respectively. The prediction results with confidence levels below the preset threshold are removed to improve detection accuracy; then, a Non-Maximum Suppression (NMS) operation is performed on the remaining prediction results to remove overlapping detection boxes and preserve the best prediction results. Lastly, the final detection result, including the target’s category and bounding box information, is output. Figure 1 and Figure 2 show the workflow and algorithm architecture of YOLOv8, respectively.

2.2. Small Target Detection Challenges and Methods

Many researchers have applied the YOLO series to remote sensing object detection (RSOD) tasks, and the main improvements currently focus on the PAN-FPN structure of YOLOv8, as shown in Figure 3. Wan et al. [22] proposed the YOLO-HR network, which integrates a multilayer FPN and a hybrid attention module to improve accuracy in RSOD. Wang et al. [23] introduced the UAV-YOLOv8 network specifically for UAV scenarios, utilizing the BiFormer [24] module to optimize the network, replacing the regression loss function with WIoU [25], and enhancing network detection speed with the design of specialized convolutional modules. The LAR-YOLOv8 [26] algorithm includes a dual-branch architecture with an attention mechanism and a bidirectional FPN guided by attention, which effectively enhances its detection capabilities. The above improvements have allowed for an increase in detection accuracy in complex environments or for small objects.

2.3. Lightweight Strategy

The current methods mainly focus on lightweighting YOLO structures to achieve lightweighting goals. Chen et al. [27] designed C2f_RVB module aiming to minimize model parameter count while enhancing the representational power of deep features. Chen et al. [28] designed a TDD header to enhance feature interaction between classification and regression tasks through a task alignment mechanism and shared convolution, reducing model parameters and computational complexity. Chung et al. [29] designed a convolutional attention module to improve the efficiency of feature extraction while reducing computational overhead. While the aforementioned methods effectively reduce model parameters, they often incur a trade-off with detection accuracy. Thus, the core research challenge remains: how to minimize parameter count without compromising—or even improving—model performance.

3. Proposed Method

3.1. Overview

We made improvements and optimizations to YOLOv8, and in Figure 4, we present the resulting RSOD network framework, CIMB-YOLOv8. Firstly, we integrated SMP in order to improve accuracy without significantly increasing computational complexity and time, using features rich in small-target information obtained with SPDConv [30] for fusion. Then, based on the CSP [31] and Omni-Kernel [32] concepts, we designed a new module, called OCSP, for feature integration. We next developed LGDD to make the model lightweight, reduce the number of parameters by using shared convolution, and improve accuracy by using GroupNorm Convolution (GNConv) and DEConv [33].

3.2. Space-to-Depth Multi-Branch Pyramid

Detecting small targets using conventional P3, P4, and P5 feature layers remains challenging due to insufficient fine-grained details. While adding a high-resolution P2 layer can enhance small-object detection, this approach introduces significant drawbacks, including increased computational complexity and prolonged post-processing time. This has stimulated the need to develop new, effective feature pyramids for small-target detection. Based on the original PAN-FPN, we propose a space-to-depth multi-branch pyramid. Unlike the traditional approach, we use the P2 feature layer to obtain features rich in small-target information with SPDConv and fuse them with P3. By incorporating SPDConv, it can eliminate strided and pooling operations in CNN downsampling through a two-stage process: (1) space-to-depth transformation to reorganize feature maps, followed by (2) non-strided convolution for dimension reduction. This approach maintains critical spatial information while performing effective downsampling, preserving discriminative features that are typically lost in conventional subsampling methods. For example, an intermediate feature map X of size S × S × C 1 is used to slice the sequence of sub-feature maps into
f 0 , 0 = X [ 0 : S : s c a l e , 0 : S : s c a l e ] , f 1 , 0 = X [ 1 : S : s c a l e , 0 : S : s c a l e ] , , f s c a l e 1 , 0 = X [ s c a l e 1 : S : s c a l e , 0 : S : s c a l e ] . f 0 , 1 = X [ 0 : S : s c a l e , 1 : S : s c a l e ] , f 1 , 1 = X [ 1 : S : s c a l e , 1 : S : s c a l e ] , , f s c a l e 1 , 1 = X [ s c a l e 1 : S : s c a l e , 1 : S : s c a l e ] . f 0 , s c a l e 1 = X [ 0 : S : s c a l e , s c a l e 1 : S : s c a l e ] , f 1 , s c a l e 1 = X [ 1 : S : s c a l e , s c a l e 1 : S : s c a l e ] , , f s c a l e 1 , s c a l e 1 = X [ s c a l e 1 : S : s c a l e , s c a l e 1 : S : s c a l e ] .
Generally speaking, given any original feature map X, sub-map f x , y consists of all ( i + x ) and ( j + y ) entries of X ( i , j ) that can be scaled evenly Figure 5 shows an example with a ratio of 2, where we obtained four sub-maps: f 0 , 0 , f 1 , 0 , f 0 , 1 , and f 1 , 1 . The shape of each sub-map is ( S 2 , S 2 , C 1 ). Next, we connect these sub-feature maps along the channel dimension to obtain feature map X , which has a spatial dimension reduced by a scaling factor and a channel dimension increased by a scaling factor of 2. In other words, SPDConv transforms feature map X ( S , S , C 1 ) into an intermediate feature map X ( S s c a l e , S s c a l e , s c a l e 2 C 1 ) .
Then, OCSP was designed for feature integration based on the concepts of CSP and Omni-Kernel. The schematic diagram of its structure is shown in Figure 6, where Split represents the operation of channel separation, DConv stands for depthwise convolution, GAP stands for global average pooling, and FFT and IFFT denote fast Fourier transform and its inverse operation, respectively. We allocate 25% of the channels to the Omni-Kernel (OK) module by default to avoid the problem of excessive computation. This module employs three specialized branches: (1) a large-scale branch that models medium-to-long-range dependencies to address large-scale degradation, (2) a local branch that captures fine-grained pixel-level details for small-scale degradation mitigation, and (3) a global branch that establishes comprehensive semantic understanding to resolve image-wide degradation patterns. This multi-branch design enables hierarchical feature learning across different spatial scales.
In the large branch, a cheap deep convolution operation with a kernel size of K × K is applied to search for large receptive fields. We also use 1 × K and K × 1 depthwise convolution operations in parallel with square convolution operations to harvest contextual information about stripe shapes. The module is placed at the bottleneck position to avoid introducing significant computational overhead due to large-scale kernel convolution. The peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) metrics increase as we enlarge the kernel size from K = 3 to K = 63 [32]. We ultimately chose K = 31 in the large branch to balance performance and complexity.
The global branch consists of a dual-domain channel attention module (DCAM) and a frequency-based spatial attention module (FSAM). Firstly, given input feature X G l o b a l R C × H × W , the DCAM performs frequency channel attention (FCA) on X G l o b a l :
X F C A = IF ( F ( X G l o b a l ) W 1 × 1 F C A ( G A P ( X G l o b a l ) ) ) .
Above, F and IF are the fast Fourier transform and its inverse operation, respectively; X F C A , W 1 × 1 , and G A P represent the output of FCA, a 1 × 1 convolutional layer, and global average pooling, respectively; and ⊗ represents the multiplication operation. The above operations effectively refine the global features of the image. After being globally modulated in the spectral domain, the obtained features are further fed into the spatial channel attention module:
X D C A M = X F C A W 1 × 1 S C A ( G A P ( X F C A ) ) .
X D C A M is the output of the DCAM. The DCAM only enhances dual-domain features in coarse-grained channel mode. Then, the frequency-based attention module is applied in the spatial dimension to refine the spectrum at a fine-grained level, which is formally represented as
X F S A M = IF ( F ( W 1 × 1 1 ( X D C A M ) ) W 1 × 1 2 ( X D C A M ) ) .
X F S A M is the result of the FSAM. This way, the model registers the frequency components of information used for high-quality image reconstruction. In addition to capturing the large and global branches of large-scale receptive fields, a very simple but effective local branch was designed for local signal modulation by using a 1 × 1 deep convolutional layer, and its effectiveness was demonstrated [34].

3.3. Lightweight GroupNorm Detail-Enhance Detection

At present, improvements to the YOLOV8 detection head are aimed at either increasing its parameter count for accuracy improvement or sacrificing accuracy for making the model lightweight. We use DEConv for improvement based on the lightweight detection head, and its structure is shown in Figure 7. We further use GNConv, which has been shown to be effective in improving accuracy in [35], instead of regular convolution. The number of parameters is significantly reduced by using shared convolution. The proposed architecture employs parameter-shared convolution to significantly reduce model complexity by reusing convolutional kernels across different branches and hierarchical levels. This design is augmented with multi-task supervision and specialized structural components that enable the shared features to adapt to diverse scene requirements, maintaining detection accuracy despite model compression. Furthermore, we introduce a Scale layer to dynamically adjust feature magnitudes, effectively resolving target scale inconsistencies across different detection heads. In DEConv, five convolutional layers (vanilla convolution (VC), central difference convolution (CDC), angular difference convolution (ADC), horizontal difference convolution (HDC), and vertical difference convolution (VDC)) are used, with vanilla convolution being applied to obtain intensity-level information and differential convolution to enhance gradient-level information. Finally, given input feature F i n , DEConv utilizes a reparameterization technique to output F o u t to a regular convolutional layer at the same computational cost and inference time:
F o u t = D E C o n v ( F i n ) = i = 1 5 F i n K i = F i n ( i = 1 5 K i ) = F i n K c v t
Above, D E C o n v ( ) represents the operation of the proposed DEConv; K i = 1 : 5 represents the kernels of VC, CDC, ADC, HDC, and VDC; ∗ represents the convolution operation; and K c v t represents the transformed kernel, which combines parallel convolution operations together.
Figure 8 visually illustrates the process of reparameterization. In the backward-propagation stage, the chain rule of gradient propagation is employed to update the kernel weights of the five parallel convolution operations independently. During the forward-propagation phase, these kernel weights remain fixed, and the transformed kernel weights are computed by element-wise addition at corresponding positions.

4. Experimental Results

4.1. Experimental Setup

We used an Intel Xeon Platinum 8352V CPU (manufactured by Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3090 GPU (manufactured by NVIDIA Corporation, Santa Clara, CA, USA), with 6TB of memory. with 32 GB of video memory. The experimental operating system was Windows 11, Python 3.9.7, PyTorch 1.11.0 was selected as the deep learning framework, and CUDA 11.3 was used to accelerate inference. The experiment was conducted by using the batch training method, whereby each dataset was divided into multiple batches for experimentation, with batch sizes set to 32 or 16, depending on the size of the dataset. The learning rate of the experimental model was 0.01, the weight decay rate was 0.0005, the training configuration parameters were set to 300 epochs, the optimizer used was stochastic gradient descent (SGD), and the image input size was uniformly 640 × 640.
The dataset was partitioned into training (80%), validation (10%), and test (10%) sets following an 8:1:1 ratio. To ensure representative sampling, we conducted the following: (1) maintained balanced class distributions across all splits, (2) guaranteed sample independence through randomized index shuffling before partitioning, and (3) rigorously separated the data usage, employing the training set for model optimization, the validation set for hyperparameter tuning, and reserving the test set exclusively for final performance assessment.

4.2. Experimental Datasets

We validated the method on three datasets: DIOR [36], DOTA [37], and NWPU VHR-10 [38].
DIOR: The DIOR dataset is a large-scale benchmark dataset proposed by Northwestern Polytechnical University for object detection in optical remote sensing images. It contains 23,463 remote sensing images and 190,288 instances, divided into 20 categories.
DOTA: The DOTA dataset is an aerial image dataset jointly developed by Wuhan University and Huazhong University of Science and Technology. It contains 2806 remote sensing images with a total of 188,282 instances, divided into 15 categories.
NWPU VHR-10: The NWPU VHR-10 dataset was constructed by Northwestern Polytechnical University and consists of 650 positive examples and 150 negative examples (backgrounds); the latter do not contain any given object class, while the positive examples contain at least 1 instance, resulting in a total of 3651 target instances. It includes 10 categories. Figure 9 shows example images of the selected datasets.

4.3. Evaluation Metrics

We used mAP as the evaluation index for the experiment; it is calculated based on average accuracy (AP), Precision (P), and Recall (R). P represents the proportion of correctly predicted positive instances among all positive instances in the predicted results, calculated as
P = T P T P + F P
R represents the proportion of correctly predicted positive instances among all positive instances of the sample being predicted, calculated as
R = T P T P + F N
In the above context, TP denotes the count of correctly predicted positive instances, FP refers to the number of incorrectly predicted negative instances (false positives), and FN signifies the quantity of incorrectly predicted positive instances (false negatives).
During the training process, the tradeoff between P and R is plotted as a P-R (Precision–Recall) curve, where AP is the area of each target category enclosed by the horizontal and vertical axes on the P-R curve. Its calculation formula is
A P = 0 1 P ( R ) d R
mAP is the average area defined by the horizontal and vertical coordinates on the P-R curve for all target categories, calculated as
m A P = 1 n i = 1 n A P i
Above, n represents the number of detected categories and i represents the i-th detection target.
FPS (frames per second) was calculated by averaging the inference time over 1000 forward passes on an NVIDIA GeForce RTX 3090 GPU, with an input image size of 640 × 640.
F P S = 1000 T o t a l T i m e

4.4. Analysis of Experimental Results

We compared CIMB-YOLO with its baseline, YOLOv8, to demonstrate the effectiveness of our model and recorded the experimental results in Table 1, Table 2 and Table 3. To maintain consistency with established YOLO benchmarks, we evaluate our model using both mAP@50 (single IoU threshold of 0.5) and mAP@50:95 (averaged across IoU thresholds from 0.5 to 0.95 at 0.05 intervals). Compared to the baseline, our method demonstrates consistent improvements across all metrics: +1.2%, +0.7%, and +1.7% in mAP@50, along with +1.4%, +0.9%, and +0.5% in mAP@50:95 - while simultaneously achieving a 14% reduction in parameter count. At the same time, the FPS is within an acceptable range, ensuring real-time performance, demonstrating the superiority of our method.
Table 4, Table 5 and Table 6 show the performance of the baseline model and CIMB-YOLO in each category (the performance fluctuations of some small categories in the NWPU VHR-10 dataset, such as Basketball Court and Tennis Court, are mainly due to insufficient sample size). In almost all categories, compared with the baseline, CIMB-YOLOv8 shows higher accuracy, recall, and mAP@50. Figure 10 shows a comparison between the CIMB-YOLO and YOLOv8n P-R curves of the DIOR dataset, indicating that our model allows for a significant improvement in performance, particularly in categories such as ships, vehicles, helicopters, dams, and ports, which have a large number of dense small target objects. The experimental results fully demonstrate the effectiveness of SMP. The SPConv module downsamples the feature map without losing learnable information, fully utilizing the features of small targets. The OK module effectively learns the feature representations from global to local levels through the utilization of the global, large, and local branches. At the same time, the LGDD detection head makes the entire model lightweight without losing accuracy. The CIMB-YOLOv8 algorithm effectively reduces false alarms and missed detection instances, greatly improving the detection accuracy for dense small targets in remote sensing images. The experimental results show that the algorithm can capture complex details well and significantly improve the detection accuracy in complex scenes.
Table 7 presents a comparative analysis of the proposed CIMB-YOLO algorithm and current mainstream object detection algorithms on the DIOR dataset. As shown in the figure, compared with the two-stage Faster R-CNN algorithm, CIMB-YOLO performs better in mAP@50. The detection accuracy is significantly improved (by 9.7%, the number of parameters is much lower, and the FPS is greatly improved. This suggests that the CIMB-YOLO algorithm achieves notable improvements in both detection accuracy and operational efficiency, rendering it better suited for deployment on resource-constrained devices. Compared with the one-stage detection algorithm SSD, there are significant improvements in the metrics of mAP@50, parameter count, and FPS. Compared with other versions of YOLO and the newer DConvTransLGA model, our model also has significant advantages. This indicates that the proposed CIMB-YOLO model simultaneously allows for an improvement in accuracy and makes the model lightweight.
Figure 11 shows an example of inference by CIMB-YOLO on the DIOR dataset. It can be clearly seen from the figure that the confidence level of our model in detecting targets is mostly around 0.9. Figure 12 shows the corresponding heatmap of the example, where the first row is the output of the CIMB-YOLO, and the second row is that of YOLOv8 baseline model. The heatmap visualization reveals distinct performance characteristics of our model compared to the baseline. Three key observations emerge: (1) the darker coloration in high-value regions quantitatively confirms stronger activation magnitudes in our model; (2) the spatial concentration of these high-value responses demonstrates precise spatial attention to target features; (3) this focused activation pattern effectively suppresses background noise. These visual patterns not only validate our architectural design principles but also provide empirical evidence for the model’s enhanced feature discrimination capability. Figure 13 shows the detection results of CIMB-YOLO on the DOTA and NWPU VHR-10 datasets.

4.5. Ablation Study

We conducted additional ablation experiments to evaluate the effectiveness and impact of integrating SMP and LGDD in the YOLOv8 model, with each module’s contribution being indicated by a Y. The evaluation indicators include mAP, parameter quantity, FLOP, and FPS, and the results are shown in Table 8. In Experiment 1, we integrated the SMP module to better capture important feature information. In Experiment 2, we introduced the LGDD module, and the results show that it reduced computational complexity by 20% and the parameter count by 2% without compromising accuracy and FPS, demonstrating its effectiveness. In Experiment 3, we combined the two modules to compensate for the negative impact of the SMP module on the increase in parameter count and computational complexity. Compared with YOLOv8, our model improved accuracy by 1.2% while reducing parameter count by 14%. These ablation experiments demonstrate the importance of each module within the CIMB-YOLO framework, highlighting their complementarity and effectiveness in enhancing YOLOv8’s performance.

5. Discussion and Conclusions

In the context of the increasing demand for lightweight models and precise object detection in remote sensing images, our method provides a feasible solution that balances accuracy and computational constraints. In this study, we propose CIMB-YOLO, a lightweight RSOD network designed to address the above unique challenges. Firstly, SMP was developed as an improvement to the original PAN-FPN; it is capable of extracting small-target-rich features for fusion and efficiently learning hierarchical feature representations across global-to-local scales, thereby enhancing small-target detection performance. Secondly, we designed a novel detection head, LGDD, in order to make the model lightweight and more suitable for deployment requirements. Using shared convolution allowed us to significantly reduce the number of parameters, making the model more lightweight, which is especially significant for its implementation on resource-limited devices. Furthermore, we improved its accuracy by using DEConv to enhance the detail capture ability of the detection head. The experimental results show that the CIMB-YOLO algorithm performs well on the DIOR, DOTA, and NWPU datasets in terms of mAP@50. The recognition rates of the CIMB-YOLOv8 algorithm are 85.3%, 68.8%, and 82.9%, respectively, which are 1.2%, 0.7%, and 1.7% higher than those of the baseline YOLOv8 algorithm. A total of 14% of the parameter reduction in the model is achieved through shared convolution, which eliminates redundant multi-scale convolutions while maintaining feature diversity. Compared with current mainstream algorithms, the CIMB-YOLO model outperforms other object detection methods in terms of object detection performance while having a smaller number of parameters and lower computational complexity, which fully attests to the effectiveness of the algorithm.
While demonstrating promising results, this study presents several limitations that merit discussion. First, although the SMP module enhances small-target detection capability, the modest 1.4% improvement in mAP@50:95 compared to the 1.2% gain in mAP@50 suggests suboptimal feature discrimination at higher IoU thresholds. Second, while the LGDD head achieves a 14% parameter reduction through shared convolutions, its real-time inference efficiency on resource-constrained edge devices (<1GB memory) remains unverified. Third, the current single-modal design limits applicability in multi-modal sensing scenarios that increasingly dominate modern remote sensing applications (e.g., SAR-optical fusion for all-weather monitoring).
To advance this research, we propose two key development directions:
1. Hierarchical Feature Enhancement: We will design a multi-stage feature refinement network with adaptive receptive field control to improve localization precision across IoU thresholds, specifically targeting a >5% increase in mAP@50:95 performance. This module will incorporate boundary-aware attention and scale-adaptive feature fusion to address current limitations in high-IoU detection.
2. Ultra-Efficient Deployment Optimization: Building upon our parameter-efficient design, we will investigate hybrid compression techniques combining (i) quantization-aware training (8-bit fixed-point), (ii) attention-guided pruning, and (iii) neural architecture search to achieve sub-50MB memory footprint while maintaining <1% accuracy drop, enabling deployment on next-generation IoT edge devices.

Author Contributions

Methodology, Y.Z.; Data curation, S.L.; Writing—original draft, Y.Z.; Writing—review & editing, R.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Belcore, E.; Piras, M.; Pezzoli, A. Land Cover Classification from Very High-Resolution UAS Data for Flood Risk Mapping. Sensors 2022, 22, 5622. [Google Scholar] [CrossRef] [PubMed]
  2. Xu, S.; Song, L.; Yin, J.; Chen, Q.; Zhan, T.; Huang, W. MFFCI–YOLOv8: A Lightweight Remote Sensing Object Detection Network Based on Multiscale Features Fusion and Context Information. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19743–19755. [Google Scholar] [CrossRef]
  3. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
  4. Lowe, D. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar] [CrossRef]
  5. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
  6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  7. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar] [CrossRef]
  8. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  9. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
  10. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
  11. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
  12. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  13. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
  14. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
  15. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
  16. Shi, W.; Zhang, S.; Zhang, S. CAW-YOLO: Cross-Layer Fusion and Weighted Receptive Field-Based YOLO for Small Object Detection in Remote Sensing. CMES—Comput. Model. Eng. Sci. 2024, 139, 3209–3231. [Google Scholar] [CrossRef]
  17. Xu, D.; Wu, Y. An Efficient Detector with Auxiliary Network for Remote Sensing Object Detection. Electronics 2023, 12, 4448. [Google Scholar] [CrossRef]
  18. Tang, S.; Zhang, S.; Fang, Y. HIC-YOLOv5: Improved YOLOv5 For Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 6614–6619. [Google Scholar] [CrossRef]
  19. Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A modified YOLOv8 detection network for UAV aerial image recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
  20. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  21. Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
  22. Wan, D.; Lu, R.; Wang, S.; Shen, S.; Xu, T.; Lang, X. Yolo-hr: Improved yolov5 for object detection in high-resolution optical remote sensing images. Remote Sens. 2023, 15, 614. [Google Scholar] [CrossRef]
  23. Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
  24. Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
  25. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
  26. Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small object detection algorithm based on improved YOLOv8 for remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1734–1747. [Google Scholar] [CrossRef]
  27. Chen, Z.; Feng, J.; Zhu, X.; Wang, B. YOLOv8-OCHD: A Lightweight Wood Surface Defect Detection Method Based on Improved YOLOv8. IEEE Access 2025, 13, 84435–84450. [Google Scholar] [CrossRef]
  28. Chen, Y.; Liu, Z. DFTD-YOLO: Lightweight Multi-Target Detection From Unmanned Aerial Vehicle Viewpoints. IEEE Access 2025, 13, 24672–24680. [Google Scholar] [CrossRef]
  29. Chung, M.A.; Chai, S.Y.; Hsieh, M.C.; Lin, C.W.; Chen, K.X.; Huang, S.J.; Zhang, J.H. YOLO-LSD: A Lightweight Object Detection Model for Small Targets at Long Distances to Secure Pedestrian Safety. IEEE Access 2025, 13, 83061–83070. [Google Scholar] [CrossRef]
  30. Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 443–459. [Google Scholar]
  31. Liu, W.; Hasan, I.; Liao, S. Center and Scale Prediction: Anchor-free Approach for Pedestrian and Face Detection. arXiv 2021, arXiv:1904.02948. [Google Scholar] [CrossRef]
  32. Cui, Y.; Ren, W.; Knoll, A. Omni-Kernel Network for Image Restoration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 1426–1434. [Google Scholar]
  33. Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef]
  34. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 5728–5739. [Google Scholar]
  35. Detector, A.F.O. Fcos: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1922–1933. [Google Scholar]
  36. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  37. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. arXiv 2019, arXiv:1711.10398. [Google Scholar] [CrossRef]
  38. Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Figure 1. YOLOv8 workflow.
Figure 1. YOLOv8 workflow.
Electronics 14 02657 g001
Figure 2. YOLOv8 algorithm framework. Detailed structures of the C2f, SPPF, and Conv modules (top-right).
Figure 2. YOLOv8 algorithm framework. Detailed structures of the C2f, SPPF, and Conv modules (top-right).
Electronics 14 02657 g002
Figure 3. PAN-FPN architecture.
Figure 3. PAN-FPN architecture.
Electronics 14 02657 g003
Figure 4. CIMB-YOLO architecture.
Figure 4. CIMB-YOLO architecture.
Electronics 14 02657 g004
Figure 5. Illustration of SPDConv with scale = 2.
Figure 5. Illustration of SPDConv with scale = 2.
Electronics 14 02657 g005
Figure 6. The architecture of OCSP. FFT and IFFT denote fast Fourier transform and its inverse operation, respectively.
Figure 6. The architecture of OCSP. FFT and IFFT denote fast Fourier transform and its inverse operation, respectively.
Electronics 14 02657 g006
Figure 7. Structure of LGDD.
Figure 7. Structure of LGDD.
Electronics 14 02657 g007
Figure 8. The process of the re-parameterization technique.
Figure 8. The process of the re-parameterization technique.
Electronics 14 02657 g008
Figure 9. Example images of selected datasets.
Figure 9. Example images of selected datasets.
Electronics 14 02657 g009
Figure 10. Comparison of P-R curves.
Figure 10. Comparison of P-R curves.
Electronics 14 02657 g010
Figure 11. Detection examples of CIMB-YOLO on the DIOR dataset.
Figure 11. Detection examples of CIMB-YOLO on the DIOR dataset.
Electronics 14 02657 g011
Figure 12. Visualization of test results on the DIOR dataset.
Figure 12. Visualization of test results on the DIOR dataset.
Electronics 14 02657 g012
Figure 13. Test results on the DOTA and NWPU VHR-10 datasets.
Figure 13. Test results on the DOTA and NWPU VHR-10 datasets.
Electronics 14 02657 g013
Table 1. Comparative experiment on the DIOR dataset.
Table 1. Comparative experiment on the DIOR dataset.
ModelPrecisionRecallmAP@50mAP@50:95FLOPsParamFPS
YOLOv8n87.478.284.161.68.1G3.0M144.7
CIMB-YOLO88.080.185.363.010.3G2.6M118.7
Table 2. Comparative experiment on the DOTA dataset.
Table 2. Comparative experiment on the DOTA dataset.
ModelPrecisionRecallmAP@50mAP@50:95FLOPsParamFPS
YOLOv8n74.763.268.144.68.1G3.0M146.1
CIMB-YOLO76.366.768.845.510.3G2.6M116.0
Table 3. Comparative experiment on the NWPU VHR-10 dataset.
Table 3. Comparative experiment on the NWPU VHR-10 dataset.
ModelPrecisionRecallmAP@50mAP@50:95FLOPsParamFPS
YOLOv8n81.575.081.251.28.1G3.0M149.3
CIMB-YOLO86.078.482.951.710.3G2.6M114.5
Table 4. Comparison of YOLOv8n and CIMB-YOLO on the DIOR dataset across each category.
Table 4. Comparison of YOLOv8n and CIMB-YOLO on the DIOR dataset across each category.
CategoryYOLOv8nCIMB-YOLO
Precision Recall mAP@50 Precision Recall mAP@50
Airplane95.789.092.495.191.093.7
Airport83.280.185.384.985.889.9
Baseball field95.990.094.395.093.195.9
Basketball court94.985.790.693.887.692.1
Bridge77.945.855.877.748.459.4
Chimney96.485.690.696.285.391.4
Dam77.772.881.883.677.781.8
Expressway service area91.390.595.994.794.796.2
Expressway toll station92.470.882.694.776.386.0
Golf course82.684.488.483.383.889.3
Ground track field83.382.088.485.186.289.3
Harbor72.867.869.671.067.468.6
Overpass83.761.470.383.261.771.9
Ship86.187.085.884.684.685.2
Stadium92.491.396.393.894.095.7
Storage tank93.278.886.893.580.287.4
Tennis court95.293.396.495.194.796.9
Train station68.369.470.569.869.473.0
Vehicle89.548.163.888.550.165.6
Wind mill95.688.394.295.691.096.0
Table 5. Comparison of YOLOv8n and CIMB-YOLO on the DOTA dataset across each category.
Table 5. Comparison of YOLOv8n and CIMB-YOLO on the DOTA dataset across each category.
CategoryYOLOv8nCIMB-YOLO
Precision Recall mAP@50 Precision Recall mAP@50
Small vehicle76.363.268.159.373.968.9
Large vehicle82.380.985.178.783.785.4
Plane93.386.191.193.386.691.5
Storage tank93.051.168.393.456.271.3
Ship91.983.188.790.085.289.2
Harbor82.080.984.381.682.083.6
Ground track field69.852.259.368.957.363.5
Soccer ball field67.450.454.564.354.454.5
Tennis court94.088.893.794.688.893.7
Swimming pool67.159.958.667.765.963.0
Baseball diamond82.864.071.678.770.274.9
Roundabout84.544.153.374.347.552.5
Basketball court69.947.553.181.555.062.5
Bridge68.834.041.762.541.743.1
Helicopter31.451.534.333.756.050.9
Table 6. Comparison of YOLOv8n and CIMB-YOLO on the NWPU VHR-10 dataset across each category.
Table 6. Comparison of YOLOv8n and CIMB-YOLO on the NWPU VHR-10 dataset across each category.
CategoryYOLOv8nCIMB-YOLO
Precision Recall mAP@50 Precision Recall mAP@50
Airplane91.899.299.391.798.398.9
Ship79.175.883.583.880.589.4
Storage tank96.092.695.093.988.193.7
Baseball diamond90.196.697.996.896.698.0
Tennis court82.763.169.881.066.768.6
Basketball court47.530.041.777.546.055.5
Ground track field93.696.098.890.91.0099.2
Harbor83.893.895.687.096.997.6
Bridge68.338.254.170.541.247.4
Vehicle82.265.376.286.769.880.2
Table 7. The experimental results of different detectors on the DIOR dataset.
Table 7. The experimental results of different detectors on the DIOR dataset.
ModelmAP@50ParamFPSPrecisionRecall
Faster R-CNN [6]75.6230M10.678.272.1
SSD [8]49.930M20.865.345.6
YOLOv4 [13]80.364M53.482.777.8
YOLOv583.47M76.886.180.2
YOLOv7 [15]83.939M2085.381.7
YOLOv7-tiny78.86M27881.475.3
YOLOv8n84.13M144.787.478.2
YOLOv8s84.46M87.187.181.5
YOLOv8m85.322M4087,982.1
YOLOv8l85.745M22.688.382.8
YOLOv8x85.880M9.888.582.9
DConvTransLGA [27]61.329.55M16.770.258.4
CIMB-YOLO85.32.6M118.788.080.1
Table 8. Ablation experiment.
Table 8. Ablation experiment.
MethodsSMPLGDDmAP@50FLOPsParamFPSPrecisionRecall
YOLOv8n 84.18.1G3.0M144.787.478.2
1Y 85.811.8G3.3M119.887.678.2
2 Y84.16.5G2.3M140.586.180.0
3YY85.310.3G2.6M118.788.080.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, R.; Zhang, Y.; Liu, S. CIMB-YOLOv8: A Lightweight Remote Sensing Object Detection Network Based on Contextual Information and Multiple Branches. Electronics 2025, 14, 2657. https://doi.org/10.3390/electronics14132657

AMA Style

Yu R, Zhang Y, Liu S. CIMB-YOLOv8: A Lightweight Remote Sensing Object Detection Network Based on Contextual Information and Multiple Branches. Electronics. 2025; 14(13):2657. https://doi.org/10.3390/electronics14132657

Chicago/Turabian Style

Yu, Rongwei, Yixuan Zhang, and Shiheng Liu. 2025. "CIMB-YOLOv8: A Lightweight Remote Sensing Object Detection Network Based on Contextual Information and Multiple Branches" Electronics 14, no. 13: 2657. https://doi.org/10.3390/electronics14132657

APA Style

Yu, R., Zhang, Y., & Liu, S. (2025). CIMB-YOLOv8: A Lightweight Remote Sensing Object Detection Network Based on Contextual Information and Multiple Branches. Electronics, 14(13), 2657. https://doi.org/10.3390/electronics14132657

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop