Next Article in Journal
ANT-KT: Adaptive NAS Transformers for Knowledge Tracing
Previous Article in Journal
Machine Learning-Based State-of-Charge Prediction for Electric Bus Fleet: A Critical Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Real-Time Lightweight Vehicle Object Detection via Layer-Adaptive Model Pruning

1
School of Automotive Applications and Rail Transit, Hefei Technology College, Hefei 230012, China
2
Automotive and Transportation School, Tianjin University of Technology and Education, Tianjin 300355, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(21), 4149; https://doi.org/10.3390/electronics14214149
Submission received: 11 September 2025 / Revised: 15 October 2025 / Accepted: 17 October 2025 / Published: 23 October 2025
(This article belongs to the Special Issue Deep Learning-Based Object Detection and Tracking)

Abstract

With the rapid advancement in autonomous driving technology, vehicle object detection has become a crucial component of perception systems, where accuracy and inference speed directly influence driving safety. To address the limitations of existing lightweight detection models in small-object perception and deployment efficiency, this study proposes an enhanced YOLOv8n-based framework, termed YOLOv8n-ALM. The proposed model integrates Mixed Local Channel Attention (MLCA), a Task-Aligned Dynamic Detection Head (TADDH), and Layer-Adaptive Magnitude-based Pruning (LAMP). Specifically, MLCA enhances the representation of salient regions, TADDH aligns classification and regression tasks while leveraging DCNv2 for improved spatial adaptability, and LAMP compresses the network to accelerate inference. Experiments conducted on the KITTI dataset demonstrate that YOLOv8n-ALM improves mAP@0.5 by 2.2% and precision by 5.8%, while reducing parameters by 65.33% and computational load by 29.63%. These results underscore the proposed method’s capability to achieve real-time, compact, and accurate vehicle detection, demonstrating strong potential for deployment in intelligent vehicles and embedded systems.

1. Introduction

1.1. Research Background and Motivation

With the rapid advancement in intelligent driving technology, object detection—an essential component of environmental perception systems—plays a crucial role in vehicle path planning, obstacle avoidance, and overall driving safety. In particular, forward object detection serves as a core task in autonomous driving, as its performance directly determines the system’s ability to interpret complex road environments. Real-world traffic scenarios, characterized by the coexistence of vehicles, pedestrians, and non-motorized participants amid multi-scale variations, dynamic conditions, and frequent occlusions, impose stringent demands on detection models for higher precision, efficiency, and robustness.
Early object detection methods primarily relied on traditional computer vision techniques, typically utilizing handcrafted image features such as Histogram of Oriented Gradients (HOG) [1], Scale-Invariant Feature Transform (SIFT) [2], and Haar-like features [3], in combination with classifiers like Support Vector Machines (SVMs) [4] and ensemble regression trees. Wei et al. integrated HOG and Haar features for vehicle detection, while Papageorgiou et al. introduced a sliding-window framework with trainable detectors. The Deformable Part Model (DPM) [5], proposed by Felzenszwalb et al., also served as a representative approach for a period. However, these methods generally suffer from limited generalization, weak robustness, and slow detection speed, making them inadequate for handling complex object recognition tasks in real-world traffic environments.
In recent years, fueled by the rapid progress of deep learning and the widespread adoption of convolutional neural networks (CNNs), object detection has achieved remarkable breakthroughs. Two-stage detectors, represented by the R-CNN family, pioneered the integration of region proposal mechanisms with deep feature extraction, significantly improving detection accuracy. Subsequent algorithms, including Fast R-CNN [6], Faster R-CNN [7], and Mask R-CNN [8], have further advanced the field by maintaining high accuracy while enhancing inference efficiency. Among these, Faster R-CNN introduced the Region Proposal Network (RPN) to eliminate redundant computations, whereas Mask R-CNN incorporated an additional instance segmentation branch to achieve more fine-grained localization.
Although two-stage methods achieve outstanding accuracy, their complex architectures and high computational overhead make them impractical for applications requiring real-time performance, such as autonomous driving. In contrast, single-stage detection algorithms have attracted extensive attention for their end-to-end modeling and superior inference speed. The YOLO (You Only Look Once) series, as a representative example, reformulates object detection as a regression problem, enabling simultaneous classification and localization of multiple objects in a single forward pass, thereby substantially improving detection efficiency. YOLOv3 introduced multi-scale feature fusion, while YOLOv4 enhanced feature representation through CSPNet [9] and PANet [10]. Building upon these advances, YOLOv5 further streamlined the architecture and improved deployment adaptability, leading to its widespread adoption in industrial and transportation applications.
To address the limitations of the YOLO series in small-object detection, feature fusion, and model lightweighting, numerous researchers have proposed targeted enhancements across various YOLO architectures. Despite these advancements, existing YOLOv8-based improvement strategies still encounter several unresolved challenges: (1) difficulty in effectively capturing fine-grained small-object features in complex traffic environments; (2) limited adaptability of current model compression approaches, which often rely on fixed pruning ratios and struggle to maintain an optimal balance between efficiency and accuracy; and (3) insufficient interaction and spatial alignment between the classification and regression branches in detection heads, which restricts overall detection consistency.
To address the aforementioned challenges, this study adopts YOLOv8n as the baseline detection framework and proposes a lightweight enhanced algorithm, termed YOLOv8n-ALM, which integrates Mixed Local Channel Attention (MLCA) [11], a Task-Aligned Dynamic Detection Head (TADDH) [12], and Layer-Adaptive Magnitude-based Pruning (LAMP) [13].
The main contributions of this work are summarized as follows:
Theoretical contributions:
  • We design a Mixed Local Channel Attention (MLCA) mechanism that integrates local structural features with global contextual semantics, thereby enhancing small-object representation under multi-scale variations.
  • We construct a Task-Aligned Dynamic Detection Head (TADDH) that employs Deformable Convolutional Networks v2 (DCNv2) to achieve adaptive spatial alignment and facilitate cross-task feature interaction between classification and regression.
  • We propose a Layer-Adaptive Magnitude-based Pruning (LAMP) strategy that performs global channel ranking and adaptive structural compression, effectively balancing model efficiency and accuracy preservation.
Practical contributions:
  • The proposed YOLOv8n-ALM achieves a 65.3% reduction in parameters and a 29.6% decrease in computational load, while improving mAP@0.5 by 2.2%, thereby validating its effectiveness for real-time deployment.
  • Real-vehicle experiments conducted on the PIX-Hooke autonomous driving platform demonstrate strong robustness, stability, and engineering feasibility under complex driving conditions.
  • For a detailed comparative analysis of representative methods, please refer to Section 1.2 and Table 1.

1.2. Related Works and Research Gaps

Recent studies have substantially advanced YOLO-based lightweight object detection models from three main perspectives: attention enhancement, multi-scale feature fusion, and model compression or pruning. Representative efforts include the integration of multi-scale fusion attention modules, coordinate attention mechanisms, and lightweight pruning strategies to achieve a balance between detection accuracy and computational efficiency. However, owing to variations in datasets, experimental settings, and evaluation metrics across these studies, direct numerical comparison of detection accuracy is not meaningful. Therefore, Table 1 provides a summary of representative methods, emphasizing their technical approaches and existing limitations, thereby clarifying the motivation and rationale behind this research.
As summarized in Table 1, existing YOLO-based approaches typically focus on optimizing a single aspect—such as feature fusion or model pruning—while overlooking the synergistic impact of these techniques on the balance between real-time performance and detection accuracy. The proposed YOLOv8n-ALM bridges this gap by integrating attention enhancement, task-aligned detection, and adaptive pruning into a unified framework, thereby enhancing robustness and deployment practicality in autonomous driving scenarios.

2. Development and Improvement of Forward Vehicle Detection Model

2.1. Network Architecture

YOLOv8n, the most lightweight model in the YOLOv8 series, is distinguished by its compact architecture and rapid inference speed, making it particularly suitable for real-time deployment on edge devices. Building upon this lightweight foundation, the present study introduces a series of architectural optimizations—illustrated in Figure 1—focusing on two primary aspects: (1) Replacing the original decoupled detection head with a Task-Aligned Dynamic Detection Head (TADDH). By incorporating dynamic convolution and deformable feature alignment mechanisms, the TADDH facilitates effective coordination between the classification and regression branches, thereby enhancing both localization accuracy and category recognition. (2) Integrating the Mixed Local Channel Attention (MLCA) module at the tail of the backbone network. This module strengthens the network’s ability to extract and emphasize salient features while suppressing redundant background information, ultimately improving overall detection performance. These structural refinements preserve the lightweight characteristics of YOLOv8n while significantly enhancing its robustness and applicability in complex traffic environments.
Figure 1. Overall architecture of the improved YOLOv8n-ALM. The backbone/neck feeds the Detection Head (TADDH) at three scales (Detect-S/M/L, strides 8/16/32; P3/P4/P5). Per-scale head details are shown in Figure 2, and the TADDH aligner used inside each branch is expanded in Figure 3.
Figure 1. Overall architecture of the improved YOLOv8n-ALM. The backbone/neck feeds the Detection Head (TADDH) at three scales (Detect-S/M/L, strides 8/16/32; P3/P4/P5). Per-scale head details are shown in Figure 2, and the TADDH aligner used inside each branch is expanded in Figure 3.
Electronics 14 04149 g001
Figure 2. Baseline YOLOv8n detection head (for reference). The three per-scale branches (Head-S/M/L) correspond one-to-one to Detect-S/M/L in Figure 1; the proposed TADDH module used inside each branch is detailed in Figure 3. cls.loss and bbox.loss are defined in Equations (1) and (2); the DFL term is omitted for clarity.
Figure 2. Baseline YOLOv8n detection head (for reference). The three per-scale branches (Head-S/M/L) correspond one-to-one to Detect-S/M/L in Figure 1; the proposed TADDH module used inside each branch is detailed in Figure 3. cls.loss and bbox.loss are defined in Equations (1) and (2); the DFL term is omitted for clarity.
Electronics 14 04149 g002
Figure 3. TADDH (task-aligned dynamic detection head) module. It takes multi-scale inputs (P3/P4/P5) from the neck and, inside each per-scale head branch (Head-S/M/L, Figure 2), produces task-aligned features for classification and regression via DCNv2-based spatial alignment and interaction.
Figure 3. TADDH (task-aligned dynamic detection head) module. It takes multi-scale inputs (P3/P4/P5) from the neck and, inside each per-scale head branch (Head-S/M/L, Figure 2), produces task-aligned features for classification and regression via DCNv2-based spatial alignment and interaction.
Electronics 14 04149 g003

2.2. Task-Aligned Dynamic Detection Head

As illustrated in Figure 1, the improved architecture adopts a three-scale detection head. Figure 2 depicts the baseline YOLOv8n detection head with per-scale branches (Head-S/M/L), corresponding to Detect-S/M/L in Figure 1, while Figure 3 further details the Task-Aligned Dynamic Detection Head (TADDH) module utilized within each per-scale branch to perform task alignment between classification and regression.
The original YOLOv8 detector employs a decoupled architecture [19], in which object classification and bounding-box regression tasks are modeled independently, effectively alleviating the task conflicts observed in earlier coupled detectors. Although this design improves detection accuracy, it also increases the number of parameters and computational complexity. Furthermore, the independent formulation of the classification and localization branches restricts deep feature interaction, thereby limiting the model’s ability to represent complex objects.
Before presenting the training curves in Figure 2, we concisely define the two losses reported in the plots to avoid ambiguity. Let P be the set of positive samples. For i P , z i C are class logits, y i 0 , 1 C is the one-hot target, b i the predicted box, and b i * the matched ground truth. σ . denotes the sigmoid.
L c l s = 1 P i P c = 1 C y i , c log σ z i , c 1 y i , c log 1 σ z i , c
L b b o x = 1 P i P 1 C I o U b i , b i *
With these definitions, we now report the training dynamics in Figure 2.
To overcome these limitations, this study introduces a Task-Aligned Dynamic Detection Head (TADDH), as illustrated in Figure 3. The TADDH structure comprises two task-specific branches: conv-cls for classification and conv-reg for regression. It receives multi-scale feature maps (P3, P4, P5) from the YOLOv8 neck, which first pass through a shared convolutional module to extract common spatial features. The resulting feature representations are subsequently separated by a task-decomposition block into two branches—conv-cls and conv-reg.
The conv-cls branch applies a standard convolution followed by a lightweight channel attention module to adaptively reweight feature channels, enhance category-specific responses, and suppress background interference. The conv-reg branch employs dynamic convolution combined with DCNv2 [20], guided by a generator-mask and offset module, which generates spatial modulation parameters to adjust receptive fields and sampling positions. Both branches share the same input features and exchange contextual information through a task-interaction mechanism, thereby maintaining alignment between classification and regression.
Building upon this structure, DCNv2 is further integrated into the regression branch to enhance spatial alignment and adaptive sampling. The generator mask and offset module produce learnable parameters that guide the deformable convolution kernels to adjust their sampling positions according to object geometry, thereby improving localization precision and robustness under scale variations. The overall processing flow—shared convolution → task decomposition → task-specific branches (conv-cls & conv-reg) with channel attention/DCNv2 (+mask & offset) → cross-task interaction → detection output—is summarized in Figure 3. This sequential workflow illustrates the data flow and functional relationships among the modules, providing a clear understanding of how information propagates through the TADDH.
As illustrated in Figure 3, the TADDH architecture integrates dynamic convolution modules and deformable convolution (DCNv2), enabling classification and regression features to be modeled both independently and through shared information channels. This design effectively mitigates spatial misalignment introduced by task decoupling and enhances boundary perception, thereby achieving higher fitting accuracy in multi-scale object regression scenarios. Within the TADDH framework, the regression branch adopts DCNv2 (Deformable Convolutional Networks v2) for spatial alignment. DCNv2 extends standard convolution by introducing learnable offsets and modulation masks, allowing the convolution kernel to adapt its sampling region flexibly according to the target’s shape and position. This mechanism substantially improves the model’s adaptability to complex object structures while maintaining end-to-end training stability. Compared with conventional static convolutions, DCNv2 provides clear advantages in bounding-box fitting accuracy. The convolution operation of DCNv2 can be formally expressed as:
Y ( p ) = k W k X ( p + p k + Δ p k ) M k
Here, Y(p) denotes the output feature response at position p, Wk represents the convolution kernel weights, pk is the standard sampling position, Δpk is the learned offset, and Mk is the learnable modulation mask. His formulation simultaneously adjusts the location and contribution of sampling points through offset and mask learning, thereby enabling a more flexible perception of target regions and enhancing the representational capacity of feature extraction.
Through this dynamic feature modeling process, the TADDH architecture enables the classification and regression branches to perform specialized feature extraction and optimization tailored to their respective task requirements. As a result, cross-task interference is effectively mitigated, leading to improved overall detection consistency and accuracy. Moreover, the feature alignment mechanism equips the model with stronger spatial adaptability and discriminative capability when handling vehicle targets exhibiting pronounced scale variations and complex shapes.
Compared with the original detection head (Figure 2), the TADDH enhances representational capacity and spatial alignment while preserving a lightweight structure. By incorporating deformable convolution and cross-task feature interaction, it achieves higher localization precision and better consistency between classification and regression tasks.

2.3. Mixed Local Channel Attention

To enhance the model’s ability to focus on critical regions, this study integrates a Mixed Local Channel Attention (MLCA) module into the YOLOv8n backbone. As illustrated in Figure 4, the MLCA fuses fine-grained local feature details with broader global contextual information. By combining local and global channel attention mechanisms, it guides the network to emphasize salient feature regions while suppressing irrelevant background responses, thereby improving overall detection performance.
The MLCA module enhances local–channel interactions by dynamically weighting multi-level features. Specifically, let X C × H × W denote the input feature map, where C, H, and W represent the number of channels, height, and width, respectively. The MLCA module performs channel-wise attention and local feature fusion along the channel (C) dimension, as illustrated in Figure 4.
First, local average pooling (LAP) and global average pooling (GAP) are applied to the input feature map to extract local structural details and global statistical characteristics, respectively. The outputs from both pooling paths are subsequently processed with one-dimensional convolution (Conv1D) for channel compression and reshaping, preparing them for the following fusion operations. Next, the feature map from the local path is fused with the original feature map through element-wise multiplication, explicitly reweighting key channels to emphasize informative features. The global path is then combined with the locally enhanced feature map via element-wise addition, introducing contextual information that strengthens the model’s overall semantic representation. Finally, the input of the UNAP module corresponds to the fused feature maps obtained after the local and global attention operations. Each feature map maintains the same channel dimension (C = 256) as the backbone output, while its spatial resolution has been reduced by pooling. The UNAP (Unpooling) operation restores these features to their original spatial resolution (H × W, e.g., 80 × 80 for the P3 level) while keeping the channel dimension constant. The output feature map is therefore aligned with the YOLOv8 backbone feature size, enabling direct integration with the subsequent detection head. This design preserves both local spatial detail and global contextual information, ensuring consistency between attention-enhanced features and the base network representations.
Compared with traditional attention mechanisms such as SE and CBAM, the proposed MLCA achieves more efficient and comprehensive fusion of local structural information and global contextual semantics within a lightweight architecture. It enhances feature selection and spatial response through a multiplicative attention mechanism, while its one-dimensional convolution design compresses the channel dimension without disrupting the spatial structure. This design enables the model to exhibit stronger feature representation and discriminative capability, particularly in scenarios involving dense small objects and complex backgrounds. Experimental results demonstrate that integrating MLCA leads to substantial improvements in detection accuracy and model stability, with only a negligible increase in parameters, making it highly suitable for deployment in computationally constrained real-world applications.

2.4. Pruning Operation

Convolutional neural networks (CNNs) have demonstrated remarkable performance in various vision tasks such as object detection. However, their high computational cost, slow inference speed, and substantial redundancy limit their applicability in resource-constrained environments, such as in-vehicle terminals. To address these limitations, this study employs Layer-Adaptive Magnitude-based Pruning (LAMP) to perform global channel pruning on the enhanced YOLOv8n model, thereby reducing model size and accelerating inference. Unlike conventional fixed-ratio pruning strategies, LAMP computes importance scores for channel weights in each layer, conducts global channel ranking and selection across the entire network, and adaptively determines the retention ratio for each layer. This approach effectively preserves critical feature information while minimizing accuracy degradation. The scoring function is defined as follows:
score ( u ; W ) = W ( u ) 2 v u W ( v ) 2
Here, u and v denote channel indices, and W represents the channel weight vector. The computed importance scores from Equation (4) are used to rank channels within each layer. Channels with lower scores are regarded as less significant and are progressively pruned according to the global pruning ratio defined in the LAMP strategy. In this manner, Equation (4) directly guides the adaptive pruning process by determining which channels are preserved or removed.
Compared with traditional pruning strategies, LAMP offers several distinct advantages: (1) It introduces no additional hyperparameters, as pruning is performed through basic tensor operations, ensuring computational simplicity. By retaining at least one connection with a score of 1 in each layer, it guarantees structural stability and prevents layers from being completely emptied. (2) It incurs negligible computational overhead, making it suitable for large-scale model compression. (3) It enables global channel pruning by automatically removing the lowest-scoring channels through adaptive sparsity constraints, thereby reducing redundancy across the network. In summary, the LAMP-based pruning mechanism achieves substantial reductions in both parameter count and computational complexity while maintaining model performance, providing an efficient and practical solution for structural compression.

3. Dataset and Evaluation Metrics

3.1. Data Augmentation

To evaluate the performance of the proposed model, this study employs the KITTI dataset, a widely used benchmark in autonomous driving research [21]. The dataset contains several representative traffic object categories—Car, Pedestrian, and Cyclist—and is characterized by complex backgrounds and multi-scale object distributions, providing both realism and high difficulty. To account for real-world variability such as illumination changes, occlusions, and diverse viewing angles, multiple data augmentation strategies are applied to the training set to enhance sample diversity and improve model robustness and generalization. These augmentations include brightness adjustment, image flipping, rotation, color dithering, random cropping, and noise perturbation. Examples of the augmented samples are shown in Figure 5. In total, the dataset comprises 7481 images, divided into training, validation, and test sets in an 8:1:1 ratio, covering a broad range of representative traffic scenarios.

3.2. Evaluation Criteria for Network Models

This study employs widely used evaluation metrics for object detection, including mAP@0.5 (mAP), mAP@0.5:0.95 (mAP), Precision (P), Recall (R), the number of parameters (Params), model size, and floating-point operations per second (GFLOPs). Among these metrics, mAP represents the mean average precision across all categories and serves as a comprehensive indicator of overall detection performance, whereas GFLOPs measures computational complexity. The corresponding calculation formulas for these metrics are provided in Equations (5)–(8).
P = T P T P + F P
R = T P T P + F N
A P = 0 1 P d R
In practice, the integral in Equation (5) is implemented as a discrete summation over precision–recall (PR) points derived from the detection results. The average precision (AP) value is computed by interpolating precision at uniformly sampled recall thresholds, in accordance with the standard PASCAL VOC evaluation protocol.
m A P = 1 N i = 1 N A P i

4. Experiments and Results Analysis

To ensure experimental reproducibility and fairness in performance evaluation, all training and pruning experiments were conducted within a unified software and hardware environment. The corresponding experimental platform specifications are summarized in Table 2. During training, the Stochastic Gradient Descent (SGD) optimizer was employed with an input image size of 640 × 640. In the pruning phase, the LAMP method was applied to perform global channel pruning. Detailed parameter configurations are presented in Table 3.

4.1. Attention Mechanism Comparison Experiment

To evaluate the practical effectiveness of the proposed MLCA attention mechanism in object detection tasks, this study integrates several attention modules—CAFM [22], SimAM [23], ECA [24], MPCA [25], and MLCA—into the backbone of the YOLOv8n model. Comparative experiments were conducted under consistent training configurations, and the results are summarized in Table 4.
To provide a more intuitive illustration of the impact of different attention mechanisms on various performance metrics, performance trend curves were plotted based on the data presented in Table 3, as shown in Figure 6. The results indicate that all attention modules improve model accuracy to varying degrees, with the MLCA module achieving the best performance in both subfigures—mAP@0.5 (a) and mAP@0.5:0.95 (b)—thereby further validating the effectiveness of its structural design.
As shown in Figure 6 and Table 4, the MLCA module delivers the most substantial overall improvement in detection performance without increasing the number of parameters, raising YOLOv8n’s mAP@0.5 by 1.1 percentage points and mAP@0.5:0.95 by 0.7 percentage points. These results demonstrate that MLCA not only provides strong channel and spatial feature fusion capabilities but also exhibits considerable potential for broader applications.

4.2. Ablation Experiment

To comprehensively assess the individual contributions of each module to detection performance, four experimental configurations were designed. These configurations progressively incorporated the MLCA, TADDH, and LAMP pruning strategies to evaluate their respective impacts on model accuracy, computational complexity, and inference efficiency. The corresponding results are summarized in Table 5.
It can be observed that incorporating MLCA increased the mAP by 1.1 percentage points, while the addition of TADDH further improved it to 88.0%. After applying LAMP for global channel pruning, the number of parameters was reduced to 1.04 million, and the inference cost (GFLOPs) decreased by more than 50%, with only negligible variations in accuracy. These results highlight the significant lightweighting effect of the proposed approach and validate the effectiveness of the structural optimization strategy.
To visually demonstrate the compression effect of the LAMP pruning strategy on the YOLOv8n network architecture, this study plots the variation in channel counts across layers before and after pruning, as shown in Figure 7.
As illustrated in Figure 7, the LAMP pruning mechanism adaptively removes channels based on their importance across different layers. Critical layers—such as the intermediate trunk layers and the detection head—preserve a larger number of channels, whereas shallow and redundant feature layers are substantially compressed. This approach effectively reduces the overall parameter count and computational complexity of the model. By performing pruning in an adaptive, demand-driven manner, the strategy achieves an optimal balance between model performance and lightweight design, offering greater flexibility than conventional fixed-ratio pruning methods.
To more clearly illustrate the trade-offs between accuracy and computational complexity under different module combinations, a performance comparison diagram is presented in Figure 8. The observed trends are consistent with the results in Table 4, indicating that the combination of MLCA and TADDH achieves the best detection performance. Meanwhile, the application of pruning significantly reduces model complexity while largely preserving accuracy, thereby confirming the practicality and effectiveness of the proposed module integration strategy.
To evaluate the changes in detection accuracy on the validation set after model pruning, Figure 9 illustrates the trends of mAP@0.5 (a) and mAP@0.5:0.95 (b) as the number of training iterations increases. In the early training phase, the curves exhibit noticeable fluctuations and occasional “jump-like” increases. This behavior occurs because the model’s feature representation capability is temporarily weakened after pruning, requiring a period of adaptation and recovery through subsequent training cycles—an expected stage in the retraining convergence process. As training progresses, both metrics gradually stabilize and converge, demonstrating that the model maintains strong accuracy recovery and generalization capability even after lightweight compression.
The loss functions adopted in this study follow the standard YOLOv8 implementation. Specifically, the bounding box regression loss (bbox.loss) is computed using the Complete IoU (CIoU) formulation to measure both spatial overlap and distance consistency between predicted and ground-truth boxes. The classification loss (cls.loss) employs the binary cross-entropy (BCE) function to optimize object category prediction, whereas the distributional focal loss (dfl.loss) refines bounding box localization by modeling the quality distribution of predicted coordinates. These components are jointly optimized during training to achieve a balanced trade-off between localization precision and classification accuracy.
To analyze the convergence and stability of the model during training, this study plots the variation curves of box_loss, cls_loss, and dfl_loss, as shown in Figure 10. In the early training phase, the substantial structural adjustments introduced by pruning operations cause periodic fluctuations and non-linear decreases in loss values. This behavior primarily reflects the network’s process of readjusting its parameters and feature response structures. After several training epochs, all loss metrics gradually stabilize, indicating that the model has completed structural adaptation and achieved overall training stability.
To illustrate the detection performance of the improved model under diverse scenarios, representative test images were selected for qualitative analysis. Each sample includes the original image, the corresponding heatmap, and the final bounding box predictions, as shown in Figure 11. Specifically, subfigures (a) and (b) show the original images, (c) and (d) display the heatmaps of detected targets, and (e) and (f) highlight the precision of the final detection results. These examples encompass a range of challenging conditions, including complex backgrounds, densely distributed objects, and varying illumination levels.
The visualization results confirm that the improved model accurately localizes targets across diverse scenarios. The heatmap response regions are clearly concentrated, and the predicted bounding box contours are complete, well-defined, and closely aligned with the actual object positions. These results demonstrate the model’s strong feature representation and spatial perception capabilities, further validating its adaptability and robustness in real-world applications.

4.3. Comparison Experiment

To comprehensively evaluate the detection performance and deployment practicality of the proposed YOLOv8n-ALM model, several representative mainstream object detection algorithms were selected for comparison. These include the two-stage detector Faster R-CNN, the anchor-free architecture TOOD-R50 [26], and multiple lightweight single-stage detectors such as YOLOv5n, YOLOv8n, and its variants (BiFPN, LSCD [27], EfficientHead [28]), as well as the latest YOLOv10n–YOLOv12n series. Comparative experiments were conducted on the KITTI dataset under consistent input dimensions and data partitions.
All models were evaluated using detection accuracy (AP@0.5) across three representative traffic object categories—Car, Pedestrian, and Cyclist—along with additional comparisons on overall detection accuracy (mAP@0.5), computational complexity (GFLOPs), and parameter count (Params). Together, these metrics provide a comprehensive assessment of a model’s deployment value in practical applications from the perspectives of accuracy, computational efficiency, and model compactness. The experimental results, summarized in Table 6, demonstrate that YOLOv8n-ALM achieves efficient detection of small and multiple objects while maintaining lightweight characteristics.
As shown in Table 6, the proposed YOLOv8n-ALM achieves an mAP@0.5 of 87.0%, surpassing all comparison models. Meanwhile, its parameter count is only 1.04 million, and its computational complexity is 5.7 GFLOPs, both significantly lower than those of most lightweight detectors. These results demonstrate that the proposed method achieves an effective balance between detection accuracy and model compactness, highlighting its strong feasibility for practical deployment.
To more intuitively illustrate the overall differences in detection performance, a bar chart comparing the mAP@0.5 values of each model is presented in Figure 12. It can be observed that YOLOv8n-ALM achieves the highest detection accuracy while maintaining the lowest computational complexity and parameter count, thereby demonstrating the efficiency and practical value of the proposed approach.
To further analyze detection performance across different target categories, the AP@0.5 values for Car, Pedestrian, and Cyclist were calculated for each model, and the results are illustrated in the comparison bar chart shown in Figure 13. The findings reveal that YOLOv8n-ALM achieved the highest precision—96.5% for Car and 86.7% for Cyclist. Although its performance on the Pedestrian category was slightly lower than that of TOOD-R50, it remained at a comparatively high level, reflecting an overall balanced performance. These results further validate the adaptability and robustness of the proposed method in multi-category detection tasks, particularly in scenarios involving small targets and complex backgrounds.
To further verify the deployability of the proposed model, its real-time performance was evaluated on a replayed video stream (batch size = 1, input = 640 × 640). As shown in Table 7, the proposed YOLOv8n-ALM achieves an average latency of 27.36 ms (≈36.55 FPS) with P95 = 33.38 ms, outperforming the baseline YOLOv8n (29.98 ms, 33.36 FPS, P95 = 35.60 ms). The model-only latency decreases from 10.68 ms to 6.89 ms, primarily due to faster inference (8.63 → 4.73 ms). These results confirm that the proposed improvements significantly enhance inference efficiency while maintaining strong real-time performance.
Furthermore, to better contextualize the proposed method within the contemporary YOLO family, a comparison with the latest YOLOv10–YOLOv12 variants is conducted. While YOLOv10–YOLOv12 primarily focus on architectural refinements—such as decoupled detection heads, optimized label assignment strategies, and re-parameterized lightweight structures—the proposed YOLOv8n-ALM adopts a module-level enhancement strategy that integrates attention fusion (MLCA), task-aligned dynamic detection (TADDH), and adaptive pruning (LAMP). This design strengthens small-object perception and deployment efficiency without altering the core YOLOv8n framework, achieving a favorable balance between accuracy and computational complexity compared with other recent YOLO models.

4.4. Real-World Onboard Vehicle Video Detection Experiments

To further validate the effectiveness of the improved model in real driving environments, typical road video data were collected using a dashcam installed in an actual vehicle. Static images were extracted from these videos through frame sampling and used as test samples. Both the original YOLOv8n model and the improved YOLOv8n-ALM model were applied to perform object detection on the extracted frames, generating visualized comparison results. As shown in Figure 14, the two models were evaluated in terms of recognition capability, target localization accuracy, and response performance under real-world driving conditions.
To further validate the robustness of the proposed model, real-vehicle testing was conducted on the PIX-Hooke open-source autonomous driving development platform, as illustrated in Figure 15. This platform integrates perception, decision-making, and control modules, providing a complete system architecture for autonomous driving research. The test vehicle is powered by a 72 V lead-acid battery and equipped with high-precision steering, braking, and propulsion systems, ensuring stable operation under outdoor conditions. The system operates on Ubuntu 18.04, running an Intel Core i7-8700 CPU (PIX-Hooke, Tianjin, China) and an NVIDIA RTX 2080 GPU (PIX-Hooke, Tianjin, China), which together support high-performance deep learning inference tasks. Furthermore, the platform is equipped with multiple perception sensors, including LiDAR, RGB cameras, and forward-facing visible-light cameras, enabling comprehensive acquisition of multimodal environmental data.
During the experiment, image samples from various representative traffic scenarios were collected using the onboard camera and portable computing terminal of the PIX-Hooke platform. These data were utilized to compare and evaluate the detection performance of the original YOLOv8n model against the improved YOLOv8n-ALM model. The improved model was deployed on the platform to perform real-time object detection, result visualization, and output generation based on inputs from the vehicle’s sensors. As shown in Figure 16, the results further verify the model’s operational stability and engineering feasibility in real-world deployment environments. Moreover, during real-vehicle testing, the proposed YOLOv8n-ALM model achieved an average inference speed of approximately 30 frames per second (FPS) on the PIX-Hooke platform, demonstrating stable real-time detection performance without noticeable latency or frame loss.

5. Conclusions

This study addresses the challenges of achieving high accuracy and efficiency in forward object detection for autonomous driving scenarios and proposes an improved lightweight algorithm, YOLOv8n-ALM, which integrates the Mixed Local Channel Attention (MLCA) mechanism, the Task-Aligned Dynamic Detection Head (TADDH), and the Layer-Adaptive Magnitude-based Pruning (LAMP) strategy. While preserving a compact model structure, the proposed approach significantly enhances small-object perception, detection accuracy, and inference speed. Experimental results on the KITTI dataset demonstrate that the improved model outperforms mainstream methods across key metrics, including mAP@0.5, multiple AP values, and inference frame rate. Moreover, the number of parameters and computational load are reduced by approximately 65% and 44%, respectively, demonstrating a strong lightweight capability. Real-vehicle experiments further validate the model’s detection effectiveness and deployment stability in complex traffic environments, underscoring its substantial engineering practicality and potential for real-world autonomous driving applications.
Nevertheless, this study still has certain limitations. The proposed YOLOv8n-ALM model has not yet been extensively validated under extreme occlusion, adverse weather, or nighttime driving conditions, where visual noise and illumination variations may affect detection stability. In future work, we plan to extend the framework by integrating multi-sensor fusion (e.g., LiDAR and infrared vision) and domain adaptation techniques to enhance robustness across diverse driving environments. Moreover, lightweight optimization and deployment on low-power embedded platforms will be further explored to improve scalability and adaptability for real-world autonomous driving applications.

Author Contributions

Conceptualization, Y.Z. and J.Z.; methodology, Y.Z.; software, Y.Z. and J.Z.; validation, Y.Z. and W.K.; formal analysis, C.W.; investigation, G.L.; resources, C.W. and F.D.; data curation, G.L.; writing—original draft preparation, Y.Z.; writing—review and editing, J.Z.; visualization, W.K.; supervision, C.W.; project administration, J.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research on new energy intelligent connected vehicle control faced on traffic scene, grant number NO. 2024CXYTD07; Natural Science Research Project of Higher Education Institutions in Anhui Province: Research on Magnetic Field Analysis of Rotor Permanent Magnet Fracture Fault in Permanent Magnet Synchronous Motor for Pure Electric Vehicles, grant number NO. 2024AH051637; Anhui Province Natural Science Research Key Project in 2022 Nonlinear Wind-induced Vibration Analysis and Design Parameter Optimization of Catenary Based on Beam Theory, grant number NO. 2022AH052237.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MLCAMixed Local Channel Attention
TADDHTask-Aligned Dynamic Detection Head
LAMPLayer-Adaptive Magnitude-based Pruning
DCNv2Deformable Convolutional Networks v2
UNAPUnified Non-Autoregressive Parsing
GAPGlobal Average Pooling
KITTIKarlsruhe Institute of Technology and Toyota Technological Institute Dataset
YOLOv8-ALMYOLOv8-Attention-Lightweight-Multihead

References

  1. Pang, Y.; Yuan, Y.; Li, X.; Pan, J. Efficient HOG human detection. Signal Process. 2011, 91, 773–781. [Google Scholar] [CrossRef]
  2. Zheng, L.; Yang, Y.; Tian, Q. SIFT meets CNN: A decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1224–1244. [Google Scholar] [CrossRef]
  3. Xu, G.; Liao, W.; Zhang, X.; Li, C.; He, X.; Wu, X. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
  4. Kurani, A.; Doshi, P.; Vakharia, A.; Shah, M. A comprehensive comparative study of artificial neural networks (ANN) and support vector machines (SVM) on stock forecasting. Ann. Data Sci. 2023, 10, 183–208. [Google Scholar] [CrossRef]
  5. Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. Mach. Intell. Res. 2025, 22, 730–751. [Google Scholar] [CrossRef]
  6. Xu, X.; Zhao, M.; Shi, P.; Ren, R.; He, X.; Wei, X.; Yang, H. Crack detection and comparison study based on faster R-CNN and mask R-CNN. Sensors 2022, 22, 1215. [Google Scholar] [CrossRef]
  7. Xu, J.; Ren, H.; Cai, S.; Zhang, X. An improved, faster R-CNN algorithm for assisted detection of lung nodules. Comput. Biol. Med. 2023, 153, 106470. [Google Scholar] [CrossRef] [PubMed]
  8. Hassan, E.; El-Rashidy, N. Mask R-CNN models. Nile J. Commun. Comput. Sci. 2022, 3, 17–27. [Google Scholar] [CrossRef]
  9. Jiang, X.; Meng, L.; Chen, X.; Xu, Y.; Wu, D. CSP-Net: Common spatial pattern empowered neural networks for EEG-based motor imagery classification. Knowl.-Based Syst. 2024, 305, 112668. [Google Scholar] [CrossRef]
  10. Wu, Y.; Yao, Q.; Fan, X.; Gong, M.; Ma, W.; Miao, Q. PANet: A point-attention based multi-scale feature fusion network for point cloud registration. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
  11. Cheng, Q.; Huang, H.; Xu, Y.; Zhou, Y.; Li, H.; Wang, Z. NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
  12. Tang, P.; Ding, Z.Y.; Jiang, M.N.; Xu, W.K.; Lv, M. LBT-YOLO: A Lightweight Road Targeting Algorithm Based on Task-Aligned Dynamic Detection Heads (April 2024); IEEE Access: Piscataway, NJ, USA, 2024. [Google Scholar]
  13. Vysogorets, A.; Kempe, J. Connectivity matters: Neural network pruning through the lens of effective sparsity. J. Mach. Learn. Res. 2023, 24, 1–23. [Google Scholar]
  14. Qiu, M.; Huang, L.; Tang, B.-H. ASFF-YOLOv5: Multielement Detection Method for Road Traffic in UAV Images Based on Multiscale Feature Fusion. Remote Sens. 2022, 14, 3498. [Google Scholar] [CrossRef]
  15. Wu, J.; Dong, J.; Nie, W.; Ye, Z. A Lightweight YOLOv5 Optimization of Coordinate Attention. Appl. Sci. 2023, 13, 1746. [Google Scholar] [CrossRef]
  16. Jani, M.; Fayyad, J.; Al-Younes, Y.; Najjaran, H. Model compression methods for YOLOv5: A review. arXiv 2023, arXiv:2307.11904. [Google Scholar] [CrossRef]
  17. Li, Z.; Wang, Y.; Chen, K.; Yu, Z. Channel-pruned YOLOv5-based deep learning approach for rapid and accurate outdoor obstacles detection. arXiv 2022, arXiv:2204.13699. [Google Scholar]
  18. Wang, J.; Chen, Y.; Dong, Z.; Gao, M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 2023, 35, 7853–7865. [Google Scholar] [CrossRef]
  19. Zhang, X.; Shen, T.; Xu, D. Object detection in remote sensing images based on an improved YOLOv8 algorithm. Laser Optoelectron. Prog. 2024, 61, 1028001. [Google Scholar]
  20. Yang, M.; Zhou, X.; Yang, F.; Zhou, M.; Wang, H. PIMnet: A quality enhancement network for compressed videos with prior information modulation. Signal Process. Image Commun. 2023, 117, 117005. [Google Scholar] [CrossRef]
  21. Martins, N.A.; Cruz, L.A.D.S.; Lopes, F. Impact of LiDAR point cloud compression on 3D object detection evaluated on the KITTI dataset. EURASIP J. Image Video Process. 2024, 2024, 15. [Google Scholar] [CrossRef]
  22. Liu, J.; Liu, D.; Zhu, L. CAF-RCNN: Multimodal 3D object detection with cross-attention. Int. J. Remote Sens. 2023, 44, 6131–6146. [Google Scholar] [CrossRef]
  23. Xu, Y.; Du, W.; Deng, L.; Zhang, Y.; Wen, W. Ship target detection in SAR images based on SimAM attention YOLOv8. IET Commun. 2024, 18, 1428–1436. [Google Scholar] [CrossRef]
  24. Ni, H.; Shi, Z.; Karungaru, S.; Lv, S.; Li, X.; Wang, X.; Zhang, J. Classification of typical pests and diseases of rice based on the ECA attention mechanism. Agriculture 2023, 13, 1066. [Google Scholar] [CrossRef]
  25. Nguyen, T.V.; Horng, S.J.; Vu, D.T.; Chen, H.; Li, T. LAWNet: A lightweight attention-based deep learning model for wrist vein verification in smartphones using RGB images. IEEE Trans. Instrum. Meas. 2023, 72, 1–10. [Google Scholar] [CrossRef]
  26. Wang, H.; Yu, Y.; Tang, Z. FDM-RTDETR: A Multi-scale Small Target Detection Algorithm; IEEE Access: Piscataway, NJ, USA, 2025. [Google Scholar]
  27. Yang, J.; Wu, S.; Gou, L.; Yu, H.; Lin, C.; Wang, J.; Wang, P.; Li, M.; Li, X. SCD: A stacked carton dataset for detection and segmentation. Sensors 2022, 22, 3617. [Google Scholar] [CrossRef]
  28. Safa’a, S.S.; Mabrouk, T.F.; Tarabishi, R.A. An improved energy-efficient head election protocol for clustering techniques of wireless sensor networks (June 2020). Egypt. Inform. J. 2021, 22, 439–445. [Google Scholar]
Figure 4. Structure of the proposed Mixed Local Channel Attention (MLCA) module. Here, X C × H × W denote the input feature map, where C, H, and W represent the number of channels, height, and width, respectively.
Figure 4. Structure of the proposed Mixed Local Channel Attention (MLCA) module. Here, X C × H × W denote the input feature map, where C, H, and W represent the number of channels, height, and width, respectively.
Electronics 14 04149 g004
Figure 5. Examples of image samples before and after data augmentation. The transformations (e.g., brightness, rotation, flipping, noise) enhance dataset diversity and improve model robustness under varying lighting and occlusion conditions. (a) Original; (b) Brightness enhancement; (c) Flip conversion; (d) Color flicker; (e) Rotation; (f) Random cropping.
Figure 5. Examples of image samples before and after data augmentation. The transformations (e.g., brightness, rotation, flipping, noise) enhance dataset diversity and improve model robustness under varying lighting and occlusion conditions. (a) Original; (b) Brightness enhancement; (c) Flip conversion; (d) Color flicker; (e) Rotation; (f) Random cropping.
Electronics 14 04149 g005
Figure 6. Comparison of detection metrics under different attention mechanisms. It visualizes how each attention design (CAFM, SimAM, ECA, MPCA, MLCA) influences precision, recall, and mAP, confirming the superiority of MLCA in balanced accuracy and stability.
Figure 6. Comparison of detection metrics under different attention mechanisms. It visualizes how each attention design (CAFM, SimAM, ECA, MPCA, MLCA) influences precision, recall, and mAP, confirming the superiority of MLCA in balanced accuracy and stability.
Electronics 14 04149 g006
Figure 7. Changes in Channel Counts Across Layers of the YOLOv8n Model Before and After LAMP Pruning. The y-axis (0–250) denotes the number of channels (per layer).
Figure 7. Changes in Channel Counts Across Layers of the YOLOv8n Model Before and After LAMP Pruning. The y-axis (0–250) denotes the number of channels (per layer).
Electronics 14 04149 g007
Figure 8. Quantitative Impact of Different Improvement Module Combinations on YOLOv8n Detection Performance.
Figure 8. Quantitative Impact of Different Improvement Module Combinations on YOLOv8n Detection Performance.
Electronics 14 04149 g008
Figure 9. mAP variation during retraining after applying LAMP pruning. The curve shows accuracy recovery and convergence stability, verifying the effectiveness of adaptive pruning and fine-tuning.
Figure 9. mAP variation during retraining after applying LAMP pruning. The curve shows accuracy recovery and convergence stability, verifying the effectiveness of adaptive pruning and fine-tuning.
Electronics 14 04149 g009
Figure 10. Training loss convergence curves (box_loss, cls_loss, dfl_loss) of the improved YOLOv8n-ALM. It reflects the model’s adaptation and stability during optimization after pruning and attention integration.
Figure 10. Training loss convergence curves (box_loss, cls_loss, dfl_loss) of the improved YOLOv8n-ALM. It reflects the model’s adaptation and stability during optimization after pruning and attention integration.
Electronics 14 04149 g010
Figure 11. Visualization of detection results on representative test samples. The examples demonstrate improved localization precision, denser heatmap responses, and better target boundary definition using YOLOv8n-ALM.
Figure 11. Visualization of detection results on representative test samples. The examples demonstrate improved localization precision, denser heatmap responses, and better target boundary definition using YOLOv8n-ALM.
Electronics 14 04149 g011
Figure 12. Comparison of mAP@0.5 across mainstream detection models on the KITTI dataset. It verifies that YOLOv8n-ALM achieves the best accuracy–efficiency balance compared to other lightweight models.
Figure 12. Comparison of mAP@0.5 across mainstream detection models on the KITTI dataset. It verifies that YOLOv8n-ALM achieves the best accuracy–efficiency balance compared to other lightweight models.
Electronics 14 04149 g012
Figure 13. Comparison of AP@0.5 across different target categories (Car, Pedestrian, Cyclist). It highlights that the proposed model maintains strong performance across diverse object scales and classes.
Figure 13. Comparison of AP@0.5 across different target categories (Car, Pedestrian, Cyclist). It highlights that the proposed model maintains strong performance across diverse object scales and classes.
Electronics 14 04149 g013
Figure 14. Visualization of detection results before and after model improvement: (a,d) original images; (b,e) detection results of the baseline YOLOv8n model; (c,f) detection results of the proposed YOLOv8n-ALM model. The improved model achieves higher confidence scores and more accurate bounding boxes for both vehicle and pedestrian targets.
Figure 14. Visualization of detection results before and after model improvement: (a,d) original images; (b,e) detection results of the baseline YOLOv8n model; (c,f) detection results of the proposed YOLOv8n-ALM model. The improved model achieves higher confidence scores and more accurate bounding boxes for both vehicle and pedestrian targets.
Electronics 14 04149 g014
Figure 15. The hardware composition and system structure of the PIX-Hooke autonomous driving platform.
Figure 15. The hardware composition and system structure of the PIX-Hooke autonomous driving platform.
Electronics 14 04149 g015
Figure 16. Real-time detection visualization after deploying YOLOv8n-ALM to the vehicle platform. It verifies the real-world robustness, detection stability, and deployment feasibility of the proposed method.
Figure 16. Real-time detection visualization after deploying YOLOv8n-ALM to the vehicle platform. It verifies the real-world robustness, detection stability, and deployment feasibility of the proposed method.
Electronics 14 04149 g016
Table 1. Summary of representative YOLO-based lightweight object detection methods and their research gaps.
Table 1. Summary of representative YOLO-based lightweight object detection methods and their research gaps.
ReferenceModelMain IdeaResearch Gap
[14]YOLOv5 + ASFF +
Multi-scale Attention
Enhanced small-object
fusion across scales
Increased computational load; difficult real-time deployment
[15]YOLOv5 + Coordinate AttentionImproves spatial-channel feature representationWeak task alignment between classification and regression
[16]YOLOv5 + FPGM PruningLightweight pruning for model compressionFixed pruning ratio; reduced adaptability
[17]YOLOv5 + BiFPNFacilitates efficient multi-scale feature aggregationInsufficient response to dense small targets
[18]LBT-YOLO + TADDHTask-aligned detection head for better cls-reg consistencyLacks an adaptive attention and pruning mechanism
This workYOLOv8n-ALMIntegrates mixed attention, dynamic alignment, and adaptive pruningAddresses the above gaps under a unified framework
Table 2. Configuration of Experimental Hardware and Software Environment.
Table 2. Configuration of Experimental Hardware and Software Environment.
CategoryEnvironmental Conditions
CPUInter Core i9-9900KF
GPUNVIDIA RTX 3080
CUDA version12.0
Python3.9.12
Pytorch2.0.0
mmcv2.2.0
MMDetection3.3.0
Operating systemUbuntu22.04
Table 3. Experimental Parameters for Network Training and Pruning.
Table 3. Experimental Parameters for Network Training and Pruning.
ParameterValue
Epochs150
Batch Size32
Image Size640 × 640
OptimizerSGD
Weight Decay0.0005
Pruning MethodLAMP
Pruning TypeGlobal pruning
Table 4. Performance Comparison of Different Attention Mechanisms in the YOLOv8n Model.
Table 4. Performance Comparison of Different Attention Mechanisms in the YOLOv8n Model.
ModelP (%)R (%)mAP@0.5 (%)mAP@0.5–0.95 (%)Params/M
Basic86.077.684.859.03.00
+CAFM88.376.284.858.73.35
+SimAM90.375.885.158.63.00
+ECA87.578.185.458.93.00
+MPCA85.877.084.158.83.33
+MLCA88.976.485.959.73.00
Table 5. Ablation experiment, “✓” denotes that the corresponding module is included in the configuration.
Table 5. Ablation experiment, “✓” denotes that the corresponding module is included in the configuration.
YOLOv8nMLCATADDHPruneModelSize/MBGFLOPsParamsP (%)R (%)mAP@0.5 (%)
6.28.13.0086.077.684.8
6.28.13.0088.976.485.9
7.712.43.7291.579.488.0
2.15.71.0491.877.787.0
Table 6. Comparison experiment.
Table 6. Comparison experiment.
ModelAP@0.5 (%)mAP@0.5 (%)GFLOPsParams
CarPedestrianCyclist
YOLOv5n95.772.883.584.07.102.50
FasterR-CNN78.569.275.574.323670.12
TOOD-R5091.078.274.380.719932.10
YOLOv8n-Bifpn95.373.781.783.67.101.99
YOLOv8n-LSCD96.176.484.785.76.52.36
YOLOv8n-
EfficientHead
96.075.982.784.88.13.83
YOLOv10n96.074.684.184.96.52.26
YOLOv11n95.774.084.084.56.32.58
YOLOv12n94.972.580.582.65.82.50
DETR-R5074.145.051.258.339.084.1
YOLOv8n-ALM96.577.786.787.05.71.04
Table 7. Comparison of real-time performance between the baseline and improved models (batch size = 1, 640 × 640).
Table 7. Comparison of real-time performance between the baseline and improved models (batch size = 1, 640 × 640).
ModelLatency-E2E Mean (ms)P95 (ms)Model-Only Mean (ms)Pre (ms)Infer (ms)Post (ms)FPS
YOLOv8n29.9835.6010.681.268.630.8033.36
YOLOv8n-ALM27.3633.386.981.304.730.8636.55
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Zhang, J.; Du, F.; Kang, W.; Wang, C.; Li, G. Real-Time Lightweight Vehicle Object Detection via Layer-Adaptive Model Pruning. Electronics 2025, 14, 4149. https://doi.org/10.3390/electronics14214149

AMA Style

Zhang Y, Zhang J, Du F, Kang W, Wang C, Li G. Real-Time Lightweight Vehicle Object Detection via Layer-Adaptive Model Pruning. Electronics. 2025; 14(21):4149. https://doi.org/10.3390/electronics14214149

Chicago/Turabian Style

Zhang, Yu, Junhui Zhang, Feng Du, Wenjie Kang, Cen Wang, and Guofei Li. 2025. "Real-Time Lightweight Vehicle Object Detection via Layer-Adaptive Model Pruning" Electronics 14, no. 21: 4149. https://doi.org/10.3390/electronics14214149

APA Style

Zhang, Y., Zhang, J., Du, F., Kang, W., Wang, C., & Li, G. (2025). Real-Time Lightweight Vehicle Object Detection via Layer-Adaptive Model Pruning. Electronics, 14(21), 4149. https://doi.org/10.3390/electronics14214149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop