A Lightweight and Efficient Approach for Distracted Driving Detection Based on YOLOv8

Li, Fu; Gu, Shenghao; Lu, Lei; Ren, Binghua; Zhang, Lijuan; Wu, Wangyu

doi:10.3390/electronics15010034

Open AccessArticle

A Lightweight and Efficient Approach for Distracted Driving Detection Based on YOLOv8

by

Fu Li

¹,

Shenghao Gu

¹,

Lei Lu

²,

Binghua Ren

²,

Lijuan Zhang

^1,*

and

Wangyu Wu

³

¹

Internet of Things Engineering, Wuxi University, Wuxi 214105, China

²

School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China

³

School of Computer Science, University of Liverpool, Liverpool L69 3DR, UK

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 34; https://doi.org/10.3390/electronics15010034

Submission received: 21 October 2025 / Revised: 16 November 2025 / Accepted: 4 December 2025 / Published: 22 December 2025

(This article belongs to the Topic Object Detection and Control of Networked Autonomous Systems: Theories, Analysis Tools and Applications)

Download

Browse Figures

Versions Notes

Abstract

To overcome the issues of excessive computation and resource usage in distracted driving detection systems, this study introduces a compact detection framework named YOLOv8s-FPNE, built upon the YOLOv8 architecture. The proposed model incorporates FasterNet, Partial Convolution (PConv) layers, a Normalized Attention Mechanism (NAM), and the Focal-EIoU loss to achieve an optimal trade-off between accuracy and efficiency. FasterNet together with PConv enhances feature extraction while reducing redundancy, NAM strengthens the model’s sensitivity to key spatial and channel information, and Focal-EIoU refines bounding-box regression, particularly for hard-to-detect samples. Experimental evaluations on a public distracted driving dataset show that YOLOv8s-FPNE reduces the number of parameters by 21.7% and computational cost (FLOPS) by 23.6% relative to the original YOLOv8s, attaining an mAP@0.5 of 81.6%, which surpasses existing lightweight detection methods. Ablation analyses verify the contribution of each component, and comparative studies further confirm the advantages of NAM and Focal-EIoU. The results demonstrate that the proposed method provides a practical and efficient solution for real-time distracted driving detection on embedded and resource-limited platforms.

Keywords:

YOLOv8; distracted driving detection; FasterNet; NAM attention; lightweight model; computer vision

1. Introduction

As the modern automotive industry continues to thrive, cars have firmly established themselves as an essential mode of daily transportation. However, this progress has been accompanied by a growing concern for traffic safety, especially due to the alarming rise in traffic accidents caused by distracted driving. These accidents have emerged as a significant threat to road safety, even surpassing those caused by drunk driving and speeding. Distracted driving encompasses visual, cognitive, and manual distractions, all of which severely diminish the driver’s focus on the road, thereby increasing the likelihood of accidents. In response to this challenge, the effective detection of distracted driving behaviors has become a focal point of research. Traditional detection methods, which often rely on sensors such as eye trackers and head motion sensors, are hindered by their high costs and complex installation requirements, limiting their practical application.

In recent years, deep learning-based computer vision techniques have achieved remarkable advancements in the field of distracted driving detection. YOLO-based networks, renowned for their high computational efficiency and superior detection performance, have become a dominant approach in this area. Nevertheless, existing methods for distracted driving detection still face certain limitations. For instance, the deep learning-based method proposed by Li Bo et al. [1] attains high detection accuracy but is encumbered by high model complexity, making it challenging to deploy on resource-limited embedded devices. Qi Kang et al. [2] utilized an enhanced YOLOv4-tiny model to reduce computational complexity, albeit at the expense of detection accuracy. Renxiang Chen et al. [3] improved the YOLOv5 model to boost detection accuracy, yet the model retains a high number of parameters and computational complexity, struggling to meet real-time detection demands. Zhihui Mao et al. [4] conducted a review of YOLO-based autonomous driving object detection research, highlighting the need for further optimization in balancing lightweight models with accuracy. Current comparisons often do not consider recent multi-modal or gaze-integrated detection approaches, which could further enhance distracted driving recognition beyond visual cues alone (e.g., combining gaze cues and step recognition for assembly [5]). Although this paper focuses on visual detection using YOLOv8, such multi-modal extensions offer a promising direction for future research. To tackle these issues, this paper introduces a lightweight distracted driving detection method based on YOLOv8. By incorporating the FasterNet architecture, PConv module, NAM attention mechanism, and Focal-EIoU loss function, the proposed method substantially reduces the model’s parameter size and computational complexity while preserving high detection accuracy. The primary contributions of this paper are as follows:

A lightweight distracted driving detection model based on YOLOv8, named YOLOv8s-FPNE, which significantly diminishes the model’s parameter size and computational complexity.
The integration of FasterNet, PConv, NAM attention mechanism, and Focal-EIoU loss function to enhance detection accuracy and efficiency.
A comprehensive series of experiments conducted on publicly available datasets to validate the effectiveness and superiority of the proposed method.

2. Introduction to YOLO

YOLOv8 represents the latest evolution in the YOLO series of object detection algorithms, building upon the strengths of its predecessors, such as rapid inference speed and high detection accuracy, while introducing enhancements to improve performance in complex scenarios. YOLOv8 manages model complexity through depth and width factors, offering several versions—YOLOv8-N, YOLOv8-S, YOLOv8-M, YOLOv8-L, and YOLOv8-X—to accommodate varying computational capabilities.

The YOLOv8 architecture is composed of four main components: the input layer, the backbone network, the neck network, and the detection head. YOLOv8 accepts input images of fixed dimensions (typically 640 × 640 × 3) and performs preprocessing tasks such as resizing to standard dimensions and normalization. This ensures that the input data is compatible with the model’s structure and parameter settings, thereby maintaining both detection efficiency and accuracy.

YOLOv8 utilizes more efficient feature extraction networks, such as CSPNet (Cross-Stage Partial Network), to minimize redundant computations while preserving high detection accuracy. The backbone’s primary role is to extract multi-level feature maps through convolution operations. In the neck section, YOLOv8 employs classical architectures like FPN (Feature Pyramid Network) and PAN (Path Aggregation Network). These architectures enhance the network’s ability to detect objects of various sizes by fusing multi-scale features, ensuring robust detection performance even with small objects or complex backgrounds.

In the detection head, YOLOv8 introduces an anchor-free detection approach, dispensing with the traditional anchor-based mechanism to make detection more flexible and efficient. The head is responsible for predicting the locations and categories of objects based on the feature maps output by the neck, directly yielding the detection results.

3. YOLO Algorithm Improvements

To address the critical need for low parameter count and low complexity in distracted driving detection, this paper presents an enhanced version of YOLOv8s, incorporating improvements in four key areas: the FasterBlock [6] is introduced to replace the Bottleneck in C2f. This modification efficiently extracts spatial features by reducing redundant computations and memory accesses, thereby significantly lowering the parameter count and computational complexity. Furthermore, PConv is integrated into the detection head. By partially convolving the input channels and combining this with PWConv, the entire feature map is utilized more effectively, leading to reduced computational and memory overhead. Moreover, the NAM [7] attention mechanism is incorporated, applying sparse weight penalties to the attention module. This enhancement makes the computation more efficient while preserving the model’s performance. Additionally, the Focal-EIoU [8] loss function is adopted, replacing the CIoU loss function in the original model. This change accelerates convergence and improves detection accuracy.

The network architecture of the improved YOLOv8s-FPNE model is depicted in Figure 1.

3.1. Integration of FasterNet

FasterNet represents a novel lightweight network architecture meticulously devised to enhance the efficiency of neural networks by curtailing redundant computations, as illustrated in Figure 2. This module demonstrates particular suitability for resource-constrained settings, such as edge computing apparatuses or embedded systems.

Conventional neural networks generally resort to depthwise convolutions (DWConv) or group convolutions (GConv) for extracting spatial features. Although these approaches have achieved certain advancements in diminishing FLOPS (floating point operations), they frequently entail additional memory access overhead. MicroNet [9] takes a step further by decomposing and sparsifying the network to drive FLOPS down to extremely low levels. Nevertheless, despite being a remarkable breakthrough in FLOP optimization, this method suffers from suboptimal computational efficiency and is often accompanied by supplementary data manipulations like cascading, shuffling, and pooling, which substantially impede the runtime performance of small models. In contrast, FasterNet optimizes the computational pathway by introducing PConv, which not only mitigates redundant computations and memory access but also attains remarkable performance across classification, detection, and segmentation tasks, manifesting lower latency and higher throughput.

FasterNet refines the computational route via the utilization of PConv. In comparison to traditional convolution operations, PConv significantly trims down computational overhead while safeguarding the quality of feature extraction. Traditional neural networks, especially deep convolutional neural networks (CNNs), ordinarily demand copious computational resources when handling images. Typical convolution operations necessitate processing data from every input channel and applying convolutions uniformly to all data, thereby engendering colossal computation requirements and memory access strain. In real-time application scenarios, such as distracted driving detection, it is of paramount importance to expeditiously analyze each frame of an image to guarantee the system’s real-time responsiveness. FasterNet’s design centers around minimizing superfluous computations without compromising accuracy, facilitating swift and efficient processing of input data.

FasterNet employs PConv to curtail computation. The principal distinction between PConv and traditional convolution lies in the fact that PConv conducts convolution operations solely on a subset of the input channels. Traditional convolution operations apply the same convolution to all input channels, implying that even if the features of certain channels exert minimal influence on subsequent detection, computational resources are still squandered. PConv adeptly diminishes redundant computation by selectively applying convolution operations.

The operational mechanism of PConv hinges on analyzing the channel similarity within the input feature map to ascertain which channels are amenable to information sharing. It conducts standard convolution operations exclusively on pivotal channels, while the remaining channels preserve their feature information via sparse connections or feature sharing. This maneuver not only curtails the quantity of convolution computations but also markedly diminishes the frequency of memory access, thereby augmenting the model’s computational efficiency.

The C2f module in YOLOv8, which represents a variant of the Cross-Stage Partial Network, serves as the feature extraction module in the original structure. C2f mitigates redundant computations by sharing feature information across diverse stages. Notwithstanding its satisfactory performance in YOLOv8, it still harbors a relatively high computational complexity, especially when high-resolution inputs are deployed, which substantially augments the model’s FLOPS (Floating Point Operations Per Second).

In this work, FasterNet supplants the C2f module in YOLOv8. By capitalizing on its lightweight architecture and the efficient computation proffered by PConv, FasterNet conspicuously reduces the model’s FLOPS. While upholding comparable or even enhanced detection accuracy, FasterNet curtails memory bandwidth consumption, endowing the model with the capacity to operate proficiently on embedded devices or within edge computing scenarios.

3.2. PConv

PConv (Partial Convolution) constitutes an optimized convolution operation methodology devised to abate redundant computations in convolutional neural networks, and it is especially apt for high-resolution image processing tasks such as object detection and distracted driving detection. In contrast to traditional convolution operations that execute identical computations across every input channel, PConv confines its convolution operations to a subset of input channels. This approach effectively trims down computation and memory bandwidth consumption while sustaining a high level of feature extraction proficiency.

When convolutional neural networks process images, they routinely engender multi-channel feature maps. These channels frequently display a high degree of similarity, and thus performing the same convolution operations across all channels can precipitate significant redundant calculations. PConv mitigates unnecessary computations by applying standard convolutions solely to a subset of the input feature map’s channels, while other channels are managed through sparse or skip connections.

In the YOLOv8s-FPNE model, PConv is predominantly employed in the detection head. The detection head shoulders the responsibility of classifying and localizing objects based on the extracted feature maps. Traditional detection heads customarily apply multiple convolutional layers to execute convolution calculations on all input channels. This practice incurs a hefty computational load, especially when grappling with high-resolution images. By downsizing the number of convolution operations, PConv significantly slashes FLOPs (Floating Point Operations) and empowers the model to conduct inference more expeditiously when processing high-resolution images.

3.3. NAM Attention

Normalized Attention Mechanism (NAM) represents a parsimonious and highly efficient attention mechanism. Engineered to heighten a model’s concentration on pivotal features, NAM concurrently alleviates the computational load. It has found extensive application in computer vision tasks, with a particular emphasis on object detection. In this domain, attention mechanisms enable models to zero-in on the significant regions within an image. This focused attention ultimately leads to a notable improvement in detection accuracy.

In neural networks, convolution operations play a pivotal role in extracting spatial and channel features from input images. Nevertheless, a common drawback of these operations is that they generally treat all features equally. This egalitarian approach may impede the model’s ability to differentiate between primary and secondary features.

The essence of the attention mechanism is to assign distinct weights to features, thereby enabling the model to concentrate more effectively on critical areas. The NAM (Normalized Attention Mechanism) takes this concept a step further by integrating both channel and spatial attention (as illustrated in Figure 3). This combination empowers the model to dynamically adapt its attention across various regions and channels.

Channel attention is primarily employed to fine-tune the significance of different channels within the feature map. Meanwhile, spatial attention zeroes in on identifying the most crucial regions in the spatial dimension of the feature map.

When compared to traditional attention mechanisms like the CBAM (Convolutional Block Attention Module) and SE (Squeeze-and-Excitation) modules, NAM strikes a superior balance between computational complexity and performance. The key innovation of NAM lies in streamlining the attention computation process. It achieves this by incorporating normalization operations and a lightweight weight adjustment mechanism. This not only curtails redundant computations but also preserves the model’s capacity to focus on important features.

In the YOLOv8s-FPNE model, the NAM attention mechanism is seamlessly integrated into multiple network modules. By appending the NAM module after the convolution layers, the model can dynamically modulate its attention across different channels and spatial regions during feature extraction. As a result, YOLOv8s-FPNE can more precisely detect distracted driving behavior in complex scenarios. Especially under multi-task conditions or challenging lighting situations, NAM substantially boosts the model’s performance.

3.4. Loss Function

In the realm of object detection tasks, the loss function serves as a crucial determinant for assessing a model’s performance. Traditional YOLO models commonly adopt the CIoU (Complete Intersection over Union) as their loss function. This function is well-designed as it not only takes into account the Intersection over Union (IoU) between the predicted bounding boxes and the ground-truth boxes but also incorporates considerations regarding the distance between them and their aspect ratios. Nevertheless, CIoU is not without its limitations, particularly when confronted with complex scenarios. For instance, when dealing with high-quality samples and hard-to-detect objects, the model often exhibits a sluggish convergence speed and limited localization accuracy.

To surmount these challenges, this paper presents the Focal-EIoU loss function. This advanced loss function integrates the Focal mechanism with the EIoU (Extended IoU) loss. Such a combination empowers the model to conduct more effective feature learning and localization prediction when handling intricate samples.

CIoU enhances the accuracy of object detection by imposing penalties on the distance between the center points of the predicted and ground truth boxes, as well as their aspect ratios. However, it fails to fully address the issues associated with handling hard samples and the unequal weighting of high-quality samples. In the context of distracted driving detection tasks, CIoU demonstrates slow convergence and suboptimal localization precision in certain complex scenarios. Specifically, when attempting to detect fine-grained driver actions, such as holding a phone or looking up at the mirror, the errors tend to be relatively large.

The Focal-EIoU loss function leverages the Focal mechanism, which enables the model to concentrate more intently on hard-to-detect samples during the training process. By reducing the weight assigned to easy-to-detect samples, the model can allocate a greater proportion of its attention to challenging targets. Consequently, the model achieves a substantial improvement in detection accuracy when dealing with complex distracted driving behaviors, including multitask distractions and subtle actions.

EIoU builds upon CIoU by expanding the geometric relationship between the predicted and ground truth boxes. It not only takes into account the IoU, distance, and aspect ratio but also introduces a new term to optimize the alignment of the boxes. This enhancement allows the model to converge more rapidly and maintain a higher level of prediction accuracy in demanding detection scenarios.

The calculation formula for the Focal-EIoU loss function is as follows:

\begin{array}{l} L_{(F o c a l - E I o U)} = (1 - I o U) + α \cdot \frac{{(x_{p r e d} - x_{t r u e})}^{2} + {(y_{p r e d} - y_{t r u e})}^{2}}{({d_{\max}}^{2})} + \\ γ \cdot A s p e c t R a t i o P e n a l t y \end{array}

(1)

In this context, IoU represents the Intersection over Union between the predicted bounding boxes and the ground-truth boxes. Meanwhile, α and γ are the weight parameters that regulate the penalties related to the distance and aspect ratio. The Focal mechanism plays a crucial role by dynamically adjusting these weights, which allows the model to place greater emphasis on samples that are difficult to detect.

Within the YOLOv8s-FPNE model, the Focal-EIoU loss function takes the place of the original CIoU loss function. This substitution leads to significant improvements, enabling the model to exhibit superior convergence and stability when confronted with complex scenarios.

Particularly in the realm of distracted driving detection tasks, the Focal-EIoU loss function proves highly effective in enhancing the localization accuracy of subtle driver actions. For example, it can accurately recognize whether a driver is using a phone, looking upward, or turning their head, even under complex lighting conditions.

4. Experimental Setup

4.1. Experimental Environment

The input image resolution for this experiment was set to 640 × 480 pixels. Training parameters included a batch size of 16, 100 training epochs, and an initial learning rate of 0.001. Detailed experimental configurations are shown in Table 1.

4.2. Dataset

The dataset employed in this research was sourced from the online domain. It encompasses a total of 5147 images depicting three prevalent distracted driving behaviors: drinking water, using a mobile phone, and smoking. These images were captured from a multitude of angles, endowing the dataset with a rich variety of visual perspectives. When juxtaposed with the dataset utilized by Shen Qian et al. [10] this dataset stands out for its enhanced diversity, broader generalization capabilities, and heightened robustness. To mitigate the risk of overfitting and optimize the utilization of the dataset, it was meticulously partitioned into training, testing, and validation sets at a ratio of 7:2:1, respectively.

Specifically, the training set consists of 3602 images, which serve as the primary material for the model to learn the patterns and features associated with distracted driving behaviors. The test set, containing 1029 images, is employed to evaluate the model’s performance after training. The validation set, with 516 images, plays a crucial role in fine-tuning the model during the training process, ensuring its generalization ability.

All images in the dataset have a standardized resolution of 640 × 640. To ensure the accuracy of the data, annotations, including bounding boxes that precisely delineate the location of the distracted behavior in the image and class labels that identify the type of behavior, were painstakingly created using the LabelImg tool (version 1.8.6). This manual annotation process guarantees the reliability and quality of the dataset for subsequent model training and evaluation.

4.3. Model Evaluation Metrics

To evaluate the lightweight YOLOv8-based model for distracted driving detection, a range of commonly used metrics were adopted to assess both accuracy and efficiency. Specifically, the following metrics were used: mean Average Precision (mAP), number of parameters, and Floating Point Operations Per Second (FLOPS). The formula for mAP is as follows:

Precision = \frac{T P}{T P + F P}

(2)

Recall = \frac{T P}{T P + F N}

(3)

AP = \int_{0}^{1} p (r) d r

(4)

mAP = \frac{1}{n} \sum_{i = 1}^{n} A P

(5)

where TP denotes true positives, FP false positives, FN false negatives, p precision, and r recall. Higher mAP values indicate better overall model performance. The number of parameters reflects the total number of learnable weights in the model, with fewer parameters indicating a more lightweight model suitable for deployment on resource-constrained devices. FLOPS quantifies the computational complexity of the model, with lower values indicating faster inference times.

4.4. Ablation Studies

To assess the influence of diverse improvement modules on detection accuracy, a comprehensive series of ablation experiments were carried out on the modified YOLOv8s-FPNE model. These experiments were designed to methodically analyze the contributions of each individual improvement module by progressively eliminating them. The setup and outcomes of the ablation study are summarized as follows:

Baseline Model: The unaltered YOLOv8s model served as the reference point for comparison.

Integration of FasterNet Module (C2f—Faster): In this modification, the original C2f module was substituted with the FasterNet module, aiming to potentially enhance the model’s efficiency.

Inclusion of PConv Module: The PConv module was added to the detection head, with the intention of improving the model’s detection capabilities.

Incorporation of NAM Attention Mechanism: The NAM attention mechanism was embedded into each network block, enabling the model to better focus on relevant features.

Loss Function Replacement: The CIoU loss function was replaced with the Focal-EIoU loss function, with the expectation of optimizing the model’s learning process.

The experimental results, as shown in Table 2, indicate that the integration of the FasterNet module leads to a substantial reduction in the number of parameters and computational complexity. However, this comes at the cost of a slight decrease in the mean Average Precision (mAP). On the other hand, the introduction of the PConv module and the NAM attention mechanism both contribute to an improvement in mAP. The detailed comparison of mAP and loss across different models is presented in Figure 4. Among them, the NAM attention mechanism yields the most favorable results, achieving an mAP of 81.6% and a recall of 76.0%. Additionally, Figure 5 presents a performance comparison of the models in real-world detection tasks.

4.5. Comparison of Network Structures

To identify the optimal network structure for lightweight modeling, a comparative experiment was conducted with mainstream lightweight networks, including MobileNetV3 [11], ShuffleNetV2 [12], and C2f-Faster. Results are presented in Table 3.

The C2f-Faster structure achieved the best balance between parameter reduction and accuracy, making it suitable for distracted driving detection tasks.

4.6. Comparison of Various Loss Functions

To identify the optimal loss function, this study compares several mainstream loss functions, including CIoU Loss, Focal-EIoU Loss, EIoU Loss, GIoU Loss [13], DIoU Loss [14], SIoU Loss [15], and WIoU Loss [16]. The experimental results are summarized in Table 4.

4.7. Multiple Attention Mechanisms

The introduction of attention mechanisms can effectively improve model detection accuracy and efficiency. To select the best attention mechanism, this study compares several lightweight attention mechanisms, including TripleAttention [17], MLCA [18], iRMB [19], Acmix [20], ECA [21], NAM, SGE [22], and EMA [23]. The experimental results are shown in Table 5.

The experimental results clearly demonstrate notable disparities among the models in terms of parameter count, computational complexity, and mean average precision (mAP). The Triplet Attention mechanism, while augmenting the model’s parameter count and complexity, paradoxically leads to a decline in mAP. This deterioration in performance renders it sub-optimal for practical applications. Both the MLCA and iRMB models are plagued by excessively protracted training times. Such long training durations make them ill-suited for real-time detection tasks, where prompt and efficient results are of the essence.

The Acmix model experiences a certain degree of increase in complexity. Regrettably, its mAP is slightly lower than that of the baseline model, indicating that it fails to offer a substantial improvement in performance. The ECA mechanism manages to keep the complexity and parameter count at a low level. Nevertheless, its mAP is marginally inferior to that of the baseline model, suggesting only modest effectiveness.

Both the SGE and EMA models contribute to an enhancement in accuracy. However, the NAM attention mechanism stands out from the rest. It achieves an impressive mAP of 81.6%, significantly boosting the detection accuracy without causing a substantial surge in either the parameter count or complexity. This outcome firmly validates the efficacy of the NAM attention mechanism in distracted driving detection tasks. Consequently, the NAM attention mechanism is chosen as the final attention mechanism for implementation.

4.8. Comparison of Parameters and Complexity

This section presents a comparison of YOLOv8s, YOLOv8s-FPNE, and several DETR-based detectors in terms of parameter count and computational complexity.

As summarized in Table 6, the baseline YOLOv8s model contains 11.13 M parameters and requires 28.4 G FLOPS, while the proposed YOLOv8s-FPNE achieves a substantial reduction to 8.72 M parameters and 21.7 G FLOPS, indicating a lighter and more efficient design.

In contrast, transformer-based detectors such as Vanilla DETR and Deformable DETR exhibit significantly higher computational costs, with 42.13 M/69.1 G and 37.46 M/46.3 G, respectively. Even the so-called Lightweight DETR still maintains 13.62 M parameters and 24.6 G FLOPS, which remain higher than those of YOLOv8s-FPNE.

These results demonstrate that the proposed YOLOv8s-FPNE model achieves superior compactness and computational efficiency, making it particularly suitable for real-time deployment on resource-constrained platforms such as embedded systems and edge devices.

4.9. Comparison of Detection Accuracy

In terms of detection accuracy, the YOLOv8s-FPNE model attains a mAP@0.5 of 0.816, outperforming the baseline YOLOv8s (0.804) despite its lighter structure and reduced computational burden.

When compared with DETR-based architectures, Vanilla DETR achieves 0.795, Deformable DETR reaches 0.806, and Lightweight DETR obtains 0.792 in mAP@0.5.

These findings indicate that YOLOv8s-FPNE achieves a more favorable balance between accuracy and efficiency. While DETR variants exhibit strong representational power through self-attention mechanisms, their higher computational costs make them less practical for real-time or embedded applications.

Overall, the YOLOv8s-FPNE model delivers competitive accuracy with markedly lower complexity, highlighting its strong potential for deployment in real-world distracted driving detection systems.

4.10. Inference Speed Evaluation

To further verify the deployment feasibility of the proposed model, we evaluate the inference efficiency of YOLOv8s-FPNE under a realistic hardware configuration (AMD Ryzen 9 6900HX, RTX 3070 Ti Laptop GPU, PyTorch 2.1, CUDA 12.6). As shown in Table 7, the model achieves a median model-side inference latency of 9.0 ms per frame, corresponding to approximately 73 FPS. When including decoding and NMS, the end-to-end latency stabilizes at 13.7 ms, maintaining real-time processing at over 70 FPS.

Compared with the baseline YOLOv8s—whose model-side latency reaches 11.5 ms and end-to-end delay reaches 16.0 ms—YOLOv8s-FPNE delivers faster inference speed and more consistent latency while operating with fewer parameters and lower computational complexity, as summarized in Table 7. Although other ablation variants (e.g., the C2f-Faster or PConv-enhanced versions) show slightly lower model-only latency (8.6–9.0 ms), their overall end-to-end performance remains inferior, and they do not achieve the same accuracy improvement as the final integrated YOLOv8s-FPNE design.

In contrast, DETR-based detectors suffer from substantially higher computational latency due to multi-head self-attention and iterative bipartite matching. Under similar hardware conditions, Vanilla DETR typically exceeds 35–50 ms per frame, making it unsuitable for real-time, safety-critical driving-monitoring scenarios.

In summary, the key inference indicators presented in Table 7 demonstrate that YOLOv8s-FPNE achieves a more advantageous speed–accuracy trade-off than both the baseline and the ablation variants. Its sub-14 ms end-to-end latency, together with strong detection accuracy, confirms the model’s practical suitability for real-time and resource-constrained distracted-driving detection systems.

5. Discussion

This research presents a streamlined approach for distracted driving detection, leveraging the YOLOv8 framework. The proposed method integrates the FasterNet module, PConv, the NAM attention mechanism, and the Focal-EIoU loss function.

The experimental outcomes clearly demonstrate that the refined YOLOv8s-FPNE model achieves a substantial reduction in both the parameter count and computational complexity. Remarkably, it manages to preserve detection accuracy at the same time. This makes the model an ideal candidate for deployment on devices with limited resources, where efficiency and performance need to be carefully balanced.

Looking ahead, future research endeavors will focus on several key directions. Firstly, there will be a continued effort to optimize the model structure, aiming to enhance its internal architecture for better performance. In addition, improving detection accuracy will remain a top priority, with the exploration of advanced techniques to refine the model’s ability to accurately identify distracted driving behaviors. Furthermore, additional lightweight strategies will be investigated to further reduce the model’s resource requirements without sacrificing its effectiveness. Finally, the model will be applied in real-world driving scenarios. This real-world testing will not only provide broader validation of the model’s performance but also open up new avenues for its practical application in various driving contexts.

Author Contributions

Conceptualization, Methodology, and Software: S.G.; Validation, Formal Analysis, and Data Curation: S.G. and L.L.; Investigation and Resources: B.R. and F.L.; Writing—Original Draft Preparation: S.G.; Writing—Review and Editing: L.Z. and W.W.; Supervision and Project Administration: L.Z.; Funding Acquisition: L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Fundamental Research Project of Jiangsu Colleges and Universities, grant number: 22KJB170021.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from [State Farm Distracted Driver Detection] and are available at [https://www.kaggle.com/competitions/state-farm-distracted-driver-detection/leaderboard] (accessed on 12 June 2025) with the permission of [State Farm Distracted Driver Detection].

Acknowledgments

The authors would like to express their sincere gratitude to everyone who supported and encouraged them throughout this research. Their contributions have been invaluable to the completion of this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cao, L.; Yang, S.; Ai, C.; Yan, J.; Li, X. Distracted driving behavior detection method based on deep learning. Automot. Technol. 2023, 6, 49–54. [Google Scholar] [CrossRef]
Wei, Q.; Zhu, W.; Jiang, J.; Xie, X. Distracted driving behavior detection based on improved YOLOv4-tiny. J. Sichuan Univ. Sci. Eng. (Nat. Sci. Ed.) 2023, 36, 67–76. [Google Scholar]
Chen, R.; Hu, C.; Hu, X.; Yang, L.; Zhang, J.; He, J. Driver distracted driving detection based on improved YOLOv5. J. Jilin Univ. (Eng. Technol. Ed.) 2024, 54, 959–968. [Google Scholar] [CrossRef]
Cai, J.; Mao, Z.; Li, J.; Wu, X. A review of target detection algorithms and applications based on deep learning. Netw. Secur. Technol. Appl. 2023, 11, 41–45. [Google Scholar]
Chen, H.; Zendehdel, N.; Leu, M.C.; Yin, Z. A gaze-driven manufacturing assembly assistant system with integrated step recognition, repetition analysis, and real-time feedback. Eng. Appl. Artif. Intell. 2025, 144, 110076. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based attention module. arXiv 2021, arXiv:2111.12419. [Google Scholar] [CrossRef]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Li, Y.; Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Yuan, L.; Vasconcelos, N. Micronet: Improving image recognition with extremely low flops. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 468–477. [Google Scholar]
Qian, S.; Lei, Z.; Yuxiang, Z.; Yi, L.; Shihao, L. Lightweight distracted driving behavior detection method based on improved YOLOv8n. Electron. Meas. Technol. 2025, 47, 65–75. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Adam, H. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3139–3148. [Google Scholar]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 1389–1400. [Google Scholar]
Pan, X.; Ge, C.; Lu, R.; Song, S.; Chen, G.; Huang, Z.; Huang, G. On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 815–825. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Li, X.; Hu, X.; Yang, J. Spatial group-wise enhance: Improving semantic feature learning in convolutional networks. arXiv 2019, arXiv:1905.09646. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]

Figure 1. YOLOv8s-FPNE model’s network architecture.

Figure 2. FasterNet.

Figure 3. Normalized Attention Mechanism.

Figure 4. Comparison of mAP and Loss Across Models in the Ablation Study.

Figure 5. Performance Comparison of the Models in Real-World Detection Tasks.

Table 1. Experimental conditions.

Configuration	Details
CPU	AMD Ryzen 9 6900HX
GPU	NVIDIA GeForce RTX 3070 Ti Laptop
Memory	16 GB
Deep learning framework	PyTorch 2.1.0
Python	3.8
CUDA	12.6

Table 2. Ablation experiment results.

YOLOv8s	C2f-Faster	PConv	Focal-EIOU	NAM	mAP@0.5 (%)	Params (M)	FLOPS (G)	Recall (%)
√					80.4	11.13	28.4	77.0
√	√				79.0	8.30	21.4	73.2
√	√	√			79.8	8.72	21.7	72.6
√	√	√	√		80.4	8.72	21.7	75.4
√	√	√	√	√	81.6	8.71	21.7	76.0

Table 3. Comparative experiments of different networks.

Method	mAP@0.5 (%)	Params (M)	FLOPS (G)
YOLOv8s	80.4	11.13	28.4
MobileNetV3	72.5	2.56	6.1
ShuffleNetV2	72.8	6.38	16.4
C2f-Faster	79.0	8.30	21.4

Table 4. Comparison of model parameters, complexity, and mAP under different loss functions.

Loss Function	Params (M)	FLOPS (G)	mAP@0.5 (%)
CIOU	8.71	21.7	0.799
FocalEIoU	8.71	21.7	0.804
EIoU	8.71	21.7	0.792
GIoU	8.71	21.7	0.797
DIoU	8.71	21.7	0.794
SIoU	8.71	21.7	0.801
WIoU	8.71	21.7	0.797

Table 5. Comparison of different attention mechanisms.

Attention	mAP@0.5 (%)	Params (M)	FLOPS (G)
Triplet attention	79.4	8.71	21.8
MLCA	None	8.73	867.5
iRMB	None	10.11	63.1
Acmix	79.6	9.55	22.4
ECA	80.2	8.72	21.7
NAM	81.6	8.72	21.7
SGE	81.1	8.72	21.7
EMA	80.9	8.81	22.8

Table 6. Model parameter comparison.

Name	Params (M)	FLOPS (G)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
YOLOv8s	11.13	28.4	0.804	0.458
YOLOv8s-FPNE	8.72	21.7	0.816	0.465
Vanilla DETR (ResNet-50)	42.13	69.1	0.795	0.421
Deformable DETR (ResNet-50)	37.46	46.3	0.806	0.435
Lightweight DETR	13.62	24.6	0.792	0.412

Table 7. Key Inference Metrics of Model Variants.

Model Variant	Params (M)	FLOPS (G)	End-to-End Latency (ms)	Throughput (FPS)
YOLOv8s (Baseline)	11.13	28.4	16.0	62
+C2f-Faster	8.30	21.4	13.1	76
+C2f-Faster + PConv	8.72	21.7	13.4	74
+C2f-Faster + PConv + Focal-EIoU	8.72	21.7	13.5	74
YOLOv8s-FPNE	8.71	21.7	13.7	73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, F.; Gu, S.; Lu, L.; Ren, B.; Zhang, L.; Wu, W. A Lightweight and Efficient Approach for Distracted Driving Detection Based on YOLOv8. Electronics 2026, 15, 34. https://doi.org/10.3390/electronics15010034

AMA Style

Li F, Gu S, Lu L, Ren B, Zhang L, Wu W. A Lightweight and Efficient Approach for Distracted Driving Detection Based on YOLOv8. Electronics. 2026; 15(1):34. https://doi.org/10.3390/electronics15010034

Chicago/Turabian Style

Li, Fu, Shenghao Gu, Lei Lu, Binghua Ren, Lijuan Zhang, and Wangyu Wu. 2026. "A Lightweight and Efficient Approach for Distracted Driving Detection Based on YOLOv8" Electronics 15, no. 1: 34. https://doi.org/10.3390/electronics15010034

APA Style

Li, F., Gu, S., Lu, L., Ren, B., Zhang, L., & Wu, W. (2026). A Lightweight and Efficient Approach for Distracted Driving Detection Based on YOLOv8. Electronics, 15(1), 34. https://doi.org/10.3390/electronics15010034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight and Efficient Approach for Distracted Driving Detection Based on YOLOv8

Abstract

1. Introduction

2. Introduction to YOLO

3. YOLO Algorithm Improvements

3.1. Integration of FasterNet

3.2. PConv

3.3. NAM Attention

3.4. Loss Function

4. Experimental Setup

4.1. Experimental Environment

4.2. Dataset

4.3. Model Evaluation Metrics

4.4. Ablation Studies

4.5. Comparison of Network Structures

4.6. Comparison of Various Loss Functions

4.7. Multiple Attention Mechanisms

4.8. Comparison of Parameters and Complexity

4.9. Comparison of Detection Accuracy

4.10. Inference Speed Evaluation

5. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI