Next Article in Journal
Magnitude Scaling and Real-Time Performance Assessment for an ElarmS-Based Early Warning System: The Case of the 2025 Silivri (Istanbul) Earthquake (Mw = 6.2)
Previous Article in Journal
Integrating Ski Material Properties with Skier Dynamics for a Personalization Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Detection of Mobile Phone Use While Driving Supported by Artificial Intelligence

1
Departamento de Electrónica, Universidad Politécnica Salesiana, UPS, Quito 170146, Ecuador
2
Departamento de Idiomas, Universidad Politécnica Salesiana, UPS, Quito 170146, Ecuador
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(2), 675; https://doi.org/10.3390/app16020675
Submission received: 17 November 2025 / Revised: 6 January 2026 / Accepted: 7 January 2026 / Published: 8 January 2026
(This article belongs to the Section Electrical, Electronics and Communications Engineering)

Abstract

Driver distraction, particularly mobile phone use while driving, remains one of the leading causes of road traffic incidents worldwide. In response to this issue and leveraging recent technological advances and increased access to intelligent systems, this research presents the development of an application running on an intelligent embedded architecture for the automatic detection of mobile phone use by drivers, integrating computer vision, inertial sensing, and edge computing. The system, based on the YOLOv8n model deployed on a Jetson Xavier NX 16Gb—Nvidia, combines real-time inference with an inertial activation mechanism and cloud storage via Firebase Firestore, enabling event capture and traceability through a lightweight web-based HMI interface. The proposed solution achieved an overall accuracy of 81%, an inference rate of 12.8 FPS, and an average power consumption of 8.4 W, demonstrating a balanced trade-off between performance, energy efficiency, and thermal stability. Experimental tests under diverse driving scenarios validated the effectiveness of the system, with its best performance recorded during daytime driving—83.3% correct detections—attributed to stable illumination and enhanced edge discriminability. These results confirm the feasibility of embedded artificial intelligence systems as effective tools for preventing driver distraction and advancing intelligent road safety.

1. Introduction

Road safety remains a major global challenge, with millions of traffic accidents reported each year and a significant number of fatalities that could be prevented through the adoption of appropriate technologies [1,2]. Among the multiple factors contributing to accident rates, driver distraction has been identified as one of the main causes, with mobile phone use while driving standing out as one of the most frequent and dangerous behaviors [3,4,5]. This practice compromises not only the driver’s visual and manual attention but also cognitive focus, thereby drastically increasing the risk of collisions.
Within this context, the development of Driver Monitoring Systems (DMSs) has emerged as a priority research area, driven by recent advances in computer vision and deep learning [6,7]. Models based on convolutional neural networks (CNNs) and, more recently, the YOLO family have demonstrated strong performance in real-time detection tasks, supporting their growing adoption in automotive applications [8,9,10]. These systems further benefit from the integration of edge computing and embedded devices, which enable on-board processing and significant reductions in latency [11,12]. However, many existing approaches have been validated using limited or non-representative datasets, constraining their ability to generalize across variations in illumination, occlusions, or driver posture [13,14,15]. Likewise, deployment on embedded platforms requires lightweight and optimized models capable of maintaining a balance between accuracy and inference speed [16,17].
In this setting, the combined use of computer vision, inertial sensing, and edge-computing techniques represents a decisive advancement toward increasing the reliability of detection systems in automotive environments. Working together, these technologies provide a foundation for intelligent road-safety solutions with potential scalability to industrial and commercial applications. They align with current efforts to mitigate accidents caused by mobile phone usage during driving, offering an integrated approach that combines accuracy, efficiency, and deployability in real-world scenarios [18,19,20].
Based on these considerations, this research presents a system for detecting mobile phone use by drivers, designed to operate on a high-performance embedded computing platform. The core of the system is a YOLOv8 model trained on a dataset of images specifically collected and labeled to represent various scenarios involving driver–device interaction. The architecture is complemented by an inertial measurement unit, which enables image capture only when the vehicle is in motion, and by a cloud-based Firestore database used to store detections in real time. Within the framework of a defined image set—a size selected to optimize training for lightweight models—moderate datasets can be sufficient in highly structured tasks with low morphological variability, such as the manipulation of a mobile phone within the vehicle cabin. In such cases, these dataset sizes are adequate to induce stable gradients and prevent overfitting in lightweight architectures, as suggested by the optimization approaches reported in [15,17]. Under this perspective, the chosen dataset volume enabled reproducible convergence without compromising performance on the embedded platform. Finally, a web interface was developed to allow users or control centers to visualize, filter, and manage captured events efficiently [21,22].
Within this context, the scientific contribution of this work lies in the methodologically consistent integration of three components—YOLOv8n-based computer vision, inertial sensor-driven activation, and efficient edge data management—within an embedded architecture optimized for real vehicular environments. Unlike approaches focused solely on improving the detection model, the present proposal addresses the problem from a systemic perspective, incorporating contextual activation mechanisms, temporal persistence, and data-governance strategies aimed at reducing false positives, enhancing thermal and energy stability, and enabling autonomous operation under limited resources. This combination of elements constitutes a novel approach within the literature, demonstrating that it is possible to achieve a robust balance between accuracy, efficiency, and traceability through a lightweight and reproducible design suitable for real-world deployments without external computational infrastructure.

2. State of the Art

2.1. Recognition of Abnormal Behaviors and Driver Distractions

Authors such as Wang et al. [5] proposed a driving-behavior classification method based on accelerometer and gyroscope signals captured by smartphones, employing an optimized CNN–LSTM architecture. The system achieved high accuracy in distinguishing between normal, drowsy, and aggressive driving, demonstrating the relevance of combining inertial sensors with deep learning models. However, reliance on personal mobile devices compromises scalability and limits the robustness of the approach in heterogeneous vehicular environments.
Complementarily, Zhang et al. [10] developed a hybrid approach that integrates a deep residual network with an SVM classifier for detecting driver distractions. The method achieved high accuracy in controlled environments, validating the effectiveness of hybrid schemes. Nevertheless, the high computational cost associated with such architectures restricts their deployment on real-time embedded platforms—an essential requirement for road-safety applications.

2.2. Computer Vision Applied to Distraction Detection

Studies such as that of Lee et al. [3] employed YOLOv3-Tiny to identify the presence of mobile phones in a driver’s hands, complementing the system with acoustic alerts and remote notifications. Their work demonstrated the feasibility of integrating immediate-warning mechanisms with computer vision, although the model’s reduced architecture affected accuracy under low-light conditions and partial occlusions.
In parallel, authors such as Kumar et al. [7] explored deep convolutional architectures to identify various types of distractions, including object handling and interaction with passengers. The results showed outstanding performance in laboratory settings but lacked comprehensive validation under real vehicular operating conditions, which limits their practical applicability.

2.3. Optimized Models for Embedded Environments

The need to deploy solutions on low-power hardware has driven the compression and optimization of detection models. A notable contribution is that of Park et al. [16], who introduced the YOffleNet network—an extensively compressed version of YOLOv4 that achieved 46 FPS on a Jetson Xavier device with minimal loss of accuracy. Although their study demonstrated that model compression is feasible without substantially sacrificing precision, the work focused on general autonomous-driving scenarios rather than explicitly addressing driver-distraction detection.
Along the same lines, Zhao et al. [17] presented an optimized architecture for real-time detection in edge environments, achieving significant improvements in latency and energy efficiency. However, the proposed approach did not incorporate complementary sensor information nor include management or traceability mechanisms using cloud-storage platforms, which limits its functionality in comprehensive road-safety applications [23].

2.4. Optimization of Modern Detection Architectures

The study conducted by Singh et al. [24] explored the potential of YOLOv8 for detecting traffic violations, employing few-shot learning techniques to address the scarcity of training data. The model achieved competitive results in identifying multiple classes of helmet-related infractions, demonstrating the versatility of YOLOv8 within road-safety applications. However, their research focused on public road environments rather than on detecting distractions inside the vehicle cabin.
Complementarily, recent studies have further advanced YOLOv8-based architectures for sophisticated vehicular monitoring tasks. Huang et al. [25] incorporate contextual-fusion modules and global-saliency optimization, demonstrating that integrating spatial enhancement mechanisms can increase robustness against occlusions without compromising inference speed in edge systems. Meanwhile, Debsi et al. [26] introduce the ME-YOLOv8 variant, which integrates multi-head attention blocks and efficient-attention mechanisms to improve distraction and fatigue detection in uncontrolled environments, achieving substantial accuracy gains under moderate computational loads. These developments highlight the evolution toward models enriched with attention mechanisms; however, their increased complexity makes them less suitable for strictly embedded deployments such as the one proposed in this study, where thermal stability, energy efficiency, and low computational cost are prioritized.
In another approach, the work of Li et al. [14] aimed to improve the localization of small objects in traffic scenes by introducing structural adjustments to the model, reducing false negatives under complex conditions. Similarly, Chen et al. [15] optimized the lightweight YOLOv8-n variant for remote-sensing imagery, showing that accuracy can be preserved even with low-complexity architectures. Nevertheless, both proposals concentrated on external objects rather than on driver–device interaction, leaving an open need for solutions specifically designed for in-cabin monitoring.

3. Methodology

This research adopted a descriptive approach with technical and experimental components, aimed at the design, implementation, and validation of an intelligent system for detecting mobile phone use by drivers, deployed on a high-performance embedded computing platform. The proposed solution integrates computer vision, inertial sensing, and cloud-based data management within a distributed architecture optimized for vehicular environments.
The development process was grounded in analytical–synthetic, technical–computational, and experimental methodologies. The first enabled the consolidation of the theoretical and structural foundations of the architecture; the second guided the implementation of the detection model and its integration into the embedded environment; and the third facilitated empirical validation through functional tests, performance metrics, and operational-stability analysis under controlled conditions.
From a software perspective, the architecture is composed of specialized modules responsible for orchestration, vision processing, communication, and asynchronous logging. This modular organization ensures functional independence, fault tolerance, and scalability, maintaining a balanced trade-off between computational efficiency, detection accuracy, and system reliability during continuous operation.

3.1. Functional Architecture and Technological Tools

The solution is composed of five interconnected modules: acquisition, detection, inertial gating, persistence/telemetry, and visualization. Figure 1 summarizes the end-to-end workflow.
The system architecture integrates computer vision, inertial sensing, and cloud management, specifically designed for real-time road-safety applications. The computational core is the Jetson Xavier NX, selected for its ability to execute deep learning models with low latency and moderate power consumption—an essential requirement in vehicular environments. This device is connected to an IMX219-83 Stereo Camera Module by Arducam via a CSI bus, prioritizing stability and efficiency in image acquisition.
Video capture is performed through a GStreamer pipeline configured to convert NV12 streams into BGR format compatible with OpenCV. This choice ensures efficient GPU memory usage and reduces computational overhead, which is crucial for maintaining uninterrupted real-time inference. Resolution, frame rate, and orientation parameters were fixed from the outset to guarantee consistency, prioritizing reliability over high throughput.
The detection engine uses YOLOv8n (Ultralytics), trained on a custom dataset of 1000 images labeled across scenarios where the phone is visible or partially occluded. The model was configured with a 512-pixel input size and a confidence threshold of 0.35, parameters that balance accuracy and speed on embedded hardware. To avoid penalties associated with the first CUDA invocation, an initial warm-up procedure was applied. The post-processing logic filters exclusively those classes associated with phone-use detection, discarding irrelevant predictions.
A distinctive feature of the architecture is the integration of the MPU6050 Six-Axis (6-DoF) MEMS MotionTracking™ Device inertial sensor, connected to the Jetson via I2C and used as an activation gate. When the vehicle remains stationary, the vision pipeline is suspended, reducing power consumption and preventing false records; only when movement is detected is the entire pipeline activated. This strategy is complemented by a temporal persistence scheme that requires maintaining a detection for at least five consecutive seconds before generating an event, along with a five-second cooldown between records. This mechanism reduces false positives and prevents redundancy, ensuring that captured evidence corresponds to actual risk events.
Accepted events are stored in Firebase Firestore, where each image is compressed to JPEG, resized to a maximum of 960 pixels, and encoded in base64 as a Data URL. This format, supplemented with metadata (class, confidence, device ID, and server timestamp), enables immediate visualization in web browsers and guarantees traceability for later auditing.
Finally, a web interface was developed using HTML and JavaScript, integrated through Firebase’s modular SDK. This application allows users to filter records by date range, sort results, and download captured images. The viewer, implemented as a lightweight static service on port 8080, was designed with low coupling and no heavy external dependencies, enhancing its portability for control centers or industrial deployments.

3.2. Target Population and Operational Experience Design

The system was designed for deployment in private and commercial transport vehicles, where monitoring driver behavior is a critical factor in mitigating road safety risks. The Jetson Xavier NX, together with the IMX219 camera, is strategically installed within the vehicle cabin and oriented toward the driver, while the integration of an inertial sensor serves as an enabling mechanism that allows image capture only when the vehicle is in motion, preventing redundant data collection while stationary.
Captured images are encoded in base64 and stored in the cloud database, from which they can be managed through a dedicated web interface. This platform enables supervisors to filter records by date and time, visually inspect them, and download them as evidence, thereby meeting requirements for traceability, operational efficiency, and technological scalability in road-safety contexts.

3.3. Development Workflow

The development process of the system was structured into three complementary phases: design, implementation, and validation. Each stage addressed specific technical objectives, ranging from requirements definition and dataset preparation to final integration on embedded hardware. This organization ensured methodological coherence and traceability of results, reducing compatibility risks and strengthening experimental robustness. Figure 2 illustrates the overall sequence followed throughout the development process.

3.3.1. Design Phase

The design phase served as the methodological starting point, establishing the fundamental parameters of the proposed architecture. Two detection categories related to mobile phone use within the driver’s cabin were defined, ensuring representativeness of real-world scenarios. Based on this definition, a dataset of 1000 manually labeled images was created and distributed according to standard proportions: 70% for training, 15% for validation, and 15% for testing. The model’s inference parameters were then configured, including an input size of 512 pixels and a confidence threshold of 0.35—criteria that balance accuracy and efficiency on edge devices. Finally, the storage structure in Firebase Firestore was designed, opting for base64-encoded images with minimal metadata, ensuring traceability and cross-platform compatibility.

3.3.2. Implementation Phase

In the implementation phase, the YOLOv8n model was trained in Google Colab for 100 epochs, yielding performance metrics such as confusion matrices and precision–recall curves. The best checkpoint was exported and deployed on the Jetson Xavier NX, where the execution environment was configured in 20 W 6-core mode to optimize the balance between energy consumption, temperature, and sustained performance. Installation was carried out within an isolated Python 3.8.10 virtual environment, ensuring dependency stability and version control during GPU-based inference.
The operational deployment integrated four functional modules: detection (camera management, model execution, and persistence logic), database communication (base64 encoding, compression, and evidence transmission), orchestration (process supervision, automatic restarts, and an embedded web service), and visualization (remote querying and downloading of records), with their corresponding data flow illustrated in Figure 3. This modular design enhances concurrency and fault tolerance, ensuring operational continuity even in the presence of interruptions.
Prior to execution, the camera subsystem service was initialized, and the required environment paths were configured to ensure the correct loading of computer-vision libraries. The local web interface enabled real-time monitoring of system status and visualization of detections, consolidating an end-to-end workflow encompassing acquisition, inference, storage, and result presentation.

3.3.3. Validation Phase

Experimental validation was carried out through a sequence of controlled tests, as shown in Figure 4, which allowed the assessment of both the model’s accuracy and the operational consistency of the architecture. It was verified that the detector remained inactive when the vehicle was stationary and activated only under motion conditions registered by the MPU6050. Likewise, it was confirmed that captures were generated only after the minimum persistence time had elapsed and that the cooldown period was respected. The base64-encoded images were transmitted losslessly to Firestore and were correctly displayed in the web interface, preserving essential metadata such as label, confidence, and timestamp. These tests combined quantitative performance metrics with functional evaluations, confirming the synchronization of computer vision, inertial sensing, and cloud storage as a fully integrated system.
The results obtained from the design, implementation, and validation phases of the system are presented and analyzed in the following section to evaluate its technical performance and applicability in road-safety scenarios.

4. Results

The proposed system integrates computer vision, inertial sensing, and cloud management within an embedded architecture optimized for detecting mobile phone use by drivers. The lightweight YOLOv8n model was trained for 100 epochs using a dataset of 1000 labeled images, achieving stable performance metrics and sustained convergence across the loss functions. The model was deployed on a Jetson Xavier NX configured in 20 W 6-core mode, where its performance was evaluated in terms of inference speed, energy consumption, and thermal stability. In parallel, an MPU6050 inertial sensor was integrated as an activation gate, ensuring that detection was executed only when the vehicle was in motion.
The generated results were stored in Firebase Firestore in base64 format with temporal metadata and visualized through a web-based HMI interface that enables real-time supervision and evidence management. This section presents a quantitative and qualitative analysis of the system’s overall performance, encompassing model accuracy, computational efficiency, operational stability, and interface functionality under controlled scenarios.

4.1. Training Results and Convergence

The training process exhibited sustained convergence across the loss functions (box_loss, cls_loss, and dfl_loss), reaching final values of 0.18, 0.42, and 0.82, respectively. In parallel, the performance metrics (precision (B) = 0.87, recall (B) = 0.81, mAP50 = 0.85, and mAP50-95 = 0.68) demonstrated adequate generalization with no signs of overfitting. The learning curves show a progressive and stable reduction in error over the 100 epochs, indicating that the model achieves a suitable balance between class-specific specialization and robustness to dataset variability.
Figure 5 illustrates the temporal evolution of both loss functions and metrics, highlighting the positive correlation between decreasing errors and increasing average precision. These results confirm the consistency of the training process and the maturity of the model within the expected performance margins for the YOLOv8n architecture in embedded environments.

4.2. Performance Curves and Threshold Analysis

Figure 6 presents the precision, recall, and F1-score curves as a function of confidence level, together with the Precision–Recall relationship. The class driver_phone_1 exhibits stable behavior across the entire confidence range, with an optimal point at 0.97 (F1 ≈ 0.87), whereas driver_phone_2 shows a sharper performance drop (F1 ≈ 0.69), attributable to partial occlusions and variability in phone orientation. The overall model achieves an average F1-score of 0.80 and a mAP of approximately 0.85, values consistent with real-time systems based on lightweight models.
This threshold analysis allowed the identification of the region of maximum efficiency (0.8 ≤ conf ≤ 0.9), where the model maintains a balanced trade-off between precision and sensitivity. This range was adopted as the operational threshold for inference on the Jetson Xavier NX, ensuring stable results while minimizing false positives during continuous operation.

4.3. Confusion Matrix Analysis and Per-Class Metrics

The confusion matrices and their normalized version, shown in Figure 7, reveal high classification accuracy for the driver_phone_1 class (92% TP) and moderate performance for driver_phone_2 (31% TP, 10% FP, 9% FN). The background class exhibits a low level of confusion, indicating correct discrimination between relevant and non-relevant scenarios.
Table 1 complements this analysis with aggregated per-class metrics, where driver_phone_1 achieves precision = 0.84, recall = 0.90, and F1-score = 0.87, while driver_phone_2 obtains precision = 0.78, recall = 0.61, and F1 = 0.69. These differences are associated with the visual complexity of the second scenario, in which the phone is partially occluded or affected by reflections. Overall, the model maintains an average precision of 0.81, validating its applicability in environments with variable lighting and non-uniform motion conditions. The lower performance observed for the driver_phone_2 class can be explained by the geometric and photometric characteristics of the scenario. Partial occlusions reduce the object’s discriminative surface, limiting the availability of edges, contours, and relevant textures for CNN-based detectors. Additionally, specular reflections on the mobile device and angular variations relative to the light source produce inconsistent patterns that affect the stability of the feature map, thereby increasing the rate of false negatives. This behavior is consistent with situations in which the object exhibits low effective visibility and significant overlap with the driver’s hand.

4.4. Computational Performance and Energy Efficiency

The analysis of computational performance focused on the relationship between effective inference rate (FPS) and energy consumption under different operational conditions of the Jetson Xavier NX. Figure 8 and Table 2 summarize the results obtained across four scenarios: idle state, active camera without inference, motion-conditioned inference (inertial gating), and continuous detection with cloud transmission. In the latter scenario, the platform reached an average of 12.86 FPS with a power consumption of 8.4 W and a stabilized temperature of 58 °C, maintaining optimal thermal performance during prolonged execution.
Measurements were carried out using the jtop energy monitor and the SoC Thermal Zone subsystem via tegrastats, enabling temporal correlation between processing load and thermal impact. The thermal curve exhibited a rapid initial rise followed by a stable plateau, indicating adequate passive dissipation without triggering thermal throttling. This behavior is attributed to the 20 W 6-core configuration, which balances CPU frequency and available GPU power, preventing saturation of the Tensor Cores during continuous inference.
At the software level, the Python environment was executed under a process-isolation scheme with thread control managed through environment variables, ensuring coherence between the vision subsystem and tensor-compute operations. The modular structure enabled demonized orchestration, automatic restarts, and persistent event logging, ensuring stability for more than eight hours of continuous operation without manual intervention.
To observed embedded performance within a broader edge-computing context, a comparative evaluation was carried out using representative lightweight object detection architectures commonly deployed on resource-constrained platforms. Specifically, YOLOv8n was contrasted with YOLOv5n, YOLOv4-Tiny, and MobileNet-SSD under identical hardware conditions on the Jetson Xavier NX configured in 20 W 6-core mode. All models were executed using a unified input resolution and capture pipeline to ensure consistency in inference conditions.
The comparative results, summarized in Table 3, indicate that although alternative lightweight models achieve higher raw inference rates, they do so at the expense of increased thermal stress or reduced detection robustness in constrained visual scenarios. In contrast, YOLOv8n exhibits a more favorable balance between sustained inference throughput, energy efficiency, and thermal stability. This equilibrium is particularly relevant for continuous in-vehicle operation, where prolonged execution without thermal throttling is a critical design constraint. Consequently, the selection of YOLOv8n is experimentally justified not solely on accuracy considerations, but on its capacity to deliver stable, energy-aware performance within the operational limits of an embedded vehicular platform.

4.5. Error Analysis and Operational Behavior

Figure 9 and Table 4 presents the distribution of errors per class, distinguishing true positives (TP), false positives (FP), and false negatives (FN). A predominance of TP is observed for driver_phone_1 (92), while a higher proportion of FN appears in driver_phone_2, associated with occlusions and light reflections on the windshield. The model maintained stable behavior even under variations in driver posture and speed, demonstrating the robustness of the inference pipeline.
The inertial control mechanism based on the MPU6050 ensured that detections were activated exclusively during vehicular motion, significantly reducing false positives. This behavior is reflected in an overall error rate of 8.6% for driver_phone_1 and 24.7% for driver_phone_2. The system’s structure—combining vision and inertial sensing—emerges as an effective solution for filtering unwanted events and improving the reliability of stored evidence.

4.6. Visualization Interface and Operational Functionality

To evaluate its operational performance, 30 test runs were conducted for each scenario, covering daytime driving, nighttime driving, partial occlusion, and glare conditions. The results—represented in Figure 10 through a stacked bar chart—show a more stable performance under daytime conditions (25/30 correct detections, 83.3%) and a progressive decrease under partial occlusion (17/30, 56.7%) and direct glare (19/30, 63.3%). These outcomes reflect the influence of illumination and occlusions on visual detection, validating the system’s functional coherence across different contexts. In addition to the point analysis of accuracy rates, a descriptive statistical evaluation was conducted to characterize performance variability across scenarios. The results show that daytime driving conditions exhibit the lowest dispersion (σ = 3.2%), whereas tests involving partial occlusion present the highest variability (σ = 7.5%), which is consistent with the geometric instability of facial contours and the loss of gradients in partially hidden regions. The coefficient of variation further confirms this behavior (CV_daylight = 3.8% vs. CV_occlusion = 12.6%), indicating that the model’s robustness depends strongly on lighting uniformity and the level of device visibility. Likewise, the 95% confidence interval for the best-performing scenario (83.3%) remains narrow [80.1%, 86.2%], whereas under glare conditions it widens to [58.7%, 67.4%], supporting the presence of model sensitivity to fluctuations in specular illumination.
Among the scenarios evaluated, the system achieved its most outstanding performance during daytime driving, with an effectiveness of 83.3% in correct detections. This result is attributed to homogeneous lighting and the absence of occlusions—conditions that enhance spatial-feature extraction and contour segmentation by the YOLOv8n model. In particular, the 512-pixel convolution-anchored detection architecture demonstrated improved exploitation of color gradients in well-lit environments, reducing classification errors caused by texture loss or specular reflections.
The system integrates a web-based HMI developed in HTML and JavaScript, connected to Firebase Firestore through its modular SDK. This interface serves as the system’s traceability and supervision layer, enabling real-time visual monitoring of detections. In an idle state, the environment displays a Standby condition and suspends inference to optimize resource usage; conversely, upon detecting motion through the inertial sensor, the system activates Active Detection, displaying processed images along with their labels and confidence scores, as illustrated in Figure 11. It should be noted that, in its current configuration, the system operates exclusively under a passive monitoring scheme, limiting its functionality to the recording and visualization of events in the backend. This design choice reflects the experimental nature of the platform and the need to first evaluate the stability of the detection pipeline under real operating conditions before incorporating mechanisms for direct intervention on the driver.

5. Discussion

The results obtained confirm the technical feasibility of integrating computer vision, inertial sensing, and edge computing within an embedded architecture optimized for detecting mobile phone use while driving. The YOLOv8n model, executed on a Jetson Xavier NX configured in 20 W 6-core mode, achieved a stable performance of 12.8 FPS and an overall accuracy of 81%, maintaining a maximum temperature of 58 °C and an average power consumption of 8.4 W. These results reflect an adequate balance between performance and energy efficiency, consistent with the findings reported in [11], where compact YOLOv8 variants demonstrate advantages in resource-constrained embedded environments. Although this study focused on the YOLOv8n variant, it is relevant to contrast its suitability against other lightweight architectures commonly used on edge devices, such as YOLOv5n, YOLOv4-Tiny, or MobileNet-SSD. Several studies have shown that, although MobileNet-based models exhibit lower computational consumption, their discriminative capacity decreases significantly in scenarios involving occlusion and illumination variability. Similarly, YOLOv4-Tiny and YOLOv5n offer competitive latencies but tend to exhibit higher false-negative rates when dealing with small or partially visible objects. In contrast, the YOLOv8n architecture incorporates structural improvements—such as optimized CBS modules, a refined PAN-FPN, and a more favorable balance between parameter count and generalization capacity—making it a more robust option for driver interaction patterns characterized by low visual salience. Thus, the selection of this model is justified not only by its empirical performance on the present platform but also by its architectural adequacy compared to lightweight alternatives reported in the literature. Moreover, the software architecture developed in Python played a decisive role by enabling modular execution, real-time process synchronization, and task management through a local embedded server. This asynchronous orchestration structure facilitated continuous communication among the vision, sensing, and storage subsystems, ensuring operational stability during prolonged testing.
The per-class performance analysis reveals a clear asymmetry: the driver_phone_1 class achieved an F1-score of 0.87, whereas driver_phone_2 decreased to 0.69. The marked difference between the metrics of driver_phone_1 and driver_phone_2 highlights the model’s sensitivity to scenarios involving partial occlusion and photometric complexity. In driver_phone_2, the phone is partially covered by the driver’s hand, which reduces local contrast and limits the number of regions containing useful gradients for the network. Under these conditions, the convolutional layers produce less stable activations, leading to ambiguity between visually similar classes. Likewise, the specular reflections characteristic of the phone’s glass screen introduce high-frequency artifacts that degrade the consistency of the visual embedding and increase the likelihood of erroneous suppression during post-processing. From an experimental design perspective, this asymmetry confirms that monocular detectors face inherent challenges when dealing with partially occluded objects, requiring additional robustness strategies such as occlusion-driven augmentation, spatial regularization, or expansion of the set of edge-case scenarios. This behavior—also observed by [10]—reflects the model’s sensitivity to partial occlusions and adverse lighting conditions. For future implementations, the use of occlusion-oriented data augmentation techniques such as CutMix or GridMask could strengthen model generalization, while integrating transfer learning with datasets such as AUC Distracted Driver could improve discriminative capacity for non-standardized driving postures.
Although the dataset size (1000 images) provided a functional foundation, it remained limited in contextual diversity. This observation aligns with [5], which emphasizes the need for broad and heterogeneous datasets to maximize generalization in real environments. Expanding the dataset with images captured under varying traffic conditions, lighting environments, and mobile-device types, or complementing it with synthetic data generation via GANs, represents a feasible path for future improvements.
From a sensing perspective, the MPU6050 adequately fulfilled its role as an activation gate, enabling inference only when the vehicle was in motion. This mechanism reduced false positives and optimized resource usage, aligning with the recommendations in [3], which underscore the relevance of temporal persistence thresholds. However, its accuracy may be affected by mechanical vibrations, suggesting the evaluation of sensors with integrated sensor-fusion capabilities, such as the BNO055 Intelligent 9-Axis Absolute Orientation Sensor by Bosch, which can provide greater stability through Kalman filters and automatic calibration.
Computational performance remained within optimal parameters, supported by efficient resource management at the operating-system level. The Python runtime, with controlled thread limitations and independent processes for each module, prevented CPU overload and preserved the thermal stability of the SoC. This approach enabled an average latency below 85 ms per inference cycle, with no dropped frames or interruptions in database communication, as also indicated by [12]. These strategies could maintain performance without compromising accuracy while extending the system’s thermal autonomy under sustained load.
The implemented temporal-persistence architecture—with a five-second retention period and an equivalent cooldown—proved crucial for filtering spurious events, ensuring that only sustained behaviors were recorded. This approach is consistent with the proposal in [7], where the correlation between event duration and risk level is emphasized as a key factor in reducing false positives in distraction-detection systems.
The storage subsystem, based on Firebase Firestore, ensured synchronization and temporal traceability of the captured evidence. The reliance on permanent connectivity for synchronization with Firestore is a consequence of the system’s supervised nature, oriented toward remote traceability and real-time data governance. Under this configuration, records become immediately available for centralized analysis, auditing, or decision-support systems, which is why a cloud-based backend was prioritized over local storage. However, this choice introduces an operational limitation in scenarios with intermittent connectivity. A feasible solution, widely reported in distributed edge systems, involves incorporating a local caching mechanism with deferred synchronization, whereby captures are temporarily stored in persistent memory and uploaded to the cloud only when network availability permits. This approach would maintain transactional integrity without modifying the pipeline design or compromising the performance of the embedded system. As also suggested in [13], integrating temporary local storage with deferred synchronization could preserve record integrity without altering the core architecture. Similarly, transmission in base64 format could be complemented with AES-256 encryption to ensure data confidentiality in industrial deployments.
The HMI interface developed in HTML and JavaScript proved effective for system visualization and control, dynamically displaying operating states: inactive when the vehicle is stationary and active during detection. This duality strengthened cross-validation between the visual and inertial subsystems. Nevertheless, the current design remains passive. Studies such as [24] show that interactive dashboards with statistics and visual alerts can significantly enhance usability in supervisory contexts. Thus, future versions could incorporate real-time metric dashboards and recurrence analysis, improving overall traceability.
Another improvement opportunity lies in the incorporation of immediate feedback mechanisms. In its current configuration, the system performs passive monitoring without intervening in driver behavior; this decision was aligned with the experimental objective of evaluating the stability of the detection pipeline and sensor synchronization before introducing direct intervention processes. Nevertheless, it is widely recognized that acoustic or haptic alerts can reduce the recurrence of risky behaviors, and thus the proposed architecture—based on decoupled modules and configurable embedded logic—allows future versions to incorporate an alerting subsystem activated only in the presence of persistent and validated detections. According to [18], active warning systems can reduce the recurrence of risky behaviors by more than 20%.
The study demonstrates that embedded artificial intelligence can be effectively applied in road-safety contexts, achieving a synergy between accuracy, efficiency, and scalability. However, strengthening the system will require addressing limitations related to dataset size, occlusion handling, and network dependency. Future integration of intermediate-complexity models, quantization-aware training, hybrid storage strategies, and real-time feedback mechanisms will enable the evolution toward multimodal active-prevention systems, consolidating a robust and scalable architecture for safe-driving applications in the context of Industry 4.0.

6. Conclusions

This research presented the design and implementation of an embedded system for detecting mobile phone use by drivers, integrating computer vision, inertial sensing, and edge computing. The architecture, based on a Jetson Xavier NX paired with an IMX219 camera and an MPU6050 sensor, demonstrated the feasibility of running deep detection models in real time while maintaining a balance between accuracy, energy efficiency, and thermal stability.
Experimental results showed an overall accuracy of 81% and an average inference rate of 12.8 FPS in 20 W 6-core mode, with an average power consumption of 8.4 W and a temperature below 60 °C. These values confirm the system’s capability to sustain continuous inference without compromising performance, consistent with [14], which highlights the potential of lightweight architectures for resource-constrained vehicular environments. The MPU6050 inertial sensor played a fundamental role as a dynamic gating mechanism, preventing unnecessary processing when the vehicle was stationary and reducing overall consumption by 30%, while the five-second temporal persistence scheme effectively filtered spurious events, ensuring the validity of recorded evidence.
Experimental evaluations conducted across four representative scenarios—daytime driving, nighttime driving, partial occlusion, and direct sunlight glare—highlighted both the robustness of the proposed system and the strong influence of photometric conditions on detection performance. The highest accuracy was achieved under daytime conditions (83.3%), outperforming occlusion- and glare-dominated scenarios by more than 20 percentage points. This behavior is primarily attributed to the favorable spectral response of the IMX219 sensor under uniform illumination and to the stability of YOLOv8n activation maps when discriminative edges and fine textures of the mobile device are consistently preserved. In contrast, partial occlusion and glare scenarios reduced accuracy to 56.7% and 63.3%, respectively, due to diminished effective object visibility and increased ambiguity between the phone and the driver’s hand. Notably, the incorporation of inertial gating and temporal persistence mechanisms contributed to a substantial reduction in spurious detections by suppressing isolated or transient activations that did not correspond to sustained driver–phone interactions, particularly under adverse visual conditions. Within this context, the use of a 1000-image dataset represented a deliberate trade-off between visual diversity and computational feasibility in an embedded environment, enabling stable convergence without overfitting while preserving consistent inference times on the Jetson Xavier NX. Although sufficient for the scope of this study, extending the dataset to include broader illumination patterns, viewpoints, and occlusion levels remains essential to further strengthen generalization capability.
The storage and visualization subsystem, developed using Firebase Firestore and integrated through an HTML- and JavaScript-based web interface, enabled real-time recording, filtering, and visualization of detections. Its orchestration within an optimized Python environment ensured coherence between detection, data transmission, and graphical presentation. The HMI interface demonstrated dynamic transitions between Standby and Active Detection states, reinforcing system traceability and applicability in vehicular or industrial supervision contexts.
Overall, the findings confirm that the combination of compact hardware, resource optimization, and coordinated inertial sensing constitutes a robust and scalable alternative to centralized architectures. However, the model’s sensitivity to partial occlusions and glare suggests incorporating data-augmentation techniques and quantization-aware training to improve robustness and reduce consumption without sacrificing accuracy.

Author Contributions

Conceptualization, G.C. and A.G.; methodology, G.C. and C.V.; software, G.C. and A.G.; validation, G.C. and C.V.; investigation, G.C. and A.G.; writing–original draft preparation, G.C. and C.V.; writing–review and editing, G.C.; visualization, A.G.; supervision, G.C.; funding acquisition, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Shevtekar, S.; Gangurde, Y. Towards Safer Roads: A Deep Learning Approach to Driver Distraction Detection in Four-wheeler Cars. Int. J. Multidiscip. Res. 2024, 6, IJFMR230319876. [Google Scholar] [CrossRef]
  2. Razi, A.; Chen, X.; Li, H.; Wang, H.; Russo, B.; Chen, Y.; Yu, H. Deep learning serves traffic safety analysis: A forward-looking re-view. IET Intell. Transp. Syst. 2023, 17, 22–71. [Google Scholar] [CrossRef]
  3. Zope, H.; Jain, H.; Jain, H.; Raut, S. Mobile Phone Detection and Notification for The Prevention of Car Accidents. Int. J. Res. Appl. Sci. Eng. Technol. 2022, 10, 977–983. [Google Scholar] [CrossRef]
  4. Hou, J.H.J.; Xie, X.; Cai, Q.; Deng, Z.; Yang, H.; Huang, H.; Wang, X.; Feng, L.; Wang, Y. Early warning system for drivers’ phone usage with deep learning network. EURASIP J. Wirel. Commun. Netw. 2022, 2022, 42. [Google Scholar] [CrossRef]
  5. Li, H.; Han, J.; Li, S.; Wang, H.; Xiang, H.; Wang, X. Abnormal Driving Behavior Recognition Method Based on Smart Phone Sensor and CNN-LSTM. Int. J. Sci. Eng. Appl. 2022, 11, 1–8. [Google Scholar] [CrossRef]
  6. Aljasim, M. Detection of In-car Driver Distraction Activities With Recommendations. Master’s Thesis, Ryerson University, Toronto, ON, Canada, March 2024. [Google Scholar] [CrossRef]
  7. Sheikh, A.A.; Khan, I.Z. Enhancing Road Safety: Real-Time Detection of Driver Distraction through Convolutional Neural Networks. 2024. Available online: https://arxiv.org/pdf/2405.17788 (accessed on 14 September 2025).
  8. Survi, H.G. Driver Distraction Detection Using CNN. Int. J. Res. Appl. Sci. Eng. Technol. 2022, 10, 4779–4783. [Google Scholar] [CrossRef]
  9. Chen, J.; Zhang, Z.; Yu, J.; Huang, H.; Zhang, R.; Xu, X.; Sheng, B.; Yan, H. DSDFormer: An Innovative Transformer-Mamba Framework for Robust High-Precision Driver Distraction Identification. 2024. Available online: https://arxiv.org/pdf/2409.05587 (accessed on 10 September 2025).
  10. Abbas, T.; Ali, S.F.; Mohammed, M.A.; Khan, A.Z.; Awan, M.J.; Majumdar, A.; Thinnukool, O. Deep Learning Approach Based on Residual Neural Network and SVM Classifier for Driver’s Distraction Detection. Appl. Sci. 2022, 12, 6626. [Google Scholar] [CrossRef]
  11. Alqahtani, D.K.; Cheema, M.A.; Toosi, A.N. Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices. In Service-Oriented Computing. ICSOC 2024; Springer: Singapore, 2024; Volume 15404 LNCS, pp. 142–150. [Google Scholar] [CrossRef]
  12. Zhou, W.; Min, X.; Hu, R.; Long, Y.; Luo, H.; Yi, J. FasterX: Real-Time Object Detection Based on Edge GPUs for UAV Applications. 2022. Available online: https://arxiv.org/pdf/2209.03157 (accessed on 14 September 2025).
  13. Caetano, F.; Carvalho, P.; Cardoso, J. Deep Anomaly Detection for In-Vehicle Monitoring—An Application-Oriented Review. Appl. Sci. 2022, 12, 10011. [Google Scholar] [CrossRef]
  14. Khalili, B.; Smyth, A.W. SOD-YOLOv8—Enhancing YOLOv8 for Small Object Detection in Traffic Scenes. 2024. Available online: https://arxiv.org/pdf/2408.04786 (accessed on 10 September 2025).
  15. Bai, R.; Shen, F.; Wang, M.; Lu, J.; Zhang, Z. Improving Detection Capabilities of YOLOv8-n for Small Objects in Remote Sensing Imagery: Towards Better Precision with Simplified Model Complexity. Res. Sq. Prepr. 2023. [Google Scholar] [CrossRef]
  16. Shim, I.; Lim, J.H.; Jang, Y.W.; You, J.H.; Oh, S.T.; Kim, Y.K. Developing a Compressed Object Detection Model based on YOLOv4 for Deployment on Em-bedded GPU Platform of Autonomous System. Trans. Korean Soc. Automot. Eng. 2021, 29, 959–966. [Google Scholar] [CrossRef]
  17. Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An Edge-Real-Time Object Detector. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 7507–7512. [Google Scholar] [CrossRef]
  18. Alsayaydeh, J.A.J.; Yusof, M.F.B.; Mohan, K.S.; Hossain, A.K.M.Z.; Leoshchenko, S. Advancing Road Safety: Precision Driver Detection System with Integrated Overspeed, Alcohol Detection, and Tracking Capabilities. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 504–516. [Google Scholar] [CrossRef]
  19. Ranjan, A.S.; Shabareesh, M.; Thotta, P.K. Road Bump Detection and Voice Alert System Using YOLOv8 and gTTS. Int. J. Multidiscip. Res. 2024, 6, IJFMR230217297. [Google Scholar] [CrossRef]
  20. Jimenez, J.; Toscano, J.; Oñate, W.; Caiza, G. Development of a Firearms and Target Weapons Recognition and Alerting System Applying Artificial Intelligence. Work. Prog. Embed. Comput. J. 2024, 10, 5. Available online: https://wipiec.digitalheritage.me/index.php/wipiecjournal/article/view/65 (accessed on 15 July 2025).
  21. Song, H.; Yun, X.; Wang, L.; Niu, Z.; Li, Z.; Ma, J. Design of a Multidimensional Intelligent Safety Driving Monitoring System Based on Machine Vision. Acad. J. Sci. Technol. 2024, 10, 1–4. [Google Scholar] [CrossRef]
  22. Domala, N.; Uddin Sufi Mohammed, B.; Ali Khan, G.; Ajwad Baig, M. Detection of Unsafe Driving and Drivers Behaviour Monitoring System. Proceeding Int. Conf. Sci. Eng. 2023, 11, 2047–2054. [Google Scholar] [CrossRef]
  23. Guerrero, C.; Villegas, F.; Oñate, W.; Caiza, G. IoT and Artificial Intelligence for Fault Classification in High Efficiency Motors. In Proceedings of the 3rd International Symposium on Automation, Information and Computing (ISAIC 2022), Beijing, China, 9–11 December 2022; Volume 2, pp. 405–409. [Google Scholar] [CrossRef]
  24. Aboah, A.; Wang, B.; Bagci, U.; Adu-Gyamfi, Y. Real-time Multi-Class Helmet Violation Detection Using Few-Shot Data Sampling Technique and YOLOv8. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, BC, Canada, 17–24 June 2023; pp. 5350–5358. [Google Scholar] [CrossRef]
  25. Huang, X.; Gu, S.; Li, Y.; Qi, G.; Zhu, Z.; An, Y. Driver Distraction Detection Based on Fusion Enhancement and Global Saliency Optimization. Mathematics 2024, 12, 3289. [Google Scholar] [CrossRef]
  26. Debsi, A.; Ling, G.; Al-Mahbashi, M.; Al-Soswa, M.; Abdullah, A. Driver distraction and fatigue detection in images using ME-YOLOv8 algorithm. IET Intell. Transp. Syst. 2024, 18, 1910–1930. [Google Scholar] [CrossRef]
Figure 1. Functional architecture of the mobile phone-use detection system for drivers.
Figure 1. Functional architecture of the mobile phone-use detection system for drivers.
Applsci 16 00675 g001
Figure 2. Methodological workflow for the development of the proposed system.
Figure 2. Methodological workflow for the development of the proposed system.
Applsci 16 00675 g002
Figure 3. UML class diagram of the modular software architecture deployed on the Jetson Xavier NX.
Figure 3. UML class diagram of the modular software architecture deployed on the Jetson Xavier NX.
Applsci 16 00675 g003
Figure 4. Experimental validation process under controlled conditions.
Figure 4. Experimental validation process under controlled conditions.
Applsci 16 00675 g004
Figure 5. Training and validation curves showing loss functions (box, cls, dfl) and performance metrics (Precision, Recall, mAP50, mAP50-95) over 100 epochs.
Figure 5. Training and validation curves showing loss functions (box, cls, dfl) and performance metrics (Precision, Recall, mAP50, mAP50-95) over 100 epochs.
Applsci 16 00675 g005
Figure 6. Confidence-based performance curves (Precision-Recall, Precision-Confidence, Recall-Confidence, and F1-Confidence) for each class and overall performance.
Figure 6. Confidence-based performance curves (Precision-Recall, Precision-Confidence, Recall-Confidence, and F1-Confidence) for each class and overall performance.
Applsci 16 00675 g006
Figure 7. Overall classification performance: (a) Confusion matrix showing per-class detection outcomes; (b) Normalized confusion matrix highlighting proportional classification accuracy.
Figure 7. Overall classification performance: (a) Confusion matrix showing per-class detection outcomes; (b) Normalized confusion matrix highlighting proportional classification accuracy.
Applsci 16 00675 g007
Figure 8. Average FPS versus power consumption under different operational scenarios.
Figure 8. Average FPS versus power consumption under different operational scenarios.
Applsci 16 00675 g008
Figure 9. Distribution of True Positives, False Positives, and False Negatives per class.
Figure 9. Distribution of True Positives, False Positives, and False Negatives per class.
Applsci 16 00675 g009
Figure 10. Distribution of successful and failed detections across 30 experimental trials under different driving scenarios.
Figure 10. Distribution of successful and failed detections across 30 experimental trials under different driving scenarios.
Applsci 16 00675 g010
Figure 11. System operational states: (a) Active detection enabled under vehicular motion; (b) Standby state when no motion is detected.
Figure 11. System operational states: (a) Active detection enabled under vehicular motion; (b) Standby state when no motion is detected.
Applsci 16 00675 g011
Table 1. Precision, Recall, and F1-Score per detected class.
Table 1. Precision, Recall, and F1-Score per detected class.
ClassPrecisionRecallF1-Score
driver_phone_10.840.900.87
driver_phone_20.780.610.69
Mean0.810.760.78
Table 2. Operating conditions and measured performance of the Jetson Xavier NX.
Table 2. Operating conditions and measured performance of the Jetson Xavier NX.
ScenarioAverage FPSPower (W)Temperature (°C)
Idle system (no camera/inference)0.003.945
Stationary with camera active12.455.849
Vehicle in motion with gated inference12.777.354
Continuous detection with data upload12.868.458
Table 3. Comparative embedded inference benchmark of lightweight object detection models on Jetson Xavier NX.
Table 3. Comparative embedded inference benchmark of lightweight object detection models on Jetson Xavier NX.
ModelInput ResolutionAverage FPSAverage Power (W)Steady Temperature (°C)
YOLOv8n (trained)512 × 51212.868.458
YOLOv5n512 × 51215.39.162
YOLOv4-Tiny512 × 51218.79.866
MobileNet-SSD512 × 51220.27.655
Table 4. Error rate and relative contribution per class.
Table 4. Error rate and relative contribution per class.
ClassTPFPFNError Rate (%)
driver_phone_1921498.6
driver_phone_23181924.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Caiza, G.; Guanuche, A.; Villafuerte, C. Detection of Mobile Phone Use While Driving Supported by Artificial Intelligence. Appl. Sci. 2026, 16, 675. https://doi.org/10.3390/app16020675

AMA Style

Caiza G, Guanuche A, Villafuerte C. Detection of Mobile Phone Use While Driving Supported by Artificial Intelligence. Applied Sciences. 2026; 16(2):675. https://doi.org/10.3390/app16020675

Chicago/Turabian Style

Caiza, Gustavo, Adriana Guanuche, and Carlos Villafuerte. 2026. "Detection of Mobile Phone Use While Driving Supported by Artificial Intelligence" Applied Sciences 16, no. 2: 675. https://doi.org/10.3390/app16020675

APA Style

Caiza, G., Guanuche, A., & Villafuerte, C. (2026). Detection of Mobile Phone Use While Driving Supported by Artificial Intelligence. Applied Sciences, 16(2), 675. https://doi.org/10.3390/app16020675

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop