Processing and Integration of Multimodal Image Data Supporting the Detection of Behaviors Related to Reduced Concentration Level of Motor Vehicle Users

Smoliński, Anton; Forczmański, Paweł; Nowosielski, Adam

doi:10.3390/electronics13132457

Open AccessArticle

Processing and Integration of Multimodal Image Data Supporting the Detection of Behaviors Related to Reduced Concentration Level of Motor Vehicle Users

by

Anton Smoliński

^*

,

Paweł Forczmański

and

Adam Nowosielski

Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Żołnierska 52 Str., 71-210 Szczecin, Poland

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2457; https://doi.org/10.3390/electronics13132457

Submission received: 24 May 2024 / Revised: 20 June 2024 / Accepted: 21 June 2024 / Published: 23 June 2024

(This article belongs to the Special Issue Advancement on Smart Vehicles and Smart Travel)

Download

Browse Figures

Versions Notes

Abstract

:

This paper introduces a comprehensive framework for the detection of behaviors indicative of reduced concentration levels among motor vehicle operators, leveraging multimodal image data. By integrating dedicated deep learning models, our approach systematically analyzes RGB images, depth maps, and thermal imagery to identify driver drowsiness and distraction signs. Our novel contribution includes utilizing state-of-the-art convolutional neural networks (CNNs) and bidirectional long short-term memory (Bi-LSTM) networks for effective feature extraction and classification across diverse distraction scenarios. Additionally, we explore various data fusion techniques, demonstrating their impact on improving detection accuracy. The significance of this work lies in its potential to enhance road safety by providing more reliable and efficient tools for the real-time monitoring of driver attentiveness, thereby reducing the risk of accidents caused by distraction and fatigue. The proposed methods are thoroughly evaluated using a multimodal benchmark dataset, with results showing their substantial capabilities leading to the development of safety-enhancing technologies for vehicular environments. The primary challenge addressed in this study is the detection of driver states not relying on the lighting conditions. Our solution employs multimodal data integration, encompassing RGB, thermal, and depth images, to ensure robust and accurate monitoring regardless of external lighting variations

Keywords:

driver distraction; multimodal imaging; convolutional neural networks; bidirectional long short-term memory; road safety

1. Introduction

Advanced driver assistance systems (ADASs) encompass various technologies designed to enhance driving safety and comfort. Several main categories of these systems can be distinguished. The first comprises warning systems, such as lane departure warning, blind spot monitoring, and collision detection alerts. The second group consists of driver assistance systems that actively intervene in vehicle operation, including adaptive cruise control, parking assistance, and lane-keeping assistance. Another category includes passenger protection systems, comprising advanced airbag systems, crash sensors, and electronic stability control. Image recognition systems that detect traffic signs, preceding vehicles, cyclists, pedestrians, and other obstacles play a major role here as well (e.g., [1,2]). Furthermore, automated event detection can surpass human perception and senses in accuracy and efficiency (e.g., [3]). Comfort is further enhanced by advanced infotainment systems utilizing natural human–machine interaction methods like voice commands with intelligent assistants or gesture control interfaces [4]. Finally, driver monitoring systems that detect fatigue and distraction play an increasingly vital role, allowing for intervention and alerting the driver to potential hazards. As these technologies continue to advance, ADASs are expected to further improve vehicle safety while alleviating the workload on drivers.

Driver distraction represents a significant threat to road safety, undermining the driver’s ability to maintain attention, make timely decisions, and react appropriately to road conditions. The World Health Organization highlights that approximately 1.35 million people lose their lives annually due to road traffic accidents, with human error, particularly distraction, being a predominant cause in 90% of these incidents [5]. This stark statistic underscores the imperative to develop and implement effective real-time methods for detecting and preventing driver distraction to mitigate these risks.

The complexity of human physiology and behavior necessitates a comprehensive approach to accurately discern the multifaceted nature of driver distraction. Recent studies have utilized a variety of physiological signals, such as heart rate, heart rate variability, body temperature, respiratory rate, and blood oxygen concentration, to gauge an individual’s level of attention and fatigue [6]. These physiological markers offer a window into the driver’s state, providing critical data that can be used to enhance the safety and efficiency of both civilian and military vehicular operations. Driver fatigue detection is also performed through vision-based monitoring systems, where typically the visible light spectrum is used. Among the myriad of methodologies explored in this domain, deep learning emerges as a particularly promising avenue. By analyzing complex patterns in image and video data captured by onboard cameras, deep learning algorithms can identify subtle cues indicative of distraction, such as changes in head pose, facial expressions, eye movements, and hand gestures [7,8]. This approach, part of the broader field of artificial intelligence, excels in extracting meaningful features from vast datasets, enabling the development of highly accurate models for recognizing various forms of distraction.

However, applying deep learning to the challenge of driver distraction detection is not without its obstacles. The diversity of distraction types, ranging from cognitive and emotional to manual and visual distractions, presents a significant classification challenge. Furthermore, the variability of driving scenarios, the limited computational resources available in embedded systems within vehicles, and concerns regarding drivers’ privacy complicate the design and optimization of deep learning models suitable for this context [9,10]. Due to the wide variation and changing lighting conditions, other non-visible light spectra are starting to be employed [11,12]. By leveraging depth maps, infrared or other wavelengths, these vision systems can more robustly detect indications of fatigue and drowsiness across different environmental conditions.

Addressing these challenges, our article proposes a novel framework that utilizes multimodal video sequences to detect driver distraction. By integrating RGB images, depth maps, and thermal imaging, our approach captures a comprehensive set of visual information that reflects the driver’s current state. We employ advanced deep neural network architectures to analyze these data, extracting high-level features that enable the classification of driver behaviors into several distinct categories of distraction. Our framework’s novelty lies in its ability to process and analyze data from multiple modalities simultaneously, leveraging the strengths of each to achieve superior detection accuracy [13,14].

Our main contributions to the field of driver distraction detection through this research are as follows:

We propose a novel, comprehensive framework for detecting driver distraction using deep-learning-based classification of multimodal video sequences. Our framework is effective at identifying a wide array of typical distraction cues, such as yawning, rubbing eyes, and head drooping. It can be especially useful in variable lighting conditions, as the model can interpret visual information captured regardless of the illumination. We validate our approach using extensive experimental data, demonstrating its robustness and reliability across diverse scenarios.
We leverage advanced deep neural network architectures for robust feature extraction and data integration from multimodal inputs, including RGB images, depth maps, and thermal imagery. Our architecture innovatively combines parallel artificial neural networks for RGB, depth, and thermal data, enriched with fusion layers at various processing stages. Through a detailed exploration of data integration strategies, we determine the most effective approach for analyzing multimodal video sequences. We examine late fusion, which combines outputs from individual networks processing RGB, depth, and thermal data; early fusion, which integrates these data types at the input stage; and a unified approach that processes all data types through a common network. Our findings elucidate the comparative advantages of each strategy in exploiting the complementary nature of multimodal data for enhanced distraction detection.

Our research objectives are as follows: firstly, to identify the most effective feature extractor for driver distraction detection across various modalities. Secondly, to determine the optimal classifier architecture that can accurately classify distraction cues from multimodal inputs. Thirdly, to find the best data fusion method that integrates information from RGB images, depth maps, and thermal imagery to improve detection accuracy. Finally, we aim to address the practical problem of detecting driver distraction under variable lighting conditions, ensuring that our framework is robust and reliable in real-world scenarios.

The rest of this article is structured as follows: Section 2 delves into the existing literature on driver distraction detection using deep learning, establishing the context for our study. Section 3 provides a detailed exposition of our proposed framework, explaining our approach’s technical underpinnings and innovative aspects. Section 4 describes the experimental setup and the capture environment, detailing the technologies and methodologies employed in data collection. Section 5 presents a comprehensive analysis of our experimental findings, showcasing the efficacy of our framework. Finally, Section 6 summarizes the key insights derived from our research, outlines its implications for the field, and suggests directions for future work.

2. Related Works

Exploring behaviors related to reduced concentration levels of motor vehicle users through multimodal image data processing has become a significant area of research [15]. This section reviews key advancements and methodologies in this domain, focusing on a holistic approach to driver state and behavior analysis and applications in related fields such as e-learning [16]. It must be noted that, despite our focus on vision systems, there are several other techniques for detecting drowsiness based on sensors, such as EEG [17,18,19], gyroscope data of head motion [18], heart rate [20], and others. Since we rely on the visual information captured in different spectra, any form of user collaboration is not expected, e.g., wearing additional body sensors or touching dedicated areas on the steering wheel. Hence, this approach is fully transparent and uninvolving to the user.

Applying deep learning to driver distraction detection has not been without challenges. The primary issues include the diversity of distraction types and the variability of driving scenarios, which make the task of developing universally effective models complex. Innovative solutions, such as adaptive algorithms and real-time processing capabilities, have been introduced to overcome these hurdles, significantly advancing the field’s state of the art [10,21,22]. The implementation of IoT-assisted systems using U-Net-based architectures exemplifies such advancements [23].

Deep learning models have revolutionized the field of driver distraction detection, offering substantial improvements in accuracy and reliability. Recent studies employing onboard camera feeds have reported detection accuracies in the range of 90% to 95%, highlighting the efficacy of these models in capturing and understanding complex driver behaviors [24,25,26]. The success of these methods underscores deep learning’s transformative potential in enhancing road safety. Innovations such as leveraging computer vision and eye-blink analyses and utilizing novel deep convolutional neural network models have shown great promise [19,27,28].

Establishing benchmark collections and conducting comparative studies have played a pivotal role in propelling the research on driver distraction and fatigue detection. These initiatives have been instrumental in providing a standardized dataset for validating the effectiveness of detection systems, particularly under challenging conditions. Such efforts have facilitated the comparison of different methodologies and highlighted areas requiring further improvement [3,29]. Research integrating sensor data and utilizing semantic learning approaches to analyze driver fatigue more accurately are notable contributions to this area [30,31].

The exploration of emerging technologies, such as thermal imaging and specialized hardware like night-vision systems, has opened new avenues for enhancing driver distraction detection. These technologies offer insights into physiological changes that may indicate reduced attention levels, offering a complementary perspective to conventional visual analysis [32,33]. Advances such as facial recognition and machine learning models tailored for driver identification and drowsiness detection further enhance driver safety [34,35,36].

Deep learning’s application extends beyond the analysis of driver behavior to encompass various aspects of automotive safety, including vehicle detection and pedestrian safety. The adaptability and versatility of deep learning technologies signify their vast potential in advancing the capabilities of advanced driver assistance systems (ADASs), marking a significant leap toward safer driving environments [37,38].

The utilization of automated fatigue detection technologies is expanding into domains beyond automotive safety, such as e-learning. Maintaining engagement poses a considerable challenge in settings where direct instructor–student interactions are limited. Innovative solutions like the E-Pod system employ visual cues for the real-time monitoring of student engagement levels during e-learning sessions. By analyzing indicators such as head pose, eye gaze, and facial expressions, the system assesses attention levels and prompts interventions to re-engage students, demonstrating the broader applicability of these technologies in educational contexts [16].

The above example demonstrates that fatigue detection is being considered not only for drivers but also in other fields. Another instance is found in the article [39], which monitors fatigue among fixed-position staff. Here, eye closure time, percentage of eyelid closure (PERCLOS) value, frequency, and number of blinks are examined. The material is recorded using a standard visual camera. Similar fatigue indicators based on eye observations are utilized in [40], where surveillance personnel, who must maintain vigilance for extended periods, are monitored for fatigue. The article proposes an interesting approach of adapting interfaces for workers under fatigue states to sustain focus and engagement.

To provide a comprehensive evaluation of our approach, we compare it with other state-of-the-art methods for driver distraction detection. Table 1 summarizes these methods, detailing the specific techniques used, the databases employed, and the reported accuracy rates. This comparison highlights the diverse strategies and performance levels achieved by different research efforts in this field.

3. Method Description

In this chapter, we present our method, detailing its key components, utilized data streams, model architecture, and training/validation processes. Our aim is to provide a comprehensive understanding of our research methodology.

The general architecture of our fatigue detection framework is depicted in Figure 1. The processing consists of several key stages:

Input: This stage involves collecting multimodal data, including RGB images, depth maps, and thermal images.
Extractor: Various feature extractors, such as ResNet50, InceptionV3, and VGG16, are employed to extract meaningful features from the input data.
Classification: The extracted features are then classified using models such as CNNs, LSTM, and bidirectional-LSTM to detect signs of fatigue.
Result: Finally, the classification results are processed to determine the driver’s state of alertness.

Additionally, our framework incorporates different data fusion techniques (Fusion) at various stages to enhance the detection accuracy. The fusion can occur at the input level, feature extraction level, or classification level, allowing the model to leverage complementary information from multiple data sources.

3.1. Framework Components

Our framework (the processing flow) consists of several elementary components (sub-networks) connected with each other following established rules. These components include data input, feature extraction sub-network(s), feature classification sub-network(s), and data fusion. Convolutional neural networks used at the feature extraction stage include several options, i.e.,

VGGNet: Known for its specific signature pattern (Conv-Conv-Conv-Pool), VGGNet simplifies the architecture by using uniform configurations, which helps in extracting hierarchical feature representations.
Inception: Utilizes inception modules that allow the model to be heavily optimized by incorporating multiple convolutional operations within the same module. This approach enhances the model’s ability to capture spatial features at different scales.
ResNet: Incorporates residual connections, which enable the network to be much deeper without suffering from the vanishing gradient problem. This results in improved accuracy and the ability to learn more complex features.

At the classification stage, the analysis of sequential and temporal data is performed with the help of one of the following:

Convolutional Neural Network (CNN): Consisting of standard convolution and fully connected layers, this approach is effective for spatial feature analysis.
Long Short-Term Memory (LSTM) networks: A specialized form of recurrent neural networks (RNNs) designed to handle temporal sequences of data. LSTM networks are capable of learning long-term dependencies and are particularly suited for time-series analysis.
Bidirectional-LSTM (Bi-LSTM) networks: Enhance the LSTM approach by processing data in both forward and reverse temporal directions, providing a more comprehensive analysis of the temporal data.

The data fusion stage is crucial for integrating the multimodal data and can vary depending on the specific architecture used. It typically involves the following:

Standard fully connected layers: Used to combine features extracted from different modalities.
Concatenation layers: These layers merge features from different sub-networks at various levels, allowing the model to leverage complementary information.
Averaging layers: These layers average the features from different modalities to provide a unified representation.

In summary, the methodological advancements in LSTM [42] and its variants present a robust framework for analyzing video sequences, offering nuanced insights into driver behavior and distraction detection. By leveraging the temporal sequencing capabilities of LSTM networks and the spatial analysis prowess of CNNs working with multi-channel inputs, our proposed model stands at the forefront of video-based monitoring systems, promising significant contributions to automotive safety research.

3.2. Input Data Streams

The capture environment is designed to take advantage of the capabilities of three different types of imaging technology, each of which has been selected for its unique contribution to the overall data collection process:

RGB Camera: Serves as the cornerstone for capturing high-resolution color imagery. It provides detailed visual information about the scene, including the appearance, attire, and facial features of individuals.
Depth Camera: Utilizes depth-sensing technology to measure the distance between the camera and objects within the scene. These data are crucial for constructing three-dimensional models of subjects, enabling precise estimations of human size, shape, and posture and facilitating their effective segmentation from the background.
Thermal Camera: Detects infrared emissions from objects, translating them into temperature maps. This capability is particularly valuable in low-visibility conditions, where conventional RGB cameras may fail, providing critical data on heat distribution and physiological responses.

Deploying RGB, depth, and thermal sensors in capture environments significantly enhances our ability to detect and analyze human behaviors.

3.3. Model Architecture

We created a model that is parameterized by three elements, namely an image modality (or multiple modalities), a feature extraction, and a classifier. A backbone (feature extractor) is put in front of the net, creating a reduced representation of the video sequence. It is a model created using an end-to-end methodology.

The input data can include up to two RGB streams from cameras with different locations and viewpoints, a depth map, and a thermal imaging stream. A detailed description of data streams is provided in the further parts of the paper.

In our study on driver distraction detection, we utilized established feature extractors (ResNet50, VGG16, VGG19, InceptV3) without final fully connected layers in conjunction with advanced classifiers (LSTM, bidirectional LSTM, CNNs) to analyze temporal sequences in video data. The combination of these well-recognized technologies was instrumental in dissecting the intricate dynamics of driver behaviors captured in video clips.

These extractors were chosen due to their proven effectiveness in similar applications, characterized by their low computational overhead and high accuracy as reported in the literature. An important aspect is the inference speed, which ranges from 4–6 ms per frame on a GPU (Nvidia Tesla A100) and 40–80 ms on a CPU (AMD Epyc), making these models practically significant for real-time applications. By leveraging these established models, we ensure a robust and efficient framework for driver distraction detection.

In our research, we do not fully train the feature extractors (backbones) relying on their pre-trained weights. The backbones were cut before the last classification layer and merged with the subsequent classification sub-network. The input and output sizes for all backbones for a single-modality frame are presented in Table 2.

The classifier works on this reduced representation and outputs a class number for action type (five classes representing distraction type and an extra class for natural behavior). An exemplary visualization of the classifier employing bidirectional-LSTM is presented below in Figure 2. This final model employs two bidirectional-LSTM layers, several regularization layers, and a fully connected layer. The experimental results show that this configuration performed significantly better than others (e.g., employing single bidirectional-LSTM, single LSTM, and custom CNN layers).

We investigated several architectures related to data fusion and input data treatment. They are depicted in the following figures. Everywhere, when ‘Extractor’ is being used, it means a backbone net for feature extractor, listed above. The ‘NN’ means a neural net classifier (bidirectional-LSTM in our case). ‘Fusion’ means merging inputs for further processing, and ‘Result’ represents the output classification result.

The first approach (unimodal), taken as a baseline, consists of the independent processing of individual data streams (see Figure 3). It was introduced for comparison purposes only and does not utilize any form of fusion. It employs two bidirectional-LSTM layers as the core of the classifier. The results should be analyzed independently for each modality. Its purpose was to select the minimal number of inputs (three out of four) for further exploitation, mainly in terms of selecting between RGB1 and RGB2 input streams.

The second approach does not distinguish between modalities and transfers sequences to a common feature extractor. Then, the results are classified using a two-layer bidirectional-LSTM sub-network. This was introduced to investigate if all modalities can be successfully classified using the same, universal feature extractor. This architecture is presented in Figure 4.

The third approach employs an early fusion strategy based on concatenating the features extracted by independent feature extractors. Then, the results are passed to the classifier based on a two-layer bidirectional-LSTM sub-network. It is depicted in Figure 5.

The fourth approach employs late fusion based on the results of three independent feature extractors working with individual modalities and four independent bidirectional-LSTM-based classifiers (with 128 units each). They output their classification results (softmax-based activation layer), which are then averaged. It is presented in Figure 6.

3.4. Training and Validation Process

The training and validation of neural networks in our study were designed to ensure the robustness and reliability of the fatigue detection model. We employed a comprehensive approach incorporating advanced optimization techniques, appropriate learning rates, and specific stopping criteria to optimize the learning process and validate the model’s performance. The benchmark set was decomposed into the training part (75%) and the testing/validation part (25%) randomly. The parameters of the training procedure, including optimizer settings, learning rate, and stopping criteria, are detailed in Table 3.

The training of the neural networks was conducted using an early stopping mechanism, which halted the training process if there was no improvement in the loss metric for 1500 iterations. Additionally, a ‘Reduce Learning Rate on Plateau’ strategy was employed, which reduced the learning rate by a factor of 0.25 if no improvement was observed for 200 iterations, with a minimum learning rate set at 0.001. These stopping criteria were implemented to prevent overfitting and ensure that the training process was efficient and effective. By carefully monitoring the loss metric—specifically, categorical cross-entropy—we ensured that the training ceased at an optimal point, preventing unnecessary computations without compromising the model’s performance. Lower values of the loss metric indicate better model performance and were a critical measure for determining the stopping point during training.

4. Data Preparation

Data preparation is crucial for ensuring the accuracy and reliability of our analyses. In this chapter, we delve into the essential steps involved in preparing our dataset for analysis. We begin by examining the capture environment, followed by detailing the data collection and pre-processing procedures. Subsequently, we discuss techniques for frame cropping, event tagging, and video segmentation, essential for refining raw data into analyzable formats. Lastly, we summarize the characteristics of the prepared dataset.

4.1. Capture Environment

The capture environment utilizes an array of sensors to record and analyze scenes, focusing on detecting human behaviors. This multi-sensor approach is instrumental in creating a dataset that captures the complexity and nuance of human actions in diverse conditions. For an in-depth exploration of the data collection process, the research protocol, and the characteristics of participants involved in the recordings, readers are referred to a dedicated study by Malecki et al. [29], which exclusively addresses these aspects.

4.2. Data Collection and Pre-Processing

Proper preparation of data is crucial for effective machine learning applications. The initial recordings, as detailed by Małecki et al. [29], contained long sequences for each participant across various modalities. This section outlines our systematic approach to adjusting these data for machine learning processing. The dataset used in our study consists of multi-spectral recordings from 44 subjects with diverse demographics. Table 4 and Table 5 provide a detailed summary of the dataset, including the physical characteristics of the subjects. Notably, subjects wearing glasses participated in the study twice, which accounts for the difference between the total number of recordings and the demographics.

In total, 52 video sequences (of four modalities each) were initially recorded, from which 44 were deemed complete and suitable for further analysis. This selection process was critical in maintaining the integrity of our study, particularly in evaluating the impact of factors like eyewear on the detection of fatigue. Exemplary frames from benchmark sequences are presented ion Figure 7.

Initially, the sequences were sorted by individual participants, ensuring that each person was represented by four videos in different modalities. Since the recording of the benchmark videos was steered manually, each file varied in length, and there were minimal shifts in the start times between different recordings (they are graphically depicted in Figure 8). It was also discovered that one of the cameras operated at a different frame rate from the others. Consequently, the first step involved standardizing the frames per second (FPS) parameter to 50 for all recordings and shifting the start times to eliminate discrepancies.

4.3. Frame Cropping

Due to the different natures of imaging systems, the subject’s head and upper body occupy different areas within frames. In order to reduce its impact on the recognition, we applied automatic cropping using region of interest estimation. Hence, heatmaps were created to define the region of interest (ROI) by overlaying successive frames for each recording. This process highlighted areas of frequent movement, allowing us to identify the space within which the recorded person moved. Four heatmaps were generated, one for each modality, to establish identical ROI proportions. These areas were then scaled to uniform resolution, ensuring that the essential features of each recording were maintained and comparable across modalities. The process is detailed in Figure 9, with ‘Before’ and ‘After’ images depicted at the top and bottom, respectively, accompanied by a heatmap for each modality to determine the ROI and scaling proportions.

The final spatial resolution of the video frame is equal to

224 \times 224

pixels due to the format expected at the extraction stage (backbone networks, i.e., VGGNet and ResNet). For the InceptV3 backbone, scaling involved a

299 \times 299

resolution.

4.4. Event Tagging and Video Segmentation

Event tagging was conducted on the RGB1 recording, where an operator manually marked the start and end of one of the five monitored events:

Yawning without covering the mouth (Figure 10)—Videos capturing subjects yawning openly, which is a common sign of tiredness;
Yawning with the mouth covered (Figure 11)—This class includes videos where subjects cover their mouths while yawning, often considered a social behavior to mask tiredness;
Unnatural blinking (Figure 12)—Recordings of subjects blinking more frequently than usual, which can indicate the early stages of fatigue;
Head drooping (Figure 13)—This category features subjects whose heads drop in a nodding motion, indicating severe drowsiness or loss of alertness;
Rubbing the eyes (Figure 14)—Subjects are shown rubbing their eyes, a behavior typically associated with fatigue and eye strain.

An additional class was also prepared, which included short 3-second-long recordings where the individual behaved naturally. These recordings were taken from the initial segments of the videos before the onset of fatigue-simulating sequences. This class serves as a control to enhance the neural network’s ability to discriminate between normal and fatigue-related behaviors.

4.5. Frames Selection

Because of the complexity, the model is not intended to analyze all the frames in each video clip. Hence, we propose three alternative frame selection techniques to optimally capture action dynamics, focusing on the following:

A fixed number of frames with a fixed interval, taken from the beginning of the sequence;
A fixed number of frames with variable intervals (depending on the video sequence length);
A variable number of frames with fixed intervals, depending on the sequence duration.

In this way, an input to the model contains only a reduced number of frames.

4.6. Dataset Characteristics Summary

An important feature of our dataset is its multimodal nature, incorporating three distinct types of sensory data: RGB (visual), depth, and thermal. The RGB data comprise recordings from two separate cameras positioned at different angles, effectively providing two distinct visual perspectives for each recorded sequence.

Each sequence consists of four recordings: two from the differentially positioned RGB cameras, one depth recording, and one thermal recording. This approach ensures a comprehensive capture of each event, leveraging the unique advantages offered by each modality. The dataset’s resolution, ratio, ROI, and FPS consistency across all modalities and recordings facilitate a cohesive and systematic analysis. These timestamps were stored in CSV files indicating the frame numbers for the start and end of these events. Given the prior synchronization of the videos, each of the four clips (for the different modalities) was segmented according to the markings from this file, thereby creating a set of 10–40 short recordings that depicted a specific sequence for a particular person in each modality. The total number of video clips in the set, the mean number of frames, and the standard deviation of frame number, considering their class assignment, are presented in Table 6.

The typical method for evaluating a model’s performance involves dividing the dataset into training, validation, and testing subsets. We assumed that the bias present in the entire dataset would also be reflected in these subsets. However, since our benchmark collection does not contain a large number of examples, we only used a division into training and validation sets. Testing a model on a very small subset would lead to unreliable results. Given that the bias is consistent across the dataset, this approach is sufficient for assessing our model’s performance. Since all data were collected under the same conditions, the performance on a test set partitioned similarly should yield results comparable to those obtained for the validation set.

5. Experimental Results

In this chapter, we present the outcomes of our experiments. They were performed in the Python environment (ver. 3.11.4) with Tensorflow/Keras (ver. 2.13.0). The hardware used in our calculations was a SuperMicro AS-4124GS-TNR-based virtualized platform equipped with two Epyc 7H12 processors and an MIG-based eight Nvidia A100 40 GB GPUs. Due to the limitations of the virtual environment, the batch size was reduced to 128 and the maximal number of epochs to 2500. The number of frames extracted from each sequence was set to 25 with variable intervals, calculated based on video length and frame rate.

We present results for individual modalities, followed by insights derived from our unified approach. Additionally, we compare the outcomes of different fusion techniques. Finally, we discuss these results comprehensively, exploring their implications and possibilities for further research. Since the RGB1 stream gave a slightly better accuracy (than RGB2), it was taken as an input to fusion models.

5.1. Results for Individual Modalities

The effectiveness of our model (see Table 7) varied across different sensory data modalities, with training and validation accuracies illustrating the distinct contributions of RGB, depth, and thermal inputs to the detection process. Notably, RGB data demonstrated the highest validation accuracies, underscoring their importance in visual recognition tasks.

By looking at the results, we selected the RGB1 stream as an input to multi-modal models.

5.2. Results for Unified Approach

Our exploration into a unified approach, which processes all modal inputs as equivalent, yielded promising results, emphasizing the potential of holistic data analysis in improving detection accuracy (see Table 8). However, it highlighted the challenge of effectively integrating heterogeneous data types without losing unique modality-specific information.

5.3. Results of Both Fusion Approaches

We assessed two data fusion techniques: early fusion and late fusion. Early fusion amalgamates input data before classification, aiming to exploit the combined strength of all modalities from the onset. Late fusion, conversely, merges the decisions from models trained independently on each data type, harnessing the specific advantages of each modality for a more nuanced analysis. The late fusion approach demonstrated superior validation accuracy, validating its effectiveness in leveraging the complementary nature of multimodal data for enhanced driver distraction detection. The idea of feeding multiple classifiers (e.g., CNN- or LSTM-based) with different modality data is similar to the one presented in [43] and proved its validity in many different scenarios. The results are shown in Table 9.

5.4. Discussion of the Results

During our experimentation, it was evident that the choice of feature extractors, classifiers, and data integration methods significantly influenced the performance of the fatigue detection framework. While the best performance was achieved using ResNet50 as the feature extractor, bidirectional-LSTM as the classifier, and late fusion for integrating modalities—resulting in a validation accuracy of 0.897—the weakest set of parameters resulted in an accuracy below 0.7.

Specifically, when InceptV3 was employed as the feature extractor, coupled with a Bi-LSTM classifier, and treating all modalities as equivalent inputs with a fixed interval and a constant number of frames for video processing, the framework managed a validation accuracy of only 0.644. This stark contrast in performance underscores the critical importance of carefully selecting the combination of technological approaches for processing and analyzing multimodal image data.

The disparity between the highest and lowest validation accuracies—from 0.644 to 0.897—highlights the potential of optimizing deep learning configurations for enhanced accuracy and the pitfalls of inadequate parameter selection.

The comprehensive analysis conducted in this study underscores the effectiveness of data fusion techniques in enhancing the performance of fatigue detection systems. Notably, the late fusion approach, which integrates the outputs from separately trained models for each modality at the decision level, emerged as the superior strategy. This method, achieving a validation accuracy of 0.897 as shown in Table 9, demonstrates its capacity to preserve and leverage the distinct characteristics of each data type, facilitating a more refined interpretation of the multimodal information.

In contrast, despite its robust performance, the early fusion technique slightly lags behind in validation accuracy (0.871 as per Table 9). This suggests that merging modalities at the input stage might lead to a premature amalgamation of features, potentially diluting each modality’s unique signals before the model can fully exploit them.

The direct comparison of fusion methods with individual modalities’ performances further validates the superiority of integrated approaches. As detailed in Table 7, models trained on single modalities consistently show lower accuracy than those utilizing fused data. This disparity is particularly evident in RGB1 and RGB2 inputs, where even the highest validation accuracy for a single modality (RGB1 at 0.891) falls short of the benchmarks set by fusion models.

Moreover, the unified approach, which treats all inputs as equivalent (without distinguishing their origin), demonstrates the potential of such processing by achieving a notable validation accuracy of 0.864 (refer to Table 8). However, it still does not match the precision afforded by the late fusion technique.

Ultimately, these findings articulate a clear preference for late fusion in complex behavior detection tasks. This method capitalizes on the inherent strengths of each data source and aligns with the nuanced requirements of accurate fatigue detection. The study propels the discourse on multimodal data processing forward, advocating for late fusion as a pivotal strategy in developing more sophisticated and accurate detection systems.

The results of our study highlight the significant impact of feature extractors, classifiers, and data fusion methods on the accuracy of fatigue detection. Our approach achieved a validation accuracy of 0.897. This performance is competitive with, and in some cases exceeds, other state-of-the-art methods.

For instance, the method by Majeed et al. [30] achieved a 96.69% accuracy using mouth aspect ratio (MAR) for yawning detection, while Safarov et al. [35] reported an over 96% accuracy with eye-blinking rate analysis. Our results, with a more comprehensive multimodal approach, demonstrate the advantages of integrating different data types.

Additionally, the hybrid approach by Ansari et al. [22], which combines postural and vehicle features, achieved a 99.60% accuracy. Although this method shows high accuracy, our vision-based system offers simplicity by not requiring additional sensors.

Table 10 and Table 11 summarize the experimental results for different configurations of our framework, highlighting the performance metrics such as recall, precision, and accuracy for various combinations of data types, feature extractors, and classifiers.

6. Conclusions

This research focused on enhancing the detection of behaviors associated with reduced concentration levels among motor vehicle users by leveraging multimodal image data. Through the integration of advanced deep learning models and a comprehensive exploration of various data fusion techniques, significant insights into the effectiveness of these methods in recognizing driver distractions were uncovered. The investigation revealed that while both early and late fusion approaches significantly improve classification accuracy compared to models trained on single modalities, the late fusion approach, in particular, stands out for its superior performance.

Our findings underscore the importance of choosing the right fusion strategy when dealing with complex, multimodal datasets. The late fusion technique, by allowing individual models to independently interpret the data before making a collective decision, provides a more nuanced analysis that capitalizes on the unique strengths of each data type. This approach resulted in the highest validation accuracy and highlighted the potential for more sophisticated and accurate detection systems that can operate effectively in real-world conditions.

The configuration highlighted in this article, selected from all tested combinations, yielded the best results and aligned with the prevailing trend in effectively fusing data for distraction detection. While our study primarily focused on determining which strategy for integrating different modalities was the most effective, we acknowledge that using a broader array of feature extractors, especially newer and potentially superior models, could further enhance our understanding and detection capabilities. The feature extractors that we chose served to validate the observed trend, yet an intriguing extension of this research would involve experimenting with a diverse set of feature extractors to find an optimal parameter set. This would undoubtedly help to ascertain the extent to which the detection of events can be improved with the best possible configuration of parameters.

Moreover, the research contributes to the broader discourse on the application of deep learning in automotive safety by demonstrating the feasibility of implementing these advanced computational techniques on embedded systems. The successful deployment of our framework on such platforms paves the way for the real-time, onboard detection of driver fatigue and distraction, which could have profound implications for road safety.

As we look to the future, the groundwork laid by this study opens several avenues for further exploration. Investigating additional modalities, refining the models to operate more efficiently on constrained hardware, and exploring the impact of different environmental factors on detection accuracy all represent fruitful directions for subsequent research. By continuing to build on the foundation established here, the goal of creating highly reliable systems capable of preventing accidents caused by driver distraction and fatigue moves ever closer to realization.

In conclusion, the research presented in this paper represents a significant step forward in mitigating the risks associated with driver distraction and fatigue. By harnessing the power of multimodal data and deep learning, we move closer to developing systems that enhance vehicle safety and save lives.

Author Contributions

Conceptualization, A.S., P.F. and A.N.; methodology, P.F.; software, A.S. and P.F.; validation, A.S., P.F. and A.N.; formal analysis, P.F. and A.N.; investigation, A.S. and P.F.; resources, A.S., P.F. and A.N.; data curation, A.S. and P.F.; writing—original draft preparation, A.S.; writing—review and editing, A.S., P.F. and A.N.; visualization, A.S.; supervision, P.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Written informed consent has been obtained from all subjects to publish this paper.

Data Availability Statement

The original data presented in the study are openly available at http://www.cvlab.zut.edu.pl (accessed on 20 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Frejlichowski, D.; Mikołajczak, P. A System for Automatic Town Sign Recognition for Driver Assistance Systems. In Proceedings of the Computer Vision and Graphics International Conference, ICCVG 2018, Warsaw, Poland, 17–19 September 2018; Chmielewski, L.J., Kozera, R., Orłowski, A., Wojciechowski, K., Bruckstein, A.M., Petkov, N., Eds.; Springer: Cham, Switzerland, 2018; pp. 115–124. [Google Scholar]
Tumas, P.; Nowosielski, A.; Serackis, A. Pedestrian Detection in Severe Weather Conditions. IEEE Access 2020, 8, 62775–62784. [Google Scholar] [CrossRef]
Nowosielski, A.; Małecki, K.; Forczmański, P.; Smoliński, A. Pedestrian Detection in Severe Lighting Conditions: Comparative Study of Human Performance vs Thermal-Imaging-Based Automatic System. In Progress in Computer Recognition Systems; CORES 2019. Advances in Intelligent Systems and Computing; Burduk, R., Kurzynski, M., Wozniak, M., Eds.; Springer: Cham, Switzerland, 2020; Volume 977, pp. 174–183. [Google Scholar] [CrossRef]
Małecki, K.; Nowosielski, A.; Kowalicki, M. Gesture-Based User Interface for Vehicle On-Board System: A Questionnaire and Research Approach. Appl. Sci. 2020, 10, 6620. [Google Scholar] [CrossRef]
World Health Organization. Global Status Report on Road Safety; World Health Organization: Geneva, Switzerland, 2018. [Google Scholar]
Guo, W.; Di, C.; Long, L. Research on Fatigue Detection Method of Equipment Operators Based on Multi-Source Physiological Signals. J. Physics Conf. Ser. 2021, 1982, 012067. [Google Scholar] [CrossRef]
Alkinani, M.H.; Khan, W.Z.; Arshad, Q. Detecting Human Driver Inattentive and Aggressive Driving Behavior Using Deep Learning: Recent Advances, Requirements and Open Challenges. IEEE Access 2020, 8, 105008–105030. [Google Scholar] [CrossRef]
Survi, H.G. Driver Distraction Detection Using CNN. Int. J. Res. Appl. Sci. Eng. Technol. 2022, 10, 4779–4783. [Google Scholar] [CrossRef]
Khan, M.Z.; Khan, M.U.G.; Irshad, O.; Iqbal, R. Deep Learning and Blockchain Fusion for Detecting Driver’s Behavior in Smart Vehicles. Internet Technol. Lett. 2019, 3, e119. [Google Scholar] [CrossRef]
Hatay, E.; Ma, J.; Sun, H.; Fang, J.; Gao, Z.; Yu, H. Learning to Detect Phone-related Pedestrian Distracted Behaviors with Synthetic Data. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 16–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2975–2983. [Google Scholar] [CrossRef]
Małecki, K.; Nowosielski, A.; Forczmański, P. Multispectral Data Acquisition in the Assessment of Driver’s Fatigue. In Proceedings of the Smart Solutions in Today’s Transport. TST 2017. Communications in Computer and Information Science, Katowice-Ustroń, Poland, 5–8 April 2017; Mikulski, J., Ed.; Springer: Cham, Switzerland, 2017; Volume 715, pp. 320–332. [Google Scholar] [CrossRef]
Cyganek, B.; Gruszczyński, S. Hybrid computer vision system for drivers’ eye recognition and fatigue monitoring. Neurocomputing 2014, 126, 78–94. [Google Scholar] [CrossRef]
Yadav, A.; Sharma, N.; Yadav, Y.; Choudhary, J.; Hore, P. Machine Learning Based Classifier Model for Autonomous Distracted Driver Detection and Prevention. Int. J. Adv. Res. Ideas Innov. Technol. 2018, 4, 606–608. [Google Scholar]
Fasanmade, A.; Aliyu, S.; He, Y.; Al-Bayatti, A.; Sharif, M.S.; Alfakeeh, A.S. Context-Aware Driver Distraction Severity Classification Using LSTM Network. In Proceedings of the 2019 International Conference on Computing, Electronics & Communications Engineering (iCCECE), London, UK, 22–23 August 2019. [Google Scholar] [CrossRef]
Fu, S.; Yang, Z.; Ma, Y.; Li, Z.; Xu, L.; Zhou, H. Advancements in the Intelligent Detection of Driver Fatigue and Distraction: A Comprehensive Review. Appl. Sci. 2024, 14, 3016. [Google Scholar] [CrossRef]
Tennakoon, S.; Wickramaarachchi, T.; Weerakotuwa, R.; Sulochana, P.; Karunasena, A.; Piyawardana, V. E-Pod: E-learning System for Improving Student Engagement in Asynchronous Mode. In Proceedings of the 2021 International Conference on Engineering and Emerging Technologies (ICEET), Istanbul, Turkey, 27–28 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
Li, G.; Lee, B.L.; Chung, W.Y. Smartwatch-Based Wearable EEG System for Driver Drowsiness Detection. IEEE Sensors J. 2015, 15, 7169–7180. [Google Scholar] [CrossRef]
Karuppusamy, N.S.; Kang, B.Y. Multimodal System to Detect Driver Fatigue Using EEG, Gyroscope, and Image Processing. IEEE Access 2020, 8, 129645–129667. [Google Scholar] [CrossRef]
Beles, H.; Vesselenyi, T.; Rus, A.; Mitran, T.; Scurt, F.B.; Tolea, B.A. Driver Drowsiness Multi-Method Detection for Vehicles with Autonomous Driving Functions. Sensors 2024, 24, 1541. [Google Scholar] [CrossRef] [PubMed]
Du, G.; Li, T.; Li, C.; Liu, P.X.; Li, D. Vision-Based Fatigue Driving Recognition Method Integrating Heart Rate and Facial Features. IEEE Trans. Intell. Transp. Syst. 2021, 22, 3089–3100. [Google Scholar] [CrossRef]
Gumaei, A.; Al-Rakhami, M.S.; Hassan, M.; Alamri, A.; Alhussein, M.A.; Razzaque, M.; Fortino, G. A deep learning-based driver distraction identification framework over edge cloud. Neural Comput. Appl. 2020, 1433–3058. [Google Scholar] [CrossRef]
Ansari, S.; Du, H.; Naghdy, F.; Hoshu, A.A.; Stirling, D. A Semantic Hybrid Temporal Approach for Detecting Driver Mental Fatigue. Safety 2024, 10, 9. [Google Scholar] [CrossRef]
Das, S.; Pratihar, S.; Pradhan, B.; Jhaveri, R.H.; Benedetto, F. IoT-Assisted Automatic Driver Drowsiness Detection through Facial Movement Analysis Using Deep Learning and a U-Net-Based Architecture. Information 2024, 15, 30. [Google Scholar] [CrossRef]
Xing, Y.; Lv, C.; Wang, H.; Cao, D.; Velenis, E.; Wang, F. Driver Activity Recognition for Intelligent Vehicles: A Deep Learning Approach. IEEE Trans. Veh. Technol. 2019, 68, 5379–5390. [Google Scholar] [CrossRef]
Wagner, B.; Taffner, F.; Karaca, S.; Karge, L. Vision Based Detection of Driver Cell Phone Usage and Food Consumption. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4257–4266. [Google Scholar] [CrossRef]
Alotaibi, M.; Alotaibi, B. Distracted Driver Classification Using Deep Learning. Signal Image Video Process. 2020, 14, 617–624. [Google Scholar] [CrossRef]
Díaz-Santos, S.; Cigala-Álvarez, Ó.; Gonzalez-Sosa, E.; Caballero-Gil, P.; Caballero-Gil, C. Driver Identification and Detection of Drowsiness while Driving. Appl. Sci. 2024, 14, 2603. [Google Scholar] [CrossRef]
Dong, B.T.; Lin, H.Y.; Chang, C.C. Driver Fatigue and Distracted Driving Detection Using Random Forest and Convolutional Neural Network. Appl. Sci. 2022, 12, 8674. [Google Scholar] [CrossRef]
Małecki, K.; Forczmański, P.; Nowosielski, A.; Smoliński, A.; Ozga, D. A New Benchmark Collection for Driver Fatigue Research Based on Thermal, Depth Map and Visible Light Imagery. In Progress in Computer Recognition Systems; CORES 2019. Advances in Intelligent Systems and Computing; Burduk, R., Kurzynski, M., Wozniak, M., Eds.; Springer: Cham, Switzerland, 2020; Volume 977, pp. 295–304. [Google Scholar] [CrossRef]
Majeed, F.; Shafique, U.; Safran, M.; Alfarhood, S.; Ashraf, I. Detection of Drowsiness among Drivers Using Novel Deep Convolutional Neural Network Model. Sensors 2023, 23, 8741. [Google Scholar] [CrossRef] [PubMed]
Ahmed, M.I.B.; Alabdulkarem, H.; Alomair, F.; Aldossary, D.; Alahmari, M.; Alhumaidan, M.; Alrassan, S.; Rahman, A.; Youldash, M.; Zaman, G. A Deep-Learning Approach to Driver Drowsiness Detection. Safety 2023, 9, 65. [Google Scholar] [CrossRef]
Forczmański, P.; Smoliński, A. Supporting Driver Physical State Estimation by Means of Thermal Image Processing. In Proceedings of the Computational Science—ICCS 2021. Lecture Notes in Computer Science, Krakow, Poland, 16–18 June 2021; Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M., Eds.; Springer: Cham, Switzerland; Volume 12746, pp. 149–163. [Google Scholar] [CrossRef]
Nowosielski, A.; Małecki, K.; Forczmański, P.; Smoliński, A.; Krzywicki, K. Embedded Night-Vision System for Pedestrian Detection. IEEE Sensors J. 2020, 20, 9293–9304. [Google Scholar] [CrossRef]
Zain, Z.M.; Roseli, M.S.; Abdullah, N.A. Enhancing Driver Safety: Real-Time Eye Detection for Drowsiness Prevention Driver Assistance Systems. Eng. Proc. 2023, 46, 39. [Google Scholar] [CrossRef]
Safarov, F.; Akhmedov, F.; Abdusalomov, A.B.; Nasimov, R.; Cho, Y.I. Real-Time Deep Learning-Based Drowsiness Detection: Leveraging Computer-Vision and Eye-Blink Analyses for Enhanced Road Safety. Sensors 2023, 23, 6459. [Google Scholar] [CrossRef]
Qin, B.; Qian, J.; Xin, Y.; Liu, B.; Dong, Y. Distracted Driver Detection Based on a CNN with Decreasing Filter Size. IEEE Trans. Intell. Transp. Syst. 2021, 23, 6922–6933. [Google Scholar] [CrossRef]
Forczmański, P.; Nowosielski, A. Deep Learning Approach to Detection of Preceding Vehicle in Advanced Driver Assistance. In Proceedings of the Challenge of Transport Telematics. TST 2016. Communications in Computer and Information Science, Katowice-Ustroń, Poland, 16–19 March 2016; Mikulski, J., Ed.; Springer: Cham, Switzerland, 2016; Volume 640, pp. 293–304. [Google Scholar] [CrossRef]
Yang, E.; Yi, O. Enhancing Road Safety: Deep Learning-Based Intelligent Driver Drowsiness Detection for Advanced Driver-Assistance Systems. Electronics 2024, 13, 708. [Google Scholar] [CrossRef]
Nie, B.; Huang, X.; Chen, Y.; Li, A.; Zhang, R.; Huang, J. Experimental study on visual detection for fatigue of fixed-position staff. Appl. Ergon. 2017, 65, 1–11. [Google Scholar] [CrossRef]
Yao, Z.; Zhou, X.; Qin, H.; Xiao, W. Monitoring Interface Design Based on Real-time Fatigue Detection. In Proceedings of the 2020 4th International Conference on Big Data and Internet of Things, BDIOT ’20, New York, NY, USA, 12–14 June 2020; pp. 49–53. [Google Scholar] [CrossRef]
Savaş, B.K.; Becerikli, Y. Real Time Driver Fatigue Detection System Based on Multi-Task ConNN. IEEE Access 2020, 8, 12491–12498. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Opałka, S.; Szajerman, D.; Wojciechowski, A. LSTM multichannel neural networks in mental task classification. COMPEL—Int. J. Comput. Math. Electr. Electron. Eng. 2019, 38, 1204–1213. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed method for fatigue detection. The figure illustrates data flow through various stages: input, feature extraction, classification, and result generation, with possible fusion points indicated at different stages.

Figure 2. Exemplary structure of bidirectional-LSTM-based classifier.

Figure 3. Unimodal architecture.

Figure 4. Unified architecture.

Figure 5. Early fusion architecture.

Figure 6. Late fusion architecture.

Figure 7. Selected frames from the developed database, illustrating the variety of data modalities. From left: RGB images from one perspective (RGB1), thermal imagery, RGB images from the other perspective (RGB2), and point cloud representation.

Figure 8. Visualizing event shifts across different modalities.

Figure 9. ROI cropping process: ‘Before’ depicted at the top, ‘After’ illustrated at the bottom, accompanied by a heatmap for each modality to determine the ROI and scaling proportions.

Figure 10. Frames of yawning without mouth covering.

Figure 11. Frames of yawning with mouth covering.

Figure 12. Frames of unnatural blinking.

Figure 13. Frames of head drooping.

Figure 14. Frames of eye rubbing.

Table 1. Comparison of state-of-the-art methods for driver distraction detection.

Study	Approach	Database	Accuracy
Majeed et al., 2023 [30]	Mouth aspect ratio (MAR)	YawDD dataset	96.69%
Safarov et al., 2023 [35]	Eye-blinking rate, mouth states	Own	96%
Ansari et al., 2024 [22]	Head and chest postures, vehicle features	Own	99.60%
Ahmed et al., 2023 [31]	Eyes and mouth states	Dataset from Kaggle, 2900 images	97%
Das et al., 2024 [23]	Percentage of eye closure (PERCLOS) and mouth states	Dataset from Kaggle, 3144 images	98.80%
Yang et al., 2024 [38]	Mouth states	YawDD dataset	97.05%
Díaz et al., 2024 [27]	IoT sensors, eyes states	DataFlair, approx. 7000 photographs	100%
Li et al., 2015 [17]	EEG signal	Own	91.92%
Karuppusamy et al., 2020 [18]	EEG, gyroscope, eyes and mouth states	Own, 4000 images	93.91%
Savas et al., 2020 [41]	Eyes and mouth states	YawDD, NthuDDD datasets	98.81%
Du et al., 2020 [20]	Heart rate, eyes and mouth states	Public dataset, and own data	94.74%

Table 2. Resulting feature vector sizes for employed feature extractor backbones.

Extractor	VGGNet16	VGGNet19	ResNet50	InceptV3
input size	$224 \times 224$	$224 \times 224$	$224 \times 224$	$299 \times 299$
output size	25,088	25,088	100,352	51,200

Table 3. Parameters of the training procedure.

Parameter	Value
Optimizer	ADAM (with AMS Gradient)
Learning rate	0.01
Beta 1	0.9
Beta 2	0.999
Weight decay	0.001
Loss metric	categorical cross-entropy
Early stopping	1500 iterations
Reduce LR on plateau	200 iterations, factor 0.25, min lr 0.001

Table 4. Details of the dataset: demographics and physical features.

Number of Recordings	Physical Features			Demographics
Number of Recordings	Glasses	Beard or Mustache	Long Hair	Male	Female
44	8	5	8	28	9

Table 5. Details of the dataset: modality and video parameters.

Modality	Resolution	Depth	Frame Rate	Color Space	Compression
RGB1	1920 × 1080	24b	25 fps	YUV420	H.264
RGB2	1920 × 1080	24b	30 fps	YUV420	H.264
Thermal	640 × 480	16b	30 fps	Gray	MJPEG
Depth	640 × 480	16b	30 fps	Depth	MJPEG

Table 6. Characteristics of short video clips related to specific behavior.

	Neutral	Yawning without Covering the Mouth	Yawning with Covering the Mouth	Unnatural Blinking	Head Drooping	Rubbing the Eyes
no. clips	176	100	105	83	390	122
mean frame no.	150	118.2	121.0	71.6	116.5	121.6
std. dev. frame no.	0	41.2	49.5	33.0	62.8	48.5

Table 7. Accuracy for individual modalities (training/validation) across different feature extractor models.

Extractor	RGB1		RGB2		Depth		Thermal
Extractor	Train	Validate	Train	Validate	Train	Validate	Train	Validate
ResNet50	0.982	0.891	0.984	0.863	0.986	0.781	0.982	0.837
VGG16	0.999	0.87	0.997	0.836	0.997	0.708	0.974	0.797
VGG19	0.994	0.818	0.994	0.809	0.996	0.682	0.977	0.781
InceptV3	0.999	0.901	0.999	0.844	0.994	0.781	0.983	0.74

Table 8. Accuracy for the unified approach for all input modalities considering different feature extraction models.

Extractor	Train	Validate
ResNet50	0.999	0.864
VGG16	0.998	0.835
VGG19	0.999	0.825
InceptV3	0.995	0.789

Table 9. Accuracy for early and late fusion methods across different feature extraction models.

Extractor	Fusion Strategy
	Early Fusion		Late Fusion
	Train	Validate	Train	Validate
ResNet50	0.999	0.871	1	0.897
VGG16	0.984	0.896	1	0.884
VGG19	1	0.854	0.996	0.875
InceptV3	1	0.8	0.996	0.839

Table 10. Summary of experimental results for early and late fusion.

Fusion	Extractor	Precision	Recall	Accuracy
Early	ResNet50	0.8496	0.8357	0.871
	VGG16	0.8746	0.8604	0.896
	VGG19	0.8243	0.8232	0.854
	InceptV3	0.7522	0.7457	0.800
Late	ResNet50	0.8742	0.8786	0.897
	VGG16	0.8608	0.8577	0.884
	VGG19	0.8490	0.8516	0.875
	InceptV3	0.8190	0.7991	0.839

Table 11. Summary of experimental results for different data types.

Extractor	Modality	Precision	Recall	Accuracy
ResNet50	RGB1	0.8804	0.8501	0.891
	RGB2	0.8262	0.8242	0.863
	Depth	0.7583	0.7241	0.781
	Thermo	0.7850	0.7208	0.837
	All	0.8249	0.8221	0.864
VGG16	RGB1	0.8387	0.8272	0.870
	RGB2	0.7912	0.7816	0.836
	Depth	0.6921	0.6607	0.708
	Thermo	0.7898	0.7104	0.797
	All	0.7981	0.7797	0.835
VGG19	RGB1	0.7880	0.7765	0.818
	RGB2	0.7664	0.7603	0.818
	Depth	0.6603	0.6369	0.682
	Thermo	0.7734	0.6931	0.781
	All	0.7814	0.7736	0.825
InceptV3	RGB1	0.8889	0.8620	0.901
	RGB2	0.8079	0.7918	0.844
	Depth	0.7606	0.7230	0.781
	Thermo	0.7107	0.6367	0.740
	All	0.7377	0.7112	0.789

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Smoliński, A.; Forczmański, P.; Nowosielski, A. Processing and Integration of Multimodal Image Data Supporting the Detection of Behaviors Related to Reduced Concentration Level of Motor Vehicle Users. Electronics 2024, 13, 2457. https://doi.org/10.3390/electronics13132457

AMA Style

Smoliński A, Forczmański P, Nowosielski A. Processing and Integration of Multimodal Image Data Supporting the Detection of Behaviors Related to Reduced Concentration Level of Motor Vehicle Users. Electronics. 2024; 13(13):2457. https://doi.org/10.3390/electronics13132457

Chicago/Turabian Style

Smoliński, Anton, Paweł Forczmański, and Adam Nowosielski. 2024. "Processing and Integration of Multimodal Image Data Supporting the Detection of Behaviors Related to Reduced Concentration Level of Motor Vehicle Users" Electronics 13, no. 13: 2457. https://doi.org/10.3390/electronics13132457

APA Style

Smoliński, A., Forczmański, P., & Nowosielski, A. (2024). Processing and Integration of Multimodal Image Data Supporting the Detection of Behaviors Related to Reduced Concentration Level of Motor Vehicle Users. Electronics, 13(13), 2457. https://doi.org/10.3390/electronics13132457

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Processing and Integration of Multimodal Image Data Supporting the Detection of Behaviors Related to Reduced Concentration Level of Motor Vehicle Users

Abstract

1. Introduction

2. Related Works

3. Method Description

3.1. Framework Components

3.2. Input Data Streams

3.3. Model Architecture

3.4. Training and Validation Process

4. Data Preparation

4.1. Capture Environment

4.2. Data Collection and Pre-Processing

4.3. Frame Cropping

4.4. Event Tagging and Video Segmentation

4.5. Frames Selection

4.6. Dataset Characteristics Summary

5. Experimental Results

5.1. Results for Individual Modalities

5.2. Results for Unified Approach

5.3. Results of Both Fusion Approaches

5.4. Discussion of the Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI