DriverMVT: In-Cabin Dataset for Driver Monitoring Including Video and Vehicle Telemetry Information

: Developing a driver monitoring system that can assess the driver’s state is a prerequisite and a key to improving the road safety. With the success of deep learning, such systems can achieve a high accuracy if corresponding high-quality datasets are available. In this paper, we introduce DriverMVT (Driver Monitoring dataset with Videos and Telemetry). The dataset contains information about the driver head pose, heart rate, and driver behaviour inside the cabin like drowsiness and unfastened belt. This dataset can be used to train and evaluate deep learning models to estimate the driver’s health state, mental state, concentration level, and his/her activity in the cabin. Developing such systems that can alert the driver in case of drowsiness or distraction can reduce the number of accidents and increase the safety on the road. The dataset contains 1506 videos for 9 different drivers (7 males and 2 females) with total number of frames equal 5119k and total time over 36 h. In addition, evaluated the dataset with multi-task temporal shift convolutional attention network (MTTS-CAN) algorithm. The algorithm mean average error on our dataset is 16.375 heartbeats per minute.


Introduction
Road accidents cause the death of hundreds of thousands of people every year. According to the World Health Organization, they are considered in the top ten of death causes in the low and middle income countries [1], because they affect not only the drivers and passengers but also the pedestrians. Human error is the main reason for most of these accidents. To eliminate the human factor, huge attention has been drawn to developing automated vehicles that are fully operated by Artificial Intelligence (AI).
With the advance of automated vehicle spreading in the world, driving will become a shared activity between the human and the machine, which generates demand for systems that can evaluate the driver state and his/her ability to take control of the vehicle at any moment.
Developing a driver monitoring system that can estimate the driver's state has drawn the researchers' attention lately. These systems aim to increase the safety level on the roads by alerting the driver. They systems include: 1.
The detection of the driver's vital signs like heart rate, blood pressure, oxygen saturation, and respiratory rate. 2.
The detection of the driver's mental state like fatigue.

3.
Measurement of the driver's attention and concentration levels.

4.
Detection of the driver's activity inside the cabin.
Over last decades, researchers investigated the drivers' behaviours to estimate the crash risk using the naturalistic driving data like speed, acceleration, and braking. The data was collected using Global Positioning System (GPS) and On Board Diagnostics (OBD) [2], accelerometers [3], and smartphones [4] to identify risky and abnormal driving events and evaluate the crash risk. Researchers [5] developed a driver assessment and recommendation system to evaluate individual driving performance and improve the traffic safety. The researchers used features like the trip distance and duration, the average and maximum speed, the number of hard brake and speed up to adopt Gaussian mixture model-universal background model and the maximum likelihood method to capture driver signature. Researcher [6] developed a driving behavior-based relative risk evaluation model using a non-parametric optimization method taking into consideration the frequency and the severity level of the different risky driving behaviors.
Researchers [7][8][9] have studied the driver behaviour factors like: road traffic violation, lapses, fail to maintain a safe gap, errors related to visual perception failure, and others. Different methods were used to evaluate and prioritize the significant driver behavior factors related to road safety. paper [7] designed an analytic hierarchy process with bestworst method (AHP-BWM) model to evaluate driver behavior factors within a designed three-level hierarchical structure. Paper [8] introduced to combine the best-worst method with the triangular fuzzy sets as a supporting tool for ranking and prioritizing the critical driver behavior criteria. While paper [9] performed Pythagorean Fuzzy Analytic Hierarchy Process to assess and prioritize the driver critical behavior criteria designed into a hierarchical model based on data gathered from observed driver groups in Budapest city. This evaluation is valuable to make drivers aware of individual traffic risks and it may assist in the implementation of effective local road safety policies.
Researchers have developed different methods to detect the driver fatigue. Some of these methods depends on detecting biological signal like the heart rate [10,11], others depends on physhical features like the face and eyes [12,13].
In this paper, we present an annotated dataset DriverMVT (Driver Monitoring dataset with Videos and Telemetry). for monitoring the driver inside the vehicle cabin. This dataset can be used to train and evaluate deep learning models to estimate the driver's state like the fatigue, the distraction, bad health situation, etc. Developing models to detect such critical behaviour and alert the driver can prevent many accidents and increase the safety on the road.
The rest of the paper is organized as follows: A review of the methods and datasets used for driver monitoring is presented in Section 2. Section 3 contains detailed information about our proposed dataset and how to use it. Section 4 shows the experiments for data evaluation. Finally, the conclusion is presented in Section 5.

Related Work
In this section, we present a small overview of the methods and datasets used for driver monitoring. The authors of paper [14] introduced a diverse benchmark with 2000 video sequences and over 650,000 frames that contain normal, critical, and accidental situations together in each video sequence. The dataset is for the scenes outside the vehicle. The researchers are answering the following question: Can we predict a driving accident if we know the driver's attention level?
The authors of paper [15] proposed a dataset called DrivFace that contains images sequences of subjects while driving in real scenarios. The dataset consists of 606 samples with resolution 640 × 480 pixels, acquired by 4 drivers (2 women and 2 men) with different facial features like glasses and beards. This dataset is annotated with head pose angles and the view direction. The authors also proposed a method to estimate the attention level from the head pose angles.
The authors of paper [16] introduced MPIIGaze dataset which contains 213,659 images collected from 15 participants during natural everyday computer use over more than three months with corresponding ground-truth gaze positions. The dataset has a large variability in appearance and illumination but it was not recorded in real driving scenarios. The main purpose for the dataset is to estimate the gaze angle from monocular camera in order to determine the attention level.
The authors of paper [17] introduced DriveAHead dataset, which contains more than 10 h of infrared (IR) and depth images of drivers' head poses taken in real driving situations. The dataset provides frame-by-frame head pose labels obtained from a motion-capture system, as well as annotations about occlusions of the driver's face. The dataset was collected from 20 persons (4 females and 16 males) using Kinect v2.
The authors of paper [18] introduced a dataset, collected from 14 young people (11 females, 3 males) who performed three successive experiments (the duration of each experiment was 10 min) in conditions of increasing sleep deprivation induced by acute, prolonged waking. The dataset contains different types of datas (images, signals, and etc.) and aims to help the resesrchers in the field of monitoring drowsiness, but it wasn't recorded inside the car cabin.
The authors of paper [19] introduced a dataset that consists of videos of drivers performing actions related to different driving scenarios. The dataset was acquired from 35 participants (10 females, 25 males) in different lightning conditions according to the time the session was recorded (morning or afternoon) with different speeds and both in simulations and real scenarios. The dataset was recorded using 3 of Intel RealSense Depth Camera D400-Series in different locations to capture the face, the body, and the hands of the driver.
The authors of paper [20] published a DMD dataset, consisting of videos of the drivers performing distraction actions in an automated driving scenario. The dataset contains over 9.6 million frames of people recorded using 5 near-infrared cameras in different perspectives, and 3 channels from a side camera (RGB, depth, IR).
The authors of paper [21] proposed a dataset called Multimodal Spontaneous Expression Heart Rate (MMSE-HR) dataset which is composed of videos and associated information about the heart rate and the blood pressure. The dataset was collected by 140 participants (58 males and 82 females) of different ages and ethnics. The data were acquired from different face sensors (high-resolution, 3D dynamic imaging, high-resolution 2D video, and thermal sensing), and contact sensors (electrical skin conductivity, respiration, blood pressure, and heart rate).
In contrast to our DriverMVT dataset, most of the datasets found in the literature concentrate on a particular tasks like head pose, gaze angles, action classification, drowsiness. Our dataset provides detailed and diverse information that make it useful for a wider range of tasks related to the drivers. Our dataset provides frame by frame annotation of the driver health indices like the heart rate, the mental state like fatigue, and the head pose estimation, alongside with driver activities. In addition, our dataset was recorded in real environment while subjects were driving home or to work. The dataset is diverse in terms of lightning conditions and speed. Table 1 shows a comparison between the available datasets and our dataset.

Dataset
In this section, an overview of the dataset is presented. Section 3.1 addresses the methodology used for collecting the proposed dataset, while Section 3.2 provides the description of the dataset and finally in Section 3.3 an exploratory analysis of the datasets is presented.

Collection Methodology
In this section, we introduce the collection methodology. In Section 3.1.1, we describe the devices used for data collection, while in Section 3.1.2, we describe the acquiring process.

Collection Devices
The dataset was collected using different camera types: a USB camera produced by ELP (see Figure 1), Samsung Galaxy S10 camera, and Samsung Galaxy S20 camera. The USB camera's sensor is OV7725, a single-chip VGA camera with an image processor. The lens size is 1/4 inch with view angle 30-150 degree, the sensor incorporates a 640 × 480 image array operating at frame rate 30 fps. The USB camera also has a high speed USB 2.0 interface module. For the smartphones, videos were recorded with resolution 1080 × 1920 and frame rate 60 fps. For the heart rate recording we used Xiaomi Mi Band 3. This is not a medical device but it provides possibilities to precisely estimate heart rate that can be used for tasks mentioned in the paper.

Data Collection
The dataset was acquired from 9 drivers of different ages and genders (2 females, and 7 males) with total number of frames equal 5119k and total time over 36 h using different conditions of car speed and light. We included drivers with different facial features (with/without beard, with/without mustache, long/short hair, etc.). Table 2 presents the demographic data of the participants. The drivers are all from St. Petersburg, Russia. We chose the participants to be diverse and balanced regarding to different facial features and different ages.
The videos were recorded and saved with the exact date and time, while the metadata was saved to the database with additional information like the user id, the measurement time and time when the ride started. These additional information is used later for synchronization as shown in Section 3.3.2. Figure 2 shows the scheme of acquiring the information.

Data Description
The dataset consists of 1506 videos of drivers inside the vehicle cabin and is divided into three sub-categories (see Figure 3): • Imprecise synchronization: the category contains videos of mean length of 1 min and meta data for each video, the video is frame by frame annotated but the synchronization between the video and the metadata is not precise with maximum delay of 1 s. • Precise synchronization and heart rate information: the category contains videos of mean length of 30 min and meta data for each video, the video is frame by frame annotated with perfect synchronization, The dataset also contains information about the driver's heart rate. • Precise synchronization and no heart rate information: the category contains videos of mean length of 30 min and frame by frame annotation for each video, the synchronization between the video and the information is precise.
For each video, meta data information are given in an CSV file. The file contains the general information about the video (see Table 3), like the geographic coordinates (latitude, longitude, and altitude), the driving trip starting time presented in milliseconds Unix timestamp, the date time (milliseconds Unix timestamp) which describes the time of recording the video, the car speed, the light level, and illuminance, the head pose angles (roll, pitch, and yaw) calculated using the method in paper [22], the data from the gyroscope (accelerometer data, gyroscope data, and magnetometer data), the mouth openness ratio, the seat belt state to detect whether the belt is fastened or not [23], and the heart rate measured using smart watch Xiaomi Mi Band 5.

Data Distribution
In this section, we present an exploratory analysis of the proposed dataset. Section 3.3.1 shows information about the meta data like the data type and the number of missing values in each columns columns. In addition, a visualization on the distribution of data like heart rate, and speed is presented. In Section 3.3.2 the synchronization method between the videos and the meta data is explained.

Data Exploration
In this section we provide a basic understanding of the dataset by showing the statistics and the distribution of the data. Table 4 shows information about the metadata of the driver video and HR information. The table shows that there is some missing information in the face_mouth, head_pose, and heart_rate columns. Face_mouth column is calculated based on the Faceboxes framework.
Head_pose column is calculated based on the proposed image processing approach discussed in paper [22]. Some frames do not have suitable exposure and in some cases the driver head can not be determined. In this case some values from this columns can be missing. Heart_rate column contains the data from Xiaomi Mi Band 3. Since not all the driver used the device one also can see that some values are missing. For dangerous state the column has values when there is some critical events like fatigue otherwise the state is considered to be normal.   Figure 5 shows the the distribution of the data according to speed. Around 29% of our datasets was recorded when the car was not moving (like the case when the driver stopped on the traffic light).
(a) The distribution of the data according to speed (b) Distribution for data with speed > 0

Data Synchronization
As we mentioned earlier, the names of the video files represent the recording starting time of the video either in the exact Unix time stamp in milliseconds or using the date and time in seconds. The meta data were saved using the Unix time stamp on the database. To synchronize the metadata with the exact video frame, we used Equation (1): where frame represent the frame number that is described by the metadata, datetime represent the Unix time stamp in ms of the metadata, video_recording_time represent the Unix time stamp in ms of the recording video, and f ramerate represent the video frame rate. This way, the videos saved by the Unix timestamp will be perfectly synchronized, while the videos that were saved by the date and time will be shifted. The maximum difference is 1 s or 10-60 frames. For efficient usage of the data, we performed the synchronization for the whole dataset. Each video is frame by frame annotated, the metadata is saved in a CSV file that contains the frame number alongside with the additional information.

Data Evaluation
To validate our dataset, we carried out expirements with multi-task temporal shift convolutional attention network (MTTS-CAN) [24], one of the state-of-the-art algorithms in heart rate estimation. The architecture is presented in Figure 8.
We tested the algorithm on subset of our dataset that contains heart rate information. This dataset consists of 12 videos. The MTTS-CAN showed a mean average error of 16.375 heartbeats per minutes and Root mean square error equal to 19.495, which considered a high error. In addition, we carried a separate experiment to evaluate the respiratory rate. We used our algorithm proposed by paper [25] to detect the respiratory rate when the car speed is zero or around the zero. We made experiments of the proposed method on the presented dataset and conclude that we can measure respiratory rate then the the vehicle speed is less than 3 Km/h. The algorithm can be summarized in the following steps:

1.
Estimate the position of the chest keypoint using Openpose human pose estimation model.
clean the displacement signal using filtering and detrending. Then count the number of peaks/troughs in a time window of one minute. Figure 9 shows the algorithm scheme.
Recorded video using smartphone camera  Figure 9. Respiratory rate algorithm scheme [25]. Figure 10 shows the produced heart rate signal produced by MMTS-CAN and respiratory rate signal.
(a) The heart rate signal produced by MTTS-CAN. The predicted pulse is 89 BPM (b) The respiratory rate signal produced by our algorithm. The predicted RR is 30 BPM Figure 10. The predicted heart rate and respiratory rate signal for a video of a driver inside the car with vehicle speed is zero and heart rate 112.
In our experiment, we divide the videos into three classes depending on the heart rate, then we used our algorithm to calculate the respiratory rate of the driver. Table 5 shows the mean respiratory rate for each class. As we can see from the table, there is a direct relationship between the heart rate and the respiratory rate. With increasing the heart rate, the respiratory rate increases as well.

Conclusions
In this paper, we introduced a new extensive, diverse dataset called DriverMVT designed to allow researchers to develop a contactless real-time monitoring system. The dataset contains 1506 videos collected using monocular camera from 9 subjects in real driving scenarios with total number of frames equal 5119k and total time over 36 h. For each video, the dataset contains the following time-synchronized information: geographic coordinates, speed, acceleration, light conditions, magnetic orientation, angular velocity, Driver head pose, driver mouth openness ratio, driver heart rate and driver actions. The dataset can be used to train and evaluate models for detecting Drowsiness/Fatigue, Distraction based on the head pose information, and predicting the driver heart rate to detect the driver health state. These models can reduce the accidents and increase the safety on the road. In addition we evaluated the dataset with MTTS-CAN algorithm. The algorithm mean average error on our dataset is 16.375 heartbeats per minute. We hope that other researchers will use our dataset in other innovative ways. Of course the main goal is that the research and models based on the DriverMVT dataset will one day help save lives on the roads.