1. Introduction
Distracted and drowsy driving leads to thousands of deaths and injuries each year [
1,
2]. Over the decades, multiple research studies have been done based on artificial intelligence to find ways to reduce fatal crashes caused by the driver’s distraction and drowsiness. Despite the effort, driver monitoring systems have not become widespread because many studies are based on referring to the non-direct characteristics (e.g., steering wheel movement, the standard deviation of lane position), wearing intrusive equipment, or using high-performance processors. These research studies are categorized by the vehicle-based, behavioral, and physiological measurements [
3].
Many automakers and researchers have adopted the vehicle-based measurement and have implemented the driver monitoring system using vehicle-based measurement because it delivers real-time performance, non-intrusive operation, and adaptability [
4,
5]. However, the vehicle-based measurement is not a solution specific to distraction and drowsiness because a variation of in-vehicle data can be the consequence of impaired driving, independent driving styles, road geometry, etc. [
3,
6]. The experiment of [
7] showed a case where the driving pattern did not change relevantly, although the driver was drowsy. Furthermore, since vehicle-based measurement typically recognizes drowsiness after the drowsiness already reaches the last stage, warnings to prevent an accident could be generated too late [
8].
The driver’s physiologic signals during driving are characteristic features that are closely related to recognizing distraction and drowsiness. The driver monitoring system using physiological measurement provides high accuracy and can be implemented in real time [
9]. Despite those merits, commercializing the system is difficult because the equipment to collect the data of physiologic signals is usually huge, expensive, and intrusive [
10]. There are several wearable devices, such as bands, watches, and headbands, to get physiologic signals, but they are not reliable sources to recognize the driver’s distraction and drowsiness. For these reasons, it is difficult to commercialize the driver monitoring system using physiological measurements.
Driver monitoring systems using behavioral measurement have been proposed as a compromise and have been actively researched for decades. Most of the research studies on the driver monitoring system using behavioral measurement have used hand-crafted features and classifiers of machine learning. However, the hand-crafted features are usually low-level features and unsatisfactory to handle object appearance variations, such as pose, illumination, and orientation [
11,
12]. It is too difficult to select the appropriate features and classifiers according to the dataset’s characteristics [
11]. After the celebrated victory of AlexNet at the ILSVRC2012 (IMAGENET Large Scale Visual Recognition Challenge 2012) [
13], deep-learning with the network for learning features has been actively researched and has achieved an excellent classification performance in various fields. As shown in
Table 1, many recent studies of the driver monitoring system using behavioral measurement have used deep-learning algorithms.
However, since deep-learning algorithms require much memory and computation, most studies related to the deep-learning-based driver monitoring system utilize a high-performance CPU or GPU. Furthermore, most studies have a hierarchical structure that performs face detection, eye or mouth detection, and driver status (distraction or drowsiness) recognition in order. Not only does a hierarchical structure require higher specifications of CPU or GPU, but also the driver status may not be properly recognized if the driver’s face is not detected or is detected incorrectly. Although tracking algorithms have been used to address this problem, as of yet, there is no absolute solution for accurately detecting and tracking the driver’s face. Face detection and tracking can affect the system in terms of speed and accuracy.
In [
14], Park et al. proposed a method that does not have a hierarchical structure and showed that the driver’s drowsiness could be directly recognized from the primary image without significantly decreasing the performance. Although AlexNet and VGG [
15] models used in [
14] have relatively shallow layers, their many parameters require much memory. In the experiment of [
16], AlexNet showed the performance of 4.5 Frames Per Second (FPS) when five Raspberry Pis were used and 0.5 with VGG when nine Raspberry Pis were used. This shows that the driver monitoring systems based on deep-learning models such as AlexNet, VGG, and FlowNet [
17] used in [
14] had difficultly achieving real-time performance.
To solve this problem, we propose a Multi-Task Mobilenets (MT-Mobilenets)-based lightweight driver monitoring system that uses a resource sharing device such as the driver’s mobile phone. Our previous work [
18] showed it is possible to recognize driver’s facial behaviors related to driver status recognition without relying on face detection and tracking by using Mobilenets, which is smaller and faster than AlexNet, VGG, and FlowNet. MT-Mobilenets is an improved method from the Mobilenets of the previous research. MT-Mobilenets uses a single Mobilenets without face detection and tracking to recognize facial behaviors independently, such as head direction, eye closure, and mouth opening. With facial behaviors recognized in the image, we defined the driver status, such as distraction and drowsiness. Thus, it is possible to recognize both distraction and drowsiness based on one MT-Mobilenets, unlike most other studies that focused on recognizing either distraction or drowsiness. With Raspberry Pi, a Single-Board Computer (SBC), the FPS of MT-Mobilenets is still not accurate enough to recognize the driver status. To alleviate this problem, we propose the lightweight driver monitoring system using a resource sharing device in a vehicle. The resource sharing device connected to the proposed system shares computing resources. The proposed system can obtain additional recognition results based on these shared resources. The multi-task MobileNets-based lightweight driver monitoring system increases FPS and improves accuracy by using a resource sharing device in the vehicle, such as the mobile devices of the driver or passengers.
The content of this article is organized into the following sections.
Section 2 describes the dataset.
Section 3 describes the proposed lightweight driver monitoring system.
Section 4 contains the analysis and explanations of our experiments. Finally,
Section 5 presents the conclusion of our research and the introduction of future work.
3. Lightweight Driver Monitoring System Configuration
The proposed system consisted of the main block implemented on the in-vehicle SBC, and sub-device blocks implemented on the in-vehicle resource sharing devices such as the driver’s mobile device, as shown in
Figure 9. The main block included a server engine module, a task offloader module, and a facial behavior recognition module. The server engine module performed the functions of image acquisition, image transmission, and driver status recognition. The task offloader module managed the schedule among facial behavior recognition modules. The task offloader module also delivered images from the server engine module to the facial behavior recognition module in the main block, and it delivered encoded images to sub-device blocks. The sub-device blocks included a client engine module and the facial behavior recognition module. The client engine module decoded received images from the main block and delivered the decoded image to the facial behavior recognition module in the sub-device block. The facial behavior recognition module recognized the facial behaviors, such as head directions, eye closure, and mouth opening. The facial behavior recognition module of sub-device blocks performed identical functions as that of the main block. The sub-device blocks recognized the facial behaviors of some images, which were not processed by the main block, and transferred the facial behaviors recognition results to the main block. The main block accumulated facial behavior recognition results from the main block and the sub-device blocks and recognized the status of the driver on the in-vehicle SBC.
Since Wi-Fi uses the ISM band, communication performance may be affected depending on the surrounding environment. This problem can be classified into interference caused by a signal received from outside the vehicle and interference generated inside the vehicle. However, the interference from the outside of the vehicle does not have a great influence on the communication because it gives a relatively small interference to the inside of the vehicle made of an iron and steel structure. In order to minimize the influence of the interference caused by the competition in the vehicle, we used high-priority IEEE 802.11e QoS (Quality of Service). Nevertheless, in order to minimize the delay caused by the high traffic state, the scheduler was designed considering the transmission delay time by selecting the image to be transmitted through the communication state prediction model [
36] when receiving the request from the sub-device block.
3.1. Server Engine Module
The server engine module performed the functions of image acquisition, image transmission, and driver status recognition. The server engine consisted of grabber and recognizer components. When the grabber component received a request from the task offloader module, it transmitted the image acquired from the camera and the timestamp of the image. To minimize transmission delays via Wi-Fi, the grabber component preprocessed the images through center cropping, resizing, and normalizing. The recognizer component accumulated facial behavior recognition results and recognized driver status from the accumulated results. In the real-time implementation, the server engine module worked whenever it received the request and the behavior recognition results, as shown in
Figure 10a.
For the evaluation, we needed to use datasets that had images and the ground truth. The server engine module needed to transmit the image to be processed next and recognized the driver status considering the processing time of the subsequent modules. In the experiment, the server engine module worked, as shown in
Figure 10b. The grabber component loaded the saved image sequentially, regardless of the request from the task offloader module. It transmitted the loaded image whenever it received a request from the task offloader module. It loaded each image at a time interval calculated by Equation (
5). The facial behaviors that were not recognized in unprocessed images were assigned with the last results received from the task offloader module. With the time-series data of the facial behaviors’ recognition results, the recognizer component recognized the driver status.
In Equation (
5),
means the delay to load the next image.
means the timestamp of the
image.
means the time to load the current image.
3.2. Task Offloader Module
The task offloader had transfer and encoder components to deliver images. The task offloader had one transfer component by default. When a resource sharing device in a vehicle created a sub-device block, it connected the main block via Wi-Fi. The task offloader module of the main block assigned an ID to the device and created a new corresponding transfer component. Each transfer component received the pre-processed image from the server engine module, along with a timestamp corresponding to the image. The first transfer component module transmitted the image to the facial behavior recognition module of the main block and received recognition results from the facial behavior recognition module. For another transfer component, the encoder component encoded the image in the byte format, and the transfer component delivered the encoded image over the sockets to a sub-device block via Wi-Fi. Whenever transfer components received the facial behavior recognition results, they delivered both the recognition results and the timestamp to the server engine module.
Figure 11 shows the procedure of the proposed system.
The task offloader module managed the schedule among transfer components by a scheduler component. The task offloader consisted of a scheduler component, an encoder component, and transfer components. The scheduler component of the task offloader scheduled facial behavior recognition modules of the main block and the sub-device blocks to not process adjacent images in the following step: First, we calculated the average time
to recognize the driver status from the
image to the current
image in the main block. Then, the average time was divided by the number of transfers (
L). The delay (
) as calculated by multiplying the
and the transfer’s number
j as shown in Equation (
6).
represents the processing time from making a request to performing recognition, as shown in
Figure 11a. Finally, the task offloader module requested the next image after
and delivered the image though the
transfer.
3.3. Client Engine Module
The client engine module had transfer and decoder components. The transfer component delivered encoded images from the main block to the decoder component. The decoder component decoded the received encoded image and delivered the decoded image to the facial behavior recognition module of the sub-device block. The transfer component received the recognition results from the facial behavior recognition module. It delivered the recognition results to the main block.
3.4. Facial Behavior Recognition Module
The facial behavior recognition modules recognized the facial behavior from the driver image, such as head direction, eye closure, and mouth opening. These modules of the main block and the sub-device blocks used the same identical functions. A CNN inference component of the facial behavior recognition module used a previously-trained CNN model. This component performed only the model inference after first loading the pre-trained CNN inference model. The facial behavior recognition modules transferred the recognition results of the CNN inference component to each corresponding transfer component.
We used MT-Mobilenets to recognize facial behaviors, such as head direction, eye closure, and mouth opening.
Figure 12 describes the structure of MT-Mobilenets. MT-Mobilenets enables recognizing the facial behaviors from the image without the functions of face detection and tracking. Using MT-Mobilenets has the following advantages over traditional driver monitoring methods: First, MT-Mobilenets is driven by Mobilenets, which shows a relatively low computational latency and high accuracy in the mobile environment. Besides, MT-Mobilenets does not depend on whether the driver’s face, eye, and mouth are precisely detected. Finally, facial behaviors, including head direction, eye closure, and mouth opening, can be simultaneously recognized from one shared Mobilenets base.
5. Conclusions
In this paper, we used Multi-Task Mobilenets (MT-Mobilenets), which had a multi-task classifier. The three softmax regressions that compose the multi-task classifier can simultaneously recognize facial behaviors including face direction, eye closure, and mouth opening from the shared Mobilenets base. Since MT-Mobilenets was trained with images and facial behavior labels without using face detection and tracking, the facial behaviors could be recognized directly from the images. The driver status, such as distraction, fatigue, and drowsiness, was recognized by the duration of the facial behaviors and PERCLOS. MT-Mobilenets that did not utilize face detection and tracking recognized the driver status at an average 2.146 FPS on a single Raspberry Pi. However, the number of obtained facial behaviors per second on the single Raspberry Pi was insufficiently accurate to recognize the driver status. To solve this problem, we proposed a lightweight driver status recognition system on the SB based on MT-Mobilenets. For most processed frames, the single Raspberry Pi processed one or no frames per 0.5 s, which is the minimum unit time of the driver status definition. The proposed system using one additional device processed more than one frame per 0.5 s and processed more than two frames per 0.5 s for half of the processed frames. Our experiments showed that the proposed system provided higher performance than the single Raspberry Pi.
The experiments showed that the images needed to be processed more than twice within the minimum unit time of the driver status definition to achieve a reasonable accuracy of the driver status recognition in an SBC such as Raspberry Pi. This could be satisfied by using one additional device in the proposed system based on Raspberry Pi. The proposed system showed an average accuracy of 11.0 percentage points higher than that of the single Raspberry Pi and higher accuracy than Panasonic’s aftermarket product released in 2017.
We expect that if the higher accuracy of the pre-trained model of facial behavior recognition is utilized, the proposed system will support a higher accuracy of driver status recognition. We plan to improve the recognition accuracy of the driver status recognition in the future by training MT-Mobilenets with a vast driver dataset. While this experiment used Raspberry Pi for a resource sharing device in a vehicle, we plan to verify the proposed system with a real device in the vehicle, such as the driver’s mobile phone.