ICONet: A Lightweight Network with Greater Environmental Adaptivity

: With the increasing popularity of artiﬁcial intelligence, deep learning has been applied to various ﬁelds, especially in computer vision. Since artiﬁcial intelligence is migrating from cloud to edge, deep learning nowadays should be edge-oriented and adaptive to complex environments. Aiming at these goals, this paper proposes an ICONet (illumination condition optimized network). Based on OTSU segmentation algorithm and fuzzy c-means clustering algorithm, the illumination condition classiﬁcation subnet increases the environmental adaptivity of our network. The reduced time complexity and optimized size of our convolutional neural network (CNN) model enables the implementation of ICONet on edge devices. In the ﬁeld of fatigue driving, we test the performance of ICONet on YawDD and self-collected datasets. Our network achieves a general accuracy of 98.56% and our models are about 590 kilobytes. Compared to other proposed networks, the ICONet shows signiﬁcant success and superiority. Applying ICONet to fatigue driving detection is helpful to solve the symmetry of the needs of edge-oriented detection under complex illumination condition environments and the scarcity of related approaches.


Introduction
Considering that 14-20% of traffic accidents are caused by fatigue driving [1], fatigue driving detection is facing urgent needs and with high research significance. An effective fatigue driving detection approach will significantly reduce the consequent traffic accidents.
At present, there are mainly three types of methods to monitor fatigue driving. The first type of method is based on vehicle parameters. These methods focus on the rotation speed of the steering wheel, the change of the offset angle, and the changing frequency of the pedal [2]. The second type is based on physiological characteristics. These methods distinguish the driver's mental state based on the driver's multiple physiological characteristics [3], including blood pressure, pulse, heart rate, EMG (electromyography) signal, and EEG (electroencephalogram) signal. The involved algorithms include logistic regression, support vector machine, the k-nearest neighbor classifier [4,5], and artificial neural network [6]. The third type is based on computer vision. The driver's driving behavior data is collected by a camera installed in the vehicle. These methods are based on relevant algorithms to comprehensively analyze the video data. They mainly focus on the eyes and mouth of the driver [7][8][9].
There are some noteworthy drawbacks to the first two types of methods. The methods based on vehicle parameters require a lot of installed sensors, which increases the cost. Meanwhile, the sensing and processing lag of related sensors may not provide a real-time result. The methods based on 1.
We propose an illumination condition classification subnet based on the OTSU segmentation algorithm and fuzzy c-means clustering algorithm. This subnet focuses on classifying input pictures into three types, including normal daylight, weak daylight, and night.

2.
We propose a convolutional neural network subnet based on Haar-like features, AdaBoost algorithm, and the modified LeNet-5 network model. This subnet focuses on drivers' face extraction and behavior classification. 3.
We design the two subnets in the ICONet with high modularity. Not limited to fatigue driving detection, ICONet can be applied to other classification problems under various illumination conditions.
The rest of this paper is organized as follows. Section 2 describes the related work. Section 3 introduces the methods used in the ICONet. Section 4 describes the experimental results. Section 5 provides a discussion. Section 6 concludes the whole paper.

Related Work
In this section, we explore the current research of fatigue driving detection in the field of deep learning, with regard to the aspect of environmental adaptivity and edge-orientation.

Deep Learning Approaches Considering Environmental Adaptivity
In the aspect of environmental adaptivity, the interference of changing illumination conditions must be considered. Fatigue driving usually occurs at night or dusk when the illustration condition is weak daylight or night.
Ma, Chau et al. [11] presented a convolutional three-stream network architecture, which integrated current-infrared-frame-based spatial information and achieved an accuracy of 94.68%. Hao et al. [12] presented a parallel convolutional neural network (CNN). The proposed method was based on the different detection characteristics of the same image. CNN was used to automatically complete the feature learning. It was claimed with high robustness to a complex environment. Villanueva et al. [13] used a deep learning algorithm. The proposed system used images captured by the camera to detect patterns in the driver's facial features (eyes closed, nodding/head tilt, and yawning). A deep neural network named SqueezeNet was used for faster model development and retraining. There existed an alarm when drowsy driving was detected. Garcia et al. [14] presented a non-intrusive approach including three stages. The first stage was face detection and normalization. The second stage performed pupil position detection and characterization. The final stage calculated the percentage of eyelid closure (PERCLOS) based on closed eyes information. Songkroh et al. [15] used both vehicle speed and driver behavior to analyze and to determine the risk level of the driver. The proposed system Symmetry 2020, 12, 2119 3 of 18 included facial image preprocessing, facial feature detection, feature classification, and an analysis module. The risk alert yielded an accuracy of 86.30% at any vehicle speed. Yu et al. [16] proposed a condition-adaptive representation learning framework. The framework combined spatio-temporal representation and estimated scene conditions were merged to enhance the discriminative power. Spatio-temporal representation learning extracted features that could simultaneously describe motions and appearances. Scene condition understanding classified the scene conditions related to various situations. Memon et al. [17] built a non-intrusive constant checking framework based on OpenCV. Ma et al. [18] presented a two-stream CNN structure focusing on the night situation and achieved an accuracy of 91.57%.
Despite some good results in methods considering environmental adaptivity, these networks have high computational complexity and cannot be applied to resource-limited edge devices.

Deep Learning Approaches Considering Edge-Orientation
In the aspect of edge-oriented methods, most previous works faced embedded systems including Raspberry Pi and Android devices including smartphones.
On embedded systems, Gu et al. [19] proposed a convolutional neural network model with multi-scale pooling (MSP-Net) and implemented it on an NVIDIA JETSON TX2 development board. On Raspberry Pi, Sharan et al. [20] proposed a driver fatigue system based on eye states using a convolutional neural network. Ghazal et al. [21] used CNN to perform embedded fatigue detection and achieved an accuracy of 95%. Their proposed approach included video signal spatial processing and deep convolutional neural network classification.
On Android devices including smartphones, Xu et al. [22] presented the Sober-Drive system based on a neural network and achieved an accuracy of 90%. Dasgupta et al. [23] proposed a three-stage drowsiness detection with an accuracy of 93.33%. The three stages included PERCLOS calculation, the voiced to the unvoiced ratio, and touch response, which could generally detect drowsy driving and subsequently raised an alarm. Galarza et al. [24] proposed a surveillance system for real-time driver drowsiness detection and with an accuracy of 93.37%. Jabbar et al. [25] proposed a real-time driver drowsiness detection system. Their saved CNN model was within 75 kilobytes and had an accuracy of 83%.
These edge-oriented methods and systems can effectively perform fatigue driving detection and have been realized on edge devices. However, they are not adaptive to various environmental conditions and have relatively low accuracy. Several seconds of fatigue driving may directly cause a fatal traffic accident. An optimized method must detect fatigue driving with high accuracy in whatever environmental condition.

Methods
The structure of the illumination condition optimized network is shown in Figure 1. ICONet includes two subnets. The first subnet classifies illumination conditions, and the second subnet classifies related behaviors based on a modified LeNet-5 CNN model. The final stage of ICONet is a comprehensive judgement based on the output results of the two subnets. The whole network is designed to aim at greater environmental adaptivity and implementation on edge devices. For the first goal, our idea focuses on an effective classification. An optimized network is expected to accurately classify images under different illumination conditions, then accordingly call the specific pre-trained model. Each illumination condition type corresponds to a CNN model, which correlates with the symmetry concept. Instead of involving image correction, directly calling the pre-trained models under various conditions will reduce the real-time computing power, which is suitable for edge devices. For the second goal, considering the limited resources on edge devices, we optimize the CNN network structure, remove unnecessary layers, and reduce the convolutional kernel size. The network parameters are carefully adjusted to guarantee high accuracy when applied to fatigue driving detection.

Illumination Condition Classification Subnet
This subnet is based on OTSU segmentation algorithm and fuzzy c-means clustering algorithm.

OTSU Segmentation Algorithm
The OTSU segmentation algorithm [26] is used to determine the image binary segmentation threshold value. Based on the one-dimensional histogram of the gray image, it selects the best threshold value and uses this threshold value to divide the entire image into two parts, including the target and the background. The optimal threshold value can maximize the variance between the two parts of the image. The detail of the OTSU segmentation algorithm is described as follows.
Consider a threshold gray value ℎ splits the pixels in an image into two groups: 1 (with pixel gray value < ℎ ) and 2 (with pixel gray value < ℎ ). The average gray values of 1 , 2 pixels are 1 , 2 . The average global gray value of the image is . The probabilities of a pixel classified as 1 or 2 are 1 and 2 .
where is the number of pixels with gray value .

Illumination Condition Classification Subnet
This subnet is based on OTSU segmentation algorithm and fuzzy c-means clustering algorithm.

OTSU Segmentation Algorithm
The OTSU segmentation algorithm [26] is used to determine the image binary segmentation threshold value. Based on the one-dimensional histogram of the gray image, it selects the best threshold value and uses this threshold value to divide the entire image into two parts, including the target and the background. The optimal threshold value can maximize the variance between the two parts of the image. The detail of the OTSU segmentation algorithm is described as follows.
Consider a threshold gray value Thre splits the pixels in an image into two groups: C 1 (with pixel gray value < Thre) and C 2 (with pixel gray value < Thre). The average gray values of C 1 , C 2 pixels are m 1 , m 2 . The average global gray value of the image is m G . The probabilities of a pixel classified as C 1 or C 2 are p 1 and p 2 . where n i is the number of pixels with gray value i.
The variance between these two classes is By substituting, we have The optimized threshold value will obtain a maximum variance σ 2 among all the 256 gray values. The whole process of the OTSU segmentation algorithm is described in Algorithm 1. Calculate p 1 , p 2 , m 1 , m 2 , m G 7. Calculate end if 12. end for 13. return Thre, σ 2

Fuzzy C-Means Clustering Algorithm
The fuzzy c-means clustering algorithm [27] introduces the fuzzy theory to provide more flexible clustering results than normal hard clustering. The algorithm assigns a weight to each object and each cluster. It indicates the degree to which the object belongs to the cluster. Pixels in the image are divided into many disjoint sets based on the characteristic distance of each pixel. This distance reflects the similarity among all the pixels. Consider X = {x 1 , . . . , x n } as a set of n objects, and V = {v 1 , . . . , v k } is the set of centers of the k clusters. U is a k × n partition matrix, where u ij is the membership degree of a sample x j to the cluster center v i . All cluster centers and the membership of the pixels can be obtained by minimizing the target function. The target function is Symmetry 2020, 12, 2119 where n represents the number of samples, k represents the number of clusters, fuzzy weight index m ∈ (1, +∞). m usually equals to 2.
The membership degrees are initially randomly assigned. The cluster center and membership in each following iteration are calculated based on Equations (13) and (14). The iterative process will stop if and only if where ε > 0 and is preassigned. The whole process of fuzzy c-means clustering is described in Algorithm 2.
Algorithm 2 Fuzzy C-Means Clustering algorithm After performing a lot of tests, we observed that the OTSU threshold value, the ratio of OTSU threshold and average gray value, and the minimum of the target function reflects the illumination condition of an image.

Convolutional Neural Network Classification Subnet
When applied in the field of fatigue driving detection, this subnet involves face detection and the extraction of the region of interest at the beginning. Face detection in our model is based on Haar-like features and the AdaBoost algorithm [28]. The Haar-like feature is a simple rectangular feature in the face detection system. It is defined as the difference of the pixel's global gray value in adjacent areas of an image. The rectangular feature can reflect the gray changes of the local features of the detected object. The introduction of the integral images accelerates the feature acquisition speed of the detector. The basic idea of the AdaBoost algorithm is to superpose lots of weak classifiers to be a strong classifier with strong classification ability. Then, several strong classifiers are connected in series to complete image retrieval.
We modified the traditional LeNet-5 model, and the structure of the proposed CNN is shown in Figure 2. For the input image, our model uses thirty-two convolution kernels with 5 × 5 size, one pooling layer with 2 × 2 size, sixteen convolution kernels with 5 × 5 size, one pooling layer with 2 × 2 size, and three full connection layers. where represents the label, and represents the predicted probability. We performed regularization in loss function to improve the generalization ability of the model and avoid over-fitting problems [29]. Indicators that can reflect the complexity of the model are added to the loss function. If the loss function used to describe the performance of the model on training data is ( ), then the optimized target function is ( ) + ( ). ( ) represents the complexity of the model, and represents the proportion of the model's complexity loss in the total loss. represents all the parameters in a neural network, including the weight and the bias term . The L2-norm regularization formula [30] used in this paper is shown in Equation (18).
The time complexity (floating-point operations per second) of convolutional layers in a CNN model is defined in Equation (19).
where is the depth of the network; is the ℎ convolution layer; is the length of the feature map belonged to each convolution kernel; is the size of convolution kernel; is the number of output channels of the ℎ convolution layer.
As for a fully connected layer, consider the dimension of input data is ( , ), the weight dimension of a hidden layer is ( , ), and the dimension of output data is ( , ). Then, the time complexity (FLOPs) of a fully connected layer in a CNN model is defined in Equation (20). We compare the time complexity of ICONet with other approaches in Section 4.3.2.

Comprehensive Judgement
Applying in the field of fatigue driving detection, the comprehensive judgement stage in ICONet is about fatigue judgement. Fatigue is closely related to the frequency of eyes and mouth closing.
PERCLOS [31] calculates the ratio of the eye closure frames within a certain period and then infers the driver's eye closure frequency. PERCLOS can be calculated by: Similar to PERCLOS, FOM (frequency of mouth) [32] calculates the ratio of mouth open within a certain period and then infers the driver's yawn frequency. FOM is calculated by: There is a SoftMax layer after the last full connection layer. The SoftMax function is defined in Equation (16).
The loss function based on SoftMax cross-entropy is defined in Equation (17).
where y i represents the label, and s i represents the predicted probability. We performed regularization in loss function to improve the generalization ability of the model and avoid over-fitting problems [29]. Indicators that can reflect the complexity of the model are added to the loss function. If the loss function used to describe the performance of the model on training data is J(θ), then the optimized target function is J(θ) + λR(w). R(w) represents the complexity of the model, and λ represents the proportion of the model's complexity loss in the total loss. θ represents all the parameters in a neural network, including the weight w and the bias term b. The L2-norm regularization formula [30] used in this paper is shown in Equation (18).
The time complexity (floating-point operations per second) of convolutional layers in a CNN model is defined in Equation (19).
where D is the depth of the network; l is the l th convolution layer; M is the length of the feature map belonged to each convolution kernel; K is the size of convolution kernel; C l is the number of output channels of the l th convolution layer.
As for a fully connected layer, consider the dimension of input data is (N, D), the weight dimension of a hidden layer is (D, out), and the dimension of output data is (N, out). Then, the time complexity (FLOPs) of a fully connected layer in a CNN model is defined in Equation (20). We compare the time complexity of ICONet with other approaches in Section 4.3.2.

Comprehensive Judgement
Applying in the field of fatigue driving detection, the comprehensive judgement stage in ICONet is about fatigue judgement. Fatigue is closely related to the frequency of eyes and mouth closing.
PERCLOS [31] calculates the ratio of the eye closure frames within a certain period and then infers the driver's eye closure frequency. PERCLOS can be calculated by: Similar to PERCLOS, FOM (frequency of mouth) [32] calculates the ratio of mouth open within a certain period and then infers the driver's yawn frequency. FOM is calculated by: For an input video, a certain frequency range is required for accurately calculating fatigue parameters. After a certain frame is detected by ICONet, according to the first-in-first-out principle, the latest result is added to the first place of the queue and the last value of the queue is removed. The total number of frames in the queue remains constant. In Figure 3, "1" represents a closed mouth or eye, and "0" represents an opened mouth or eye.
For an input video, a certain frequency range is required for accurately calculating fatigue parameters. After a certain frame is detected by ICONet, according to the first-in-first-out principle, the latest result is added to the first place of the queue and the last value of the queue is removed. The total number of frames in the queue remains constant. In Figure 3, "1" represents a closed mouth or eye, and "0" represents an opened mouth or eye. Combining the frequency queue with PERCLOS and FOM parameters, we can judge whether a driver is fatigued driving.

Experimental Results
All experiments were conducted on a computer with Intel(R) Core™ i7-10750H CPU @2.6GHz, 16.0GB RAM, NVIDIA GeForce RTX 2060, and Windows 10. The algorithms were developed in Python 3.6 via OpenCV 3.3.1 and TensorFlow 1.13.1.

YawDD Dataset
YawDD dataset [33] includes two video datasets of car drivers' behaviors in the car. Fatigue driving behaviors are involved. These drivers include males and females, with and without glasses, and from different races. In the 322 videos of the first video set, the camera is installed under the front mirror of the car. In the 29 videos of the second video set, the camera is installed on the dashboard.

CEW (Closed Eyes in the Wild) Dataset
The CEW dataset [34]   Combining the frequency queue with PERCLOS and FOM parameters, we can judge whether a driver is fatigued driving.

Experimental Results
All experiments were conducted on a computer with Intel(R) Core™ i7-10750H CPU @2.6GHz, 16.0GB RAM, NVIDIA GeForce RTX 2060, and Windows 10. The algorithms were developed in Python 3.6 via OpenCV 3.3.1 and TensorFlow 1.13.1.

YawDD Dataset
YawDD dataset [33] includes two video datasets of car drivers' behaviors in the car. Fatigue driving behaviors are involved. These drivers include males and females, with and without glasses, and from different races. In the 322 videos of the first video set, the camera is installed under the front mirror of the car. In the 29 videos of the second video set, the camera is installed on the dashboard.

CEW (Closed Eyes in the Wild) Dataset
The CEW dataset [34] is about eye state detection under normal daylight. It includes 2423 volunteers and 1192 of them are with eyes closed. There are 2462 images with open eyes and 2384 images with closed eyes.

Self-Collected Dataset
Our network focuses on environments with complex illumination conditions, however, YawDD and CEW datasets are about various driving behaviors under normal daylight. There are currently no open-access datasets focusing on environments with different illumination conditions. Thus, we collected our dataset by an infrared camera including fatigue driving behaviors under weak daylight and night.
The self-collected data includes 15 drivers. Each driver has 1 or 2 videos, depending on whether they wear glasses or not. Normal driving and yawning behaviors are involved in each video. The video is captured at 30 frames per second, and the frame height and width are 1920 × 1080.

Illumination Condition Classification Subnet
We have mentioned in Section 3.1 that four threshold values can represent the illumination condition of a picture. The four threshold values include the OTSU threshold value, average gray value, the ratio of OTSU threshold value and average gray value, and a minimum of target function in fuzzy c-means clustering. We classified the illumination condition types of pictures into normal daylight, weak daylight, and night. Based on YawDD and the self-collected dataset, we captured the videos according to a certain frame rate. We randomly picked 70% of the pictures to calculate the threshold ranges. The remaining 30% were used to verify the accuracy. The distribution of related parameters in the set of pictures used for calculation is shown in Figure 4.

Illumination Condition Classification Subnet
We have mentioned in Section 3.1 that four threshold values can represent the illumination condition of a picture. The four threshold values include the OTSU threshold value, average gray value, the ratio of OTSU threshold value and average gray value, and a minimum of target function in fuzzy c-means clustering. We classified the illumination condition types of pictures into normal daylight, weak daylight, and night. Based on YawDD and the self-collected dataset, we captured the videos according to a certain frame rate. We randomly picked 70% of the pictures to calculate the threshold ranges. The remaining 30% were used to verify the accuracy. The distribution of related parameters in the set of pictures used for calculation is shown in Figure 4.       [2,20]. To test the accuracy of the method, we performed this method on the test datasets mentioned in Section 4.1. The results are shown in Table 1. According to Table 1, the proposed illumination condition classification subnet achieves a general accuracy of 98.31% when classifying the various illumination conditions. This subnet works as a pre-classification stage and leads to a well-directed behavior classification in the next subnet.

Experimental Results
In the CNN classification subnet, we first performed face detection and region of interest extraction based on the AdaBoost algorithm and Haar-like feature. The process is shown in Figure 5. In the classification process, the related CNN models are pre-trained. We used CEW and self-collected datasets to train eye and mouth models based on the proposed CNN model. The ratio of the training set and testing set was 70% to 30%.  According to Table 1, the proposed illumination condition classification subnet achieves a general accuracy of 98.31% when classifying the various illumination conditions. This subnet works as a pre-classification stage and leads to a well-directed behavior classification in the next subnet.

Experimental Results
In the CNN classification subnet, we first performed face detection and region of interest extraction based on the AdaBoost algorithm and Haar-like feature. The process is shown in Figure 5. In the classification process, the related CNN models are pre-trained. We used CEW and selfcollected datasets to train eye and mouth models based on the proposed CNN model. The ratio of the training set and testing set was 70% to 30%. In the training process, our model first resizes the input picture to 24 × 24, then optimizes the model based on the stochastic gradient descent (SGD) method and updates the parameters of the neural network. The batch size of the neural network is 120 and the learning rate is 0.001. Based on the CEW dataset, we trained the eye models under different illumination conditions. The mouth models were trained based on YawDD and self-collected datasets. The models' loss and accuracy of training and testing are shown in Figures 6 and 7. In the training process, our model first resizes the input picture to 24 × 24, then optimizes the model based on the stochastic gradient descent (SGD) method and updates the parameters of the neural network. The batch size of the neural network is 120 and the learning rate is 0.001. Based on the CEW dataset, we trained the eye models under different illumination conditions. The mouth models were trained based on YawDD and self-collected datasets. The models' loss and accuracy of training and testing are shown in Figures 6 and 7. ural network. The batch size of the neural network is 120 and the learning rate is 0.001. Based o e CEW dataset, we trained the eye models under different illumination conditions. The mout odels were trained based on YawDD and self-collected datasets. The models' loss and accuracy ining and testing are shown in Figures 6 and 7.   Besides, we mixed the dataset under different illumination conditions and trained a hybr odel. The involvement of this hybrid model works as an ablation study. We expected to earn gher accuracy of our subnet than the hybrid model. It will demonstrate the necessity an fectiveness of the involvement of the illumination condition classification subnet.
Under the circumstances of inputting pictures captured from a video, we compared the accurac this hybrid model with models involving illumination condition classification. The result is show Table 2.

Mouth Classification Accuracy
General Accuracy Besides, we mixed the dataset under different illumination conditions and trained a hybrid model. The involvement of this hybrid model works as an ablation study. We expected to earn a higher accuracy of our subnet than the hybrid model. It will demonstrate the necessity and effectiveness of the involvement of the illumination condition classification subnet.
Under the circumstances of inputting pictures captured from a video, we compared the accuracy of this hybrid model with models involving illumination condition classification. The result is shown in Table 2. According to Table 2, the classification of illumination conditions earns a 5.23% superiority of accuracy. In other words, the involvement of the first subnet can guarantee at least one more correct behavior classification in every twenty detections. It should be noted that an accident may occur only after several seconds of fatigue driving. A more accurate driving behavior classification will result in an earlier notification if the proposed model is implemented on the vehicular devices, which will reduce the possibility of a tragic accident.

Comparison with Other Approaches
Additional to the ablation study, we performed comparisons between the proposed ICONet and other approaches. Since the training of all the models involves a self-collected dataset, we first performed a comparison of the eye models on the public CEW dataset, in order to strengthen the persuasiveness and demonstrate the superior ability of our proposed network. As is introduced in Section 4.1.2, the CEW dataset only includes eye images under normal daylight. The comparative results are shown in Figure 8 and Table 3.

Author or Network Year Accuracy
Sharma [35] 2018 97.80% Sharan [20] 2019 96.56% ICONet 2020 98.30% We reproduced the network proposed by Sharma [35] and Sharan [20]. According to Figure 8 and Table 3, our network earns a superior accuracy of at least 0.5%. The results indicate that regardless of the illumination condition classification, simply utilizing the second subnet of our network can comparatively obtain a better result. This proves that our improvements to the traditional LeNet-5 framework are effective and essential.
In the literature [12] and [13], the authors noted that their networks could be applied under different illumination conditions. To prove the superiority of the proposed ICONet, we reproduced  Table 3. Accuracy comparison of testing on the CEW dataset.

Author or Network
Year Accuracy Sharma [35] 2018 97.80% Sharan [20] 2019 96.56% ICONet 2020 98.30% We reproduced the network proposed by Sharma [35] and Sharan [20]. According to Figure 8 and Table 3, our network earns a superior accuracy of at least 0.5%. The results indicate that regardless of the illumination condition classification, simply utilizing the second subnet of our network can comparatively obtain a better result. This proves that our improvements to the traditional LeNet-5 framework are effective and essential.
In the literature [12,13], the authors noted that their networks could be applied under different illumination conditions. To prove the superiority of the proposed ICONet, we reproduced several models in the two pieces of literature and compared them with ICONet. The results are shown in Figure 9.
Sharma [35] 2018 97.80% Sharan [20] 2019 96.56% ICONet 2020 98.30% We reproduced the network proposed by Sharma [35] and Sharan [20]. According to Figure 8 and Table 3, our network earns a superior accuracy of at least 0.5%. The results indicate that regardless of the illumination condition classification, simply utilizing the second subnet of our network can comparatively obtain a better result. This proves that our improvements to the traditional LeNet-5 framework are effective and essential.
In the literature [12] and [13], the authors noted that their networks could be applied under different illumination conditions. To prove the superiority of the proposed ICONet, we reproduced several models in the two pieces of literature and compared them with ICONet. The results are shown in Figure 9.    Figure 9 compares the changing of test and train accuracies among the three networks. It can be observed that previous approaches are with high training cost, while ICONet requires approximately 100 steps to reach a stable accuracy. When comparing the final stable accuracy, the results demonstrate that although previous approaches can be utilized in various illumination conditions, they are not designed with superior environmental adaptivity compared to ICONet.
Verification of the proposed network with high accuracy on a computer or a server is not the final step. Instead, the resources-limited vehicular devices are closer to a driver in a real scenario. Aiming at the network implementation on edge devices, we compare the model size and time complexity (FLOPs) of ICONet with other approaches. A network with lower time complexity and smaller model size can perform better on edge devices and comparatively without serious lag. The time complexity (FLOPs) is calculated based on Equations (19) and (20). The results are shown in Table 4. According to Table 4, ICONet has superiority in convolutional layer time complexity of at least 17.04%, and a fully connected layer time complexity of at least 84.13%. Working as a lightweight network, ICONet is capable to be loaded on edge devices.

Comprehensive Judgement
The illumination condition classification subnet and the convolutional neural network subnet can only provide the classification results of mouth and eyes behavior under various illuminations. The comprehensive judgement stage is designed to combine these classification results and determine whether a driver is in fatigue driving. This determination is based on the PERCLOS and FOM threshold values. The process is shown in Figure 10.
Hao [12]  According to Table 4, ICONet has superiority in convolutional layer time complexity of at least 17.04%, and a fully connected layer time complexity of at least 84.13%. Working as a lightweight network, ICONet is capable to be loaded on edge devices.

Comprehensive Judgement
The illumination condition classification subnet and the convolutional neural network subnet can only provide the classification results of mouth and eyes behavior under various illuminations. The comprehensive judgement stage is designed to combine these classification results and determine whether a driver is in fatigue driving. This determination is based on the PERCLOS and FOM threshold values. The process is shown in Figure 10. In the process of fatigue driving detection, only focusing on one characteristic is inaccurate. For example, a driver may blink with high frequency or close eyes for a long time under intense illumination. Our approach combines the characteristics of both eyes and mouth. Figure 11 shows the eye and mouth results during yawning, where "1" represents eye or mouth closing, and "0" represents eye or mouth opening. From the 30th captured frame to the 50th captured frame, the driver can be considered to be fatigue driving. According to related literature [23] and experiments, we set the threshold value as  In the process of fatigue driving detection, only focusing on one characteristic is inaccurate. For example, a driver may blink with high frequency or close eyes for a long time under intense illumination. Our approach combines the characteristics of both eyes and mouth. Figure 11 shows the eye and mouth results during yawning, where "1" represents eye or mouth closing, and "0" represents eye or mouth opening. From the 30th captured frame to the 50th captured frame, the driver can be considered to be fatigue driving. According to related literature [23] and experiments, we set the threshold value as f PERCLOS_Threshold = 0.25, f FOM_Threshold = 0.2. A driver can be considered fatigue driving if f PERCLOS ≥ 0.25 or f FOM ≥ 0.2, i.e., in 100 continuously captured frames, there are at least 25 frames closing eyes or 20 frames opening mouth. Based on the threshold value range, we performed tests on YawDD and self-collected datasets. The result is shown in Figure 12.    Figure 12 demonstrates that under the high accuracy of the previous two subnets, ICONet can effectively judge fatigue driving.

Discussion
In the field of deep learning-based fatigue driving detection, previous works have achieved some significant success. However, with the migration of artificial intelligence from the cloud to the edge, fatigue driving detection is required to be effectively loaded on edge devices and have high accuracy under various environments. Most previous works mainly focus on one aspect and perform undesirably when being applied to other aspects.
Aiming at a greater environmental adaptivity, some pieces of literature design complicated network structures [12,13], while other approaches attempt to obtain a desirable result based on a large training set. The proposed illumination condition classification subnet in this paper is based on traditional image processing algorithms. The experimental results prove the effectiveness of our framework.
Aiming at the implementation on edge devices, we modified the LeNet-5 model, which is the most lightweight framework among all the CNN frameworks. The convolutional layers and kernels  Figure 12 demonstrates that under the high accuracy of the previous two subnets, ICONet can effectively judge fatigue driving.

Discussion
In the field of deep learning-based fatigue driving detection, previous works have achieved some significant success. However, with the migration of artificial intelligence from the cloud to the edge, fatigue driving detection is required to be effectively loaded on edge devices and have high accuracy under various environments. Most previous works mainly focus on one aspect and perform undesirably when being applied to other aspects.
Aiming at a greater environmental adaptivity, some pieces of literature design complicated network structures [12,13], while other approaches attempt to obtain a desirable result based on a large training set. The proposed illumination condition classification subnet in this paper is based on traditional image processing algorithms. The experimental results prove the effectiveness of our framework.
Aiming at the implementation on edge devices, we modified the LeNet-5 model, which is the most lightweight framework among all the CNN frameworks. The convolutional layers and kernels are optimized to obtain a lower time complexity. The experimental results comparatively demonstrate the compactness of ICONet.
Additional to the presented results, it should be noted that ICONet is designed to be a universal network, which works as a general solution towards classification problems in various illumination condition environments. Besides fatigue driving detection, it has the potential to be applied to other fields including but not limited to traffic classification, human activity classification, and classification problems in medical science and agriculture.
In the illumination condition classification subnet, we focused on the natural illumination conditions and classified the input pictures into normal daylight, weak daylight, and night. However, the complex illumination conditions may also include different luminance and location of the light source. Especially at night, the suddenly exerted intense light may affect the classification result. When considering these factors, the subnet may fail to classify well. In future work, we will focus on the classification of unnatural illumination conditions.
In the convolutional neural network classification subnet, the model focuses on a single person, which is the driver. However, there may exist more than one person in real scenarios, including the copilot, passengers, and people outside the vehicle. The subnet may fail to correctly detect the driver's face under these circumstances. We will involve additional image processing algorithms in our future work.
Despite the mentioned limitations, ICONet provides a reference for designing a multi-subnet framework. With the sharp increase of edge devices, ICONet works as an attempt towards the future development of edge-oriented deep learning.

Conclusions
Artificial intelligence is migrating from cloud to edge, and deep learning is required to be edge-oriented and adaptive to complex environments. In this paper, we proposed an illumination condition optimized network (ICONet) and applied it to fatigue driving detection. Based on the OTSU segmentation algorithm and fuzzy c-means clustering algorithm, the illumination condition classification subnet classifies pictures under normal daylight, weak daylight, and night. After the face detection and the extraction of the region of interest, the CNN classification subnet provides the classification results of eyes and mouth based on the modified LeNet-5 model. According to indicators including PERCLOS and FOM, ICONet can comprehensively judge fatigue driving. ICONet achieves a general accuracy of 98.56% and time complexity is reduced by at least 17.04% compared to the previous work. The size of all the CNN models is about 590 kilobytes. Experimental results demonstrate the feasibility of applying ICONet on edge devices under various illumination conditions in fatigue driving detection.
In our future work, besides solving the mentioned limitations, we will transplant the ICONet to the Android platform and test it on onboard devices. Additionally, we will add more driving behaviors and further optimize our model to improve its environmental adaptivity.
Author Contributions: Conceptualization, methodology, software, validation, formal analysis, investigation, data curation, visualization, writing-original draft preparation, Y.H. and Z.F.; resources, writing-review and editing, supervision, project administration, funding acquisition, W.H. and Y.L. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.