Deep Learning-Based Gaze Detection System for Automobile Drivers Using a NIR Camera Sensor

A paradigm shift is required to prevent the increasing automobile accident deaths that are mostly due to the inattentive behavior of drivers. Knowledge of gaze region can provide valuable information regarding a driver’s point of attention. Accurate and inexpensive gaze classification systems in cars can improve safe driving. However, monitoring real-time driving behaviors and conditions presents some challenges: dizziness due to long drives, extreme lighting variations, glasses reflections, and occlusions. Past studies on gaze detection in cars have been chiefly based on head movements. The margin of error in gaze detection increases when drivers gaze at objects by moving their eyes without moving their heads. To solve this problem, a pupil center corneal reflection (PCCR)-based method has been considered. However, the error of accurately detecting the pupil center and corneal reflection center is increased in a car environment due to various environment light changes, reflections on glasses surface, and motion and optical blurring of captured eye image. In addition, existing PCCR-based methods require initial user calibration, which is difficult to perform in a car environment. To address this issue, we propose a deep learning-based gaze detection method using a near-infrared (NIR) camera sensor considering driver head and eye movement that does not require any initial user calibration. The proposed system is evaluated on our self-constructed database as well as on open Columbia gaze dataset (CAVE-DB). The proposed method demonstrated greater accuracy than the previous gaze classification methods.


Introduction
Traffic accidents are mainly caused by a diminished driver vigilance level and gaze distraction from the road [1,2]. Driver distraction is the main source of attention divergence from the roadway and can pose serious dangers to the lives of drivers, passengers, and pedestrians. According to the United States Department of Transportation, 3179 people were killed and 431,000 injured in 2014 due to distracted drivers [3]. Any activity that can divert driver attention from the primary task of driving can lead to distracted driving. It can happen for many reasons, but the most common are using a smart phone, controlling the radio, eating and drinking, and operating a global positing system (GPS). According to the National Highway Traffic Safety Administration (NHTSA) the risk factor for auto wrecks increases three times when drivers are using their smart phones during driving [4]. Using a smart phone causes the longest period of drivers taking their eyes off the road (EOR). In short, it can be a reason for driver distraction, and the technology of driver gaze detection can play a pivotal role in helping to avoid auto accidents. The classification of driver gaze attention is an area of increasing relevance in the pursuit of accident reduction.
Current road safety measures are approaching a level of maturity with the passage of time. One of the major contributions to this is the development of advanced driver assistance systems (ADAS) that can monitor driver attention and send alerts to improve road safety and avoid unsafe driving. Real-time estimation of driver gaze could be coupled with an alerting system to enhance the effectiveness of the ADAS [5]. However, these real-time systems are faced with many challenges for obtaining reliable EOR estimation and classification of the gaze zones. Some significant challenges include: varying illumination conditions; considerable variation in pupil and corneal reflection (CR) due to driver head position and eye movements; variations in physical features that may differ due to gender, skin color, ethnicity, and age; providing consistent accuracy for people wearing glasses or contact lenses; and designing a system for a calibration-free environment. Some of the previous studies have proven to be good under specific conditions, but they have limitations in actual car environments.
To overcome the limitations of previous systems and address the above-mentioned challenges, we propose a near-infrared (NIR) camera sensor-based gaze classification system for car environments using a convolutional neural network (CNN). It is an important issue as this research area has many applications. The proposed system can be used for reliable EOR estimation and ADAS. It uses state-of-the-art deep-learning techniques to solve gaze tracking in an unconstrained environment.
The remainder of this paper is organized as follows. In Section 2, we discuss in detail the previous studies on gaze detection. In Section 3, the contributions of our research are explained. Our proposed method and its working methodology overview are explained in Section 4. The experimental setup is explained in Section 5, and the results are presented. Section 6 shows both our conclusions and discussions on some ideas for future work.

Related Works
Several studies have been conducted relating to the gaze classification systems [6][7][8][9]. Gaze classification can be broadly categorized into indoor desktop environments and outdoor vehicle environments. The former can be further divided into wearable device-based methods and non-wearable device-based methods. Wearable device-based methods include a camera and illuminator mounted on the subject's head in the form of a helmet or a pair of glasses [10][11][12][13][14][15]. In [13,14], a mouse and a wheel chair are controlled by a head-mounted wearable eye-tracking system. Galante et al. proposed a gaze-based interaction system for patients with cerebral palsy to use communication boards on a 2D display. They proposed a system using a head-mounted device with two cameras for eye tracking and frontal viewing [15]. In wearable systems, the problem of absolute head position can be easily avoided as wearable devices move along with head movements. However, the problem of user inconvenience arises when wearing the devices for long periods of time. To address this issue, the non-wearable device-based methods use non-wearable gaze-tracking devices such as cameras and illuminators to acquire face or eye images for gaze tracking [16][17][18][19][20]. Su et al. proposed a gaze-tracking system that was based on a visible light web camera. In this system that detected a face on the basis of skin color, luminance, chrominance, and edges, eyes are tracked to control the mouse [16]. A gaze-tracking system for controlling such applications as spelling programs or games was proposed by Magee et al. [17] A remote gaze detection method was proposed by Lee et al. that uses wideand narrow-view cameras as an interface for smart TVs [18]. In addition, a typical example of the non-wearable eye-tracking method is the PCCR-based method [19,20]. One major advantage of PCCR-based methods is they require no complicated geometrical knowledge about lighting, monitors, cameras, or eyes. User convenience of non-wearable device-based methods is higher than that of the wearable gaze-tracking methods, but initial user calibration or camera calibration is required to map the camera, monitor, and user's eye coordinates. In addition, these studies have focused only on indoor desktop environments considering small-sized monitors. In this study, we try to analyze the applicability of PCCR-based methods in a vehicle environment in which the head rotation of user is larger than that in desktop monitor environments and initial user calibration is difficult to perform.
The second category includes outdoor vehicle environments to classify the driver's gaze position and their behavior while driving. Rough gaze position based on driver head orientation is usually acceptable in driver behavior analyzing systems. Gaze zone estimators are being used to generate the probability of driver attention position. Outdoor vehicle environments for gaze classification can be further divided into two categories: multiple camera-based methods and single camera-based methods.
In past research, multiple camera-based methods were mostly used for the outdoor vehicle environment [21][22][23][24]. When dealing with the challenges of more peripheral gaze directions or large gaze coverage, multiple cameras may be the most suitable solution. Ahlstrom et al. [21] used multi-camera eye trackers in the car environment. They installed two hidden cameras, one at the A-pillar and one behind the center console of the car for covering forward-facing eye gazes. They tried to investigate the usefulness of a real-time distraction detection algorithm named AttenD. The performance, reliability, and accuracy of AttenD is directly dependent on eye-tracking quality. In addition, they defined the field relevant for driving (FRD) excluding the right-side mirror, but it is often the case that drivers gaze at the right-side mirror while driving.
Liang et al. observed driver distraction by using eye motion data in a support vector machine (SVM) model [22]. They compared it with a logistic regression model and found the SVM model performed better in identifying distraction. However, wearing glasses or eye make-up can adversely affect the accuracy of this system. An initial calibration of 5 to 15 min is also required and can be time-consuming and annoying for drivers. The concept of a distributed camera framework for gaze estimation was given by Tawari et al. [23]. They tracked facial landmarks and performed their correspondence matching in 3D face images. A random forest classifier in combination with proposed feature set was used for zone estimation. Since a visible light camera is used instead of a NIR light camera, it is greatly influenced by the external light conditions. Although the accuracy of the driver's gaze was high, only eight frontal gaze regions were considered. Later they proposed [24] that head pose for gaze classification can be estimated effectively by facial landmarks and their 3D correspondences. This is done by using a pose from an orthography and scaling (POS) algorithm [25]. Later, they used a constrained local model (CLM) to extract and analyze the head pose and its dynamic in the multi-camera system [26]. Although multiple camera-based methods show high accuracies of gaze estimation, the processing time is increased by the images of multiple cameras.
Considering this issue, single camera-based methods have been researched [27][28][29][30][31]. SVM was used by Lee et al. to estimate driver gaze zones by using their pitch and yaw clues [27]. The camera resolution of this system [27] is low with low illuminator power, and the driver's pupil center cannot be detected. Therefore, they estimated the driver's gaze position only by measuring head rotation (not eye rotation), and obtained the experimental data by instructing drivers to rotate their heads intentionally and sufficiently. If a driver only moves the eyes to gaze at some position without head rotation (which is often the case while driving), their method cannot detect driver gaze position. Vicente et al. [28] proposed a supervised descent method (SDM) using a scale invariant feature transform (SIFT) descriptor to express face shape by providing a clear representation against illumination. For eye pose estimation, facial feature landmarks use eye alignment to locate the eye region. There is an advantage that the camera position is not significantly affected by the change. However, the disadvantage is that the driver's wide head rotation and the use of thick glasses can decrease its performance. In addition, accuracy of the gaze tracker system may be limited as pupil center position in the driver's image can be mixed with the iris center position in daylight when not using the NIR light illuminator [32]. In [29,30], they detected skin regions based on pre-trained skin color predicate by Kjeldsen et al.'s method [33]. In the detected image, the eye region is classified as the non-skin region. If a non-skin region is detected above the lip corners, it becomes the most probable eye region. A small window is set and searched within the determined eye region to determine the pupil with the lowest pixel value, and the eye is traced using the optical flow algorithm of [34]. Finally, assuming that the eyes are aligned, they estimated the driver's gaze by modeling head movements using the position of both eyes, the back of the head just behind both eyes, and the center of the back of the head. In this study, it is not necessary to measure the distance from the driver's head to the camera, but this is because it detects only the direction of the eyes. When the driver's head and eyes rotate in opposite directions, or when the head does not rotate and only the eyes move, accuracy of tracking is decreased. In addition, since the iris center position is detected instead of the pupil center position, there is a limitation in improving the accuracy of eye tracking. Fridman et al. combined the histogram of oriented gradients (HOG) with a linear SVM classifier to find face region and classify feature vectors to gaze zones by random forest classifier [31]. Previous studies use Purkinje images [35] or detect facial feature points [36] to estimate gaze. Purkinje images (PI) are the light reflections generated on the cornea and crystalline lens of the human eye [37]. By analyzing the movements of these reflections (especially, the 1st and 4th PI), it is possible to identify the direction of eye rotation and determine gaze. However, this study did not evaluate the gaze detection accuracy in a vehicular environment [35]. Fridman, et al. [36] used facial feature points to find the iris and binarized it to estimate the area being gazed at. However, their accuracy was not high, because the iris could be detected in only 61.6% of the test images. There were limits to enhancing the accuracy of gaze detection because the center of the iris, and not the pupil, was detected. Choi et al. [38] detected driver faces with a Haar feature face detector and used CNN to categorize gaze zones, but they considered only eight gaze regions. Vora et al. proposed the method of driver's gaze estimation by CNN, but small numbers of gaze regions (six gaze regions) were considered in this research [39]. Therefore, the detailed gaze position of the driver cannot be detected. Fu et al. proposed automatic calibration method for the driver's head orientation by a single camera [40]. However, their calibration method requires the driver to gaze at several positions such as the side mirrors, the rear-view mirror, the instrument board, and different zones in the windshield as calibration points, which causes inconvenience to the driver in the actual car environment. In addition, only 12 gaze zones were considered in their research.
Other categories of gaze detection methods, such as regression-based methods, have been studied including appearance-based gaze estimation via uncalibrated gaze pattern recovery and adaptive linear regression for appearance-based gaze estimation [41,42].
Ghosh et al. [43] proposed using eye detection and tracking to monitor driver vigilance. However, their method classified open or closed eyes instead of detecting the driver's gaze position. In addition, the camera angle is small, and there is the limitation of movement of the driver's head, which causes inconvenience to the driver. García et al. [44] proposed a non-intrusive approach for drowsiness detection. Cyganek et al. proposed the hybrid visual system for monitoring the driver's states of fatigue, sleepiness and inattention based on the driver's eye recognition using the custom setup of visible light and NIR cameras and cascade of two classifiers [45]. Chen et al. proposed the method of detection of alertness and drowsiness by fusing electroencephalogram (EEG) and eyelid movement by electrooculography (EOG) [46]. However, as with the research in [43], their method [44][45][46] just recognized the alertness/drowsiness status of the driver by classifying open or closed eyes (or by physiological signals), not by detecting the driver's gaze position. Kaddouhi et al. proposed the method of eye detection based on the Viola and Jones method, corner points, Shi-Tomasi detector, K-means, and eye template matching [47]. This is just for the research of eye detection, and driver's gaze position was not detected in this research.
In previous research [48][49][50][51], they investigated the drivers' visual strategies, the distribution of fixation points, driving performance, and gaze behavior by on-road experiment or driving simulator. Their research was focused on the analyses of the driver's visual characteristics while driving instead of proposing new gaze detection methods.
Considering the limitations of existing studies, we investigated a method for driver gaze classification in the car environment using deep CNN. In Table 1, we have summarized the comparison of the proposed method and existing methods on gaze classification in vehicle environment.

Methods Advantage Disadvantage
Multiple camera-based [21][22][23][24] Multiple cameras are used to classify driver's gaze region -Possibility of invisible eye region is reduced -The processing time is increased by the images of multiple cameras -Reliability is higher when information from multiple cameras is combined -Difficulties in applying to actual vehicular environment due to complicated, time-consuming calibration [21,22] Single camera-based Using SVM [27], SDM with SIFT [28], skin color [29,30], HOG, SVM, random forest classifier [31], Purkinje image [35], facial feature points [36], and particle filtering [40] Computational complexity is lower than multiple camera-based system Higher possibility of invisible eye regions or disappearance of pupil and CR, which can negatively affect the reliability of system Regression-based method [41,42] Continuous gaze position can be obtained instead of discrete one Using one CNN based on driver's face image [38,39] -Accurate detection of pupil and CR is not required

The Contributions of Our Research
Our research has contributed in the following four ways compared to previous works.

•
Although there exists only one previous piece of research that used a shallow CNN of AlexNet for gaze estimation in the car environment [38], they used only one CNN using whole face image as input for estimating the small number of gaze regions (8 regions). Because the accurate gaze position of the driver cannot be detected based on only these 8 regions, we increased the number of gaze regions into 17 as shown in Figure 3a. The consequent classification complexity of gaze estimator is increased, and one shallow CNN using one input face image cannot show good accuracy of gaze estimation as shown in Tables 10 and 13. Therefore, we use three deep CNNs that use the images of left eye, right eye and face, respectively, and combine the outputs by these three CNNs based on score level fusion, which shows higher accuracy of gaze estimation without initial driver calibration.  [52] to enable them to have comparisons with our database and model.

Overview of Proposed Method
An overview of the system is shown in Figure 1. After the NIR camera of our system of Figure 2 captures the image frames of the driver's frontal view, 68 face landmarks are detected by the Dlib facial feature tracker [53] (steps (1) and (2) of Figure 1, and details are explained in Section 4.2). Then, the region-of-interest (ROI) images of face, left and right eye are obtained based on the corresponding face landmarks position (step (3) of Figure 1). Brightness normalization is done in each ROI image based on the mean of all the pixel values in each ROI image to increase the performance of the system and reduce the effect of light. In step (4) of Figure 1, three sets of feature values are extracted using three CNNs of face, left, and right eye ROI images, respectively (details are explained in Section 4.4). Then, each set of feature values is normalized, and three distances are calculated by three sets of feature values (step (5) of Figure 1). Here, distance is calculated between the input set of feature values and that in each gaze zone. Finally, our system classifies the driver's gaze zone based on score fusion of three distances (details are explained in Section 4.4.4).

Overview of Proposed Method
An overview of the system is shown in Figure 1. After the NIR camera of our system of Figure 2 captures the image frames of the driver's frontal view, 68 face landmarks are detected by the Dlib facial feature tracker [53] (steps (1) and (2) of Figure 1, and details are explained in Section 4.2). Then, the region-of-interest (ROI) images of face, left and right eye are obtained based on the corresponding face landmarks position (step (3) of Figure 1). Brightness normalization is done in each ROI image based on the mean of all the pixel values in each ROI image to increase the performance of the system and reduce the effect of light. In step (4) of Figure 1, three sets of feature values are extracted using three CNNs of face, left, and right eye ROI images, respectively (details are explained in Section 4.4). Then, each set of feature values is normalized, and three distances are calculated by three sets of feature values (step (5) of Figure 1). Here, distance is calculated between the input set of feature values and that in each gaze zone. Finally, our system classifies the driver's gaze zone based on score fusion of three distances (details are explained in Section 4.4.4).   Figure 2, our device consists of a NIR camera and the illuminator of 6 NIR light emitting diodes (LEDs). Because the size of the device is small (8.8 cm × 4.3 cm × 4.0 cm), it could be installed in the vicinity of the dashboard, as shown in Figure 2, and can continuously track the driver's gaze without obscuring the dashboard. The NIR illuminator in the gaze detection system was placed to the left of the camera, and helped capture the driver's facial image without being influenced by changing ambient light. Using the NIR LEDs at a wavelength of 850 nm, which is a little visible to the driver's eye, prevented uncomfortable situations, such as the driver being blinded or fatigued by the light while driving, and it distinguished the boundary of the pupil. A zoom lens (focal length of 9 mm) was attached to the camera to capture enlarged facial images of the driver. An 850-nm band pass filter (BPF) was also mounted on the   Figure 2, our device consists of a NIR camera and the illuminator of 6 NIR light emitting diodes (LEDs). Because the size of the device is small (8.8 cm × 4.3 cm × 4.0 cm), it could be installed in the vicinity of the dashboard, as shown in Figure 2, and can continuously track the driver's gaze without obscuring the dashboard. The NIR illuminator in the gaze detection system was placed to the left of the camera, and helped capture the driver's facial image without being influenced by changing ambient light. Using the NIR LEDs at a wavelength of 850 nm, which is a little visible to the driver's eye, prevented uncomfortable situations, such as the driver being blinded or fatigued by the light while driving, and it distinguished the boundary of the pupil. A zoom lens (focal length of 9 mm) was attached to the camera to capture enlarged facial images of the driver. An 850-nm band pass filter (BPF) was also mounted on the camera's lens to minimize interference due to sunlight [54]. Power to this device was supplied by a laptop computer using two universal serial bus (USB) lines, one connecting the camera and the other the illuminator. The captured image by web camera of our gaze-tracking device is successively transmitted to the laptop computer via a USB interface line. The characteristics of the camera and the illuminator of our gaze detection system are shown in Table 2. camera's lens to minimize interference due to sunlight [54]. Power to this device was supplied by a laptop computer using two universal serial bus (USB) lines, one connecting the camera and the other the illuminator. The captured image by web camera of our gaze-tracking device is successively transmitted to the laptop computer via a USB interface line. The characteristics of the camera and the illuminator of our gaze detection system are shown in Table 2.

Detecting Facial Landmarks by Dlib Facial Feature Tracker
In our research, 68 facial landmarks are detected by the Dlib facial feature tracker [53]. Facial landmarks are used to localize and represent salient regions of the face, such as eyes, eyebrows, nose, mouth and jawline. It can be successfully applied to various applications of face alignment, face swapping, and blink detection etc. In our case we have utilized it to extract face, left eye, and right eye areas. The procedure of obtaining facial landmarks is composed of localizing the face in the image and detecting the main facial structures on the face ROI. Localizing the face in the image can be done various ways such as using Haar cascades detector, HOG and Linear SVM-based detector, and deep learning-based algorithms. In any case, the major purpose is to find the face-bounding box. Once the face is localized through face-bounding box, our next target is to detect key facial structures in the face area. The Dlib facial feature tracker is used to estimate 68 (x,y)-coordinates that are mapped on the facial structure of the face. The indices of the 68 coordinates of facial landmarks are shown in Figure 4.

Detecting Facial Landmarks by Dlib Facial Feature Tracker
In our research, 68 facial landmarks are detected by the Dlib facial feature tracker [53]. Facial landmarks are used to localize and represent salient regions of the face, such as eyes, eyebrows, nose, mouth and jawline. It can be successfully applied to various applications of face alignment, face swapping, and blink detection etc. In our case we have utilized it to extract face, left eye, and right eye areas. The procedure of obtaining facial landmarks is composed of localizing the face in the image and detecting the main facial structures on the face ROI. Localizing the face in the image can be done various ways such as using Haar cascades detector, HOG and Linear SVM-based detector, and deep learning-based algorithms. In any case, the major purpose is to find the face-bounding box. Once the face is localized through face-bounding box, our next target is to detect key facial structures in the face area. The Dlib facial feature tracker is used to estimate 68 (x,y)-coordinates that are mapped on the facial structure of the face. The indices of the 68 coordinates of facial landmarks are shown in Figure 4.

Calculating PCCR Vector for Left and Right Eye
For showing the effectiveness of our proposed method, we have made comparison of a method considering PCCR vector (scheme 2) with our proposed method without considering PCCR (scheme 1). For this purpose, we calculated PCCR vector from left and right eyes. There have been various previous studies on eye, pupil, and CR detection [56,57]. In our research, within the captured eye ROI, defined based on the facial landmarks of 36~41 (for left eye) and those of 42~47 (for right eye) of Figure 4, the pupil center and CR center are detected as follows [7]. As the first step, histogram stretching is performed within the eye ROI. Then, the image subjected to histogram stretching goes through image binarization. This is intended to distinguish the pupil from the non-pupil regions as well as the CR from the non-CR regions. Morphological processing and component labeling are used on the binarized image to find the largest region. Then, based on the boundary identified using the canny edge detector, the outer boundary of the pupil is detected in the image by using the convex hull algorithm. By subtracting the overlapping area (of this boundary and the binarized CR region) from the outer boundary, it is possible to find the pupil boundary that is not distorted by the CR. Finally, the pupil center is accurately detected by performing ellipse fitting based on this boundary.
A search region is defined to detect the CR centered on the detected center of the pupil. Image binarization is performed on this search region to distinguish the CR and non-CR regions, after which component labeling is performed on the non-CR region because areas with the same pixel brightness as the CR can still exist in the region. The region closest to the detected pupil center is then designated as the CR region, and the geometric center of the designated CR is determined to be the CR center. An example of the detected pupil and CR regions are shown in Figure 5.

Calculating PCCR Vector for Left and Right Eye
For showing the effectiveness of our proposed method, we have made comparison of a method considering PCCR vector (scheme 2) with our proposed method without considering PCCR (scheme 1). For this purpose, we calculated PCCR vector from left and right eyes. There have been various previous studies on eye, pupil, and CR detection [56,57]. In our research, within the captured eye ROI, defined based on the facial landmarks of 36~41 (for left eye) and those of 42~47 (for right eye) of Figure 4, the pupil center and CR center are detected as follows [7]. As the first step, histogram stretching is performed within the eye ROI. Then, the image subjected to histogram stretching goes through image binarization. This is intended to distinguish the pupil from the non-pupil regions as well as the CR from the non-CR regions. Morphological processing and component labeling are used on the binarized image to find the largest region. Then, based on the boundary identified using the canny edge detector, the outer boundary of the pupil is detected in the image by using the convex hull algorithm. By subtracting the overlapping area (of this boundary and the binarized CR region) from the outer boundary, it is possible to find the pupil boundary that is not distorted by the CR. Finally, the pupil center is accurately detected by performing ellipse fitting based on this boundary.
A search region is defined to detect the CR centered on the detected center of the pupil. Image binarization is performed on this search region to distinguish the CR and non-CR regions, after which component labeling is performed on the non-CR region because areas with the same pixel brightness as the CR can still exist in the region. The region closest to the detected pupil center is then designated as the CR region, and the geometric center of the designated CR is determined to be the CR center. An example of the detected pupil and CR regions are shown in Figure 5.

Calculating PCCR Vector for Left and Right Eye
For showing the effectiveness of our proposed method, we have made comparison of a method considering PCCR vector (scheme 2) with our proposed method without considering PCCR (scheme 1). For this purpose, we calculated PCCR vector from left and right eyes. There have been various previous studies on eye, pupil, and CR detection [56,57]. In our research, within the captured eye ROI, defined based on the facial landmarks of 36~41 (for left eye) and those of 42~47 (for right eye) of Figure 4, the pupil center and CR center are detected as follows [7]. As the first step, histogram stretching is performed within the eye ROI. Then, the image subjected to histogram stretching goes through image binarization. This is intended to distinguish the pupil from the non-pupil regions as well as the CR from the non-CR regions. Morphological processing and component labeling are used on the binarized image to find the largest region. Then, based on the boundary identified using the canny edge detector, the outer boundary of the pupil is detected in the image by using the convex hull algorithm. By subtracting the overlapping area (of this boundary and the binarized CR region) from the outer boundary, it is possible to find the pupil boundary that is not distorted by the CR. Finally, the pupil center is accurately detected by performing ellipse fitting based on this boundary.
A search region is defined to detect the CR centered on the detected center of the pupil. Image binarization is performed on this search region to distinguish the CR and non-CR regions, after which component labeling is performed on the non-CR region because areas with the same pixel brightness as the CR can still exist in the region. The region closest to the detected pupil center is then designated as the CR region, and the geometric center of the designated CR is determined to be the CR center. An example of the detected pupil and CR regions are shown in Figure 5.  Based on the detected two centers of the pupil and CR, the PCCR vector is calculated. The PCCR vector is most commonly used to calculate gaze position [7,56]. Figure 6 shows a typical pupil-corneal reflection setup. Visual axis angle is calculated by tracking the relative position of the pupil center and CR technically known as "glint". Assuming that the positions of camera and light source are fixed, the eye is a sphere that only rotates around its center, and the position of CR does not move with the eye rotation, CR (glint) can be suitable as a reference point. Therefore, 2D PCCR vectors are calculated as → v L and → v L , respectively, as shown in Equation (1) and Figure 6.
where (x Lp , y Lp ) and (x Rp , y Rp ) are the two pupil centers of left and right eye image, respectively, whereas (x Lg , y Lp ) and (x Rg , y Rg ) are the two CR (glint) centers of left and right eye image, respectively.
Sensors 2018, 17, x Based on the detected two centers of the pupil and CR, the PCCR vector is calculated. The PCCR vector is most commonly used to calculate gaze position [7,56]. Figure 6 shows a typical pupil-corneal reflection setup. Visual axis angle is calculated by tracking the relative position of the pupil center and CR technically known as "glint". Assuming that the positions of camera and light source are fixed, the eye is a sphere that only rotates around its center, and the position of CR does not move with the eye rotation, CR (glint) can be suitable as a reference point. Therefore, 2D PCCR vectors are calculated as and , respectively, as shown in Equation (1) and Figure 6.

Gaze Feature Extraction
For extracting gaze feature, we use three inputs extracted from the driver's image, the face, left eye, and right eye. For obtaining these inputs, we used the detected 68 landmarks on the face of Figure 4. We added a margin of 50 pixels around the outer landmarks of face, left eye, and right eye to crop face, left eye, and right eye ROI images around them. Once the input images are cropped they are resized to the images of 224 × 224 pixels using bi-linear interpolation, and they are used as inputs to three CNNs as shown in Figure 7 (scheme 1).
In this research, we used the original VGG-face network (including 16 layers) that was trained for face recognition [58]. The VGG-face 16 model was trained using approximately 2.6 million face images obtained from 2622 people. The structure of the VGG-face 16 model is similar to the VGG-Net 16 architecture [59], and their accuracies were evaluated on the databases of Labeled Faces in the Wild (LFW) [60] and YouTube Faces (YTF) [61]. With this model of VGG-face 16, we performed the fine-tuning with our training data. Detailed explanations on the training and testing data are shown in Section 5.1. With the fine-tuned VGG-face model, we extracted features from the response of the fully connected layer (FCL), which is the second to last level (Fc7 of Table 3)

Gaze Feature Extraction
For extracting gaze feature, we use three inputs extracted from the driver's image, the face, left eye, and right eye. For obtaining these inputs, we used the detected 68 landmarks on the face of Figure 4. We added a margin of 50 pixels around the outer landmarks of face, left eye, and right eye to crop face, left eye, and right eye ROI images around them. Once the input images are cropped they are resized to the images of 224 × 224 pixels using bi-linear interpolation, and they are used as inputs to three CNNs as shown in Figure 7 (scheme 1).
In this research, we used the original VGG-face network (including 16 layers) that was trained for face recognition [58]. The VGG-face 16 model was trained using approximately 2.6 million face images obtained from 2622 people. The structure of the VGG-face 16 model is similar to the VGG-Net 16 architecture [59], and their accuracies were evaluated on the databases of Labeled Faces in the Wild (LFW) [60] and YouTube Faces (YTF) [61]. With this model of VGG-face 16, we performed the fine-tuning with our training data. Detailed explanations on the training and testing data are shown in Section 5.1. With the fine-tuned VGG-face model, we extracted features from the response of the fully connected layer (FCL), which is the second to last level (Fc7 of Table 3) of 4096-dimensional descriptor. After extracting three sets of features of 4096 from face, left eye, and right eye images, the final gaze zone among 17 zones of Figure 3a is determined based on the minimum distance, and detailed explanations are included in Section 4.4.4.
As the next scheme (scheme 2 of Figure 8), we also considered the PCCR vector for gaze zone classification because the PCCR vector has been widely used for previous studies on gaze detection. Then, we compared the performance of scheme 1 of Figure 7 with the scheme 2 of Figure 8 for calculating the driver's gaze classification system. Like scheme 2, after extracting three sets of features of 4096 from face, left eye, and right eye images with two additional PCCR vectors from left and right eye, respectively, final gaze zone among 17 zones of Figure 3a was determined based on the minimum distance, and detail explanations are also included in Section 4.4.4.

CNN (VGG-Face 16) Structure
We are going to discuss our CNN structure that is represented in Figure 9 and explained in Table  3. It is comprised of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers (FCLs). In the first convolutional layer, 64 filters of the size of 3 × 3 are used for the input of 224 × 224 × 3. Here, 224 × 224 × 3 represents width, height, and number of channel, respectively. From that, the feature map of 224 × 224 × 64 is obtained. This can be calculated based on the following criteria: (output width (or height) = (input width (or height) − filter width (or height) + 2 × the padding number)/the stride number + 1) [62]. For example, in Table 3, the input width, filter width, the padding number, and the stride number are 224, 3, 1, and 1 respectively. Therefore, the output width of the feature map by convolution is calculated as 224 (= (224 − 3 + 2 × 1)/1 + 1). In general, the output feature map for standard convolution based on stride one and padding is obtained by [63]: Ok,l,n = Σi,j,m Ki,j,m,n  Ik+i-1,l+j-1,m When Ik+i-1,l+j-1,m is the input feature map of the size of SF × SF × P. SF is the width and height of square input feature map, and P is the number of input channels (input depth). Ok,l,n is the output feature map of the size of TF × TF × Q. TF is the spatial width and height of a square output feature map, and Q is the number of output channels (output depth). Ki,j,m,n is the convolution kernel of size

CNN (VGG-Face 16) Structure
We are going to discuss our CNN structure that is represented in Figure 9 and explained in Table  3. It is comprised of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers (FCLs). In the first convolutional layer, 64 filters of the size of 3 × 3 are used for the input of 224 × 224 × 3. Here, 224 × 224 × 3 represents width, height, and number of channel, respectively. From that, the feature map of 224 × 224 × 64 is obtained. This can be calculated based on the following criteria: (output width (or height) = (input width (or height) − filter width (or height) + 2 × the padding number)/the stride number + 1) [62]. For example, in Table 3, the input width, filter width, the padding number, and the stride number are 224, 3, 1, and 1 respectively. Therefore, the output width of the feature map by convolution is calculated as 224 (= (224 − 3 + 2 × 1)/1 + 1). In general, the output feature map for standard convolution based on stride one and padding is obtained by [63]: Ok,l,n = Σi,j,m Ki,j,m,n  Ik+i-1,l+j-1,m When Ik+i-1,l+j-1,m is the input feature map of the size of SF × SF × P. SF is the width and height of square input feature map, and P is the number of input channels (input depth). Ok,l,n is the output feature map of the size of TF × TF × Q. TF is the spatial width and height of a square output feature map, and Q is the number of output channels (output depth). Ki,j,m,n is the convolution kernel of size

CNN (VGG-Face 16) Structure
We are going to discuss our CNN structure that is represented in Figure 9 and explained in Table 3. It is comprised of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers (FCLs). In the first convolutional layer, 64 filters of the size of 3 × 3 are used for the input of 224 × 224 × 3. Here, 224 × 224 × 3 represents width, height, and number of channel, respectively. From that, the feature map of 224 × 224 × 64 is obtained. This can be calculated based on the following criteria: (output width (or height) = (input width (or height) − filter width (or height) + 2 × the padding number)/the stride number + 1) [62]. For example, in Table 3, the input width, filter width, the padding number, and the stride number are 224, 3, 1, and 1 respectively. Therefore, the output width of the feature map by convolution is calculated as 224 (= (224 − 3 + 2 × 1)/1 + 1). In general, the output feature map for standard convolution based on stride one and padding is obtained by [63]: When I k+i−1,l+j−1,m is the input feature map of the size of S F × S F × P. S F is the width and height of square input feature map, and P is the number of input channels (input depth). O k,l,n is the output feature map of the size of T F × T F × Q. T F is the spatial width and height of a square output feature map, and Q is the number of output channels (output depth). K i,j,m,n is the convolution kernel of size S K × S K × P × Q, and S K is the spatial dimension of convolution kernel. From that, standard convolutions have the following computational cost of: Based on Equation (2), we can find that the computational cost depends multiplicatively on the kernel size S K × S K , the number of input channels P, the number of output channels Q, and the input feature map size S F × S F [63]. SK × SK × P × Q, and SK is the spatial dimension of convolution kernel. From that, standard convolutions have the following computational cost of: Based on Equation (2), we can find that the computational cost depends multiplicatively on the kernel size SK × SK, the number of input channels P, the number of output channels Q, and the input feature map size SF × SF [63]. Rectified linear unit (ReLU) layer is based on the following function as shown in Equation (4) [64,65].
In Equation (4), x and y are the input and output values, respectively. This function reduces the problem of vanishing gradient [66] that may occur when a hyperbolic or sigmoid tangent function is used in back-propagation for training. In addition, it has a faster processing speed than a non-linear activation function. After passing through the ReLU layer (ReLU-1_1 of Table 3), the feature map obtained through the second convolutional layer is once again passed through the ReLU layer (ReLU-1_2 of Table 3) before passing through the max pooling layer (Pool-1 of Table 3) as shown in Table 3. Here, the 2nd convolutional layer maintains the feature map size of 224 × 224 × 64 with filter of size 3 × 3, padding 1 × 1, and stride 1 × 1 as in the first convolutional layer.
The maximum value among the values defined in the filter range is selected in the max pooling layer, which performs a kind of subsampling. Note that after ReLU-1_2 of Table 3, the feature map size is 224 × 224 × 64. By using max pooling layer (Pool-1) of kernel size of 2 × 2, and stride of 2 × 2, Rectified linear unit (ReLU) layer is based on the following function as shown in Equation (4) [64,65].
In Equation (4), x and y are the input and output values, respectively. This function reduces the problem of vanishing gradient [66] that may occur when a hyperbolic or sigmoid tangent function is used in back-propagation for training. In addition, it has a faster processing speed than a non-linear activation function. After passing through the ReLU layer (ReLU-1_1 of Table 3), the feature map obtained through the second convolutional layer is once again passed through the ReLU layer (ReLU-1_2 of Table 3) before passing through the max pooling layer (Pool-1 of Table 3) as shown in Table 3. Here, the 2nd convolutional layer maintains the feature map size of 224 × 224 × 64 with filter of size 3 × 3, padding 1 × 1, and stride 1 × 1 as in the first convolutional layer.
The maximum value among the values defined in the filter range is selected in the max pooling layer, which performs a kind of subsampling. Note that after ReLU-1_2 of Table 3, the feature map size is 224 × 224 × 64. By using max pooling layer (Pool-1) of kernel size of 2 × 2, and stride of 2 × 2, the feature map size is reduced to 1/4 (112 × 112 × 64) because there is no overlapping area for the filter.

FCLs of CNN
Once the input image is passed through the 13 convolutional layers, 13 ReLU layers, and 5 pooling layers, we can get the feature map size of 7 × 7 × 512 pixels. The obtained feature map passes through the additional three FCLs. After each FCL, the feature maps of 4096 × 1, 4096 × 1, and 17 × 1 are obtained, respectively, as shown in Table 3. In this study, we have designed the classification system for driver's gaze region by CNN. Because the number of gazing zones is 17 as shown in Figure 3a, the output layer of Table 3 is 17 × 1.
In the third FCL, the softmax function is used as shown in Equation (5) [67].
Here, r is an array of output neurons; we can obtain the probability of neurons belonging to the ith class by dividing the value of the ith element by the summation of all the elements.
It has been frequently observed that there is the problem of low recognition accuracy with testing data in CNN-based cognitive systems due to over-fitting in the data. To solve this issue, we used data augmentation and dropout methods [68,69]. A detailed description of the experimental data generated by the data augmentation is given in Section 5.1. For the dropout method, we adopt the dropout probability of 50% to randomly disconnect the links between the previous layer and the current layer in the 1st and 2nd FCLs.

Classifying Gaze Zones by Score Fusion of Three Distances
As explained in Section 4.4.1, after extracting three separate feature sets (three sets of 4096 features) from face, left eye, and right eye images (scheme 1), we normalized them to each other by min-max scaling. With the training data, we already saved the three (normalized) feature sets per each gaze zone of Figure 3a. Then, we can calculate three Euclidean distances between the three feature sets of inputs and the three saved on each gaze zone. After that, these three distances were combined based on score level fusion. Finally, one final score (distance) is obtained, and the gaze zone whose final score (distance) is smallest among 17 zones of Figure 3a is determined as the driver gazing region. As explained in Section 4.4.1, in scheme 2, five feature sets from face, left eye, and right eye images with two additional PCCR vectors from left and right eye are used instead of three feature sets from face, left eye, and right eye images.
For the score level fusion, the performances of weighted SUM and weighted PRODUCT rules of Equations (6) and (7) were compared, and optimal weights were selected with training data. Detailed explanations are shown in Section 5.3.1.
where m is 3 in case of scheme 1; whereas m is 5 in case of scheme 2 as shown in Figures 7 and 8.
In scheme 1, i of 1, 2 and 3 shows the left eye, right eye, and face, respectively, as shown in Figure 7.
In scheme 2, i of 1~5 represents the left eye, right eye, face, PCCR vector of left eye, and PCCR vector of right eye, respectively, as shown in Figure 7. WS and WP are respectively the scores by weighted SUM and weighted PRODUCT rules, d i is the Euclidean distance obtained from the input, and w i is the weight. The 17 outputs of the output layer in Figure 9 represent the 17 gaze regions of Figure 3a. If we use one CNN for gaze estimation, these 17 outputs can be used for the detection of final gaze position. However, in our research, we combine the gaze estimation results by three CNNs (scheme 1 of Figure 7) or three CNNs with two PCCR vectors (scheme 2 of Figure 8). In these cases, the 17 outputs from one CNN cannot be combined with other outputs or PCCR vectors. Therefore, we use 4096 features from Fc7 of Table 3 for obtaining the combined Euclidean distance by score level fusion, by which the final gaze position can be detected as shown in Figures 7 and 8.

Experimental Data and Environment
In this research, we have collected our own database (DDGC-DB1) for the driver's gaze classification system in the car environment as shown in Figure 10. It was obtained through the experimental setup that can be viewed in Figure 2. Most previous driving databases are not open access for academic research as they are prepared by auto manufacturers. The Chinese Academy of Sciences pose, expression, accessories, and lighting (CAS-PEAL) database is very popular and widely used for baseline evaluation of gaze estimation or face recognition with various factors of pose and light [70]. However, this database was collected in a laboratory instead of an actual car environment, and various factors in cars are not reflected in this database. Another database is RobeSafe driver monitoring video (RS-DMV) dataset [71]. However, this database does not fit to our purpose because the information of ground-truth gaze position is not provided. Without this information, we cannot evaluate the accuracies of our gaze detection method. Therefore, we collected our own database (DDGC-DB1). where m is 3 in case of scheme 1; whereas m is 5 in case of scheme 2 as shown in Figures 7 and 8. In scheme 1, i of 1, 2 and 3 shows the left eye, right eye, and face, respectively, as shown in Figure 7. In scheme 2, i of 1~5 represents the left eye, right eye, face, PCCR vector of left eye, and PCCR vector of right eye, respectively, as shown in Figure 7. WS and WP are respectively the scores by weighted SUM and weighted PRODUCT rules, di is the Euclidean distance obtained from the input, and wi is the weight. The 17 outputs of the output layer in Figure 9 represent the 17 gaze regions of Figure 3a. If we use one CNN for gaze estimation, these 17 outputs can be used for the detection of final gaze position. However, in our research, we combine the gaze estimation results by three CNNs (scheme 1 of Figure  7) or three CNNs with two PCCR vectors (scheme 2 of Figure 8). In these cases, the 17 outputs from one CNN cannot be combined with other outputs or PCCR vectors. Therefore, we use 4096 features from Fc7 of Table 3 for obtaining the combined Euclidean distance by score level fusion, by which the final gaze position can be detected as shown in Figures 7 and 8.

Experimental Data and Environment
In this research, we have collected our own database (DDGC-DB1) for the driver's gaze classification system in the car environment as shown in Figure 10. It was obtained through the experimental setup that can be viewed in Figure 2. Most previous driving databases are not open access for academic research as they are prepared by auto manufacturers. The Chinese Academy of Sciences pose, expression, accessories, and lighting (CAS-PEAL) database is very popular and widely used for baseline evaluation of gaze estimation or face recognition with various factors of pose and light [70]. However, this database was collected in a laboratory instead of an actual car environment, and various factors in cars are not reflected in this database. Another database is RobeSafe driver monitoring video (RS-DMV) dataset [71]. However, this database does not fit to our purpose because the information of ground-truth gaze position is not provided. Without this information, we cannot evaluate the accuracies of our gaze detection method. Therefore, we collected our own database (DDGC-DB1).  As shown in Figure 3, 17 spots (gaze zones) were designated to gaze at for the experiment, and each driver stared at each spot five times. Data were collected from 20 drivers including 3 wearing glasses. The image size is 1600 × 1200 pixels with 3 channels. When the participants were staring at each spot, they were told to act normally, as if they were actually driving and were not restrained to one position or given any special instructions to act in an unnatural manner. There were risks of car accidents to motivate the participants to accurately stare at the 17 designated spots while actually driving for the experiment. Instead, this study obtained images from various locations (from roads in daylight to a parking garage) in a real vehicle (model name of SM5 New Impression by Renault Samsung [72]) with its power on, but in park to create an environment most similar to when it is being driven (including factors such as car vibration and external light). Moreover, to understand the influence of various kinds of external light on driver gaze detection, test data were acquired at different times of the day: in the morning, the afternoon, and at night. From our database, we obtained the images (of 224 × 224 pixels) of face, left, and right eyes for CNN training and testing as shown in Figure 10.
The research by Lee et al. [27] used 18 gaze zones by using an additional zone (the upper position of region 6 of Figure 3) compared to 17 zones in our research. However, the case of gazing at this As shown in Figure 3, 17 spots (gaze zones) were designated to gaze at for the experiment, and each driver stared at each spot five times. Data were collected from 20 drivers including 3 wearing glasses. The image size is 1600 × 1200 pixels with 3 channels. When the participants were staring at each spot, they were told to act normally, as if they were actually driving and were not restrained to one position or given any special instructions to act in an unnatural manner. There were risks of car accidents to motivate the participants to accurately stare at the 17 designated spots while actually driving for the experiment. Instead, this study obtained images from various locations (from roads in daylight to a parking garage) in a real vehicle (model name of SM5 New Impression by Renault Samsung [72]) with its power on, but in park to create an environment most similar to when it is being driven (including factors such as car vibration and external light). Moreover, to understand the influence of various kinds of external light on driver gaze detection, test data were acquired at different times of the day: in the morning, the afternoon, and at night. From our database, we obtained the images (of 224 × 224 pixels) of face, left, and right eyes for CNN training and testing as shown in Figure 10.
The research by Lee et al. [27] used 18 gaze zones by using an additional zone (the upper position of region 6 of Figure 3) compared to 17 zones in our research. However, the case of gazing at this additional zone does not frequently occur while driving [48][49][50][51]. Therefore, in previous studies, they did not use this additional zone for experiments, either [24,36,[38][39][40]. Based on that, we did not use this additional zone, and performed the experiments with the data where drivers gazed at the 17 positions of Figure 3.
In our experiment, we performed two-fold cross validation for training and testing. For that, we have randomly divided our databases into two subsets of face, left eye, and right eye images as shown in Table 4. Then, as explained in Section 4.4.3, data augmentation with training data is performed to avoid the overfitting problem as follows. Five images are obtained from each rectangular ROI defined for the face, left eye, and right eye in the image by shifting 1 pixel in the left, right, up and down directions based on the coordinates of the original image. Hence five images are obtained from each face, left eye, and right eye in the single original image as shown in Figure 11. Original data was used for testing whereas the augmented data was used only for training as shown in Table 4. additional zone does not frequently occur while driving [48][49][50][51]. Therefore, in previous studies, they did not use this additional zone for experiments, either [24,36,[38][39][40]. Based on that, we did not use this additional zone, and performed the experiments with the data where drivers gazed at the 17 positions of Figure 3. In our experiment, we performed two-fold cross validation for training and testing. For that, we have randomly divided our databases into two subsets of face, left eye, and right eye images as shown in Table 4. Then, as explained in Section 4.4.3, data augmentation with training data is performed to avoid the overfitting problem as follows. Five images are obtained from each rectangular ROI defined for the face, left eye, and right eye in the image by shifting 1 pixel in the left, right, up and down directions based on the coordinates of the original image. Hence five images are obtained from each face, left eye, and right eye in the single original image as shown in Figure 11. Original data was used for testing whereas the augmented data was used only for training as shown in Table 4.  For the CNN training and testing, we used a desktop computer with an Intel ® Core™ (Santa Clara, CA, USA) i7-3770K CPU @ 3.50 GHz, 16 GB memory, and a NVIDIA GeForce GTX 1070 (1920 CUDA cores and 8 GB memory) graphics card [73].  For the CNN training and testing, we used a desktop computer with an Intel ® Core™ (Santa Clara, CA, USA) i7-3770K CPU @ 3.50 GHz, 16 GB memory, and a NVIDIA GeForce GTX 1070 (1920 CUDA cores and 8 GB memory) graphics card [73].

Training of CNN Model
Stochastic gradient descent (SGD) method was used for CNN training [76]. The SGD method is a derivative-based method of finding the optimal weight that minimizes the difference between the desired output and the calculated output. Unlike the gradient descent (GD) method, in the SGD method, the number of training sets divided by mini-batch size is defined as iteration, and one epoch is set when training is performed for all the number of iterations as shown in Equations (8) and (9) [69].
where w i is the weight to be learnt at the ith iteration. m is momentum, v i is the momentum variable, d is the weight decay, and η is the learning rate.
∂Q i (w) ∂w w i D i is the average over the ith batch D i of the derivative of the object with respective to w, evaluated at w i . In our experiment, the training was performed for the predefined epoch count of 16 based on the maximum number of training i.e., about 13,048. m, d, and η of Equations 8 and 9 were set as 0.9, 0.0005, and 0.00001, respectively, with batch size of 20. Figure 12 shows the visualization of the relationship between loss and training accuracy during training of sub-databases of face, left eye, and right eye. The x-axis represents the number of epoch. The left side of the y-axis represents the loss and the right side of the y-axis represents the training accuracy. The loss depends on the learning rate and batch size. When the learning rate is lowered, it slowly goes down, showing linearity. If the learning rate is high, the loss decreases sharply, but the loss value changes suddenly, which may lead to the problem of maintaining the loss value without reaching the optimal CNN model. In this experiment, we used optimal models with loss curves close to 0 (0%) and training accuracies close to 1 (100%) as shown in Figure 12.

Training of CNN Model
Stochastic gradient descent (SGD) method was used for CNN training [76]. The SGD method is a derivative-based method of finding the optimal weight that minimizes the difference between the desired output and the calculated output. Unlike the gradient descent (GD) method, in the SGD method, the number of training sets divided by mini-batch size is defined as iteration, and one epoch is set when training is performed for all the number of iterations as shown in Equations (8) and (9) [69].
where is the weight to be learnt at the i th iteration. m is momentum, is the momentum variable, d is the weight decay, and is the learning rate. < ( ) | > is the average over the i th batch Di of the derivative of the object with respective to , evaluated at . In our experiment, the training was performed for the predefined epoch count of 16 based on the maximum number of training i.e. about 13,048. m, d, and of Equations 8 and 9 were set as 0.9, 0.0005, and 0.00001, respectively, with batch size of 20. Figure 12 shows the visualization of the relationship between loss and training accuracy during training of sub-databases of face, left eye, and right eye. The x-axis represents the number of epoch. The left side of the y-axis represents the loss and the right side of the y-axis represents the training accuracy. The loss depends on the learning rate and batch size. When the learning rate is lowered, it slowly goes down, showing linearity. If the learning rate is high, the loss decreases sharply, but the loss value changes suddenly, which may lead to the problem of maintaining the loss value without reaching the optimal CNN model. In this experiment, we used optimal models with loss curves close to 0 (0%) and training accuracies close to 1 (100%) as shown in Figure 12.

Comparison of Weighted SUM and Weighted PRODUCT Method
In our research, the accuracy of gaze estimation was measured based on strictly correct estimation rate (SCER) and loosely correct estimation rate (LCER). SCER refers to the ratio of the number of strictly correct frames divided by the number of total frames. The strictly correct frame indicates the frame where the estimated gaze zone is equivalent to the ground truth gaze zone. LCER refers to the ratio of the number of loosely correct frames divided by the number of total frames. The loosely correct frame indicates the frame where the estimated gaze zone is placed within ground truth gaze zone or its surrounding zones. For example, when a driver looked at zone 10 of Figure 3a in a test data image, the SCER considered it correct estimation only when the minimum distance fell to position 10. On the other hand, the LCER considered it a correct estimation when the minimum distance fell to either position 10 or one of the positions in its vicinity-6 through 14.
As explained in Section 4.4.4, Euclidean distances (scores) of inputs are combined by weight SUM or weight PRODUCT rules. The optimal weights for these rules were experimentally determined with training data. It is observed that optimal weights obtained for face, left eye, and right eye using weighted SUM rule in case of scheme 1 are 0.1, 0.5 and 0.4, respectively, with average SCER and LCER value of 92.8% and 99.6% respectively. The optimal weights obtained for face, left eye, right eye, PCCR vector of left eye, and PCCR vector of right eye using weighted SUM rule in case of scheme 2 are 0.085, 0.495, 0.4, 0.011, and 0.009 respectively, with average SCER and LCER value of 64.8% and 91.1% respectively. It is found that optimal weights obtained for face, left eye, and right eye using weighted PRODUCT method in case of scheme 1 are 0.1, 0.5 and 0.4, respectively with average SCER and LCER value of 90.8% and 99.1% respectively. In case of scheme 2, the average SCER and LCER value of 65.7% and 90.4% with optimal weights of 0.09, 0.49, 0.4, 0.01, and 0.01 for face, left eye, right eye, PCCR vector of left eye, and PCCR vector of right eye respectively using weighted PRODUCT rule. Because other parts of the face such as lips can be changed according to the change of facial expression even in case of gazing at the same zone, lower weight was determined for face compared to those for left and right eyes. As shown in Figure 3b, the head rotation in the right direction is more severe than that in the left direction. Therefore, the left eye can be better observed by our gaze-tracking camera (installed in the vicinity of the dashboard, as shown in Figure 2) than right eye, and more gaze information can be obtained from left eye for the driver gaze classification system. Consequently, larger weight was determined for left eye than right eye. In addition, because the weighted SUM rule outperformed the weighted PRODUCT rule, we use the weighted SUM rule in our research.

Comparison of Weighted SUM and Weighted PRODUCT Method
In our research, the accuracy of gaze estimation was measured based on strictly correct estimation rate (SCER) and loosely correct estimation rate (LCER). SCER refers to the ratio of the number of strictly correct frames divided by the number of total frames. The strictly correct frame indicates the frame where the estimated gaze zone is equivalent to the ground truth gaze zone. LCER refers to the ratio of the number of loosely correct frames divided by the number of total frames. The loosely correct frame indicates the frame where the estimated gaze zone is placed within ground truth gaze zone or its surrounding zones. For example, when a driver looked at zone 10 of Figure 3a in a test data image, the SCER considered it correct estimation only when the minimum distance fell to position 10. On the other hand, the LCER considered it a correct estimation when the minimum distance fell to either position 10 or one of the positions in its vicinity-6 through 14.
As explained in Section 4.4.4, Euclidean distances (scores) of inputs are combined by weight SUM or weight PRODUCT rules. The optimal weights for these rules were experimentally determined with training data. It is observed that optimal weights obtained for face, left eye, and right eye using weighted SUM rule in case of scheme 1 are 0.1, 0.5 and 0.4, respectively, with average SCER and LCER value of 92.8% and 99.6% respectively. The optimal weights obtained for face, left eye, right eye, PCCR vector of left eye, and PCCR vector of right eye using weighted SUM rule in case of scheme 2 are 0.085, 0.495, 0.4, 0.011, and 0.009 respectively, with average SCER and LCER value of 64.8% and 91.1% respectively. It is found that optimal weights obtained for face, left eye, and right eye using weighted PRODUCT method in case of scheme 1 are 0.1, 0.5 and 0.4, respectively with average SCER and LCER value of 90.8% and 99.1% respectively. In case of scheme 2, the average SCER and LCER value of 65.7% and 90.4% with optimal weights of 0.09, 0.49, 0.4, 0.01, and 0.01 for face, left eye, right eye, PCCR vector of left eye, and PCCR vector of right eye respectively using weighted PRODUCT rule. Because other parts of the face such as lips can be changed according to the change of facial expression even in case of gazing at the same zone, lower weight was determined for face compared to those for left and right eyes. As shown in Figure 3b, the head rotation in the right direction is more severe than that in the left direction. Therefore, the left eye can be better observed by our gaze-tracking camera (installed in the vicinity of the dashboard, as shown in Figure 2) than right eye, and more gaze information can be obtained from left eye for the driver gaze classification system. Consequently, larger weight was determined for left eye than right eye. In addition, because the weighted SUM rule outperformed the weighted PRODUCT rule, we use the weighted SUM rule in our research.

Comparison of Scheme 1 and Scheme 2
As the next experiment, we have made the comparison of accuracies of gaze estimation by schemes 1 and 2 of Figures 7 and 8. We have analyzed the testing results through different metrics such as the confusion matrix and estimation rate of SCER and LCER. First, we discuss the results obtained through the confusion matrix. It is a popular metric for classification problems on a set of test data for which true values are known. We analyzed the results without using PCCR vectors i.e., scheme 1 explained in Figure 7. As we have collected the results from two-fold cross validation, Table 5 shows the average classification accuracy results from two-fold cross validations (scheme 1). "Actual" and "Predicted" mean the ground-truth and estimated gaze zone, respectively. Observe from the results through the confusion matrix of scheme 1 that almost all the gaze regions have shown a high level of accuracy. Although the distances between gaze regions were small and the number of gaze regions is large, our proposed method has demonstrated high accuracy of gaze estimation. Later, we obtained the results by using scheme 2 for the driver's gaze classification. We have obtained the results with PCCR vectors from left and right eye combined with face, left, and right eye images. For comparison purposes, we have extracted the results and represented them in the form of the confusion matrix shown below. As we have collected the results from two-fold cross validation, Table 6 shows the average classification accuracy from two-fold cross validation with PCCR vector (scheme 2).
Observe from the results through the confusion matrix of scheme 2 that the accuracy for driver's gaze classification in the car environment was degraded with considering PCCR vector (scheme 2) in the results compared to scheme 1. It shows that error in detection of the pupil center and corneal reflection causes the error in PCCR vector, which decreases the accuracy of gaze estimation.
To further verify our results for scheme 1 and scheme 2, we used other metrics for classification i.e., estimation rate categorized into SCER and LCER. Estimation rate is measured using the proposed method without PCCR vectors (scheme 1) and with PCCR (scheme 2) are shown in Tables 7 and 8 respectively. Table 7 shows the SCER and the LCER results for each gaze region without using PCCR vector (scheme 1). Table 5. Average confusion matrix of scheme 1 from two-fold cross validation.

R1
R2 R3  R4  R5  R6  R7  R8  R9  R10  R11  R12  R13  R14  R15  R16  R17  R1 98  Note that the average detection rate using SCER is 92.8% and LCER is 99.6%. Even in the case that one of the driver's eyes is occluded by severe head rotation (gaze regions of 1, 5, and 16 of Figure 3a), our system shows high accuracy of gaze estimation because our system uses the information of the whole face, as shown in Figure 7. Table 8 shows the SCER and the LCER results using PCCR vector (scheme 2).
Note that the average detection rate with PCCR vector (scheme 2) using SCER is 64.8% and LCER is 91.1%. These accuracies are lower than those without PCCR vector (scheme 1). Therefore, we found that it is difficult to use PCCR vector in the outdoor environment using one camera and without driver calibration. It also shows that error in detection of the pupil center and corneal reflection causes the error in PCCR vector, which reduces the accuracy of gaze estimation.
In Figure 13, we have shown some examples of the correctly classified gaze zones in terms of SCER by our proposed method (scheme 1). As shown in this figure, although the images are collected from different people with different head and eyes directions, our proposed method can classify gaze zone with a high level of accuracy. from different people with different head and eyes directions, our proposed method can classify gaze zone with a high level of accuracy.

Comparison with Previous Method
In the next experiment, we compared the performance of our proposed method with a previous method [38], where AlexNet CNN model [69] was used to detect eight gaze zones. AlexNet is comprised of five convolutional layers and three fully connected layers. In the first convolutional layer AlexNet uses 96 filters of size 11 × 11 × 3, and uses a local response normalization (LRN) layer after a ReLU layer. Based on Gaussian distribution, the weights in each layer were initialized to random values with standard deviations of 0.01 with a mean of zero [69]. We have detected 17 gaze zones using the previous method. In the previous method [38], they have only considered the face as an input. However, in our method (scheme 1), we used three inputs i.e., face, left eye, and right eye images. We have used the same training and testing data from two-fold cross validation for fair comparison. As we did for comparison between scheme 1 and scheme 2, we have adopted the same metrics, confusion matrix and estimation rate of SCER and LCER. Table 9 shows the average confusion matrix with testing data by the previous method [38] from two-fold cross validation.
Observe from the obtained results that the accuracies by previous method [38] are lower than those by our method when it is tested on DDGC-DB1, covering all 17 gaze regions. As shown in Table  9, the highest accuracy achieved by the previous method based on AlexNet is 72.6% at gaze region 14. That is lower than the accuracy by our method, which achieved an accuracy above 90% in most gaze regions as shown in Table 7. In Table 10, we compared the accuracies by our method and previous method [38] in terms of SCER and LCER.
Note that with previous method [38], the average SCER and LCER is 64.3% and 87.2% respectively. These are lower than the average SCER and LCER obtained from the proposed method: 92.8% and 99.6% respectively. Hence, we can find that the proposed driver gaze classification of scheme 1 of Figure 7 had higher performance and accuracy compared to the previous method based on AlexNet. Figure 14. Incorrectly detected gaze zones with our system. Left, middle, and right figures respectively show the cases that the driver looks at gaze zones of (a) 1, 7, and 14; (b) 15, 2, and 8; (c) 9, 3, and 16.

Comparison with Previous Method
In the next experiment, we compared the performance of our proposed method with a previous method [38], where AlexNet CNN model [69] was used to detect eight gaze zones. AlexNet is comprised of five convolutional layers and three fully connected layers. In the first convolutional layer AlexNet uses 96 filters of size 11 × 11 × 3, and uses a local response normalization (LRN) layer after a ReLU layer. Based on Gaussian distribution, the weights in each layer were initialized to random values with standard deviations of 0.01 with a mean of zero [69]. We ave detected 17 gaze zones using the previous method. In the previous method [38], they have only considered the face as an input. However, in our method (scheme 1), we used three inputs i.e., face, left eye, and right eye images. We have used the same training and testing data from two-fold cross validation for fair comparison. As we did for comparison between scheme 1 and scheme 2, we have adopted the same metrics, confusion matrix and estimation rate of SCER and LCER. Table 9 shows the average confusion matrix with testing data by the previous method [38] from two-fold cross validation.
Observe from the obtained results that the accuracies by previous method [38] are lower than those by our method when it is tested on DDGC-DB1, covering all 17 gaze regions. As shown in Table 9, the highest accuracy achieved by the previous method based on AlexNet is 72.6% at gaze region 14. That is lower than the accuracy by our method, which achieved an accuracy above 90% in most gaze regions as shown in Table 7. In Table 10, we compared the accuracies by our method and previous method [38] in terms of SCER and LCER.
Note that with previous method [38], the average SCER and LCER is 64.3% and 87.2% respectively. These are lower than the average SCER and LCER obtained from the proposed method: 92.8% and 99.6% respectively. Hence, we can find that the proposed driver gaze classification of scheme 1 of Figure 7 had higher performance and accuracy compared to the previous method based on AlexNet.

Comparison with Open Database
In the next experiment, we compared the accuracies by our method with those by previous method [38] on open Columbia gaze dataset CAVE-DB [77]. It is a large gaze database of 56 people with 5880 images over varying head poses and gaze directions. There are 105 gaze directions as 5 head poses with 21 gaze directions per head pose. By excluding the images of severe gaze direction, for all people, we have chosen 13 gaze direction images considering the driver's gaze in the car environment of Figure 3. The examples of images with gaze zones are shown in Figure 15.

Comparison with Open Database
In the next experiment, we compared the accuracies by our method with those by previous method [38] on open Columbia gaze dataset CAVE-DB [77]. It is a large gaze database of 56 people with 5880 images over varying head poses and gaze directions. There are 105 gaze directions as 5 head poses with 21 gaze directions per head pose. By excluding the images of severe gaze direction, for all people, we have chosen 13 gaze direction images considering the driver's gaze in the car environment of Figure 3. The examples of images with gaze zones are shown in Figure 15. We have obtained augmented data from the selected data for making a fair comparison. Then we performed two-fold cross validation similar to the experiments with DDGC-DB1. Augmented data was used only for training and original data was used for testing similar to the experiments with DDGC-DB1. Results are also summarized in the form of a confusion matrix and estimation rates of SCER and LCER. First, we show results obtained by previous method [38]. Table 11 shows the We have obtained augmented data from the selected data for making a fair comparison. Then we performed two-fold cross validation similar to the experiments with DDGC-DB1. Augmented data was used only for training and original data was used for testing similar to the experiments with DDGC-DB1. Results are also summarized in the form of a confusion matrix and estimation rates of SCER and LCER. First, we show results obtained by previous method [38]. Table 11 shows the average  confusion matrix obtained from the first-and second-fold cross validation by the previous method  using open database where Table 12 shows the same for the proposed method. It can be analyzed from the obtained results that our proposed method has shown better accuracy on CAVE-DB as compared to the previous method [38]. As can be seen in Table 12, the highest accuracy achieved by our method is 88.9% at gaze region 12. That is much higher than that of previous method i.e., 70.1% at gaze region 12 as shown in Table 11. We further compared the accuracies by previous method [38] and the proposed method through average estimation rate SCER and LCER from two-fold cross validation as shown in Table 13. Note that with CAVE-DB, average SCER and LCER by the proposed method are 77.7% and 96.3% respectively. This is higher than those of the previous method i.e., 53.1% and 88.7%, respectively. Hence, we find that the proposed driver gaze classification of scheme 1 of Figure 7 had higher performance and accuracy compared to previous method [38] based on AlexNet on CAVE-DB.
In Figure 16, we have shown some examples of the correctly classified gaze zones in terms of SCER by our proposed method (scheme 1). As shown in this figure, although the images are collected from different people with different head and eyes directions, our proposed method can classify gaze zone with a high level of accuracy. Note that with CAVE-DB, average SCER and LCER by the proposed method are 77.7% and 96.3% respectively. This is higher than those of the previous method i.e., 53.1% and 88.7%, respectively. Hence, we find that the proposed driver gaze classification of scheme 1 of Figure 7 had higher performance and accuracy compared to previous method [38] based on AlexNet on CAVE-DB.
In Figure 16, we have shown some examples of the correctly classified gaze zones in terms of SCER by our proposed method (scheme 1). As shown in this figure, although the images are collected from different people with different head and eyes directions, our proposed method can classify gaze zone with a high level of accuracy. In Figure 17, we have shown some examples of the incorrectly classified gaze zones in terms of SCER by our proposed method (scheme 1). Observe that these errors are caused by the variation of head and eye rotations even with gazing at the same zone (for example, by comparing the left figure of Figures 16b and 17b for zone 2). Another reason of error cases is incorrect detection of facial landmarks due to face fixture used in CAVE-DB as shown in the center figure of Figure 17c. In Figure 17, we have shown some examples of the incorrectly classified gaze zones in terms of SCER by our proposed method (scheme 1). Observe that these errors are caused by the variation of head and eye rotations even with gazing at the same zone (for example, by comparing the left figure of Figures 16b and 17b for zone 2). Another reason of error cases is incorrect detection of facial landmarks due to face fixture used in CAVE-DB as shown in the center figure of Figure 17c. (c) Figure 17. Incorrectly detected gaze zones with our system on CAVE-DB. Left, middle, and right figures respectively show the cases that user looks at gaze zones of (a) 1, 4, and 7; (b) 2, 5, and 8; (c) 3, 6, and 9, respectively, of Figure 15.
As the last experiment, we measured the processing speed by our method on DDGC-DB1 and CAVE-DB. Experimental results showed that the average processing time on DDGC-DB1 was 12.72 msec. per image and that on CAVE-DB was 11.21 msec. per image. From that, we can find that our system can be operated at a speed of 78.6 (1000/12.72)~89.2 (1000/11.21) frames per second.

Effect of the Errors in Facial Landmark Detection on the Accuracies of Gaze Detection
We checked how resistant our method is to the potential errors in facial landmark detection. In our method, the regions of two eyes and face for CNN input of Figure 7 are determined by the positions of facial landmarks as shown in Figure 1 (step (3)). Therefore, the errors in facial landmark detection cause the errors of regions of two eyes and face for CNN input of Figure 7. We measured the accuracies by our gaze detection method according to the errors in facial landmark positions on DDGC-DB1 database. As shown in Table 14, in case the errors are less than ±8 pixels in X-and Y-axes, the accuracy degradation by our method is very small (degradation of 0.9% in SCER and that of 0.5% in LCER compared to the accuracies in case of no detection error in facial landmarks). However, in case of the errors same to (or larger than) ±8 pixels, the accuracy degradation is increased (degradation larger than about 3% in SCER and that larger than about 2.1% in LCER from ±8 to ±10 pixels). From that, we can find that our method is resistant to the errors (same to (or less than) ±7 pixels in X-and Y-axes) in facial landmark detection. Table 14. SCER and LCER of our method according to the errors in facial landmark detection by Dlib facial feature tracker (X-axis of 0 and Y-axis of 0 mean no detection error in facial landmarks). Figure 17. Incorrectly detected gaze zones with our system on CAVE-DB. Left, middle, and right figures respectively show the cases that user looks at gaze zones of (a) 1, 4, and 7; (b) 2, 5, and 8; (c) 3, 6, and 9, respectively, of Figure 15.
As the last experiment, we measured the processing speed by our method on DDGC-DB1 and CAVE-DB. Experimental results showed that the average processing time on DDGC-DB1 was 12.72 msec. per image and that on CAVE-DB was 11.21 msec. per image. From that, we can find that our system can be operated at a speed of 78.6 (1000/12.72)~89.2 (1000/11.21) frames per second.

Effect of the Errors in Facial Landmark Detection on the Accuracies of Gaze Detection
We checked how resistant our method is to the potential errors in facial landmark detection. In our method, the regions of two eyes and face for CNN input of Figure 7 are determined by the positions of facial landmarks as shown in Figure 1 (step (3)). Therefore, the errors in facial landmark detection cause the errors of regions of two eyes and face for CNN input of Figure 7. We measured the accuracies by our gaze detection method according to the errors in facial landmark positions on DDGC-DB1 database. As shown in Table 14, in case the errors are less than ±8 pixels in X-and Y-axes, the accuracy degradation by our method is very small (degradation of 0.9% in SCER and that of 0.5% in LCER compared to the accuracies in case of no detection error in facial landmarks). However, in case of the errors same to (or larger than) ±8 pixels, the accuracy degradation is increased (degradation larger than about 3% in SCER and that larger than about 2.1% in LCER from ±8 to ±10 pixels). From that, we can find that our method is resistant to the errors (same to (or less than) ±7 pixels in X-and Y-axes) in facial landmark detection. Table 14. SCER and LCER of our method according to the errors in facial landmark detection by Dlib facial feature tracker (X-axis of 0 and Y-axis of 0 mean no detection error in facial landmarks).

Eye Safety
We measured the level of danger by our NIR illuminator. For that, we measured the eye safety of the NIR illuminator based on the American Council of Government and Industrial Hygienists (ACGIH) and threshold limit values (TLV) [78,79]. The ACGIH exposure limit for infrared radiation is defined by the following equation. For exposures greater than 1000 s, irradiance must be limited to less than 10 mW/cm 2 [78,79]: 3000 nm ∑ 700 nm E λ ·∆λ 1.8t − 3 4 W/cm 2 (10) where λ represents the wavelength of incident light, summation is over the λ range where the light level is significant, E λ represents the irradiance into the eye in W/cm 2 , and t represents the exposure time in second. In the proposed system, the exposure time t by NIR illuminator was a maximum of 900 s (time-out), and the NIR illuminator is automatically turned off for 0.1 s. Then, it is turned on for 900 s again, and this procedure of turning on and off is iterated. Thus, we obtained the maximum ACGIH exposure limits for infrared radiation as about 10.95 (=1.8 × 900 −3/4 ) mW/cm 2 based on Equation (10). The experimental results showed that the infrared radiation power (0.53 mW/cm 2 ) of our NIR illuminator was much less than the limit, so the proposed system met the safety requirements.

Conclusions
In this study, we proposed a method of driver gaze classification in the vehicular environment based on CNN. For driver gaze classification, face, left eye, and right eye images are obtained from input image based on the ROI defined by facial landmarks from the Dlib facial feature tracker. We performed fine tuning with a pre-trained CNN model separately for the extracted cropped images of face, left eye, and right eye using VGG-face network to obtain the required gaze features from the fully connected layer of the network. Three distances based on all the obtained features are combined to find the final result of classification. The impact of PCCR vector on gaze classification is also studied. We compared the performance of the proposed gaze classification method using CNN with PCCR vector and without PCCR vector. We verified from the results that the driver gaze classification without PCCR vector is suitable in terms of accuracy. We also compared the accuracies of our method with those of a previous method. Evaluations were also performed on open CAVE-DB, and we can confirm that our method outperformed the previous method. Based on the processing time, we can find that our system can be operated at a speed of 78.6~89.2 frames per second.
As shown in Figure 4, the Dlib facial feature tracker cannot detect the position of the pupil and iris. Therefore, in case the driver gazes at a position just by eye movement (not by head movement) when gazing at the position close to our gaze-tracking camera, the method using only facial landmarks by the Dlib facial feature tracker cannot detect accurate gaze position. To solve this problem, the pupil center and corneal reflection position are detected by the method outlined in Section 4.3, and PCCR vector was used for scheme 2 of Figure 8. However, the accuracy of scheme 2 is lower than that of scheme 1 not using PCCR vector as shown in Tables 7 and 8.
The reason why we used a NIR camera and illuminator is to use the movement of the pupil within eye region (iris region) for gaze estimation for better accuracy. However, our method can also be applied to the images by visible light camera without an additional illuminator, which was proved by the experiments with open Columbia gaze dataset CAVE-DB [77] as explained in Section 5.3.4. In case of severe head and eye rotation, which causes disappearance of one of two eyes in the captured image, the error of gaze estimation can be increased, and this is the limitation of our research. This can be solved by using multiple cameras, but it can also increase the processing time. We would research a solution to this problem by using multiple cameras at fast processing speeds in future work. In addition, we would check the effect of image resolution, blurring level, or severe occlusion on the face image on the accuracy of the gaze estimator.