Next Article in Journal
Identity-as-a-Service: An Adaptive Security Infrastructure and Privacy-Preserving User Identity for the Cloud Environment
Previous Article in Journal
Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet
Previous Article in Special Issue
Combining Facial Expressions and Electroencephalography to Enhance Emotion Recognition
 
 
Article
Peer-Review Record

Convolutional Two-Stream Network Using Multi-Facial Feature Fusion for Driver Fatigue Detection

Future Internet 2019, 11(5), 115; https://doi.org/10.3390/fi11050115
by Weihuang Liu, Jinhao Qian, Zengwei Yao, Xintao Jiao and Jiahui Pan *
Reviewer 1: Anonymous
Future Internet 2019, 11(5), 115; https://doi.org/10.3390/fi11050115
Submission received: 23 February 2019 / Revised: 19 April 2019 / Accepted: 29 April 2019 / Published: 14 May 2019
(This article belongs to the Special Issue Special Issue on the Future of Intelligent Human-Computer Interface)

Round 1

Reviewer 1 Report

How many classes are there in this classification problem? The distribution of the classes in the dataset?

If the data is unbalanced, is accuracy an appropriate evaluation metric?

Please include other metrics as well, for example, the (averaged) F1-score, etc.

Please get the features before the softmax layer as inputs and test with other classifiers, including k-NN, an enhanced k-NN algorithm (in DOI: 10.1109/THMS.2015.2453203), SVM (10.1145/1961189.1961199), and Random Forest or XGBoost. The hyper-parameters for those classifiers should be chosen carefully, e.g., from using cross-validation and GridSearch.

A major revision is needed for this paper.

Author Response

The authors are grateful to the first reviewer for the insightful comments and constructive suggestions. In light of your comments and suggestions, the paper has been revised. Please see our point to point responses in the following.

1. How many classes are there in this classification problem? The distribution of the classes in the dataset?

Response: It is divided into 5 classes in our classification problem, including normal, drowsiness, nodding, talking and yawning, and the distribution of these classes in the dataset is around 5:9:2:5:3. According to your comment, we have added some contents in the section of Methods. Please see the following paragraph extracted from the present version.

“We trained our models using train dataset with 5-fold cross validation, and use evaluation dataset for test. Images were extracted 1 frame from every 3 frames in the videos and labeled into 5 driver states, normal, drowsiness, nodding, talking and yawning, and the distribution of these classes in the dataset was around 5:9:2:5:3.”

 (p. 9. Session 3.2. Experiment, the first paragraph)

2. If the data is unbalanced, is accuracy an appropriate evaluation metric?

3. Please include other metrics as well, for example, the (averaged) F1-score, etc.

Response: It's not suitable to just use accuracy for evaluating metric of unbalanced dataset. According to your suggestions, we have added F1-score as one of the evaluate metrics. Please see the following paragraph extracted from the present version.

Considering the problem of unbalanced data, we add F1-score to evaluate metric. Table 3 shows the details in predicting different states using model GFDN. According to Table 3, we can get the precision rate and the recall rate, which shown in Table 4. Based on this, F1-score can be calculated to be equal to 0.9688.

(p. 9. Session 3.2. Experiment, the fourth paragraph)

Table 3. Drowsiness detection details for different states of the NTHU-DDD dataset in GFDN.

Real

Predict

normal

drowsiness

nodding

talking

yawning

normal

9643

269

24

331

24

drowsiness

164

21234

112

12

121

nodding

11

183

5833

23

6

talking

204

9

39

12614

38

yawning

15

167

19

19

9970

Table 4. Precision rate and recall rate (%) for different states.


normal

drowsiness

nodding

talking

yawning

Precision   Rate

96.07

97.12

96.78

97.03

98.13

Recall   rate

93.70

98.11

96.31

97.75

97.84

 

4. Please get the features before the softmax layer as inputs and test with other classifiers, including kNN, an enhanced k-NN algorithm (in DOI: 10.1109/THMS.2015.2453203), SVM (10.1145/1961189.1961199), and Random Forest or XGBoost. The hyper-parameters for those classifiers should be chosen carefully, e.g., from using cross-validation and GridSearch.

Response: According to your suggestions, we have added some contents in the sections of Methods and Results. Specifically, we got the features before the softmax layer as inputs and test with other classifiers, including k-NN, SVM and Random Forest, and compare the performance with softmax. Then, we selected the hyper-parameters using cross-validation and grid search. Please see the following paragraph extracted from the present version.

All of the layer weights were randomly initialized. We chosen the hyper-parameters using grid search. The network was trained using batch gradient descent with a batch size of 128 and with a dropout rate of 0.2. An initial learning rate of 0.1 was used in optimization function Adadelta. Training was stopped when the validation loss did not improve for 50 iterations. The model was trained for around 230 iterations. The results are shown in Table 1 and Table 2.

(p. 9. Session 3.2. Experiment, the first paragraph)

We show another result in Table 2. We get the features before the softmax layer in the model GFDN as inputs and test with other classifiers, including KNN, SVM and Random Forest, and compare the performance with softmax. It was shown below that the accuracy of each derived model has dropped slightly.

(p. 9. Session 3.2. Experiment, the third paragraph)

Table 2. Drowsiness detection accuracies (%) for different states of the NTHU-DDD dataset in our model and derived models.

State

Ours-KNN

Ours-SVM

Ours-RF

Ours-2

normal

92.86%

93.96%

94.61%

93.70%

drowsiness

98.02%

98.05%

97.49%

98.11%

nodding

96.00%

96.51%

96.21%

96.31%

talking

97.37%

97.18%

96.83%

97.75%

yawning

97.19%

97.44%

97.25%

97.84%

Average

96.67%

96.98%

96.70%

97.06%

 


Author Response File: Author Response.pdf

Reviewer 2 Report

General description of the work:

 

In the present work a driver fatigue detection algorithm using two-stream network models with multi-facial features is presented. The algorithm consists of four parts: positioning mouth and eye with multitask cascaded convolution networks: extracting the static features from partial facial image; extracting the dynamic features from partial facial optical flow; combining both static and dynamic features using two-stream neural network to make classification. The main characteristic of the contribution is the combination of two-stream network and multi-facial features for driver fatigue detection. Two-stream networks can combine static and dynamic image information, while partial facial images as network inputs can focus on fatigue related information. Gamma corrections are applied to enhance image contrast and ensure reliable decision of CNNs .

 

 

Remarks:

 

The sizes of all arrays and matrices in Figs. 2, 3, 4 and 8 have to be preliminary defined, described and explained.

 

English in the entire paper has to be      corrected.

For example:

 

Page 3, rows 97 – 101, correct English.

Page 4, row 104: write uses.

Row 126: write “of term”.

Write: left part of lips and right part of lips instead left lips and right lips.

Row 139: missing “of”.

Row 147, write it improves.

Page 5, row 152: remove “of”

 

Despite the problem considered and suggested solutions by authors are of great practical meaning, the paper needs thoroughly revision.


Author Response

Review 2:

The authors are grateful to the second reviewer for the insightful comments and constructive suggestions. In light of your comments and suggestions, the paper has been revised. Please see our point to point responses in the following.

1. The sizes of all arrays and matrices in Figs. 2, 3, 4 and 8 have to be preliminary defined, described and explained.

Response: According to your suggestions, we have added the definition, description and their corresponding explanations for the sizes of all arrays and matrices in Figs. 2, 3, 4 and 8. Please see the following paragraph extracted from the present version.

Proposal Network (P-Net) (shown in Fig. 2): The main function of this network structure is to obtain the regression vector of candidate window and bounding box in the face area. At the same time, it use the bounding box to do the regression and calibrate the candidate window, and then merge the highly overlapping candidate boxes by non-maximum suppression (NMS). All input samples are first resized into 12*12*3 and finally the P-Net output is obtained by 1x1 convolution kernel of three different output channels. The P-Net output is divided into three parts: 1). face classification: the probability that the input image is a face. 2). bounding box: the position of the rectangular. 3). facial landmark localization: the five key points of the input face sample.

(p. 3. Session 2.1. Face detection and key area positioning, the fourth paragraph)


Fig. 2. P-Net Network structure.

Refine Network (R-Net) (shown in Fig. 3): This network structure also removes the false positive region through the bounding box regression and non-maximum suppression. However, since the network structure has one more fully connection layer than the P-Net network structure, a better effect of suppressing false positives can be obtained. All input samples are first resized into 24*24*3 and finally the R-Net output is obtained by fully connected layer. The R-Net output is divided into three parts: 1). face classification: the probability that the input image is a face. 2). bounding box: the position of the rectangular. 3). facial landmark localization: the five key points of the input face sample.

(p. 4. Session 2.1. Face detection and key area positioning, the fifth paragraph)


Fig. 3. R-Net Network structure.

Output Network (O-Net) (shown in Fig. 4): This network structure has one more convolutional layer than R-Net, so the result of the processing is finer. The network works similarly to R-Net, but it supervises the face area and gets five coordinates representing the left eye, right eye, nose, left and right lip respectively. All input samples are first resized into 48*48*3 and finally the O-Net output is obtained by fully connected layer. The O-Net output is divided into three parts: 1). face classification: the probability that the input image is a face. 2). bounding box: the position of the rectangular. 3). facial landmark localization: the five key points of the input face sample.

(p. 4. Session 2.1. Face detection and key area positioning, the sixth paragraph)


Fig. 4. O-Net Network structure.

The fatigue detection network, as shown in Fig. 8, includes four sub-networks. Input image in each sub-network is first resized into 50*50*3. The first sub-network is to extract the feature of optical flow of the left eye. The second sub-network is to extract the feature of the left eye. The third sub-network is to extract the feature of optical flow of the mouth. The fourth sub-network is to extract the feature of the mouth. Together with the mouth and eye areas obtained after the detection and interception, the calculation results of the optical flow of mouth and eye areas are respectively input into the four sub-networks. After several layers of convolution and pooling, the left eye sub-network and the left eye optical flow sub-network are first fused to obtain further left eye regional features, while the mouth sub-network and the mouth optical flow sub-network are merged to obtain a further mouth regional feature. For the sake of getting the characteristics of the global region, we merge the two new sub-networks re-integration into the full connection layer. Finally, we input the data into the softmax layer for classification and get an 1*5 vector, which represents the probability of each class. To avoid over-fitting, an L2-Regularization is added at each convolutional layer. At the same time, a Dropout hyper parameter is added at each fully connected layer.

(p. 15. Session 2.4. Fatigue detection, the fifth paragraph)

 

2. English in the entire paper has to be corrected.

Response: We have improved the English as your suggestion. Specifically, we sent this manuscript to a company for improving the English after we technically revised it.


Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Did you ensure the class distribution in each fold in your 5-fold CV experiments follow the class distribution of the whole dataset?

The classic k-NN was included in your experiments together with RF and SVM; however, its counter-part (the enhanced k-NN method in DOI: 10.1109/THMS.2015.2453203) was missing. Please include that enhanced k-NN algorithm as well since its performance is expected to be similar to RF and SVM. Parameter k may need to be tuned through CV.

Figures 2,3,4 are in low quality. They should be in a vector image format.

The references are not in a standard format. Please remove [J], [C], \\[C]

Ref [32] can be removed since it is not a scientific paper.

Author Response

Review 1:

The authors are grateful to the first reviewer for the insightful comments and constructive suggestions. In light of your comments and suggestions, the paper has been revised. Please see our point to point responses in the following.

1. Did you ensure the class distribution in each fold in your 5-fold CV experiments follow the class distribution of the whole dataset?

Response: We trained our models using train dataset with stratified 5-fold cross-validation which tried to ensure that each class was approximately equally represented across each fold. Furthermore, we have added the corresponding reference [32] in the current manuscript. Please see the following paragraph extracted from the present version.

“We trained our models using train dataset with stratified 5-fold cross-validation [32], where the data folds were chosen such that each fold had nearly the same class distribution as original dataset, and used evaluation dataset for test.”

 (p. 9. Session 3.2. Experiment, the first paragraph)

 

2. The classic k-NN was included in your experiments together with RF and SVM; however, its counter-part (the enhanced k-NN method in DOI: 10.1109/THMS.2015.2453203) was missing. Please include that enhanced k-NN algorithm as well since its performance is expected to be similar to RF and SVM. Parameter k may need to be tuned through CV.

Response: According to your suggestions, we have added the enhanced k-NN method and compared the performance with other classifiers’ performance. Parameter k was finally tuned and chosen to be 3 through cross-validation. Furthermore, we have added the corresponding references [33-35] in the current manuscript. Please see the following paragraph extracted from the present version.

“We show another result in Table 2. We got the features before the softmax layer in the model GFDN as inputs and tested with other classifiers. We chose and tuned 4 classification algorithms including k-Nearest Neighbors (KNN) [33], Centroid Displacement-Based k-Nearest Neighbors (CDNN) [34], support vector machine (SVM) [35] and random forest (RF) [36]. Parameter k was tuned and chosen to be 5 and 3 respectively in KNN and CDNN through cross-validation. It was shown below that the accuracy of each derived model has dropped slightly.”

Table 2. Drowsiness detection accuracies (%) for different states of the NTHU-DDD dataset in GFDN and derived models.

State

GFDN-KNN

GFDN-CDNN

GFDN-SVM

GFDN-RF

GFDN

normal

92.86

93.42

93.96

94.61

93.70

drowsiness

98.02

97.93

98.05

97.49

98.11

nodding

96.00

96.11

96.51

96.21

96.31

talking

97.37

97.02

97.18

96.83

97.75

yawning

97.19

97.21

97.44

97.25

97.84

Average

96.67

96.78

96.98

96.70

97.06

(p. 9. Session 3.2. Experiment, the third paragraph)

 

3. Figures 2,3,4 are in low quality. They should be in a vector image format.

Response: According to your suggestions, we have uploaded Figures 2, 3 and 4 with the vector image format. Please see the following paragraph extracted from the present version.

Fig. 2. P-Net Network structure.

(p. 4. Session 2.1. Face detection and key area positioning, the fourth paragraph)

Fig. 3. R-Net Network structure.

(p. 4. Session 2.1. Face detection and key area positioning, the fifth paragraph)

Fig. 4. O-Net Network structure.

(p. 4. Session 2.1. Face detection and key area positioning, the sixth paragraph)

 

4. The references are not in a standard format. Please remove [J], [C], \\[C]

Response: According to your suggestions, we have corrected the format of all the references.

 

5. Ref [32] can be removed since it is not a scientific paper.

Response: According to your suggestions, we have removed reference [32].



Reviewer 2 Report

The authors have corrected the paper according reviewer's requirements. 

Author Response

Thank you!

Round 3

Reviewer 1 Report

I am OK with the response from the authors.

Back to TopTop