A Lightweight Attention-Based Network towards Distracted Driving Behavior Recognition
Round 1
Reviewer 1 Report
The article is devoted to developing a new lightweight attention-based neural network (LWANet) for the problem of image classification. The study's relevance is confirmed by the fact that distracted driving is currently a global problem leading to traffic accidents with fatal outcomes and injuries. Therefore, this article addresses this problem and proposes a new lightweight attention-based network for image classification problems. To reduce the cost of computation and trainable parameters, the authors propose to replace the standard convolution layers with depth-separated convolutions and optimize the classic VGG16 architecture, reducing the trainable parameters by 98.16%. Using the attention mechanism in cognitive science, a lightweight inverted residual attention module is proposed to mimic human attention, extract more specific features, and improve overall accuracy. When tested on two public datasets with only 1.22 million trainable parameters and a model file size of 4.68 MB, the quantitative experimental results showed that the proposed neural network architecture provides state-of-the-art overall performance in deep learning-based abstract behavior recognition.
Despite the high quality of the article, some shortcomings need to be corrected.
- It is recommended to expand the abstract with obtained numerical results.
- The section Proposed Method also contains state-of-the-art methods. It is recommended to separate them from the one proposed by the authors.
- The purpose of choosing such neural network architecture should be described.
- The datasets used for the experimental study should be described in more detail.
- The conclusions section should include numerical results obtained by the authors.
- The scientific novelty of the paper should be highlighted.
In summarizing my comments, I recommend that the manuscript is accepted after minor revision.
Author Response
Response to Reviewer 1 Comments
We greatly appreciate the Reviewer’s valuable comments and concerns leading us to significantly improve the quality of our manuscript. We have intensively revised our paper and rewritten some parts of it. We hope this can make our work easier to understand and our contributions, novelties, and key techniques more highlighted. These revisions are highlighted using the “Tracking Changes” function in WORD. Below are our point-to-point responses to the Reviewer’s comments.
The article is devoted to developing a new lightweight attention-based neural network (LWANet) for the problem of image classification. The study's relevance is confirmed by the fact that distracted driving is currently a global problem leading to traffic accidents with fatal outcomes and injuries. Therefore, this article addresses this problem and proposes a new lightweight attention-based network for image classification problems. To reduce the cost of computation and trainable parameters, the authors propose to replace the standard convolution layers with depth-separated convolutions and optimize the classic VGG16 architecture, reducing the trainable parameters by 98.16%. Using the attention mechanism in cognitive science, a lightweight inverted residual attention module is proposed to mimic human attention, extract more specific features, and improve overall accuracy. When tested on two public datasets with only 1.22 million trainable parameters and a model file size of 4.68 MB, the quantitative experimental results showed that the proposed neural network architecture provides state-of-the-art overall performance in deep learning-based abstract behavior recognition.
Despite the high quality of the article, some shortcomings need to be corrected.
Point 1: It is recommended to expand the abstract with obtained numerical results.
We greatly appreciate this valuable comment. The work of this paper can be summarized more clearly in the abstract when enpand the abstract with obtained numerical results. We have specifically added the related content in the revised version. (Line 20-23)
Point 2: The section Proposed Method also contains state-of-the-art methods. It is recommended to separate them from the one proposed by the authors.
Thanks for this comment. In fact, the state-of-the-art methods are a part of the LWANet proposed in this paper. In order to make the description more sufficient, we put the state-of-the-art methods into the section Proposed Method. In order to make this part more reasonable, we change the name of section 3 to Method. (Line 149)
Point 3: The purpose of choosing such neural network architecture should be described.
We appreciate this suggestion. It is important to describe the purpose of choosing such neural network. We have specifically added the related content in the revised version. (Line 247-251)
Point 4: The datasets used for the experimental study should be described in more detail.
We appreciate this suggestion. A more detailed description of the datasets can more fully describe the whole experimental process, making the experiment more convincing. The SF3D and AUC2D datasets both contain ten classes: safe driving, texting-right, talking on the phone-right, texting-left, talking on the phone-left, operating the radio, drinking, reaching behind, hair and makeup, talking to passenger. We added the example of ten classes images in the SF3D and AUC2D datasets in part 4.1, which are shown in Figure 5 and Figure 6, respectively. We have specifically added the related content in the revised version. (Line 278-279,Line 285-286) In addition, the number of each class in the SF3D and AUC2D datasets can be found in Table 3.
Point 5: The conclusions section should include numerical results obtained by the authors.
We appreciate this suggestion. We ignore the importance of numerical results in the conclusions section, which leads it hard to understand. Thus, we have rewritten and added the numerical results of our work to the conclusions section. (Line 470-482)
Point 6:The scientific novelty of the paper should be highlighted.
We greatly appreciate this valuable comment. Highlighting the scientific novelty in this paper can better reflect the value of this paper and make the content easier to understand. We revised the Introduction part to highlight the scientific novelty of this paper. We have specifically added the related content in the revised version. (Line 63-78) Thanks again for this kind reminder.
Author Response File: Author Response.docx
Reviewer 2 Report
The article has a good structure and is well written. The experiments are well done. So, the article is at an acceptable level and can be accepted after minor revisions:
- It is necessary to talk about the role of the parameters of the proposed algorithm in a separate section.
- What are the advantages and disadvantages of this study compared to the existing studies in this area?
- There are some grammatical mistakes and typo errors.
2- Introduction: the authors included the needs, importance of research topic, methods, and main contributions. Besides the paper structure. No comments.
3- Related works: the authors have clearly reviewed the and illustrated the most important parts in the state-of-the-art methods including lightweight network design to solve the trade-off between computation cost and overall accuracy. No comments
4- Preliminaries: the main concepts used in the proposed method are clearly explained. No comments.
5- Methodology: The proposed Inverted Residual Attention Module (IRAM), the proposed LWANet Network Structure, and their schematics are clearly explained. No comments
6- Experiments: a) Dataset: the authors used use two publicly available datasets to train and evaluate the performance of the network: the State Farm Distracted Driver Detection (SF3D) and American University in Cairo Distracted Driver (AUC2D). No Comments b) Evaluation Metrics: evaluation metrics are presented in good manner: Accuracy, Loss, GPU speed, and Android Phone speed. No comments c) Parameter Setting: It is necessary to talk about the role of the parameters of the proposed algorithm in a separate section.
d) Results and analysis: 1) the results in tables and graphs are clearly presented. No comments.
2) What are the advantages and disadvantages of this study compared to the existing studies in this area? What makes your method outperformed other methods? In which part and how? critical analysis is needed.
7- Conclusion: methods, results and conclusion are explained besides future work is mentioned. No Comments
8- References: the references provided applicable and sufficient.
The final decision for this paper is accepted with minor corrections.
Author Response
Response to Reviewer 2 Comments
We greatly appreciate the Reviewer’s valuable comments and concerns leading us to significantly improve the quality of our manuscript. We have intensively revised our paper and rewritten some parts of it. We hope this can make our work easier to understand and our contributions, novelties, and key techniques more highlighted. These revisions are highlighted using the “Tracking Changes” function in WORD. Below are our point-to-point responses to the Reviewer’s comments.
The article has a good structure and is well written. The experiments are well done. So, the article is at an acceptable level and can be accepted after minor revisions:
Point 1: It is necessary to talk about the role of the parameters of the proposed algorithm in a separate section.
We appreciate this suggestion. In order to make the experiment results more clear, it is necessary to describe the parameters of the proposed algorithm in a separate section. We added Section 4.2 in the experiment results part to explain our parameter setting. (Line 297-306)
Point 2: What are the advantages and disadvantages of this study compared to the existing studies in this area?
Thanks for this comment. The main advantages of this study are as follows. Firstly, we propose a lightweight Inverted Residual Attention Module (IRAM) towards the problem that the current lightweight network has relatively low accuracy. Secondly, aiming at the problem that most existing studies cannot be implemented on edge devices due to high model complexity, we embed the depthwise separable convolution into classic VGG16 and optimize the network structure. Using the proposed method, the model size of the network is greatly reduced. Thirdly, the subtract mean filter is used in image preprocessing, which improves the environmental adaptability of the model compared with other methods. (Line 66-78)
However, there are also some shortages in this study. The parameters and size of the model need to be further compressed to better adapt to the edge devices. In addition, the proposed model is only verified on SF3D and AUC2D datasets. Limited by the experimental conditions, the model is not tested in the actual driving scene. Therefore, in future work, we plan to utilize our network for video-based distracted driving detection. (Line 483-484)
Point 3: There are some grammatical mistakes and typo errors.
Thanks for your comment. We are sorry for some grammatical mistakes and typo errors. We have corrected it according to your suggestions. (Line 29, 32, 44, 60-62, 70, 93)
Author Response File: Author Response.docx