Risky-Driving-Image Recognition Based on Visual Attention Mechanism and Deep Learning

Risky driving behavior seriously affects the driver’s ability to react, execute and judge, which is one of the major causes of traffic accidents. The timely and accurate identification of the driving status of drivers is particularly important, since drivers can quickly adjust their driving status to avoid safety accidents. In order to further improve the identification accuracy, this paper proposes a risky-driving image-recognition system based on the visual attention mechanism and deep-learning technology to identify four types of driving status images including normal driving, driving while smoking, driving while drinking and driving while talking. With reference to ResNet, we build four deep-learning models with different depths and embed the proposed visual attention blocks into the image-classification model. The experimental results indicate that the classification accuracy of the ResNet models with lower depth can exceed the ResNet models with higher depth by embedding the visual attention modules, while there is no significant change in model complexity, which could improve the model recognition accuracy without reducing the recognition efficiency.


Introduction
Motor vehicles have become an important means of transportation for the daily travel and cargo transportation of residents. Their present possession and annual increase show an explosive growth trend, which unavoidably causes an increasing number of traffic safety problems and accidents. Therefore, facing the development background of the above era, how to reduce the probability of traffic safety problems and improve the traffic safety factor has become a common concern of scholars. Risky driving is one of the essential factors leading to traffic safety problems. It causes the driver to have less control over the vehicle, which in turn leads to the driver being unable to perform normal car maneuvers, such as steering, gear shifting, and deceleration [1][2][3].
Statistical results indicate that more than 75% of traffic accidents and traffic safety problems are closely related to irregular driving and risky driving behaviors [4]. For example, during the driving process, calling, drinking and smoking can affect driver attention, making them unable to focus on the driving conditions ahead and the environment around the motor vehicle, which may directly lead to the occurrence of safety accidents. Therefore, it is important to improve the detection capacity of the driver's driving status, and the timely identification and correction of risky driving behaviors can avoid traffic safety problems to the greatest extent [5]. At this stage, a large number of scholars have carried out experimental research on the detection of risky driving and have achieved relatively excellent performance. Among them, the early risky-driving-detection systems were mainly based on vehicle driving information, driver physiological signals or driver facial characteristics, and they also achieved relatively stable detection accuracy supported by accurate sensor devices. However, the traditional risky-driving-detection system still has some application problems, such as a slow detection efficiency, complex detection scheme and difficult application deployment. ResNet [18] is a class of networks designed to solve the gradient explosion and the overfitting problems during the model-training phase as the network deepens. The purpose of the residual module ( Figure 2a) is to add the features extracted from the front to back layers of the model, and by using the shortcut connection (Figure 2b), ResNet effectively solves the problems of network gradient explosion and overfitting during the training process. At the same time, by introducing the batch-normalization (BN) layers, ResNet speeds up the network training speed and convergence stability. Due to the application advantages of the ResNet, this experiment selects a different-depth ResNet as the guiding architecture to complete the risky-driving-image-classification tasks.  ResNet [18] is a class of networks designed to solve the gradient explosion and the overfitting problems during the model-training phase as the network deepens. The purpose of the residual module ( Figure 2a) is to add the features extracted from the front to back layers of the model, and by using the shortcut connection (Figure 2b), ResNet effectively solves the problems of network gradient explosion and overfitting during the training process. At the same time, by introducing the batch-normalization (BN) layers, ResNet speeds up the network training speed and convergence stability. Due to the application advantages of the ResNet, this experiment selects a different-depth ResNet as the guiding architecture to complete the risky-driving-image-classification tasks. convolutional neural networks, ResNet architecture, different pooling operation schemes, different types of visual attention modules, and data-augmentation techniques.

Convolutional Neural Networks & ResNet
Convolution neural networks (CNNs) are a kind of feed-forward neural network with a deep structure and convolution calculation, which has strong learning capability and uses a convolution layer structure to classify input information shift invariant [17]. The basic CNNs consist of five structures: thr input layer, convolutional layer, pooling layer, fully connected layer and classification layer. The CNN network architecture is shown in Figure 1. ResNet [18] is a class of networks designed to solve the gradient explosion and the overfitting problems during the model-training phase as the network deepens. The purpose of the residual module ( Figure 2a) is to add the features extracted from the front to back layers of the model, and by using the shortcut connection (Figure 2b), ResNet effectively solves the problems of network gradient explosion and overfitting during the training process. At the same time, by introducing the batch-normalization (BN) layers, ResNet speeds up the network training speed and convergence stability. Due to the application advantages of the ResNet, this experiment selects a different-depth ResNet as the guiding architecture to complete the risky-driving-image-classification tasks.

Pooling Operation & Different Pooling Strategies
The pooling operation is one of the most important processing units in the CNN models, which plays the role of extracting representative features for captured image features, and therefore is also called the sub-sampling or down-sampling operation. After the pooling operation, the dimension of the output feature is effectively reduced, which is helpful for reducing the network training parameters and preventing overfitting. In the CNN architecture, the common pooling strategies include max pooling, average pooling and stochastic pooling, as shown in Figure 3.
The pooling operation is one of the most important processing units in the CNN models, which plays the role of extracting representative features for captured image features, and therefore is also called the sub-sampling or down-sampling operation. After the pooling operation, the dimension of the output feature is effectively reduced, which is helpful for reducing the network training parameters and preventing overfitting. In the CNN architecture, the common pooling strategies include max pooling, average pooling and stochastic pooling, as shown in Figure 3. For different types of image-classification tasks, different pooling strategies can focus on preserving different image features, such as texture, contour, background or other types of features in the input feature maps, and researchers can select different pooling strategies to optimize the CNN models for a specific task. However, it is worth knowing that the single pooling strategy often results in the loss of useful feature extraction. For instance, max pooling discards all non-maximum values in the pooling kernel, while average pooling fails to retain the maximum feature values, and stochastic pooling does not focus on the retention of features in a specific direction. Therefore, the single pooling strategy also limits the classification performance of the CNN models, and needs to be compensated for and solved by the optimization methods.

Visual Attention Module Design
To solve the problem of feature loss caused by using a single pooling strategy and to improve the classification performance of deep-learning models in risky-driving-imageclassification tasks, this paper proposes to incorporate visual attention mechanisms into risky-driving-image-classification models, and this section mainly illustrates the four visual attention module design schemes.

Squeeze and Excitation Visual Attention Block (SE Block)
The squeeze and excitation visual attention block (SE block) was firstly proposed by Hu et al. in SE Net [19], which adds the visual attention mechanism to the CNN model in the channel direction to obtain more channel feature information, and the structure of the SE block is shown in Figure 4a. For different types of image-classification tasks, different pooling strategies can focus on preserving different image features, such as texture, contour, background or other types of features in the input feature maps, and researchers can select different pooling strategies to optimize the CNN models for a specific task. However, it is worth knowing that the single pooling strategy often results in the loss of useful feature extraction. For instance, max pooling discards all non-maximum values in the pooling kernel, while average pooling fails to retain the maximum feature values, and stochastic pooling does not focus on the retention of features in a specific direction. Therefore, the single pooling strategy also limits the classification performance of the CNN models, and needs to be compensated for and solved by the optimization methods.

Visual Attention Module Design
To solve the problem of feature loss caused by using a single pooling strategy and to improve the classification performance of deep-learning models in risky-driving-imageclassification tasks, this paper proposes to incorporate visual attention mechanisms into risky-driving-image-classification models, and this section mainly illustrates the four visual attention module design schemes.

Squeeze and Excitation Visual Attention Block (SE Block)
The squeeze and excitation visual attention block (SE block) was firstly proposed by Hu et al. in SE Net [19], which adds the visual attention mechanism to the CNN model in the channel direction to obtain more channel feature information, and the structure of the SE block is shown in Figure 4a.  The SE block mainly contains three processing processes: squeeze, excitation and scale. The output of the previous layer is the processing object, and a 1 × 1 convolution operation is performed first to obtain the feature map.
where represents the number of parameters of the c-th filter, the is the input image, The SE block mainly contains three processing processes: squeeze, excitation and scale. The output of the previous layer is the processing object, and a 1 × 1 convolution operation is performed first to obtain the feature map.
where v c represents the number of parameters of the c-th filter, the X is the input image, * represents the convolution operation process, and u c is the output feature map. Afterwards, the SE block will use the convolutional output to perform squeeze, excitation and scale operations in sequence, where the squeeze process is implemented as a global average pooling operation, that is, each feature channel of the feature map is compressed and characterizes the global distribution of responses over the feature channels; the excitation process is implemented by using a fully connected layer. The result after excitation is subjected to another fully connected operation to achieve feature dimensionality recovery, and the sigmoid activation function is used to obtain a weight value between 0 and 1. This process allows the CNN model to effectively learn the nonlinear interactions and nonreciprocal relationships between channels, and ensures the attention enhancement of multiple channels. Finally, the output values of the excitation processing are subjected to a reweight process that is used to weight the normalized weights to the features of each channel, also known as scale, which is weighted to the previous features channel by channel through the dot product. Through the SE block operation, the CNN model is effectively enhanced for feature extraction in the channel direction, and the SE block can be flexibly embedded in the residual branch of the ResNet model, as shown in Figure 4b.

Channel Visual Attention Block & Spatial Visual Attention Block (CA Block & SA Block)
Referring to the design idea of the SE visual attention block, Woo et al. proposed two new visual attention blocks, the channel attention module (CA block) and the spatial attention module (SA block), for spatial direction and channel direction, respectively [20], which further improve the feature-extraction ability and classification performance of the CNN image-classification model. The structure details of the CA block and SA block are shown in Figure 5.  In the CNN image-classification models, the CA block and SA block focus on performing visual attention tasks in different ways, where the CA block focuses on computing the intrinsic relationships between individual channels, while the SA block focuses on the intrinsic relationships of feature maps at the spatial level.
On the one hand, in the CA block, it performs the max pooling, average pooling and In the CNN image-classification models, the CA block and SA block focus on performing visual attention tasks in different ways, where the CA block focuses on computing the intrinsic relationships between individual channels, while the SA block focuses on the intrinsic relationships of feature maps at the spatial level.
On the one hand, in the CA block, it performs the max pooling, average pooling and stochastic pooling operations on the input feature map F to simultaneously obtain the texture, contour and background information of the input image and enhance the model robustness. Finally, the computation result will be sent to an MLP shared network, which will sum the corresponding elements of the three different feature maps and output the channel attention feature map, so the CNN model not only obtains the reduced dimensionality of the output feature images in the convolutional layer, but also retains more comprehensive image features. On the other hand, in the SA block, the max pooling, average pooling and stochastic pooling are performed on the input feature maps in turn, and the results are obtained for feature concatenation. Then, the fused feature maps are subjected to a standard convolution operation to recover the feature dimension and output the spatial visual attention feature map, so the SA block can efficiently help the CNN model solve the problem of "which regions are important and which regions are minor" in the input image. In addition, both the CA block and the SA block can be flexibly deployed in the ResNet, and their embedding schemes are similar to those of the SE block.

Mixed Visual Attention Block (MA Block)
In the process of exploring the use of visual attention mechanisms in CNN models, Woo et al. found that there is still space for the upward improvement of CNN imageclassification models, so they proposed a mixed visual attention block that combines the use of two types of visual attention blocks to improve the feature-extraction and imageclassification performance of deep-learning models, as shown in Figure 6

Data-Augmentation Technology
With the deepening of the deep-learning model and the increase in the model com plexity, training a new, deep and large CNN image-classification model needs to be su ported by a large amount of labeled image data, and an insufficient amount of image da will directly lead to overfitting and accuracy bottlenecks during the training phase. B sides, as a relatively new research area, there are relatively few public datasets and insu ficient image data for the risky-driving-image-classification task. In addition, the acquis tion of risky-driving images requires a professional camera at a fixed position on th driver's side of the motor vehicle, which has relatively strict requirements for imagin equipment and shooting environments, which also increases the difficulty of acquirin risky-driving images and preparing data sets.
One solution to the above problem is the data-augmentation (DA) technology, whic is now widely used by researchers to obtain training data that can be used for deep-lear ing models. Specifically, the classic DA methods includes rotating, flipping, scaling, i

Data-Augmentation Technology
With the deepening of the deep-learning model and the increase in the model complexity, training a new, deep and large CNN image-classification model needs to be supported by a large amount of labeled image data, and an insufficient amount of image data will directly lead to overfitting and accuracy bottlenecks during the training phase. Besides, as a relatively new research area, there are relatively few public datasets and insufficient image data for the risky-driving-image-classification task. In addition, the acquisition of risky-driving images requires a professional camera at a fixed position on the driver's side of the motor vehicle, which has relatively strict requirements for imaging equipment and shooting environments, which also increases the difficulty of acquiring risky-driving images and preparing data sets.
One solution to the above problem is the data-augmentation (DA) technology, which is now widely used by researchers to obtain training data that can be used for deeplearning models. Specifically, the classic DA methods includes rotating, flipping, scaling, increasing contrast, adding Gaussian noise, and many other forms. Among them, rotation processing rotates the original training image by a certain angle; flipping inverts the original image horizontally or vertically; scaling enlarges or shrinks the original image by a certain proportion; increasing contrast changes the saturation (S) and value (V) of the original image in the HSV color space; adding Gaussian noise randomly perturbs each pixel RGB in the original image. Therefore, by using the above classic DA methods, researchers can quickly and efficiently expand the training image dataset for their CNNs models, which in turn alleviates the problems of overfitting and unbalanced data volume between groups during the training phase.

Experiment Data Processing & Dataset Preparation
For the task of monitoring the driver's driving status, this experiment selected the normal driving status and three risky-driving-status images as the experiment object, among which, the three risky-driving-status images include smoking, drinking and calling. During driving status, drivers' behaviors of smoking, drinking and calling will seriously distract drivers' attention and reduce their response speed to emergencies, so when the above scenarios occur, the probability of drivers causing potential safety hazards or traffic accidents will also increase significantly. Therefore, the above risky-driving situations should be avoided as much as possible in the actual driving process.
The experimental data were collected by a professional image data acquisition company, the camera was deployed in the left side of the motor vehicle above the A-pillar, which can clearly capture the driver's driving status images, and selected two images in each category as an example. The camera deployment position and the acquired images are shown in Figure 7.

Sensors 2022, 22, x FOR PEER REVIEW
The experimental data were collected by a professional image data acquisition pany, the camera was deployed in the left side of the motor vehicle above the Awhich can clearly capture the driver's driving status images, and selected two ima each category as an example. The camera deployment position and the acquired i are shown in Figure 7. At the beginning stage of the dataset preparation, this experiment adopted an ratio to divide the training set, test set and validation set, respectively. In addition, in to avoid the problems of insufficient training data and an uneven data volume be different categories of images, this experiment used the DA technology to expan training set. The data volume of each category after the DA process is shown in Ta in which the training sets of normal, smoking, drinking and calling are 2403, 2420 and 2407, respectively, totaling 9646, the test set is 298, 293, 291 and 299, totaling 118 the validation set is 280, totaling 1120.  At the beginning stage of the dataset preparation, this experiment adopted an 8:1:1 ratio to divide the training set, test set and validation set, respectively. In addition, in order to avoid the problems of insufficient training data and an uneven data volume between different categories of images, this experiment used the DA technology to expand the training set. The data volume of each category after the DA process is shown in Table 1, in which the training sets of normal, smoking, drinking and calling are 2403, 2420, 2416 and 2407, respectively, totaling 9646, the test set is 298, 293, 291 and 299, totaling 1181, and the validation set is 280, totaling 1120.

Model-Building Details and Experiment Setting
In order to explore the performance of different-depth CNNs and the proposed visual attention blocks for risky-driving image classification, four deep-learning imageclassification models with different depths were built with reference to the ResNet, which are ResNet18 with 18 layers, ResNet34 with 34 layers, ResNet50 with 50 layers, and ResNet101 with 101 layers. After that, the four visual attention blocks (SE block, CA block, SA block, and MA block) were embedded in the different-depth The modelbuilding details and the embedding details of the visual attention blocks are shown in Table 2. By comparing the above CNN models, this experiment will systematically explore the performance of deep-learning-based image-classification technology in the field of safety driving detection. In the model-building, platform-deployment and testing phases of this experiment, the details of its experimental environment and application platform are shown in Table 3. Among them, this experiment selected the SGD model optimizer, a learning rate of 1 × 10 −4 , a momentum of 0.95, a discard rate of 0.5, a loss function of category cross-entropy loss, and in the deployment of the attention module, its decay rate was 16 and the pooling kernel size was 7 × 7. Meanwhile, in order to improve the model-training efficiency, this experiment used ReduceLROnPlateau and EarlyStopping algorithms, where the monitor of ReduceLROnPlateau was validation loss, the decay learning rate (factor) was 0.5, and the patience was 4. In EarlyStopping, its monitor was validation loss, the Min_delta was 0, and the patience was 10. In addition, the models were built based on the Keras toolbox in the Python 3.7 environment, the training epoch was 400 with a batch size of 32, and all models were trained in Nvidia RTX 2080Ti and CUDA10.1, cudnn7.3.1 environment.

Model Comparison and Evaluation
To fully evaluate and compare the performance of the deep-learning models embedded with attention blocks in risky-driving-image-classification tasks, and to explore the application performance between different visual attention blocks, this experiment collected the training accuracy, training loss, validation accuracy, and validation loss of 20 different ResNet models, as shown in Table 4. Among them, training accuracy and training loss were used to evaluate the model-training status, and validation accuracy and validation loss were used to evaluate the model classification performance. Meanwhile, the number of calculation parameters of each model was calculated to evaluate the training difficulty of the models and to observe the increase in training parameters and the training difficulty due to embedding visual attention blocks.
This experiment compares the base ResNet models with different depths without the visual attention mechanism. The results indicate that the classification accuracy is positively correlated with the depth of the model, and the model complexity, that is, the number of calculation parameters, is negatively correlated with the depth of the model. Among them, ResNet101 achieved 92.73% classification accuracy with 45.13 M parameters in the risky-driving-image dataset; while the recognition accuracy was improved, the model was also more complex.
After that, we compared the ResNet models of each depth embedded with visual attention blocks, and the results indicate that the visual attention modules can enhance the recognition accuracy of the ResNet model to varying degrees, but they do not significantly increase the number of model parameters. Among them, the ResNet models embedded with the MA block improved the validation accuracy to the greatest extent, and the improvement degrees of ResNet18_Mixed, ResNet34_Mixed, ResNet50_Mixed and ResNet101_Mixed were 4.93%, 4.39%, 3.63% and 4.45%, respectively. It is worth noting that the classification accuracy of the ResNet models with lower depths can exceed the ResNet models with higher depths by embedding the visual attention modules. For instance, the validation accuracy of ResNet50_Mixed was 96.52%, while that of ResNet101 was 93.73%, and the number of parameter in ResNet101 was 60% more than ResNet50_Mixed. Therefore, we can greatly improve the recognition accuracy of the ResNet model by embedding a visual attention module, but the recognition efficiency will not be affected, which is of great significance to the practical application and popularization of this technology.

Confusion Matrices Analysis
To further demonstrate the classification performance of the proposed ResNet model embedded with attention blocks in the risky-driving-image dataset, the confusion-matricesevaluation tool was introduced in this experiment [21,22]. Taking the ResNet101 and the ResNet101 variant model with different visual attention blocks as examples, the confusion matrices of the above models in the risky-driving-image test set are shown in Figure 8.
On the whole, the ResNet101 models with attention blocks had a lower misclassification rate compared with the base ResNet101 model, in which the number of misclassifications for ResNet101_SE, ResNet101_SA, ResNet101_CA, and ResNet101_Mixed we're 104, 105, 100, and 84, respectively, while the number of misclassification images of the base ResNet101 model reached 111, which indicates that embedding the visual attention module to the ResNet model can effectively improve the classification performance of the CNN model for risky-driving images.
When analyzing the misjudgment rate of different categories, the results indicate that the misclassification of normal-driving images was relatively high in the four categories, while the misclassification of smoking, drinking, and calling we're more similar in the three categories. In analyzing the reasons for the above results, we believe that the normaldriving images do not have obvious characteristics, and their driving actions have partial similarity to the remaining three risky-driving images, which to some extent causes the model misclassification.

Confusion Matrices Analysis
To further demonstrate the classification performance of the proposed ResNet mo embedded with attention blocks in the risky-driving-image dataset, the confusion-ma ces-evaluation tool was introduced in this experiment [21,22]. Taking the ResNet101 a the ResNet101 variant model with different visual attention blocks as examples, the co fusion matrices of the above models in the risky-driving-image test set are shown in F ure 8. On the whole, the ResNet101 models with attention blocks had a lower misclass cation rate compared with the base ResNet101 model, in which the number of misclas fications for ResNet101_SE, ResNet101_SA, ResNet101_CA, and ResNet101_Mixed we 104, 105, 100, and 84, respectively, while the number of misclassification images of base ResNet101 model reached 111, which indicates that embedding the visual attent module to the ResNet model can effectively improve the classification performance of CNN model for risky-driving images.
When analyzing the misjudgment rate of different categories, the results indicate t the misclassification of normal-driving images was relatively high in the four categori while the misclassification of smoking, drinking, and calling we're more similar in three categories. In analyzing the reasons for the above results, we believe that the norm driving images do not have obvious characteristics, and their driving actions have par similarity to the remaining three risky-driving images, which to some extent causes model misclassification.

Grad-CAM Visualization Analysis
In order to observe the changes in the magnitude and distribution of the classification weights caused by embedding visual attention blocks more intuitively, taking the differentdepth ResNet101 models and variant ResNet101 models as examples, one image of each category in the risky-driving-image dataset was selected for Grad-CAM visualization [23], and the visualization results are shown in Figure 9.

Grad-CAM Visualization Analysis
In order to observe the changes in the magnitude and distribution of the classific weights caused by embedding visual attention blocks more intuitively, taking the d ent-depth ResNet101 models and variant ResNet101 models as examples, one ima each category in the risky-driving-image dataset was selected for Grad-CAM visu tion [23], and the visualization results are shown in Figure 9. On the one hand, in the overall trend, the visualization results show that afte bedding attention blocks to the ResNet101 model, the model extracts more image fe information in the input image, as the regions that account for the weights are relat increased, which indicates the improved feature extraction ability of the ResNet m On the other hand, when analyzing Grad-CAM map of the ResNet101 with differe tention blocks, the experimental results show that the distribution and size of the we On the one hand, in the overall trend, the visualization results show that after embedding attention blocks to the ResNet101 model, the model extracts more image feature information in the input image, as the regions that account for the weights are relatively increased, which indicates the improved feature extraction ability of the ResNet model. On the other hand, when analyzing Grad-CAM map of the ResNet101 with different attention blocks, the experimental results show that the distribution and size of the weights occupied by some of the focal regions in the three risky-driving category images increase to some extent. For example, the weight of the water bottle and drinking-action region in the driving-while-drinking image, the weight of the cigarette stick and smoking-action region in the driving-while-smoking image, and the weight of the cell phone and talkingaction region in the driving-while-talking image, which represents the main direction of the features extracted by the visual attention blocks.

Conclusions
In order to further improve the performance of the deep-learning image-classification model in the risky-driving-detection task, this paper proposes a solution of embedding visual attention blocks into the deep-learning framework to improve the feature-extraction ability and classification performance. Through the model comparison and evaluation, it is worth noting that the classification accuracy of ResNet models with lower depths can exceed the ResNet models with higher depths by embedding the visual attention modules, while there is no significant change in model complexity. Therefore, we can greatly improve the recognition accuracy of the ResNet model by embedding the visual attention module, but the recognition efficiency will not be affected, which is of great significance to the practical application and popularization of this technology. Moreover, the results of the confusion matrices analysis and Grad-CAM visualization analysis confirm the superiority of the proposed model.
In future studies, we will further expand the amount of dangerous-driving-scene recognition and image data, optimize the configuration of the visual attention module, and carry out practical applications and optimization on the basis of improving the accuracy and efficiency of recognition.