Deep Deterministic Policy Gradient Algorithm Based on Convolutional Block Attention for Autonomous Driving

: The research on autonomous driving based on deep reinforcement learning algorithms is a research hotspot. Traditional autonomous driving requires human involvement, and the autonomous driving algorithms based on supervised learning must be trained in advance using human experience. To deal with autonomous driving problems, this paper proposes an improved end-to-end deep deterministic policy gradient (DDPG) algorithm based on the convolutional block attention mechanism, and it is called multi-input attention prioritized deep deterministic policy gradient algorithm (MAPDDPG). Both the actor network and the critic network of the model have the same structure with symmetry. Meanwhile, the attention mechanism is introduced to help the vehicles focus on useful environmental information. The experiments are conducted in the open racing car simulator (TORCS)and the results of ﬁve experiment runs on the test tracks are averaged to obtain the ﬁnal result. Compared with the state-of-the-art algorithm, the maximum reward increases from 62,207 to 116,347, and the average speed increases from 135 km/h to 193 km/h, while the number of success episodes to complete a circle increases from 96 to 147. Also, the variance of the distance from the vehicle to the center of the road is compared, and the result indicates that the variance of the DDPG is 0.6 m while that of the MAPDDPG is only 0.2 m. The above results indicate that the proposed MAPDDPG achieves excellent performance.


Introduction
With the rapid development of artificial intelligence technology, the era of automatic driving has come. However, traditional autonomous driving systems require human involvement in designing rules. Also, to make decisions, the structural information from the scenario [1] needs to be built, such as the subsystem of lanes, marks, walkers, cars, and beacons [2]. Currently, there are two main types of unmanned driving algorithms: modular unmanned driving and end-to-end unmanned driving. The traditional modular unmanned driving algorithm has the following shortcomings.

1.
The process of rule-based policy-making is complex and costly [3].
The predefined input and output of each subsystem are not necessarily optimal, which makes it difficult to adapt to complex and changeable environments [9].
Deep reinforcement learning combines the perceptual capabilities of deep learning with the decision-making abilities of reinforcement learning to solve the above problems and enables end-to-end learning. It has a wide range of applications in autonomous driving, such as vehicle power administration [10][11][12][13], control [14], and policy-making [15,16].

1.
Channel and spatial attention mechanisms are applied to the actor network of the MAPDDPG to weigh the important regions of the input images so that the model can focus on the key region information. Based on a middle feature matrix, an attention feature map in two independent dimensions (channel and space) is deduced by the module. Then, the two features are multiplied to improve the characteristics of the input image.

2.
In the actor network of the MAPDDPG algorithm, the GRU layer is used by the model as a temporal attention mechanism to weigh the past few frames according to their importance to determine the current driving policy. The important frames in the past can provide further optimization for behavioral decisions. By processing the data in parallel, the entire network obtains the data features symmetrically from multiple dimensions.

3.
A new reward function is proposed and used as the criterion for evaluation. Three orbits in the TORCS simulation environment are exploited to evaluate the performance of the MAPDDPG module.

Related Work
Deep learning and deep reinforcement learning have been extensively studied and applied to autopilot systems. The application of deep learning in autopilot dates back as far as 1986, and [26] used a three-layer back-propagation network to train data that was obtained from cameras and laser rangefinders to output vehicle actions. LeCun et al. [27] designed a vision-based obstacle avoidance system that maps raw input images to steering angles and the system was trained in supervised mode to predict the steering angles. Karol Zieba mapped raw camera pixels to action commands and trained them in a convolutional neural network using a small amount of data [4,5]. Besides, to solve the problem of the unstable learning process, the scholars devised a few network structures that can be used in complex environments. For example, Mehta et al. [28] proposed multi-task learning from demonstration (MTLfD) framework that predicts visual affordances and action primitives and guides predictive driving commands through direct supervision. Sauer et al. [29] presented a direct perception method that maps video inputs to intermediate representations and is adapted to autonomic guidance in sophisticated urban surroundings to reduce traffic accidents. All these methods require significant human involvement, which introduces uncertainty into the training of end-to-end self-driving systems. If the data obtained is not independently distributed or the noise is not homogeneous, the results of the end-to-end autonomous driving system training can deviate significantly from the real results.
Deep reinforcement learning learns through exploration and trial-and-error and does not require prior knowledge. As for the study of deep reinforcement learning, Riedmiller et al. [30] proposed a neural fit Q iterative (NFQ) network that is purely datadriven and guides a real robot car based on the data collected directly from experiments. Jung et al. [31] extended an approach based on deep inverse reinforcement learning. The extension exploited a new type of neural network to derive contextual relationships from sensory data and blend them with the output. By using expert presentation in Q-learning, Xia et al. [32] improved the stability by 32% and reduced the convergence time by 72%. Chae et al. [33] exploited Deep-Q-Learning [17] to train the driving behavior of the agent in urban environments to decrease the occurrence of mishaps. Wang et al. [34] treated both the state space and action space as continuous and designed a Q-function approximator with a closed greedy policy to train the vehicle to learn automatic lane-changing behavior.
However, the DQN can only handle discrete action spaces, and subsequent researchers use the Actor-Critic algorithm and deep deterministic policy gradient (DDPG) to deal with the continuous action problem. Jaritz et al. [35] mapped the RGB images from the front camera to the output actions and trained the agent with the Asynchronous Advantage Actor-Critic [36] algorithm to achieve fast convergence and stable driving. Wang et al. [37] exploited DDPG to train the lane-changing behavior of the agent. For the first time, deep reinforcement learning is applied to an actual full-size self-driving vehicle, where the DDPG network takes the image information observed by the vehicle as input and it is trained with sparse reward [16]. Wang et al. [38] proposed to set the learning objectives by collecting and analyzing the driving data from different drivers. Then, the DDPG algorithm was exploited to design a driving decision system. Although the DDPG algorithm works well in autonomous driving research, it suffers from great deficiencies in stability and data processing.
Based on the previous research, this paper proposes the MAPDDPG model that can selectively pay attention to the input information, thus enhancing the safety of autonomous driving and obtaining a better reward.

Methods
This section outlines the MAPDDPG model, and the overall structure of the model is shown in Figure 1. A convolutional block attention mechanism is introduced in this paper to extract channel and spatial features of images to make the model focus on the information of important regions. Meanwhile, a GRU layer is added to make the important frames in the past provide further optimization to behavioral decisions. Unlike previous autopilot models, MAPDDPG takes sensor and image information as input and a new reward function is designed to accelerate the training process. The MAPDDPG model mainly consists of five elements:

1.
Convolutional Neural Network (CNN): The model adopts a five-layer CNN to extract features from the image and obtain the middle feature matrix.

2.
Channel attention and spatial attention layer: The model adds the channel attention and spatial attention layer behind the CNN. First, the model ignores the spatial dimensionality of the input features to obtain the channel attention feature map. Then, the model produces a spatial attention graph by using the spatial correlation between the features. 3.
Gated Recurrent Unit layer (GRU): The GRU layer is placed behind the convolutional block attention mechanism. GRU uses gating mechanisms to make the proposed model not only remember past information but also selectively forget unimportant information.

4.
Priority experience replay: The MAPDDPG differs from the previous models in that it exploits priority experience replay to improve sample utilization.

5.
Reward function: A new reward function is designed that consists of three components: speed, distance to the center of the lane, and whether to run out of the track.
3. Gated Recurrent Unit layer (GRU): The GRU layer is placed behind the convolutional block attention mechanism. GRU uses gating mechanisms to make the proposed model not only remember past information but also selectively forget unimportant information. 4. Priority experience replay: The MAPDDPG differs from the previous models in that it exploits priority experience replay to improve sample utilization. 5. Reward function: A new reward function is designed that consists of three components: speed, distance to the center of the lane, and whether to run out of the track.

Deep Deterministic Policy Gradient
This paper improves the MAPDDPG model based on DDPG [23] that has two neural network structures with symmetry. The parameters of the target network are copied from the online network after C iterations. Symmetries may be found in markov decision processes (MDPs). For example, CartPole has symmetry along the longitudinal axis. Our model introduces an MDP with symmetry, which involves a set of transitions about the state-action space and keeps the reward function and the transition operator unchanged. A state transition and a state-dependent action space are respectively denoted as S s  and A → a.
The actor network takes state and action as input and output which are in charge of producing actions and reacting with the environment, respectively. The critic network is responsible for evaluating the performance of the action and determining the actions for the next state of the actor-network to obtain the maximum Q-value. The definition of the loss function L is as follows.
where i y and i r are respectively the target Q-value and reward, and  is the discount factor. i s and 1 i s  are the current state and the next state. Q  and   denote the network parameters of the critic-network and the actor-network. The gradient renewal of actor network is:

Deep Deterministic Policy Gradient
This paper improves the MAPDDPG model based on DDPG [23] that has two neural network structures with symmetry. The parameters of the target network are copied from the online network after C iterations. Symmetries may be found in markov decision processes (MDPs). For example, CartPole has symmetry along the longitudinal axis. Our model introduces an MDP with symmetry, which involves a set of transitions about the state-action space and keeps the reward function and the transition operator unchanged. A state transition and a state-dependent action space are respectively denoted as S → s and A → a.
The actor network takes state and action as input and output which are in charge of producing actions and reacting with the environment, respectively. The critic network is responsible for evaluating the performance of the action and determining the actions for the next state of the actor-network to obtain the maximum Q-value. The definition of the loss function L is as follows.
where y i and r i are respectively the target Q-value and reward, and γ is the discount factor. s i and s i+1 are the current state and the next state. θ Q and θ µ denote the network parameters of the critic-network and the actor-network. The gradient renewal of actor network is: where ∇ θ µ J is the gradient of the performance objective, and J(µ) measure the performance of the actor network µ.
The DDPG algorithm adopts an experience pool and dual-network architecture with symmetry to break the dependency between data samples. The target network provides target values Q s i+1 , µ s i+1 θ µ θ Q . The weights of the target network are adjusted with a regular frequency τ that is usually much less than 1. The weights are updated as follows:

Attention-Based Actor-Critic Structure
Drivers are likely to consider several historical states to take action, and they evaluate the significance based on the time and place of the state. To gain this ability, the attention mechanism shown in Figure 2 is proposed in this paper.
The DDPG algorithm adopts an experience pool and dual-network architecture with symmetry to break the dependency between data samples. The target network provides target values . The weights of the target network are adjusted with a regular frequency  that is usually much less than 1. The weights are updated as follows:

Attention-Based Actor-Critic Structure
Drivers are likely to consider several historical states to take action, and they evaluate the significance based on the time and place of the state. To gain this ability, the attention mechanism shown in Figure 2 is proposed in this paper. Attention enables the neural network to focus on the useful information of the input data that is relevant to the current output, thus improving the quality of the output. In this paper, attentional mechanisms are exploited to make the model concentrate on the Attention enables the neural network to focus on the useful information of the input data that is relevant to the current output, thus improving the quality of the output. In this paper, attentional mechanisms are exploited to make the model concentrate on the significant traits. Specifically, the picture information is extracted by the convolutional neural network to form an intermediate matrix, and the features of the important regions in the picture are extracted by the channel and spatial attention mechanism. Then, the feature matrix is fed into the GRU layer to filter out the important frames in the past time, and the final output action after three fully connected layers is obtained.
In this subsection, two attention mechanisms (channel attention and spatial attention) that are incorporated into the actor network to provide better options for action are described.

Channel Attention
The MAPDDPG generates a channel attention map by exploiting the channel-tochannel correlations. Since each channel of the characteristic graph is regarded as a detector of traits [39], it makes sense for the channel attention to focus on the "what" of the input image. Based on this, the currently commonly used average and max pooling are used to deal with the aggregation of spatial information. The architecture of channel attention is illustrated in Figure 3.

Channel Attention
The MAPDDPG generates a channel attention map by exploiting the channel-tochannel correlations. Since each channel of the characteristic graph is regarded as a detector of traits [39], it makes sense for the channel attention to focus on the "what" of the input image. Based on this, the currently commonly used average and max pooling are used to deal with the aggregation of spatial information. The architecture of channel attention is illustrated in Figure 3. The channel attention of the MAPDDPG model is as follows: Firstly, a feature map The spatial data of the feature map is aggregated by applying max pooling and average pooling to create two unique spatial circumstance descriptors: max c F and c avg F that respectively represent the maxpooled and average-pooled features. Then, the two spatial circumstance descriptors are input into a sharing network that consists of a multilayer perceptron (MLP) with a hidden layer to create a channel attention feature map . The overhead of the parameter is minimized by setting the size of the hidden activation to where r is the downsampling multiplier of the number of channels. Finally, the calculated two characteristic graphs are summed and fed into a sigmoid activation function  to obtain the weights are the weights of the sharing network.

Spatial Attention
The MAPDDPG adds the spatial attention behind the channel attention to focus on "where" the features are meaningful. The spatial sub-module uses two outputs that are similar to the channel attention and collects features to produce a valid feature descriptor along the channel axis. The architecture of spatial attention is shown in Figure 4. The channel attention of the MAPDDPG model is as follows: Firstly, a feature map F H×W×C is input to the model. The spatial data of the feature map is aggregated by applying max pooling and average pooling to create two unique spatial circumstance descriptors: F c max and F c avg that respectively represent the max-pooled and average-pooled features. Then, the two spatial circumstance descriptors are input into a sharing network that consists of a multilayer perceptron (MLP) with a hidden layer to create a channel attention feature map M c ∈ R C×1×1 . The overhead of the parameter is minimized by setting the size of the hidden activation to R C/r×1×1 , where r is the downsampling multiplier of the number of channels. Finally, the calculated two characteristic graphs are summed and fed into a sigmoid activation function σ to obtain the weights M c . W 0 ∈ R C/r×C and W 1 ∈ R C×C/r are the weights of the sharing network.

Spatial Attention
The MAPDDPG adds the spatial attention behind the channel attention to focus on "where" the features are meaningful. The spatial sub-module uses two outputs that are similar to the channel attention and collects features to produce a valid feature descriptor along the channel axis. The architecture of spatial attention is shown in Figure 4.  The calculation of spatial attention is shown in Equation (7). . The activation function  is sigmoid.

Convolutional Block Attention Mechanism
Channel attention and spatial attention can be combined in series or parallel, but it is found that combing the attention mechanisms in series and putting the channel attention first lead to better results. Therefore, channel attention is placed before the spatial attention in our model, as shown in Figure 5. MAPDDPG model is calculated as follows. The calculation of spatial attention is shown in Equation (7).
Similar to channel attention, a feature F with a size of H × W × C is given (reconstructed by channel attention). Two 2D feature graphs F s avg ∈ R 1×H×W and F s max ∈ R 1×H×W along the channel axis are created by using the same operation as the channel attention. Then, they are concatenated and fed into a f 7×7 convolutional layer to obtain a spatial attention graph M s (F ) ∈ R H×W . The activation function σ is sigmoid.

Convolutional Block Attention Mechanism
Channel attention and spatial attention can be combined in series or parallel, but it is found that combing the attention mechanisms in series and putting the channel attention first lead to better results. Therefore, channel attention is placed before the spatial attention in our model, as shown in Figure 5. MAPDDPG model is calculated as follows.
where ⊗ indicates multiplication, and F is the feature map reconstructed by the channel and spatial attention mechanism.
along the channel axis are created by using the same operation as the channel attention. Then, they are concatenated and fed into a 7 7 f  convolutional layer to obtain a spatial attention graph . The activation function  is sigmoid.

Convolutional Block Attention Mechanism
Channel attention and spatial attention can be combined in series or parallel, but it is found that combing the attention mechanisms in series and putting the channel attention first lead to better results. Therefore, channel attention is placed before the spatial attention in our model, as shown in Figure 5. MAPDDPG model is calculated as follows.
where  indicates multiplication, and " F is the feature map reconstructed by the channel and spatial attention mechanism.

Gated Recurrent Unit
CNN networks only take historical observations as input and do not consider the time information. By contrast, RNN considers longer sequences of historical information through the time link, which contributes to the generation of more sophisticated driving strategies. As shown in Figure 6, the recurrent neural network chosen in this paper is Gated Recurrent Unit [40].

Gated Recurrent Unit
CNN networks only take historical observations as input and do not consider the time information. By contrast, RNN considers longer sequences of historical information through the time link, which contributes to the generation of more sophisticated driving strategies. As shown in Figure 6, the recurrent neural network chosen in this paper is Gated Recurrent Unit [40].
where W and U are two linear matrices, and ⊙ denotes a Boolean operation. Then, the hidden vector ' quently, the matrix of contexts that is the weighted sum of the GRU layer outputs is calculated as follows: As shown in Figure 2, the context vector T C is fed into the connected layer before the actions are output. The weights 1 T t w   learned by the network can be explained as the significance of the GRU output at a confirmed time. The significance of using the GRU layer is that it exploits the output characteristics of the past T frames  to obtain the action After the channel and spatial attention, the structured features map F and the hidden vector h t−1 are each transformed by a linear transformation to form two feature matrices, i.e., the update gate z t = σ W (z) F + U (z) h t−1 and the reset gate r t = σ W (r) F + U (r) h t−1 . The update gate assists our model in determining the amount of previous information to pass on to the future, while the reset gate primarily decides the amount of previous experience that should be removed. The formulas are as follows: where W and U are two linear matrices, and denotes a Boolean operation. Then, the hidden vector h t−1 = r t h t−1 is spliced with the input vector F to obtain the hidden vector h t . The hidden vector h t retains the information about the current cell. Subsequently, the matrix of contexts that is the weighted sum of the GRU layer outputs is calculated as follows: As shown in Figure 2, the context vector C T is fed into the connected layer before the actions are output. The weights w T+1−t learned by the network can be explained as the significance of the GRU output at a confirmed time. The significance of using the GRU layer is that it exploits the output characteristics of the past T − f rames to obtain the action output policy.

Network and Priority
As shown in Figure 7, in the actor network and target actor network, the feature maps and 29-dimensional sensor information obtained from the TORCS as input to the model, and actions are output. The two networks have the same structure with symmetry. As shown in Figure 8, the critic network and target critic network differ from the actor network in two aspects. One is that the action information obtained from the actor network is input to the connected network in the model. The other is that only the Q value of the output actions needs to be calculated to evaluate the performance of the actions in the critic network. In this case, there is no need to use the attention mechanism in the critic network, which reduces the computational complexity of the network and speeds up the training. The critic network and target critic network have the same structure with symmetry. In this paper, the MAPDDPG is optimized by priority sampling instead of uniform sampling. The prioritized experience replay [0] approach prioritizes all experiences by TD-error and chooses the ones with higher priority. This is achieved with SumTree, a binary tree construction. As shown in Figure 9, each leaf node of the SumTree represents a sample of experience, and the inside nodes store the sum of the sub-node values. For instance, if a random number between [0, 42] is chosen as a sample, e.g., number 24, searching for 24 starts from the parent node, and which of the two child nodes is larger than 24 is determined. If it is the left node, the searching is performed on the left node and vice versa. However, if both the left and right nodes are less than 24, then the searching is performed on the right node and the priority value of the right node is subtracted by 24. The searching continues from top to down until a sample with a priority of 12 is found. As shown in Figure 8, the critic network and target critic network differ from the actor network in two aspects. One is that the action information obtained from the actor network is input to the connected network in the model. The other is that only the Q value of the output actions needs to be calculated to evaluate the performance of the actions in the critic network. In this case, there is no need to use the attention mechanism in the critic network, which reduces the computational complexity of the network and speeds up the training. The critic network and target critic network have the same structure with symmetry. As shown in Figure 8, the critic network and target critic network differ from the actor network in two aspects. One is that the action information obtained from the actor network is input to the connected network in the model. The other is that only the Q value of the output actions needs to be calculated to evaluate the performance of the actions in the critic network. In this case, there is no need to use the attention mechanism in the critic network, which reduces the computational complexity of the network and speeds up the training. The critic network and target critic network have the same structure with symmetry. In this paper, the MAPDDPG is optimized by priority sampling instead of uniform sampling. The prioritized experience replay [0] approach prioritizes all experiences by TD-error and chooses the ones with higher priority. This is achieved with SumTree, a binary tree construction. As shown in Figure 9, each leaf node of the SumTree represents a sample of experience, and the inside nodes store the sum of the sub-node values. For instance, if a random number between [0, 42] is chosen as a sample, e.g., number 24, searching for 24 starts from the parent node, and which of the two child nodes is larger than 24 is determined. If it is the left node, the searching is performed on the left node and vice versa. However, if both the left and right nodes are less than 24, then the searching is performed on the right node and the priority value of the right node is subtracted by 24. The searching continues from top to down until a sample with a priority of 12 is found. In this paper, the MAPDDPG is optimized by priority sampling instead of uniform sampling. The prioritized experience replay [41] approach prioritizes all experiences by TD-error and chooses the ones with higher priority. This is achieved with SumTree, a binary tree construction. As shown in Figure 9, each leaf node of the SumTree represents a sample of experience, and the inside nodes store the sum of the sub-node values. For instance, if a random number between [0, 42] is chosen as a sample, e.g., number 24, searching for 24 starts from the parent node, and which of the two child nodes is larger than 24 is determined. If it is the left node, the searching is performed on the left node and vice versa. However, if both the left and right nodes are less than 24, then the searching is performed on the right node and the priority value of the right node is subtracted by 24. The searching continues from top to down until a sample with a priority of 12 is found.  Figure 9. The tree structure of sumTree [42].

Reward Function
To assess the superiority of the MAPDDPG, a new reward function that consists of three components is designed. Firstly, the velocity is restricted to a steady state, neither too large or too small. This component of the reward function can be expressed as speed R : where v is the speed. Secondly, it is desired that the vehicle stays on the centerline. Thus, if the vehicle deviates from the center, a penalty is given. This component of the reward function is denoted as center R . Thirdly, to penalize the situation where the vehicle runs out of a lane, the component of the reward function is denoted as .

Experiment
To confirm the validity of the MAPDDPG model, it is run on TORCS, an open-source simulation tool for autopilot. Other software used in the experiment includes Anaconda 3, Keras 0.1.1, and Tensorflow-gpu 0.12. All our experiments are conducted on a machine running Ubuntu 16.04, and the machine is equipped with a 16-core CPU, 64 GB memory, and GTX-1060 GPU. The parameters set for the network are listed in Table 1. out R Figure 9. The tree structure of sumTree [42].

Reward Function
To assess the superiority of the MAPDDPG, a new reward function that consists of three components is designed. Firstly, the velocity is restricted to a steady state, neither too large or too small. This component of the reward function can be expressed as R speed : where v is the speed. Secondly, it is desired that the vehicle stays on the centerline. Thus, if the vehicle deviates from the center, a penalty is given. This component of the reward function is denoted as R center . Thirdly, to penalize the situation where the vehicle runs out of a lane, the component of the reward function is denoted as R out .
R out = −50 (15) where V x cos(θ) indicates the speed along the lane, which should be encouraged; V x sin(θ) denotes the velocity perpendicular to the orbit and should be disabled; |trackPos| and V x |trackPos| are penalty terms. The farther the vehicle leaves the centerline, the lower the reward. The overall reward function is a linear combination of the three components with the specified weights α, β, γ. Finally, normalization of the reward function is performed, and the MAPDDPG is trained to determine the most appropriate combination of the weights.

Experiment
To confirm the validity of the MAPDDPG model, it is run on TORCS, an open-source simulation tool for autopilot. Other software used in the experiment includes Anaconda 3, Keras 0.1.1, and Tensorflow-gpu 0.12. All our experiments are conducted on a machine running Ubuntu 16.04, and the machine is equipped with a 16-core CPU, 64 GB memory, and GTX-1060 GPU. The parameters set for the network are listed in Table 1.

Experiment Settings
In many autonomous driving studies based on deep reinforcement learning, the inputs are divided into two domains, i.e., image and sensor information. In human driving scenarios, the data transferred by sensors and vision needs to be considered. So, the input of MAPDDPG has two aspects: image and sensors. Specifically, the images from the vehicle's front camera are selected as input, and the sensor input is listed in Table 2. To ensure that the proposed method is not restricted to a specific route, the three roads shown in Figure 10 are chosen for training and validation. Aalborg is selected as the training track, and Track-2 and Track-3 as the validation tracks. All models are trained in 500 episodes.

Experiment Settings
In many autonomous driving studies based on deep reinforcement learning, the inputs are divided into two domains, i.e., image and sensor information. In human driving scenarios, the data transferred by sensors and vision needs to be considered. So, the input of MAPDDPG has two aspects: image and sensors. Specifically, the images from the vehicle's front camera are selected as input, and the sensor input is listed in Table 2. To ensure that the proposed method is not restricted to a specific route, the three roads shown in Figure 10 are chosen for training and validation. Aalborg is selected as the training track, and Track-2 and Track-3 as the validation tracks. All models are trained in 500 episodes.

Experiment Results
The vehicle is trained in about 500 episodes (close to 500,000 steps) on the map of Aalborg. When the vehicle dashes off the track or turns in the opposite direction, the episode ends. Thus, the length of each episode varies greatly, that is, a good model can make the episode infinite. Meanwhile, the maximum step size for each episode is set to 100,000 iterations. The images with a size of 320 × 240 are gathered at a frequency of 5 Hz, and the training is performed on the Aalborg map for hours. In this case, the vehicle collects approximately 100,000 frames for each track. As for training autonomous vehicles with deep reinforcement learning, the cumulative returns per episode is a significant assessment criterion. The better the model, the more the return.
During the training, it is found that the training reaches convergence at 200 episodes. Thus, three criteria are adopted to test the model performance: the reward values for 200

Experiment Results
The vehicle is trained in about 500 episodes (close to 500,000 steps) on the map of Aalborg. When the vehicle dashes off the track or turns in the opposite direction, the episode ends. Thus, the length of each episode varies greatly, that is, a good model can make the episode infinite. Meanwhile, the maximum step size for each episode is set to 100,000 iterations. The images with a size of 320 × 240 are gathered at a frequency of 5 Hz, and the training is performed on the Aalborg map for hours. In this case, the vehicle collects approximately 100,000 frames for each track. As for training autonomous vehicles with deep reinforcement learning, the cumulative returns per episode is a significant assessment criterion. The better the model, the more the return.
During the training, it is found that the training reaches convergence at 200 episodes. Thus, three criteria are adopted to test the model performance: the reward values for 200 episodes, the average velocity along the X-axis, and the distance from the centerline. In this paper, we compare the MAPDDPG model with the A3C, PPO, P-DDPG, and DDPG. All models are trained under the same parameters as listed in Table 1, and the experimental results of five runs are averaged to make comparisons. The comparison results are shown in Figures 11-13. episodes, the average velocity along the X-axis, and the distance from the centerline. In this paper, we compare the MAPDDPG model with the A3C, PPO, P-DDPG, and DDPG. All models are trained under the same parameters as listed in Table 1, and the experimental results of five runs are averaged to make comparisons. The comparison results are shown in Figures 11-13.   episodes, the average velocity along the X-axis, and the distance from the centerline. In this paper, we compare the MAPDDPG model with the A3C, PPO, P-DDPG, and DDPG. All models are trained under the same parameters as listed in Table 1, and the experimental results of five runs are averaged to make comparisons. The comparison results are shown in Figures 11-13.   It can be seen from Figure 11 that the reward obtained by MAPDDPG is better than the DDPG, P-DDPG, (only add experience priority to DDPG), A3C, and PPO algorithms. The reward value of the proposed algorithm increases significantly after 25 episodes, reaching 72,995 after about 50 episodes and becoming stable after about 110 episodes of training. At this point, the reward values of the DDPG, P-DDPG, A3C, and PPO algorithms are 7000, 42,995, 12,772, and 39,995, respectively. This suggests that by adding an attention mechanism, the proposed model can analyze the current state and focus on the important information, which allows the MAPDDPG model to obtain accurate action strategies rapidly. By contrast, the DDPG, P-DDPG, A3C, and PPO algorithms gradually obtain the training results after about 100, 80, 145, and 110 episodes, respectively. It can be seen from Figure 11 that the reward obtained by MAPDDPG is better than the DDPG, P-DDPG, (only add experience priority to DDPG), A3C, and PPO algorithms. The reward value of the proposed algorithm increases significantly after 25 episodes, reaching 72,995 after about 50 episodes and becoming stable after about 110 episodes of training. At this point, the reward values of the DDPG, P-DDPG, A3C, and PPO algorithms are 7000, 42,995, 12,772, and 39,995, respectively. This suggests that by adding an attention mechanism, the proposed model can analyze the current state and focus on the important information, which allows the MAPDDPG model to obtain accurate action strategies rapidly. By contrast, the DDPG, P-DDPG, A3C, and PPO algorithms gradually obtain the training results after about 100, 80, 145, and 110 episodes, respectively.
As shown in Figure 12, the comparison of the average speed after 200 episodes indicates that the MAPDDPG model performs more stable and better than the other algorithms. The speed of the MAPDDPG model can reach 158.2 km/h at the 50th episode, the speed of the DDPG, P-DDPG, A3C, and PPO algorithms is respectively only 96.2 km/h, 50.1 km/h, 49.5 km/h, and 82.6 km/h. This experiment also compares the deviation of the vehicle from the centerline for each episode. As shown in Figure 13, the MAPDDPG can steadily travel on the centerline at episode 28, while the DDPG, P-DDPG, A3C, and PPO algorithms tend to stabilize after 75, 62, 146, and 52 episodes, respectively.
Besides, the convergence time and the variance of the distance from the vehicle to the center of the lane are compared. As listed in Table 3, the MAPDDPG converges two times faster than DDPG, and its variance has been reduced by nearly three times. Meanwhile, the success episodes of the algorithms in driving one circle smoothly in TORCS is evaluated. As shown in Table 4, comparing with the DDPG, P-DDPG, A3C, and PPO, our algorithm performs the best in all aspects. The MAPDDPG model can drive a complete circle in 147 of 200 episodes, collision in 36 episodes, and run out of boundary in 17 episodes. The main failures are collisions and out of boundary, each with a 50% split. Collisions are caused by vehicles not learning to slow down, while running out of boundary is caused by not controlling speed during a sharp turn. As shown in Figure 14, the gated recurrent units (GRU), spatial attention (Spat), channel attention (Chan), and joint models are integrated into the DDPG respectively to study their effects. All models are trained under the same parameters and all the experimental results presented in the graph are the average of five runs.
As can be noticed in Figure 14 that the integration of GRU accelerates convergence and contributes to higher reward and maximum speed. The integration of attention mechanisms leads to greater utilization of image and sensor information, as well as more secure and robust autonomic driving behavior. It can be seen from the model combining all components (comb), the max-speed of our model increases from 135 km/h to 193 km/h, while the success episodes increase from 96 to 147 in comparison to DDPG. Finally, this paper summarizes all the salient results in Table 5. As shown in Figure 14, the gated recurrent units (GRU), spatial attention (Spat), channel attention (Chan), and joint models are integrated into the DDPG respectively to study their effects. All models are trained under the same parameters and all the experimental results presented in the graph are the average of five runs. As can be noticed in Figure 14 that the integration of GRU accelerates convergence and contributes to higher reward and maximum speed. The integration of attention mechanisms leads to greater utilization of image and sensor information, as well as more secure and robust autonomic driving behavior. It can be seen from the model combining all components (comb), the max-speed of our model increases from 135 km/h to 193 km/h, while the success episodes increase from 96 to 147 in comparison to DDPG. Finally, this paper summarizes all the salient results in Table 5. As shown in Table 5, we compare with previous autopilot models that are trained by putting all input information into the network without assigning weights to the information, MAPDDPG adds an attention mechanism that allows the model to assign different weights to different regions of the input images, thus achieving the ability to recognize the environment. And it is clear from all the comparisons that the MAPDDPG model has the best results.

Conclusions
In this paper, a deep reinforcement learning algorithm based on convolutional block attention is proposed to learn self-driving behavior. Using sensor and image information as input, an attention layer is first designed to make the model focus on the focal region of the image. Then, the GRU module is designed to optimize the output strategy by using important frames from the past time to make the model "memorable". The weights of the  As shown in Table 5, we compare with previous autopilot models that are trained by putting all input information into the network without assigning weights to the information, MAPDDPG adds an attention mechanism that allows the model to assign different weights to different regions of the input images, thus achieving the ability to recognize the environment. And it is clear from all the comparisons that the MAPDDPG model has the best results.

Conclusions
In this paper, a deep reinforcement learning algorithm based on convolutional block attention is proposed to learn self-driving behavior. Using sensor and image information as input, an attention layer is first designed to make the model focus on the focal region of the image. Then, the GRU module is designed to optimize the output strategy by using important frames from the past time to make the model "memorable". The weights of the attention and GRU layers are fused into the actor-critic network in a hybrid manner. Next, a prioritized experience replay buffer is added to improve sample utilization, and a new reward function is designed to speed up the training process. The MAPDDPG model processes data in parallel and captures the features better from multiple aspects based on the symmetry of the network model. Finally, it is demonstrated in TORCS simulation that the channel and spatial attention mechanisms can improve the performance of deep reinforcement learning algorithms for autopilot. Compared with the current state-of-the-art autopilot algorithms including A3C, PPO, DDPG, and P-DDPG, the MAPDDPG model can reach a maximum speed of 193 km/h, a maximum reward of 116,347, and the variance from the vehicle to the center of the lane is only 0.2 m, indicating that the proposed model achieves excellent performance.
In this paper, the study of autonomous driving strategies is based on individual vehicles. In the future, we will consider multiple vehicles for the research of autonomous driving and the effect of the remaining vehicles on a single car.