Imitative Reinforcement Learning Fusing Mask R-CNN Perception Algorithms

: Autonomous urban driving navigation is still an open problem and has ample room for improvement in unknown complex environments. This paper proposes an end-to-end autonomous driving approach that combines Conditional Imitation Learning (CIL), Mask R-CNN with DDPG. In the ﬁrst stage, data acquisition is ﬁrst performed by using CARLA, a high-ﬁdelity simulation software. Data collected by CARLA is used to train the Mask R-CNN network, which is used for object detection and segmentation. The segmented images are transformed into the backbone of CIL to perform supervised Imitation Learning (IL). DDPG means using Reinforcement Learning for further training in the second stage, which shares the learned weights from the pre-trained CIL model. The combination of the two methods is an innovative way of considering. The beneﬁt is that it is possible to speed up training considerably and obtain super-high levels of performance beyond humans. We conduct experiments on the CARLA driving benchmark of urban driving. In the ﬁnal experiments, our algorithm outperforms the original MP by 30%, CIL by 33%, and CIRL by 10% in the most difﬁcult tasks, dynamic navigation tasks, and in new environments and new weather, demonstrating that the two-stage framework proposed in this paper shows remarkable generalization capability in unknown environments on navigation tasks.


Introduction
Autonomous driving has made significant progress in the last decade. To date, there are two main paradigms for vision-based autonomous driving systems: the mediated perception approach, which makes driving decisions by parsing the entire scene, and the behavioral reflection approach, which maps input images directly to driving actions via a regulator [1]. The behavioral reflection approach, also known as the end-to-end approach, has performed reasonably well over the last five years. Imitation learning for end-to-end autonomous driving has attracted academic attention.
There are two trends in training research for end-to-end driving models. One is reinforcement learning. Our knowledge, however, indicates that many current reinforcement learning-based driving methods are based on trial-and-error reinforcement learning. These methods are difficult to apply to the real world because the training process is not safe. Imitation learning is the second method. Despite the ease of understanding and implementing imitation learning, policies that only learn from expert demonstrations may be unable to recover from mistakes as a result of a lack of a recovery process. For example, the DeepMind parkour paper [2] used 6400 CPU hours to achieve the results in the paper. Sample Efficiency is not really noticeable on these platforms, it can be allowed to run in this virtual environment, but in realistic scenarios such as robotic tasks, it poses a major obstacle, after all; it is costly to keep a robot running for many hours in reality. determined whether the most promising robustness enhancement techniques require structural modifications, data enhancement schemes, modifications to the loss function, or a combination of these. The relevant solution in this paper is to use Mask-RCNN [10] for multiple processing of obstacles, including obstacle recognition and instance segmentation to process obstacles to obtain information about them and add a pre-trained model for imitation learning to add robustness. The original innovations and contributions of this work are reflected in the following aspects: 1. This paper proposes a two-stage framework called CIL-DDPG, which combines reinforcement learning and imitation learning using obstacle position information as additional input, and Mask-RCNN [10] is used for image segmentation. The collection of CIL and DDPG is an innovative development; previous releases have used the two methods separately as one development algorithm; secondly, this paper has made innovations such as obstacle information fusion in CIL. 2. A new reward function is designed to learn an alternative autonomous driving strategy in a dynamic scenario. Extensive experiments on the CARLA simulator benchmark show that the work in this paper enables the network to overcome the effects of image noise. To better explain the content of this paper, the rest of the paper is organized as follows. Section 2 discusses some important related works, while Section 3 discusses the Mask-RCNN concepts. Section 4 discusses the concepts of conditional imitation learning. The reinforcement learning algorithm is described in Section 5. Section 6 contains simulation experiments. Finally, findings are offered, along with potential future study areas. The related issues to be addressed in this paper are as follows: It is not hard to have a reward function. However, designing a reward function that allows the agent to learn the desired behavior is difficult. In short, sparse rewards can make learning difficult for the agent, but if a lot of manually designed rewards are added by the individual, the agent may learn unintended behaviors if the rewards are poorly designed. Current reinforcement learning algorithms are opaque, and in most cases, we have only high-level intuition about what a reinforcement learning algorithm can learn and how it will work. For most problems, we want the algorithm to be predictable and interpretable. The least explanatory and predictable approach is a large neural network that learns the desired knowledge from scratch, given only low-level reward signals or an environmental model (like AlphaGo Zero).
In addition to this, the paper will also consider bad weather [12][13][14] bad weather. These papers suggest that there is still room for improvement, and it has not yet been determined whether the most promising robustness enhancement techniques require structural modifications, data enhancement schemes, modifications to the loss function, or a combination of these. The relevant solution in this paper is to use Mask-RCNN [10] for multiple processing of obstacles, including obstacle recognition and instance segmentation to process obstacles to obtain information about them and add a pre-trained model for imitation learning to add robustness.
The original innovations and contributions of this work are reflected in the following aspects: 1.
This paper proposes a two-stage framework called CIL-DDPG, which combines reinforcement learning and imitation learning using obstacle position information as additional input, and Mask-RCNN [10] is used for image segmentation. The collection of CIL and DDPG is an innovative development; previous releases have used the two methods separately as one development algorithm; secondly, this paper has made innovations such as obstacle information fusion in CIL.

2.
A new reward function is designed to learn an alternative autonomous driving strategy in a dynamic scenario. Extensive experiments on the CARLA simulator benchmark show that the work in this paper enables the network to overcome the effects of image noise.
To better explain the content of this paper, the rest of the paper is organized as follows. Section 2 discusses some important related works, while Section 3 discusses the Mask-Appl. Sci. 2022, 12, 11821 4 of 19 RCNN concepts. Section 4 discusses the concepts of conditional imitation learning. The reinforcement learning algorithm is described in Section 5. Section 6 contains simulation experiments. Finally, findings are offered, along with potential future study areas.

Related Work
Deep learning-based image segmentation algorithms, such as VGGNet [15] and ResNet [16], are extremely well-preferred. The image pixels are labeled, so each pixel shares certain features, such as color, intensity, and texture. To date, these two networks still have an extremely high dominance in the field of feature extraction.
Long J et al. [17] presented FCN networks at CVPR in 2015, proposing full conventionalization of neural networks, using convolutional layers instead of the final fully connected layer to complete the segmentation task. Many network models still borrow the structure of FCN networks to this day.
Zhang et al. [18] proposed an algorithm called Mask Scoring R-CNN that was used for traffic monitoring to obtain comprehensive vehicle information such as vehicle type, speed, length, current driving lane, etc. Eventually, the average recognition accuracy for the model and the number of axles was above 97% and 88%, respectively.
Due to an evolution based on Mask R-CNN networks, the combination of ResNeXt-101+FPN can be said to be the best feature learning now. The specific improvement includes the segmentation loss, varying from the original FCIS polynomial cross-entropy based on single-pixel softmax to single-pixel signed binary cross-entropy. ROIAlign, an interpolation of feature maps, solves the misalignment problem. Therefore, in this paper, Mask R-CNN is used.
In 2016, Bojarski et al. [19] trained a CNN to drive autonomously on different types of roads and achieved over 10 miles of lane-keeping. The network achieved an autonomous driving rate of 98% through real-world testing.
Another deep CNN, PilotNet, was trained using road images from a single frontfacing camera paired with driver-generated steering angles captured inside the cabin [20]. A drawback of the above work is that their performance comes from a large amount of training data with manual markers.
Hesham et al. [21] consider that most existing solutions only consider visual camera frames. Therefore, this work, proposed a convolutional long short-term memory recurrent neural network (C-LSTM), which is an end-to-end approach to learning visual and dynamic time-dependent driving. Although their study ultimately achieved good performance, the nature of the vision-based study failed to avoid the effects of the weather environment.
Codevlla et al. [22] used high-level commands as additional input to build a conditional imitation learning (CIL) model. Another end-to-end example is the ICCV 2019 Learning to Drive Challenge, where Columbia University's deep learning team rounds out the top two. The fusion of data from camera sensors and visual maps resulted in significant performance improvements. While these end-to-end approaches have proven to perform well in real-world experiments, the robustness and generality need to be improved.
Wang et al. [23] proposed a new navigation command that does not require human involvement and a new model structure, the angular branching network. Furthermore, in addition to segmentation information, depth information can also improve the performance of the driving model. They conducted experiments in both qualitative and quantitative evaluation to show the effectiveness of the model.
In recent years, deep reinforcement learning (DRL) methods for decision-making in self-driving cars have been increasingly researched. One reason is its great success on many artificial intelligence tasks; another well-known shortcut of imitation learning is the weak generalization performance and the risk of overfitting the training data.
However, decision-making in autonomous driving remains a challenge. Reinforcement learning (RL) has been used to obtain correct behavior in uncertain environments automatically, but it cannot guarantee the performance of the final policy.
Maxime et al. [24] propose a general approach to enhance the probabilistic guarantees of RL agents. An exploration policy is derived before training, constraining the agent from choosing among actions that satisfy the desired probability specification in a linear time logic (LTL) representation. Reducing the search space can simplify reward design.
Jianyu Chen et al. [25] proposed a framework that allows model-free deep reinforcement learning to be applied to challenging urban autonomous driving scenarios. A bird'seye view input representation was designed to reduce sample complexity, and visual coding was used to capture low-dimensional latent states. While the adaptation method outperforms the baseline, it does not solve the task perfectly. By using reinforcement learning (RL), strategies can be learned and improved automatically without any manual design. However, current RL methods are usually unsuitable for complex urban scenarios. Furthermore, to perform more complex autopilot tasks, we would design a more efficient reward function [26].
This section presents some related work that combines reinforcement learning with imitation learning. The core idea is that the agent can learn quantitative parameters from the image data, which can represent information about the state of the road. These parameters are then used to control the vehicle.
Over the past decade of its role, researchers have achieved good results with end-toend approaches. However, the approach is generally poorly adapted to the environment, especially for dynamic traffic environments. Image information is acquired by using the front-facing camera of one's vehicle, which is then fed into a carefully designed convolutional network to extract features that represent the current state of the vehicle environment. The feature information is then fed back into the reinforcement learning framework for learning. Finally, the reinforcement learning model directly outputs the amount of steering, throttle, etc., that the vehicle will control at the next moment in time.
Mingxing Peng et al. [27] proposed a two-stage framework called IPP-RL. In their IPP model, the visual information captured by the camera is compensated by the steering angle calculated by a pure tracking algorithm. It can therefore operate well in adverse weather conditions. However, with reliance on visual information, there may be obstacle misdetection, and the approach is still inadequate in more challenging and complex driving conditions where vehicles are unable to grasp safe distances for collisions.
Xiaodan Liang et al. [28] proposed a general and rule-based Controllable Imitative Reinforcement Learning (CIRL). To alleviate the low exploration efficiency of large continuous action spaces, the CIRL initializes the pre-trained model weights of the actor network through imitation learning. Furthermore, CIRL also proposes adaptive strategies and steering angle reward functions for different control signals (i.e., following, straight ahead, right turn, left turn) to improve the model's ability to handle varying situations. The heavyweight references relevant to this study are shown in Table 1.

Structure of Mask R-CNN
Mask-RCNN [10] is the best paper of ICCV2017. Mask-RCNN is an improvement on Faster-RCNN by adding a fully connected segmentation sub-network. The model changes from two tasks (classification + regression) to three tasks (classification + regression + segmentation). The structure allows the semantic segmentation of the target while implementing object detection. The detection is first done on the image to find out the ROIs in the image, pixel correction is performed for each ROI using ROIAlign, and then the prediction of the different instance belonging classification is performed for each ROI using the designed FCN framework to obtain the image instance segmentation result finally. Mask R-CNN is a two-stage framework, where the first stage scans the image and generates proposals (proposals, i.e., regions that are likely to contain a target), and the second stage classifies the proposals and generates bounding boxes and masks. The workflow of Mask-RCNN is shown in Figure 2.

Dataset Acquisition
It is recommended that standardized fonts such as Times New Roman and Arial are used with a font size no smaller than 10 pt. In order to compare CIRL and CIL in the later experimental results, the paper uses the same experimental setup as [7] to validate the effectiveness of our imitation reinforcement learning.
The information obtained by the sensors is from the forward-facing image camera, the velocity measurements from the simulator, and the navigation planner by the generated control commands. In this paper, the CARLA simulator is used, as in [16]. The dataset includes RGB images, controls, and measurements for each step.
The dataset was collected in the CARLA simulator, using the specified keys on the keyboard to control the car to collect images and labels in the city as a sample set.
To obtain more image information, the image size was set to 800 * 600, the number of vehicles to 15, the number of pedestrians to 30, and the FPS to 10. A total of approximately 700,000 images were acquired, including both the original RGB images and the converted semantic segmentation images.
While the images were captured, the labels and control information was also saved as CSV tables, with each row containing the vehicle position coordinates, vehicle pose, control information, etc.
There are 28 labels corresponding to the images according to the CARLA default acquisition method, but only five labels are used: speed, steer, throttle, brake, and high-level Commands, as shown in Table 2   Step 1: The RGB images are fed into ResNet101 for feature fusion; Step 2: Then two feature maps are generated as rpn_feature_maps and mrcnn_feature_maps; Step 3: Different sizes of rpn_feature_maps are sent to the RPN in the feature extraction phase; Step 4: After the RPN, the rpn_class, rpn_box and the anchor generator generated from the anchors, finally go to the Proposal Layer; Step 5: Mapping proposals of mrcnn_class, mrcnn_bboxes and iuput_image_meta to the final layer of the DetectionTargetLayer; Step 6: Generating a fixed-size feature map for each RoI using an RoI Align layer; Step 7: The detections are combined with mrcnn_feature_maps to fpn_mask_graph; Step 8: Final generation of mrcnn_masks.

End repeat
As can be seen, Mask R-CNN is trained by sending feature maps of different sizes to the RPN in the feature extraction phase. The choice of multiple feature maps was chosen because it is known that there are different sizes of targets on the graph. The advantage is that when the targets are large, it is good to use low-resolution feature maps to detect large targets; correspondingly, when the targets are small, it is good to use the high resolution to detect small targets. This is the reason why the backbone chose resnet + fpn.
After RPN, a large number of candidate regions are generated, which need to be cut out using ROI on several feature maps of different sizes, i.e., the target region. The target regions are then fed into ROIAlign (faster is ROIPooling) for subsequent classification and regression.

Dataset Acquisition
It is recommended that standardized fonts such as Times New Roman and Arial are used with a font size no smaller than 10 pt. In order to compare CIRL and CIL in the later experimental results, the paper uses the same experimental setup as [7] to validate the effectiveness of our imitation reinforcement learning.
The information obtained by the sensors is from the forward-facing image camera, the velocity measurements from the simulator, and the navigation planner by the generated control commands. In this paper, the CARLA simulator is used, as in [16]. The dataset includes RGB images, controls, and measurements for each step.
The dataset was collected in the CARLA simulator, using the specified keys on the keyboard to control the car to collect images and labels in the city as a sample set.
To obtain more image information, the image size was set to 800 * 600, the number of vehicles to 15, the number of pedestrians to 30, and the FPS to 10. A total of approximately 700,000 images were acquired, including both the original RGB images and the converted semantic segmentation images.
While the images were captured, the labels and control information was also saved as CSV tables, with each row containing the vehicle position coordinates, vehicle pose, control information, etc.
There are 28 labels corresponding to the images according to the CARLA default acquisition method, but only five labels are used: speed, steer, throttle, brake, and highlevel Commands, as shown in Table 2

Image Enhancement
The size of the images captured by the CARLA simulator is 800 * 600, which is slow and prone to over-fitting if used directly for training, so some processing is required first.
Firstly, the image was resized by cropping off the top part of the sky and the bottom part of the car hood, leaving an 800 * 352 image, and then it was subsampled twice to reduce the size by 200 * 88.
There are three reasons to explain this: firstly, due to the limitations of our equipment, the video memory is too small to process large images, so reducing the size can improve the speed of image processing; secondly, smaller images can use smaller convolutional kernels to reduce the number of operations, which has been commonly used since VGG; thirdly, a large image with fewer convolutional layers will lead to a higher dimension of Flatten, and the final output will have a huge number of parameters, resulting in a complex model, while smaller image inputs can simplify the model, avoiding the problem of overfitting.
As the captured images are too homogeneous, augmentation is required to increase the data sample and its diversity. The typical image enhancement method is to flip the image, adjust the brightness, add shadows and move the image. The Figure 3 below shows each enhancement method's before and after image comparison. a complex model, while smaller image inputs can simplify the mod lem of overfitting.
As the captured images are too homogeneous, augmentation the data sample and its diversity.
The typical image enhancement method is to flip the image, add shadows and move the image. The Figure 3 below shows each e before and after image comparison.

Conditional Imitation Learning
In this work, the structure of the model, the velocity module, consistent with CIL [3]. The biggest difference is the use of the outp Two fully connected layers connect all backbone speed modu units in the image module and 128 units in the speed module.
A fully connected layer connects the backbone with 512 units are composed of a fully connected layer with 512 units. Each branc using a high-level multi-branching-based mechanism command and method enhancement during training of the data network is p The image size is 200 * 88 * M, with M = 3 representing the inp = 1, the input is a semantically segmented image.
The input is normalized, speeding up the gradient descent to tion and accelerating the convergence to transform the pixel valu pending on the backbone, the input M is adjusted to achieve differe As shown in Figure 4.

Conditional Imitation Learning
In this work, the structure of the model, the velocity module, and other settings are consistent with CIL [3]. The biggest difference is the use of the output of Mask-RCNN.
Two fully connected layers connect all backbone speed modules. Each contains 512 units in the image module and 128 units in the speed module.
A fully connected layer connects the backbone with 512 units, and velocity modules are composed of a fully connected layer with 512 units. Each branch is trained separately using a high-level multi-branching-based mechanism command. Online enhancement and method enhancement during training of the data network is performed as in CIL [3].
The image size is 200 * 88 * M, with M = 3 representing the input RGB image; with M = 1, the input is a semantically segmented image.
The input is normalized, speeding up the gradient descent to find the optimal solution and accelerating the convergence to transform the pixel values between [0, 1]. Depending on the backbone, the input M is adjusted to achieve different inputs to the model. As shown in Figure 4.  In the training phase, the position of each step is defined as the planning path, with one point per 0.4 m scattered path. In the test phase, the paths are planned by the planner in CARLA.
The dataset can then be interpreted as: where: is the sensor data observation, which is referenced to RGB image information or semantic segmentation images in this paper; is the vehicle speed; is the advanced command; is the steering result; is the vehicle ground truth action, including steering angle, acceleration, and braking for each step.
The action predicted by the network is defined as follows (2) In the training phase, the position of each step is defined as the planning path, with one point per 0.4 m scattered path. In the test phase, the paths are planned by the planner in CARLA.
The dataset can then be interpreted as: where: o i is the sensor data observation, which is referenced to RGB image information or semantic segmentation images in this paper; s i is the vehicle speed; c i is the advanced command; p i is the steering result; a i is the vehicle ground truth action, including steering angle, acceleration, and braking for each step. The action predicted by the network is defined as follows Using the L2 loss function: where: a s i is the steering angle; a a i is the acceleration; a b i is the braking action; The network is trained to minimize the gap between the predicted steering commands and the underlying facts. In practice, the best parameters θ are obtained by minimizing the loss of;

Training and Validation
The inputs include the original RGB image and the semantically segmented image, as well as control information (i.e., measurements). The image is subjected to information feature extraction by a convolutional neural network, which outputs a predicted velocity value.
The predicted velocity values are fused with the control information extracted from the fully connected network, and the model outputs the predicted action values combined with the high-level control commands, which give more accurate results for each branch of the prediction.
As can be seen in Figure 5 below, the loss profile of the model tends to decrease as the number of iteration steps increases. Although there is some jitter in all the intermediate training losses, they eventually level off. The loss profile no longer decreases, indicating that the network has converged. At this point, the model has reached the optimal result, and the validation loss is slightly lower than the training loss in the early stage, indicating that the model does not appear to be overfitted during the training process, and the training result is good.

Markov Decision Process (MDP)
By interacting with the car simulator, the agent can optimize according to the reward signals provided by the environment without human intervention, which can be defined as a Markov Decision Process (MDP).
In an autonomous driving scenario, the MDP is defined by a tuple <I,C,S,A,R,P,λ>. In an autonomous driving scenario, the MDP is defined by a tuple that consists of a set of states O, defined by observed frames I, velocities S, control commands C; a set of actions, a reward function, a transition function R(s t , a t ), P(o |o, a) and a discount factor γ.
After performing the action and interacting with the environment, the agent receives a reward and arrives at a new state according to a probability distribution.
In each state, the client subject performs an action a ∈ A. After taking that action and interacting with the environment, the agent receives a reward and arrives at a new state according to a probability distribution. To make driving strategies more realistic, the vehicle must follow the path generated by the topology planner to reach the intended goal. New observations o are updated by simulator observations and a series of commands towards the goal. The event terminates when the vehicle reaches the target, collides with an obstacle, or when the time budget is exhausted.
Deterministic and static policies π specify the actions that the agent will take in each state given. The goal of the driving agent is to find policies π that map states to actions that maximize the total expected discounted payoff. Thus, this can be learned by using an action-value function: where is the expectation E π of the distribution of allowable trajectories (o 0 , a 0 , . . . , o t , a t ) by executing the policies π sequentially over some time.
As the autonomous driving system needs to predict continuous movements (steering angle, braking, and acceleration), we use an actor-critic network for the continuous control problem, where both actor and critic are parameterized by a deep network.
In this work, we used the deep deterministic policy gradient DDPG, a model-free algorithm based on actor critique that can operate on a continuous action space. The DDPG algorithm consists of an actor function and µ(s t |θ µ ) a critic function Q(s t , a t θ Q ) . Due to its good performance on continuous control problems, it uses the gradient of the Q function relative to the action directly for policy training.
The behavior policy µ is a random process generated from the current online policy and random noise OU.OU represents the value obtained by the Ornstein-Uhlenbeck from which the random process is sampled. N ∼ OU µ, σ 2 is a stochastic process that allows action exploration. This further noise exploration ensures that the agent behavior does not converge prematurely to a local optimum. The key advantage of our DDPG is that the exploration starting point can be better initialized by learning human expectations, which helps to significantly reduce the thorough exploration that can take days in the early stages of the DDPG. Starting from a better state, stochastic action exploration allows RL to further refine actions based on simulator feedback and produce more general and robust driving strategies.
Unlike the traditional random initialization θ µ of the DDPG, our DDPG is proposed to be initialized by simulating pre-trained θ I loading as parameters θ µ . In this paper, we define s t = {o t , f t } for each step, o, f is observed from a camera in the simulator, and F is additional obstacle perception information, i.e., the imitation learning phasedefined above.
When the number of samples in the replay buffer exceeds the batch size, it starts training the actor and critic network and optimizes them at each step based on Equations (7) and (9). Definition of loss for Q-networks: a similar approach to supervised learning, define the loss as MSE: mean squared error.
y i is calculated using the target strategy network µ and the target Q network Q . This makes the learning of Q-network parameters more stable and converge. This label relies on the target network we are learning from, which is what distinguishes it from supervised learning.
Using the running average, the parameters of the online network are soft updated to the parameters of the target network.
On the other hand, the actor-network is further updated by a gradient descent step.
Firstly, the simulation environment is initialized, the network parameters are set, and the network hyperparameters are passed to the Actor target network via the Actor-network, which is shown in Figure 6. The target network is mainly used to solve the target action, and the action is passed to the Critic target network, and the output is passed to the Critic network. As shown in Figure 7. network is then updated to pass the information to the Actor-network, generating an initial policy, which is explored and given the appropriate reward. When the reward is maximized, an optimal policy (action) is obtained, and this action is applied to the controller to drive the vehicle.

Reward Function of DDPG
The simulation environment is first initialized, and the parameters are set. The parameters are passed through the Actor network to the Actor target network, mainly used to solve the target actions, passing the actions to the Critic target network and the output return values to the Critic network.
At the same time, the input data information is sampled for evaluation and passed network is then updated to pass the information to the Actor-network, generating an initial policy, which is explored and given the appropriate reward. When the reward is maximized, an optimal policy (action) is obtained, and this action is applied to the controller to drive the vehicle.

Reward Function of DDPG
The simulation environment is first initialized, and the parameters are set. The parameters are passed through the Actor network to the Actor target network, mainly used to solve the target actions, passing the actions to the Critic target network and the output return values to the Critic network.
At the same time, the input data information is sampled for evaluation and passed The input data information is also sampled for evaluation and passed to the Actor target network, the Critic target network, and the Critic network, respectively. The Critic network is then updated to pass the information to the Actor-network, generating an initial policy, which is explored and given the appropriate reward. When the reward is maximized, an optimal policy (action) is obtained, and this action is applied to the controller to drive the vehicle.

Reward Function of DDPG
The simulation environment is first initialized, and the parameters are set. The parameters are passed through the Actor network to the Actor target network, mainly used to solve the target actions, passing the actions to the Critic target network and the output return values to the Critic network.
At the same time, the input data information is sampled for evaluation and passed to the Actor target network, the Critic target network, and the Critic network, respectively. The Critic network is then updated to pass the information to the Actor-network, producing an initial policy.
The strategy is explored, and a reward is given. When the reward is maximum, an optimal strategy (action) is obtained, and the vehicle is driven by applying this action to the controller. The design is based on the following parameters: efficiency, safety, and comfort. For the DDPG algorithm, the design of the reward value function is important in influencing the network, guiding the direction of the gradient of parameter updates throughout the network. In the framework of reinforcement learning, the process by which an intelligence learns to adapt to its environment is guided by the reward function. A suitable reward function not only makes the strategies learned by the intelligence more reasonable but also allows the intelligence to learn faster and better the convergence of the network.
if c is left or right (11) In Equation (10), the reward parameter is a negative reward penalty for the vehicle speed dropping below the reference speed Vref, where the relationship between the negative reward and the vehicle speed is a one-time positive proportional relationship. When the vehicle speed reaches the reference speed, then it is rewarded positively. The advantage of this is that it increases the damping of speed changes and prevents the excessive pursuit of rewards from causing speed changes and uneven driving.
In Equation (11), we want the steering angle to be 0 when the vehicle is going straight, and we give a larger penalty when this value is larger. When turning left or right, it is desired that the vehicle's steering angle can be smoother when the intelligent body acts on corner crossing, obstacle avoidance, etc. Therefore, the magnitude of the steering wheel turning angle δ is considered. where k s is the penalty factor for vehicle directional cornering and the output range for steering wheel cornering is [−1, 1].
Finally, both are set to −100, overlapping with the pavement and opposite lane. Collision damage is −100 for collisions with other vehicles and pedestrians and r d −50 for other objects (e.g., trees and utility poles).
The magnitude of the steering wheel corner of the vehicle: This reward parameter is a penalty for the vehicle to make a large hitting corner. It is hoped that the vehicle's cornering can be smoother when the intelligent body acts by crossing curves and avoiding obstacles, so the magnitude of the steering wheel corner δ is considered.
Where is the penalty factor for the vehicle directional turn, the output range for steering wheel turn is [−1, 1] and R s is the vehicle directional turn reward term. The final reward r conditions for different command controls are calculated as follows.
In summary, the final reward function can be obtained as follows:

Experimental Settings
This paper uses the CARLA simulator [20], which simulates the urban environment with high fidelity CARLA environment contains dynamic obstacles such as self-cars, pedestrians crossing the road randomly, etc. The CARLA simulator provides 14 weather conditions, GPS, sensory measurements, and a rough plan consisting of coarse waypoint coordinates in a map without any fine-grained trajectory reference.
We pre-trained the actor network using the same experimental setup as in [27] to demonstrate the effectiveness of our imitative reinforcement learning. Fourteen hours of driving data collected from CARLA were used for training, and the network was trained using the Adam optimizer. In the imitative learning section, the setup details were the same as mentioned above. In the reinforcement learning section, the environment was dynamic by setting the number.
The maximum number of vehicles ranges from 20 to 40, and the number of pedestrians ranges from 50 to 100. Setting the maximum set to 1000 results in a maximum number of steps per turn of 3000. The remaining parameters used for model training are listed in the Table 3 below: TRAIN_PLAY_MODE is the DDPG operating mode, when this parameter is set to 1 it enters training mode and when this parameter is set to 0 it enters test mode.
Where tau is the hyperparameter of the target network, lra is the learning rate of the actor-network or is the learning rate of the actor-network, buffer_size is the maximum capacity for storing experience samples, batch_size is the size of each BATCH acquisition, gamma = 0.99 the reward discount factor of the agent, the larger the discount factor, the more "long-term" the agent is thinking, the smaller the agent is more "immediate." "episodes_num is the number of training rounds the agent performs, max_steps is the maximum number of steps the agent can perform per round, and ACTION_NOISE is the number of exploration rounds the agent performs. NOISE is the coefficient by which the agent explores.
Several goal-directed tasks were evaluated using the CARLA benchmark [8], including "straight line," "one lap," "navigation," and "dynamic navigation." The experiment is divided into three phases, each with a different level of difficulty, with an overall gradual increase in difficulty, and each phase of the task has two different weather environments, sunny and rainy. Each stage involved driving the vehicle with a destination as the goal, and the task focused on how well the vehicle performed in straight-line driving, turning, and decision planning tasks in dynamic traffic, so the effects of signals and yellow lanes were not taken into account during the training test. The test map is Town04 in CARLA, a CARLA map of a city with eight lanes in a circle, with elevated, circular, uphill, and downhill scenarios, three-way intersections, etc.
In order to more fully consider the influence of weather on decision planning, this paper divides the weather into two groups, Weather1, and Weather2, referring to the work of Xiaodan Liang et al. [27]. Weather1 includes sunny days, sunny sunsets, rainy days, and days after rain. weather2 includes cloudy days and rainy days at sunset. The details are shown in Figure 8 below.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 15 of 20 sunny days sunny sunset rainy days days after rain rainy days at sunset cloudy days

Results
In this experiment, we compare our method with the original MP from [3] and two state-of-the-art approaches: CIL [25] and CIRL. CIRL combines imitation learning with reinforcement learning. To evaluate the generalization performance in unknown weather conditions and environments, we take all four tasks for all methods in four driving conditions denoted as "Training condition," "New town", "New weather", and "New weather&town".
As shown in Figure 9, our model greatly outperforms the MP and CIL baseline tasks in almost all respects. It can be observed that the model achieves better performance than the others to some extent. Although the success rate of the model in this paper is, in some cases, inferior to that of CILR, this may be because the demonstration dataset used for training is smaller.

Results
In this experiment, we compare our method with the original MP from [3] and two state-of-the-art approaches: CIL [25] and CIRL. CIRL combines imitation learning with reinforcement learning. To evaluate the generalization performance in unknown weather conditions and environments, we take all four tasks for all methods in four driving conditions denoted as "Training condition," "New town", "New weather", and "New weather&town".
As shown in Figure 9, our model greatly outperforms the MP and CIL baseline tasks in almost all respects. It can be observed that the model achieves better performance than the others to some extent. Although the success rate of the model in this paper is, in some cases, inferior to that of CILR, this may be because the demonstration dataset used for training is smaller.
In addition, the model achieves better performance. Robustness and generalization in unknown environments while it takes less time to train. Our actor commentary network was trained using only 200,000 simulation steps driving non-stop at ten frames per second for about 8 h. This compares to about 12 h in CIRL, where it took 300,000. The results in "New weather&town" show that our model improves generalization performance, just like using large-scale demonstration data. Our model can obtain a high percentage of completed episodes after a few hours with good sampling efficiency, thanks to a good start of exploration driven by a controlled imitation phase. The proposed approach is implemented on the TensorFlow framework. weather&town".
As shown in Figure 9, our model greatly outperforms the MP and CIL baseline tasks in almost all respects. It can be observed that the model achieves better performance than the others to some extent. Although the success rate of the model in this paper is, in some cases, inferior to that of CILR, this may be because the demonstration dataset used for training is smaller.  In addition, the model achieves better performance. Robustness and generalization in unknown environments while it takes less time to train. Our actor commentary network was trained using only 200,000 simulation steps driving non-stop at ten frames per second for about 8 h. This compares to about 12 h in CIRL, where it took 300,000. The results in "New weather&town" show that our model improves generalization performance, just like using large-scale demonstration data. Our model can obtain a high percentage of completed episodes after a few hours with good sampling efficiency, thanks to a good start of exploration driven by a controlled imitation phase. The proposed approach is implemented on the TensorFlow framework.
As can be seen from Figure 10, the average score obtained by our algorithm is about 400. The higher the score, the better the driving condition of the algorithm in the experiment and the smaller the collision and violation ratio, which proves the effectiveness of the algorithm. The reward function 400 indicates that convergence has been achieved in the training results, i.e., that one can design a reward function according to their design and thus achieve the requirements of their design; 600 is the maximum value of the reward one can possibly achieve and is the reward reported for certain special driving situations.

Straight
One turn Navigation Nav.dynamic As can be seen from Figure 10, the average score obtained by our algorithm is about 400. The higher the score, the better the driving condition of the algorithm in the experiment and the smaller the collision and violation ratio, which proves the effectiveness of the algorithm. The reward function 400 indicates that convergence has been achieved in the training results, i.e., that one can design a reward function according to their design and thus achieve the requirements of their design; 600 is the maximum value of the reward one can possibly achieve and is the reward reported for certain special driving situations.
This paper analyses the test results for straight ahead, cornering, and mixed conditions during vehicle driving, with a benchmark assessment test mainly for the mixed conditions.
From the results of the Tables 4 and 5, we can see that in the same urban environment, using the same weather conditions as in the training, the car did not have lane deviation and only just had the event of driving in other lanes, the average number of occurrences was 1.33; in the same urban environment and different weather conditions, the car also had better task completion, only had lane deviation and driving in other lanes violations, but Under the same weather conditions and different urban environments, the robustness of the model is also good, with a higher average task completion, a larger average distance travelled by the car before the violation occurred, and a lower number of violations; however, under different urban environments and weather conditions, the adaptability of the model decreases, and the probability of violation also relatively. Although the average task completion is high, this is because the straight ahead working condition is relatively simple and the distance between the beginning and end of the task point is not very long, so the model is able to complete more of the tasks. In this condition, there are no dynamic objects, and the task distance is short, so there are no collisions, collisions with people or static collisions, etc. From the results of several other benchmark evaluations, the model in this paper can complete the task well in the straight-ahead condition.
mented on the TensorFlow framework.
As can be seen from Figure 10, the average score obtained by our algorithm 400. The higher the score, the better the driving condition of the algorithm in th ment and the smaller the collision and violation ratio, which proves the effect the algorithm. The reward function 400 indicates that convergence has been ac the training results, i.e., that one can design a reward function according to th and thus achieve the requirements of their design; 600 is the maximum value ward one can possibly achieve and is the reward reported for certain special dri ations.   From the results of Tables 6 and 7, the car driving environment under turning conditions is relatively simple, and when the training conditions are the same, the model in this paper is basically able to drive through the whole section of the road, and the number of collisions is also low; in other weather conditions there is a slight decline, but due to the existence of advanced control commands, the model in this paper is less sensitive to the weather when the weather changes, and it can complete the driving task excellently. Adaptation ability decreases significantly, a part of the task volume will be lost, and the number of collisions also increases significantly, which is because the change in the urban environment will cause the structure of the image to change, and the algorithm will not extract enough information about the features with greater recognition degree, which affects the neural network's judgment of the recognition results. From the results of Tables 8 and 9, although the overall performance of the model in this paper is not as good as that of the individual working conditions, the vehicle can complete the driving task well under the same conditions as the training, which is already a good result in the long-distance driving task. The model is robust and generalizable. In the same urban environment and under different weather conditions, the generalization ability of the model in this paper decreases, but it can still perform the driving task well. The average driving distance before a violation is less than that in Weather ID = 1, the number of violations increases, and the probability of collision increases, but the test results are relatively good, which shows that the algorithm is not particularly sensitive to changes in weather. The algorithm is not particularly sensitive to changes in weather, and under the same test environment, the model can still have good stability and prediction ability. However, when the algorithm was transferred to an unfamiliar urban environment, the model was unable to complete the test task properly in Town02 due to the longer coordinate distances of the task points and the increased dynamic factors in the environment, which resulted in more violations and more frequent collisions when the vehicles were driving, shorter normal driving distances and a significant decrease in driving effectiveness. This may be due to the dynamic and unstructured nature of the new urban environment and the limited feature information extracted by the network, resulting in an inability to optimize the representational capability of the network and a reduction in the adaptability of the model.

Conclusions
In this paper, a two-stage framework is proposed to address the challenges of autonomous urban driving in complex environments and adverse weather conditions. This paper combines reinforcement learning with imitation learning. Extensive experiments conducted on the CARLA simulator show that the present model significantly improves robustness and generalization performance under a variety of driving conditions.
While the results are admirable, there is also significant scope for improvement under more challenging driving conditions. While the driving agent trained by the CIL-DDPG method learns reasonably good driving strategies for navigation tasks in dynamic environments, such as slowing down to avoid car collisions, there are more robust driving strategies that the agent needs to learn, such as obeying traffic rules and avoiding collisions with pedestrians.
With the continuous development of artificial intelligence applications and autonomous driving technology, urban autonomous driving technology has an extremely important role to play in the safety of vehicles and the efficiency of traffic travel. In this paper, an end-to-end approach based on conditional imitation learning is proposed using the idea of multi-input neural networks and validated by simulation with the CARLA simulator to demonstrate the effectiveness of the algorithm and provide a feasible solution for the study of autonomous driving technology.

Institutional Review Board Statement:
This study waived ethical review and approval, because it does not involve humans or animals.

Informed Consent Statement: Not applicable.
Data Availability Statement: As the project involves confidentiality, research data is not provided. If readers need research data, please contact the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.