An NN-Based Double Parallel Longitudinal and Lateral Driving Strategy for Self-Driving Transport Vehicles in Structured Road Scenarios

: Studies on self-driving transport vehicles have focused on longitudinal and lateral driving strategies in automated structured road scenarios. In this study, a double parallel network (DP-Net) combined with longitudinal and lateral strategy networks is constructed for self-driving transport vehicles in structured road scenarios, which is based on a convolutional neural network (CNN) and a long short-term memory network (LSTM). First, in feature extraction and perception, a preprocessing module is introduced that can ensure the effective extraction of visual information under complex illumination. Then, a parallel CNN sub-network is designed that is based on multifeature fusion to ensure better autonomous driving strategies. Meanwhile, a parallel LSTM sub-network is designed, which uses vehicle kinematic features as physical constraints to improve the prediction accuracy for steering angle and speed. The Udacity Challenge II dataset is used as the training set with the proposed DP-Net input requirements. Finally, for the proposed DP-Net, the root mean square error (RMSE) is used as the loss function, the mean absolute error (MAE) is used as the metric, and Adam is used as the optimization method. Compared with competing models such as PilotNet, CgNet, and E2E multimodal multitask network, the proposed DP-Net is more robust in handling complex illumination. The RMSE and MAE values for predicting the steering angle of the E2E multimodal multitask network are 0.0584 and 0.0163 rad, respectively; for the proposed DP-Net, those values are 0.0107 and 0.0054 rad, i.e., 81.7% and 66.9% lower, respectively. In addition, the proposed DP-Net also has higher accuracy in speed prediction. Upon testing the collected SYSU Campus dataset, good predictions are also obtained. These results should provide signiﬁcant guidance for using a DP-Net to deploy multi-axle transport vehicles. The experimental results show that our proposed DP-Net can accurately predict steering angle and speed, and further improve the robustness of complex illumination and the accuracy of lateral and longitudinal prediction. In addition, a new SYSU Campus dataset is collected for evaluation and testing. However, the more complex kinematic characteristics (multi-axle steering angles, position, and pose) of the multi-axle transport vehicle are not considered in our approach. Moreover, due to the different distributions between the training datasets and test datasets, the prediction of speed by the DP-Net, in this study, is not sufﬁciently accurate.


Introduction
The limited scenario of structured roads is an important market for the implementation of autonomous driving, and the driving strategy of transport vehicles is the key technology for autonomous driving implementation. Traditional decision-making algorithms are based on a vehicle kinematic models that combine environmental information with expert control logic to generate decision commands for vehicle driving [1,2]. The most important advantage of explicit vehicle kinematic modeling is its interpretability, which can be subsequently extended to multi-axle steering transport vehicles. However, because of the high complexity and variability of the environmental light dynamics, various complex driving strategies need to be set manually to cover different weather scenarios and unexpected situations.

Related Work
Dean Pomerleau developed the seminal work of ALVINN [7], which adopted networks that were "shallow" and tiny (mostly fully connected layers) as compared with modern networks with hundreds of layers, and the experimental scenarios were mostly simple roads with few obstacles.
Many studies have started to use deep neural networks for environment perception and steering command prediction with the development of deep learning. NVIDIA proposed PilotNet, a CNN-based autonomous system [5], which is an end-to-end driving decision algorithm for vehicle steering angle control. PilotNet predicts the steering angle according to the image of the road ahead and good prediction results have been achieved in road driving, therefore PilotNet has also become the base model for many subsequent studies. However, this algorithm does not consider the temporal feature between the input images' front and back frames, and it has limited accuracy for predicting driving commands.
In subsequent research, a substantial number of studies have been based on PilotNet's end-to-end architecture. A combination of visual temporal dependencies of the input data have been considered in [15] and a convolutional long short-term memory (C-LSTM) network has been proposed for steering control. In [16], surround-view cameras were used for end-to-end learning. The expectation is that human drivers also use rear and side-view mirrors for driving. Thus, all the vehicle information must be gathered and integrated into the network model for a suitable control command. The above methods make use of temporal information during vehicle driving and have improved performance; however, the model input provides only single visual information, which has led to poor perception due to backlighting and complex light and shadow situations.
UC Berkeley proposed a network with full convolution neural networks and long-term and short-term neural networks as branches [6]. They introduced semantic segmentation methods to enhance the understanding of driving scenarios and predict discrete or continuous driving behavior. Peking University proposed the STConv + ConvLSTM + LSTM network [17] to predict the lateral and longitudinal control of self-driving vehicles, including using building techniques or modules, such as spatio-temporal convolution, multiscale residual aggregation, convolutional long short-term memory network. The most relevant to ours is the work in [11]. The authors proposed a multimodal multitask network with five convolutional layers and four fully connected layers and they used LSTM networks to extract previous feedback speeds as extra features. We argue that it is inadequate to effectively capture the steer angle temporal dependence in autonomous driving.
Our work enhances lighting robustness by exploring combinations of multiple spatial features by incorporating additional features from vehicle kinematics as physical constraints, and therefore improves prediction accuracy. The validation on actual data shows that the proposed DP-Net better captures spatial-temporal information based on vehicle kinematics and predicts more accurate steering angle and speed.

Problem Formulation
For the end-to-end training of self-driving transport vehicles, the central issue of this task is to measure the quality of a longitudinal and lateral decision-making model. Following the treatment in prior studies [5,18], we regard the behavior of human drivers as a reference for "good" driving skills. In other words, the value from human drivers is treated as ground truth. Then, we quantitatively evaluate the model's decision-making effect by calculating the divergence between model-predicted values and the ground truth. However, in Nvidia's report [5], this divergence is not intuitive enough as follows: Therefore, in this study, the divergence between the predicted and the ground truth of the lateral steering angle and longitudinal speed are separately calculated.
The general objective of longitudinal and lateral prediction is to predict the angle p 1 and speed p 2 given an image x, steering angle sequence s 1 and speed sequence s 2 . Typically, we use an image as input to encapsulate the space information. We also use steering angle sequence s 1 and speed sequence s 2 as the input to encapsulate the temporal information, and then learn a function F : (x, s 1 , s 2 ) → (p 1 , p 2 ) for multitask longitudinal and lateral prediction. The steering angle is a continuous value. It is a regression problem. We adopt a simple form of squared loss that is amenable to gradient back-propagation. The objective below is minimized as follows: where s t,steer denotes the steering angle by a human driver at a time t and s t,steer is the learned model's prediction.
The same can be obtained as follows: where s t,speed denotes the speed by a human driver at a time t and s t,speed is the learned model's prediction. In this study, we mainly train the model by making the model approximate the above two square losses. We introduce our method in the next section.

Proposed Method: DP-Net
For statement clarity, we conceptually segment the proposed double parallel network (DP-Net) into three sub-networks with complementary functionalities. As shown in Figure 1, the original red (R), green (G), blue (B) image and processed image are fed into the first-parallel network. Through a spatial-feature-extracting sub-network, we catch a fixed-dimension feature representation that succinctly models the complex light visual surroundings of a car. At the same time, the steering angle sequence and speed sequence are fed into the second parallel network. A temporal-feature-extracting sub-network generates the same fixed-dimension feature representation that succinctly models the continuous kinematics internal status of a car. The temporal and spatial features extracted above are all further passed to the longitudinal and lateral prediction sub-network, which contribute to the multitask prediction of steering angle and speed.
where , denotes the steering angle by a human driver at a time and ̃, is the learned model's prediction.
The same can be obtained as follows: where , peed denotes the speed by a human driver at a time and ̃, peed is the learned model's prediction.
In this study, we mainly train the model by making the model approximate the above two square losses. We introduce our method in the next section.

Proposed Method: DP-Net
For statement clarity, we conceptually segment the proposed double parallel network (DP-Net) into three sub-networks with complementary functionalities. As shown in Figure 1, the original red (R), green (G), blue (B) image and processed image are fed into the first-parallel network. Through a spatial-feature-extracting sub-network, we catch a fixeddimension feature representation that succinctly models the complex light visual surroundings of a car. At the same time, the steering angle sequence and speed sequence are fed into the second parallel network. A temporal-feature-extracting sub-network generates the same fixed-dimension feature representation that succinctly models the continuous kinematics internal status of a car. The temporal and spatial features extracted above are all further passed to the longitudinal and lateral prediction sub-network, which contribute to the multitask prediction of steering angle and speed.  Figure 2 provides an anatomy of the spatial-feature-extracting sub-network. This sub-network is designed to handle visual perception under complex lighting conditions. To ensure that the sub-network can fully extract multiple types of spatial features, first, we create an image preprocessing module.  Figure 1. The architecture of our proposed double parallel networks for the predicting task of steering angle and speed. The arrows in the network denote the direction of data forwarding. Figure 2 provides an anatomy of the spatial-feature-extracting sub-network. This sub-network is designed to handle visual perception under complex lighting conditions. To ensure that the sub-network can fully extract multiple types of spatial features, first, we create an image preprocessing module.  Preprocessing Module. The RGB images, which represent the colors of the three channels of red (R), green (G), and blue (B), show color images using different color components. However, the acquired lane images' quality can become poor due to insufficient light at night or in low-light road environments. As a result, there is no apparent difference between the lane lines and the background in the original RGB ones, when using Preprocessing Module. The RGB images, which represent the colors of the three channels of red (R), green (G), and blue (B), show color images using different color components. However, the acquired lane images' quality can become poor due to insufficient light at night or in low-light road environments. As a result, there is no apparent difference between the lane lines and the background in the original RGB ones, when using sub-networks to directly extract spatial features from the raw RGB lane images collected at night, which leads to insufficient capture of semantic information. Inspired by image enhancement algorithms, we use grayscale transformation [8] to increase the contrast between lane lines and the road surface, as well as HSV color space transformation [19] to improve the robustness of illumination. In fact, grayscale transformation mainly converts the input image into the output image through specific pixel operation rules. It does not change the spatial relationship of the input image. Once the grayscale transformation function T is determined, the grayscale transformation g(·) is also defined as follows:

Spatial-Feature-Extracting Sub-Network
Meanwhile, the HSV color space represents hue (H), saturation (S), and value (V), and, therefore, as compared with RGB images, HSV images are more consistent with human intuition in regard to color.
In summary, as shown in Figure 3, the original RGB image and the preprocessed image are used as the dual inputs of the parallel CNN described in the following subsection, which facilitates the subsequent perception of complex illumination.
Processed image Figure 2. The design of spatial-feature-extracting sub-network. The proposed sub-network enjoys several unique traits, including parallel structure, larger convolution kernel, and a smaller number of convolution kernels.
Preprocessing Module. The RGB images, which represent the colors of the three channels of red (R), green (G), and blue (B), show color images using different color components. However, the acquired lane images' quality can become poor due to insufficient light at night or in low-light road environments. As a result, there is no apparent difference between the lane lines and the background in the original RGB ones, when using sub-networks to directly extract spatial features from the raw RGB lane images collected at night, which leads to insufficient capture of semantic information. Inspired by image enhancement algorithms, we use grayscale transformation [8] to increase the contrast between lane lines and the road surface, as well as HSV color space transformation [19] to improve the robustness of illumination. In fact, grayscale transformation mainly converts the input image into the output image through specific pixel operation rules. It does not change the spatial relationship of the input image. Once the grayscale transformation function T is determined, the grayscale transformation g(•) is also defined as follows: Meanwhile, the HSV color space represents hue (H), saturation (S), and value (V), and, therefore, as compared with RGB images, HSV images are more consistent with human intuition in regard to color.
In summary, as shown in Figure 3, the original RGB image and the preprocessed image are used as the dual inputs of the parallel CNN described in the following subsection, which facilitates the subsequent perception of complex illumination.

Parallel CNN Design.
It has been demonstrated that, based on AlexNet [20,21], the performance of a CNN designed by Yang Z et al. [11] had a good ability for extracting visual features and it was capable of directly regressing the steering angle from raw pixels. As shown in Figure 2, we propose an improved CNN structure for this task with two improvements to research complex light and structured scenes. To fully extract the multidimensional features of complex illuminated lane images, we constructed a singleinput CNN into two dual-input parallel CNNs with the same structure. In this way, we can input both the original RGB image and the preprocessed image. As it is well known that the lane is the critical information for vehicle steering; therefore, a large kernel size (11*11) was kept in the first layer.
Another improvement involved changing the convolutional layers to four convolutional layers, four pooling layers, and four fully connected layers. As shown in Figure 2, there was an entire set of convolution kernels in each convolutional layer and each one can produce a separate two-dimensional activation map. These activation maps were stacked along the depth dimension and produced the output volume. We also reduced the number of cores to the combination of 64-128-192-192. The number of convolution cores determined the number of output volumes. Previous methods [20,21] have adopted five convolutional layers and four fully connected layers and the number of cores has been the combination of 96-256-384-384-256. Going deep is essential for deep learning. However, for each convolutional layer, its capacity for learning more complex patterns should be guaran-Sustainability 2021, 13, 4531 6 of 16 teed [22]. Therefore, while reducing the number of convolution layers, we also reduce the number of convolution kernels accordingly. The spatial feature extraction subnet finally outputs 100-dimensional visual information. Then, we fused it and used it for multitask prediction in the longitudinal and lateral prediction subnets below. The experiments show that these two improvements (multidimensional inputs and new convolution parameters) do improve the accuracy of steering angle prediction in structured scenes under complex lighting. Figure 4 shows the anatomy of the temporal-feature-extracting sub-network. This sub-network is designed to capture temporal features of continuous vehicle kinematic transitions such as steering angle sequences and speed sequences.

Temporal-Feature-Extracting Sub-Network
The steering angle prediction network in [11] only used a single frame image as input. However, in our research project, the driving strategy of the steering angle of the multi-axle self-driving transport vehicle is related to the input image and also the steering angle at the last moment. The steering angles are continuous values in the time dimension. We used recurrent neural networks to capture the temporal dependence in steering angle sequence, which can improve the accuracy of steering angle prediction in the dark and improve the stability of self-driving transport vehicles.
In fact, both the steering angle of the lateral control and the speed value of the longitudinal control [23,24] affect the driving strategy of self-driving transport vehicles. The vehicle speed in driving depends on various factors, including the driver's habits, the surrounding traffic conditions, road conditions, etc. The above factors cannot be reflected by the front view camera alone. Therefore, in this study, the feedback speed sequences, which are set to additional auxiliary kinematic information, also input into the model. The recurrent neural networks also capture the temporal dependence in speed sequences, improving speed prediction accuracy and achieving longitudinal control of autonomous vehicles.  LSTM is a variant of recurrent neural networks, which can capture long-term timedependent information [25]. Therefore, in this study, the single-input LSTM based on [11] is improved to two double-input parallel LSTMs with the same structure to facilitate the extraction of temporal features between steering angle sequences and speed sequences simultaneously. As seen in Figure 4, the internal structure of the LSTM unit is illustrated by the dashed rectangular box, where denotes the input to the LSTM cell at the moment ; denotes the cell state, which records the information passed over time; denotes the input gate that determines how much information inputs to the current cell state ; denotes the forgetting gate that determines how much information is retained by the cell state −1 to at the last moment; denotes the output gate that controls how much information passes to the output ℎ of the current state; ℎ −1 indicates the output at the moment; and is the state candidate value. LSTM is a variant of recurrent neural networks, which can capture long-term timedependent information [25]. Therefore, in this study, the single-input LSTM based on [11] is improved to two double-input parallel LSTMs with the same structure to facilitate the extraction of temporal features between steering angle sequences and speed sequences simultaneously. As seen in Figure 4, the internal structure of the LSTM unit is illustrated by the dashed rectangular box, where x t denotes the input to the LSTM cell at the moment t; c t denotes the cell state, which records the information passed over time; i t denotes the input gate that determines how much information x t inputs to the current cell state c t ; f t denotes Sustainability 2021, 13, 4531 7 of 16 the forgetting gate that determines how much information is retained by the cell state c t−1 to c t at the last moment; o t denotes the output gate that controls how much information c t passes to the output h t of the current state; h t−1 indicates the output at the moment; and m t is the state candidate value.
LSTM controls the cell state through the gating unit. First, the forgetting gate decides what information to discard from the cell state based on the previous moment output h t−1 and the current input x t by generating the forgetting probability f t through the sigmoid layer. Secondly, new information to update the cell state is generated in the following two steps: in the first step, the input gate determines the information i t that needs to be updated by a sigmoid layer, and in the second step, a tanh layer will be used to generate the state candidates m t . Multiply the cell state at the previous moment f t and add i t m t to get the new cell state c t . Last but not least, the output information is decided. The output gate is passed through the sigmoid layer to obtain the initial output o t , and then the new cell state c t is processed by the tanh function and multiplied with the current output h t . The working principle is shown in Equations (5)-(10) as follows: where w and b denote the weight vector and offset of the corresponding gating unit, respectively; σ(•) denotes the sigmoid activation function; tanh(•) denotes the hyperbolic tangent activation function; and denotes the Hadamard product. As shown in Figure 4, the input x t to the LSTM unit represents the steering angle sequence or speed sequence. At moment t, the previous LSTM cell output h t−1 , the cell state c t−1 , as well as x t , are input to the LSTM cell to obtain the temporal feature output h t of the current moment. Finally, the temporal feature extraction sub-network outputs two 100-dimensional vehicle dynamics information. Next, we will merge them in the longitudinal and lateral prediction sub-networks, described below, and use them for multitask prediction. As shown in Figure 5a, the longitudinal and lateral control networks predict the driving strategy based on the new fusion features. We propose a longitudinal and lateral prediction sub-network which consists of feature fusion (merge) layers and fully connected layers. Unlike Yang Z [11] who fused only speed sequence features, in this study, we separately fuse visual features and vehicle kinematic features (steering angle and speed) in the feature merge layer. In fact, the excellent results achieved by ResNet [26] in image classification areas also demonstrate that feature fusion can enhance network learning, improve the expressiveness of the network, and help the model converge more accurately and faster. In the merged layer, the method is feature cascading (concatenate), as Figure 5b shows. Feature cascade is the stitching of the feature vectors output from two branch networks, and the newly generated features are the result of concatenating the two feature vectors. Different from feature summation, a simple superposition, the dimensions of the new feature vector generated by feature cascading can be significantly increased.

Longitudinal and Lateral Prediction Sub-Network
gitudinal and lateral prediction sub-networks, described below, and use them fo task prediction. As shown in Figure 5a, the longitudinal and lateral control networks predict ing strategy based on the new fusion features. We propose a longitudinal and lat diction sub-network which consists of feature fusion (merge) layers and fully co layers. Unlike Yang Z [11] who fused only speed sequence features, in this study, arately fuse visual features and vehicle kinematic features (steering angle and s the feature merge layer. In fact, the excellent results achieved by ResNet [26] i classification areas also demonstrate that feature fusion can enhance network l improve the expressiveness of the network, and help the model converge more ac and faster. In the merged layer, the method is feature cascading (concatenate), a 5b shows. Feature cascade is the stitching of the feature vectors output from two networks, and the newly generated features are the result of concatenating the two vectors. Different from feature summation, a simple superposition, the dimensio new feature vector generated by feature cascading can be significantly increased.

Longitudinal and Lateral Prediction Sub-Network
Two 100-dimensional feature vectors, separately output from the spatial fea traction network and the temporal feature extraction network, are stitched tog generate 200 high-dimensional features. The new feature vectors are passed to connected layers, where the numbers of neurons in the four fully connected laye lateral prediction network are 200, 100, 50, and 1. Additionally, the longitudinal pr network parameters are the same as above. Finally, the steering angle and speed ultaneously output to achieve the self-driving transport vehicle's lateral and long decision making. Two 100-dimensional feature vectors, separately output from the spatial feature extraction network and the temporal feature extraction network, are stitched together to generate 200 high-dimensional features. The new feature vectors are passed to the fully connected layers, where the numbers of neurons in the four fully connected layers of the lateral prediction network are 200, 100, 50, and 1. Additionally, the longitudinal prediction network parameters are the same as above. Finally, the steering angle and speed are simultaneously output to achieve the self-driving transport vehicle's lateral and longitudinal decision making.

Experiments Setup
Dataset Description. We perform evaluations on the standard benchmarks that are widely used in the community; namely, Udacity Challenge II [27]. The Udacity dataset is mainly composed of video frames taken from structured urban roads and it contains multiple frames of severe lighting changes. As shown in Figure 6, it fits our model research scenario. Specifically, data-collecting cars have three cameras mounted at the left/middle/right around the rear mirror. Videos are captured at a rate of 20 FPS. For each video frame, the data provider managed to record corresponding geo-location (latitude and longitude), timestamp (in millisecond), and vehicle states (wheel angle, torque, and driving speed). Recalling the previously designed double parallel network DP-Net, the video frame input of the spatial feature extraction subnet and vehicle state input (steering angle, speed) of the temporal feature extraction subnet, are precisely provided by the Udacity Challenge II dataset.

Experiments Setup
Dataset Description. We perform evaluations on the standard benchmarks that are widely used in the community; namely, Udacity Challenge II [27]. The Udacity dataset is mainly composed of video frames taken from structured urban roads and it contains multiple frames of severe lighting changes. As shown in Figure 6, it fits our model research scenario. Specifically, data-collecting cars have three cameras mounted at the left/middle/right around the rear mirror. Videos are captured at a rate of 20 FPS. For each video frame, the data provider managed to record corresponding geo-location (latitude and longitude), timestamp (in millisecond), and vehicle states (wheel angle, torque, and driving speed). Recalling the previously designed double parallel network DP-Net, the video frame input of the spatial feature extraction subnet and vehicle state input (steering angle, speed) of the temporal feature extraction subnet, are precisely provided by the Udacity Challenge II dataset.  Network Optimization. The experiments are conducted on the Intel(R) Core(TM) i7-8700CPU and NVIDIA GeForce GTX1660 GPU. All code is written in Google's Tensor-Flow framework. The following are some crucial parameters for reimplementing our method: dropout with a ratio of 0.5 is used in fully connected layers; the learning rate is initialized to 1×10 −4 and halved when the objective is stuck in some plateau. We randomly draw 5% of the data set for validating models and always retain the best model on the whole validation process. We adopt ADAM [28] for the stochastic gradient solver, which is an algorithm for first-order gradient-based optimization of stochastic objective functions. Model training requires about 20 h over the GPU. Inspired by AlexNet's LRN [21], Network Optimization. The experiments are conducted on the Intel(R) Core(TM) i7-8700CPU and NVIDIA GeForce GTX1660 GPU. All code is written in Google's TensorFlow framework. The following are some crucial parameters for reimplementing our method: dropout with a ratio of 0.5 is used in fully connected layers; the learning rate is initialized to 1 × 10 −4 and halved when the objective is stuck in some plateau. We randomly draw 5% of the data set for validating models and always retain the best model on the whole validation process. We adopt ADAM [28] for the stochastic gradient solver, which is an algorithm for first-order gradient-based optimization of stochastic objective functions. Model training requires about 20 h over the GPU. Inspired by AlexNet's LRN [21], and in order to improve the generalization ability of the network, we introduce Batch normalization after the convolutional layer [29].
In neural networks, the loss function is used to measure the difference between the predicted value and the ground truth value. We define two types of loss items in the preliminary prediction task; namely, L steer and L speed . The steering angle prediction loss L steer is described in Equation (2) and the speed prediction loss L speed is defined in Equation (3).
In addition, we add the mean absolute error (MAE) as metrics to monitor the model performance, which can better reflect the actual prediction value error. Therefore, the final objective function is described as: speeding loss (11)

Comparison with Competing Algorithms
First, we evaluate the performance of the end-to-end steering angle prediction. We compare the proposed DP-Net with several competing algorithms. Brief descriptions of these competitors are given below:

•
PilotNet is the network proposed by NVIDIA. It consists of five convolutional layers and five fully connected layers, which use small kernel sizes (3 × 3, 5 × 5). We reimplemented this according to NVIDIA's original technical report. All input video frames are resized to 200 × 66 before feeding PilotNet.

•
CgNet is an open-source model with an excellent ranking in the Udacity Challenge II. Compared with PilotNet, it adjusts the kernel parameters to use only three convolutional layers and two fully connected layers, and it only uses a small kernel size (3 × 3). Note that the input of both PilotNet and CgNet is only the original single frame image, which ignores the visual features in dark light.

•
The E2E multimodal multitask network is a five-layer convolution and four-layer fully connected multimodal multitask network, based on the AlexNet architecture proposed by Yang Z et al. We re-implemented it on Udacity Challenge II, according to the authors' paper. Note that, although the authors extracted the temporal features by the LSTM network, it was not passed and applied to predict the steering angle. Therefore, the internal continuous kinematic state of the vehicle is ignored.
In this section, we evaluate the performance of the proposed DP-Net, which merges the visual and kinematic features extracted by the double parallel network, sequentially, to predict lateral steering angle. First, we focus on predicting the steering angle. The RMSE (root mean squared error) and the MAE of steering angles are shown in Table 1, from which we have several immediate observations. First, image preprocessing and the input of multidimensional image features heavily correlate to the final performance. In particular, image enhancement and HSV color space conversion exhibit advantages for representing complex lighting conditions. It is well known that lane lines are an essential feature of structured roads. We increase the contrast between the lane lines and the road surface in the image through our image preprocessing module. Then, we merge the original RGB image features and HSV spatial color features for the lateral prediction sub-network. Compared to competing algorithms with single raw image feature inputs, our network has significantly better accuracy and robustness. To further investigate the experimental results, Figure 7a plots the steering angles in a testing video sequence. We also take two sub-nodes (t = 3 and t = 1840), which correspond to the intense light and dark light road scenes on the road, as shown in the scenes plotted in Figure 7a. Clearly, our model (the orange curve) predicts very accurately. As shown in Figure 7c, the basic steer angle errors between the predicted and the Udacity dataset values are limited to ±0.02. Additionally, we will try to use transfer learning to deal with individual errors during subsequences (t = 4000~4300) in future research. Secondly, besides the input of multidimensional visual features, our model also clearly differs from others by fusing vehicle kinematic features. Inspired by the kinematic modeling of traditional decision algorithms, we conjecture that it is inherently difficult to predict the steering angle with visual input alone. Therefore, we add the steering angle sequence of the previous ten frames as the model's physical constraints. As shown in Table 1, it is apparent that the RMSE and MAE of proposed DP-Net are reduced from 0.3063 and 0.2213 to 0.0107 and 0.0054 as compared with NVIDIA PilotNet. Noticeably, compared with E2E multimodal multitask network, we reduce the RMSE by 81.7% (from 0.0584 to 0.0107), the MAE by 66.9% (from 0.0163 to 0.0054), and the maximum prediction error by 65.3% (from 0.2732 to 0.0948). This shows that the additional vehicle kinematic information, from the steering angle sequence, provides richer and more comprehensive inputs. It improves the continuity of steering angle prediction and reduces the jerking of the vehicle's steering wheel. As can be seen from Table 1, it is clear that our experimental Secondly, besides the input of multidimensional visual features, our model also clearly differs from others by fusing vehicle kinematic features. Inspired by the kinematic modeling of traditional decision algorithms, we conjecture that it is inherently difficult to predict the steering angle with visual input alone. Therefore, we add the steering angle sequence of the previous ten frames as the model's physical constraints. As shown in Table 1, it is apparent that the RMSE and MAE of proposed DP-Net are reduced from 0.3063 and 0.2213 to 0.0107 and 0.0054 as compared with NVIDIA PilotNet. Noticeably, compared with E2E multimodal multitask network, we reduce the RMSE by 81.7% (from 0.0584 to 0.0107), the MAE by 66.9% (from 0.0163 to 0.0054), and the maximum prediction error by 65.3% (from 0.2732 to 0.0948). This shows that the additional vehicle kinematic information, from the steering angle sequence, provides richer and more comprehensive inputs. It improves the continuity of steering angle prediction and reduces the jerking of the vehicle's steering wheel. As can be seen from Table 1, it is clear that our experimental results verify the directional effects of the conjecture. Additionally, our results provide some ideas for more vehicle dynamic information such as position and attitude of multi-axle transport vehicles in subsequent fusion projects, which we further described through an ablation analysis.
Lastly, different from PilotNet, CgNet with small convolutional kernels, and E2E multimodal multitask network with AlexNet structure, we have designed a new combination of convolutional kernel size and number, which we further consider in a subsequent ablation analysis.
The proposed DP network once again merges the continuous extracted visual and kinematic features, subsequently to predict the longitudinal speed through the double parallel network. Similar to the regression task for the steering angle, we also analyzed the performance of the model in terms of root mean square error (RMSE) and mean absolute error (MAE) of the speed. For the CNN model, the predicted speed bias is larger when only a single image frame is input. Therefore, we did not do a comparative analysis of PilotNet and CgNet in Table 2. We only choose the E2E multimodal multitask network results as a baseline for comparison.
As can be seen from Table 2, the RMSE is reduced from 1.7112 to 1.4211, a relative improvement of 17%. To further evaluate the overall effectiveness of the proposed DP-Net experimental results, Figure 7b plots the speed in 4400 frames from the test set. The orange curve in the graph indicates the ground truth curve and the blue curve indicates the prediction curve. It can be observed that the speed prediction results and the predicted values match well with the ground truth. The basic speed errors between the predicted and the Udacity dataset values fluctuate around 0.5, as shown in Figure 7d. This indicates that the improved parallel CNN network can extract richer visual features and can merge them with the temporal contextual features of the speed sequence by the LSTM network to generate new high-level semantic features. Therefore, our network can better facilitate the learning of the longitudinal decision network, narrowing the gap between the prediction and ground truth values of speed. However, as shown in Table 3, the proposed DP-Net's MAE and maximum prediction errors are not satisfactory. We conjecture that because MAE gives each error value the same weight, some abnormal speed points in the test set cannot be well predicted. The subsequent t = 3500 to approximately t = 4000, in Figure 7b, also confirm our conjecture. Therefore, attention should be focused on the handling and prediction of speed outliers when our model is subsequently deployed to actual multi-axle transport vehicles.

Validation on SYSU Campus
Safety is always regarded as a top priority in autonomous driving. Therefore, to ensure that the proposed DP-Net can subsequently be safely and accurately deployed to our multi-axle transport vehicle project, it is necessary to test it on a real-world dataset. Ideally, the training model has an accurate predictive effect on the steering angle and speed of the test set.
We recorded and constructed the SYSU Campus dataset collected by Baidu's Apollo D-KIT Lite [30]. The dataset includes two hours of driving data from the Gufeng Road and Xiaoyuan West Road on campus, with clear road edges. The routing and some frame images of the dataset are shown in Figure 8. The dataset contains the driving data in both normal daylight and dark tunnel night. Similar to the structure of the Udacity Challenge II dataset, speed values and steering angles are recorded. The video streams contain videos from one center and two side front view cameras with a frame rate of 20 frames per second.
In order to visualize the longitudinal and lateral test results of the proposed DP- Net  (choosing a combination of 64-128-192-192), we prepared six representative key test video frames before, during, and after entering and exiting the tunnel, shown in Figure 9. It is observed that the predictions of steering angle and speed are very close to the ground truth values, which indicates that the learned model indeed captures the critical factor in complex lighting such as tunnels. To further evaluate the overall effectiveness of the longitudinal and lateral prediction results, Figure 10 plots the curves in 8000 frames from the SYSU Campus dataset. The orange curve in the graph indicates the ground truth and the blue curve shows the prediction value. For the steering angle, our model (orange curve) predicts it very accurately; for the speed values, as conjectured above, the predicted values are limited. Therefore, transfer learning on the actual dataset could be a good optimization idea to deploy the model on the multi-axle real vehicle in our subsequent project. In order to visualize the longitudinal and lateral test results of the proposed DP-Net (choosing a combination of 64-128-192-192), we prepared six representative key test video frames before, during, and after entering and exiting the tunnel, shown in Figure 9. It is observed that the predictions of steering angle and speed are very close to the ground truth values, which indicates that the learned model indeed captures the critical factor in complex lighting such as tunnels. To further evaluate the overall effectiveness of the longitudinal and lateral prediction results, Figure 10 plots the curves in 8000 frames from the SYSU Campus dataset. The orange curve in the graph indicates the ground truth and the blue curve shows the prediction value. For the steering angle, our model (orange curve) predicts it very accurately; for the speed values, as conjectured above, the predicted values are limited. Therefore, transfer learning on the actual dataset could be a good optimization idea to deploy the model on the multi-axle real vehicle in our subsequent project.   In order to visualize the longitudinal and lateral test results of the proposed DP-Net (choosing a combination of 64-128-192-192), we prepared six representative key test video frames before, during, and after entering and exiting the tunnel, shown in Figure 9. It is observed that the predictions of steering angle and speed are very close to the ground truth values, which indicates that the learned model indeed captures the critical factor in complex lighting such as tunnels. To further evaluate the overall effectiveness of the longitudinal and lateral prediction results, Figure 10 plots the curves in 8000 frames from the SYSU Campus dataset. The orange curve in the graph indicates the ground truth and the blue curve shows the prediction value. For the steering angle, our model (orange curve) predicts it very accurately; for the speed values, as conjectured above, the predicted values are limited. Therefore, transfer learning on the actual dataset could be a good optimization idea to deploy the model on the multi-axle real vehicle in our subsequent project.

Ablation Analysis
Our proposed model includes two novel designs, i.e., large convolutional kernels and small convolutional kernel numbers. First, based on the original dataset of 640 × 480 pixels, in order to better extract the underlying features in the image, we design a large convolution kernel size (11 × 11) to obtain a larger perceptual field [31]. Secondly, the higher the number of convolutional kernels, the more feature information can be extracted for learning. However, this also causes the network parameters to increase abruptly, slowing down the computation and overfitting during the training process [23]. This section presents an ablative experiment to quantitatively evaluate the effect of different combinations of convolution kernel numbers. Specifically, we tested six combinations of the number of convolutional kernels from small to large. The goal was to verify the effect of the factor on the final accuracy.

Ablation Analysis
Our proposed model includes two novel designs, i.e., large convolutional kernels and small convolutional kernel numbers. First, based on the original dataset of 640 × 480 pixels, in order to better extract the underlying features in the image, we design a large convolution kernel size (11 × 11) to obtain a larger perceptual field [31]. Secondly, the higher the number of convolutional kernels, the more feature information can be extracted for learning. However, this also causes the network parameters to increase abruptly, slowing down the computation and overfitting during the training process [23]. This section presents an ablative experiment to quantitatively evaluate the effect of different combinations of convolution kernel numbers. Specifically, we tested six combinations of the number of convolutional kernels from small to large. The goal was to verify the effect of the factor on the final accuracy.

Ablation Analysis
Our proposed model includes two novel designs, i.e., large convolutional kernels and small convolutional kernel numbers. First, based on the original dataset of 640 × 480 pixels, in order to better extract the underlying features in the image, we design a large convolution kernel size (11 × 11) to obtain a larger perceptual field [31]. Secondly, the higher the number of convolutional kernels, the more feature information can be extracted for learning. However, this also causes the network parameters to increase abruptly, slowing down the computation and overfitting during the training process [23]. This section presents an ablative experiment to quantitatively evaluate the effect of different combinations of convolution kernel numbers. Specifically, we tested six combinations of the number of convolutional kernels from small to large. The goal was to verify the effect of the factor on the final accuracy.
The results of these evaluations are shown in Table 3. The table shows that a moderate number of convolutional kernels provides us with optimal accuracy. Inspired by the idea in [23,32], the task of designing deeper networks and a more significant number of convolutional kernels is essentially a constrained optimization problem. In the lane perception task of this study, for each convolutional layer, its capacity of learning more complex patterns should be guaranteed. Therefore, at this point, it is not suitable and unreasonable to use too many convolution kernels and too deep networks. According to the above 64-128-192-192 convolution kernel parameters, we further designed controlled experiments to compare the network's performance, considering different information. The results are shown in Table 4. It shows that the visual information (such as the edge of roads and lane lines) and previous vehicle kinematics state (steering angle and speed) provide crucial information for the task that we are considering. In terms of steering angle, the RMSE and MAE of CNN-LSTM (DP-Net) are reduced from 0.0301 and 0.0243 to 0.0107 and 0.0054 as compared with CNN. This shows that the additional vehicle kinematic information improves the continuity of steering angle prediction and it also reduces the jerking of the vehicle's steering wheel. In terms of speed, it is apparent that the RMSE and MAE of CNN-LSTM (DP-Net) are reduced from 3.3778 and 2.7171 to 1.4211 and 0.8802, as compared with CNN. It shows that the LSTM networks capture the temporal dependence in speed sequences, improving speed prediction accuracy and achieving the longitudinal control of autonomous vehicles. Therefore, it is possible to enhance the network to multi-parallel to fuse more vehicle kinematics information, such as position and attitude of multi-axle transport vehicles in our subsequent project.

Conclusions
In this study, we have solved the task of end-to-end vehicle lateral and longitudinal driving strategy in terms of the speed and steering angle. Aiming at the complex illumination (tunnel) of structured scenarios in autonomous driving, a double parallel network is proposed. In feature extraction and perception, a preprocessing module is presented to ensure adequate visual information extraction under complex illumination. Then, a parallel CNN sub-network and a parallel LSTM sub-network are designed. The parallel LSTM sub-network uses vehicle kinematic features as physical constraints to help predict acceleration more accurately.
The experimental results show that our proposed DP-Net can accurately predict steering angle and speed, and further improve the robustness of complex illumination and the accuracy of lateral and longitudinal prediction. In addition, a new SYSU Campus dataset is collected for evaluation and testing. However, the more complex kinematic characteristics (multi-axle steering angles, position, and pose) of the multi-axle transport vehicle are not considered in our approach. Moreover, due to the different distributions between the training datasets and test datasets, the prediction of speed by the DP-Net, in this study, is not sufficiently accurate. Subsequent proposed research work should include the following: (1) incorporating more kinematic features of multi-axle vehicles in feature extraction and perception and (2) using transfer learning to further improve the accuracy of longitudinal speed prediction values.