End-to-End Deep Neural Network Architectures for Speed and Steering Wheel Angle Prediction in Autonomous Driving

: The complex decision-making systems used for autonomous vehicles or advanced driver-assistance systems (ADAS) are being replaced by end-to-end (e2e) architectures based on deep-neural-networks (DNN). DNNs can learn complex driving actions from datasets containing thousands of images and data obtained from the vehicle perception system. This work presents the classiﬁcation, design and implementation of six e2e architectures capable of generating the driving actions of speed and steering wheel angle directly on the vehicle control elements. The work details the design stages and optimization process of the convolutional networks to develop six e2e architectures. In the metric analysis the architectures have been tested with different data sources from the vehicle, such as images, XYZ accelerations and XYZ angular speeds. The best results were obtained with a mixed data e2e architecture that used front images from the vehicle and angular speeds to predict the speed and steering wheel angle with a mean error of 1.06%. An exhaustive optimization process of the convolutional blocks has demonstrated that it is possible to design lightweight e2e architectures with high performance more suitable for the ﬁnal implementation in autonomous driving.


Introduction
Autonomous driving technology has advanced greatly in recent years, but it is still an ongoing challenge. Traditionally, intelligent decision making systems on board autonomous vehicles have been characterized by their enormous complexity [1] and are composed of multiple subsystems, including a perception system, global and local navigation systems, a control system, a surroundings interpretation system, etc., [2]. These subsystems are combined aiming to cover the complicated decisions and tasks which the vehicle must perform whilst driving. To obtain the objectives of the vehicle, these subsystems use a wide range of techniques which include: cognitive systems [3], agent systems [4], fuzzy systems [5], neural networks [6], evolutionary algorithms [7] or rule-based methods [8].
Deep learning techniques are becoming increasingly popular and are now a valuable tool in a wide range of industries, including the automotive industry, due to their powerful image feature extraction. These techniques have allowed the so-called end-to-end (e2e) driving approach to appear, simplifying the traditional subsystems greatly and reducing the tasks of modeling and control of the vehicle [9] (Figure 1). The appearance of DNNs mean that decision-making systems on board autonomous vehicles can replace many of the subsystems mentioned previously with neural blocks [10]. These neural blocks, properly interconnected and trained with the correct data are capable of obtaining performances greater than 95% for the prediction of vehicle control variables [11]. An advantage of these models is that they generally require fewer onboard sensors as the main source of information fed to the DNNs usually consists of RGB images and kinematic data from an inertial measurement unit (IMU) [12]. This makes end-to-end driving systems more easily accessible than the traditional perception subsystems with sensors such as LIDAR which are very costly. Deep learning methods for autonomous driving have gained popularity with advancements in hardware, such as GPUs, and more readily available datasets, both for end-to-end driving techniques [13] and the use of deep learning in individual subsystems [14]. There have been a variety of different approaches for the development of driving applications using end to end learning techniques. In one study, a 98% accuracy was obtained using convolutional neural networks (CNN) to generate steering angles from images generated by a front view camera [15]. In a similar work, a sequence of images from a public dataset was used as input to the CNN, to predict whether the vehicle was accelerating, decelerating or maintaining speed as well as calculating the steering angle [16].
An interesting approach designed a CNN to develop a human-like autonomous driving system which aims to imitate human behavior meaning the vehicle can better adapt to real road conditions [13]. The authors used 3D LIDAR data as input to the model and generated steering and speed commands, and in a driving simulation managed to decrease accidents with the autonomous system ten-fold compared with the human driver. A driving simulator was also used to test a CNN-based closed loop feedback to control the steering angle of the vehicle [17]. The authors designed their own CNN, DAVE-2SKY, using the Caffe deep learning framework and tested the system in a lane-keeping simulation. The results were promising, although problems occurred if the distance to the vehicle in front became less than 9 m.
Various long short-term memory (LSTM) models have also been studied. A convolutional LSTM model with back propagation was trained to obtain the steering angle from video frames using the Udacity dataset [18]. An FCN-LSTM architecture was used to predict driving actions and motion from images obtaining almost 85% accuracy. A convolutional LSTM model was also used to predict steering angles from a stream of images from a front facing camera [19], improving on the results from previous works [20].
Another approach consists in adding more sensors. In one work a dataset was obtained using surround view cameras in addition to the typical front view camera [21]. The data obtained by the cameras was used to predict the speed and steering angle using existing pretrained CNN models. The use of surround view cameras improved the results obtained at low speeds (<20 km/h), but at greater speeds the improvement was less significant.
In this work, we present a detailed study implementing six end-to-end DNN architectures for the prediction of the vehicle speed and the steering wheel angle. The architectures have been trained and tested using 78,011 images from real driving scenarios, which were captured by the Cloud Incubator Car (CIC) autonomous vehicle [2].

Materials and Methods
DNN end-to-end architectures require large volumes of data for the models to converge correctly. The data needed to create DNN models for autonomous driving or ADAS can be obtained from three different types of sources:

1.
Ad hoc tests. To perform this type of testing, large resources are required, in the form of one or more vehicles, expensive perception systems (e.g., LIDAR) and personnel capable of the installation, integration and commissioning of sophisticated sensors and data recording systems. In addition, the data must be post-processed, and the synchronization of the different vehicle information sources is required. 2.
Public datasets. There are datasets developed by businesses and universities for autonomous driving where data obtained from the perception systems of their vehicles can be accessed [10]. Some of these present diverse scenarios with different light and meteorological conditions [22]. Table 1 shows some recent public datasets including number of samples, types of images available and types of vehicle control actions stored.

3.
Simulators. Given the complexity of conducting real tests, autonomous driving simulators have become one of the most widely used alternatives. The simulation industry ranges from simulation platforms, vehicle dynamics simulation and sensor simulation to scenario simulation and even scenario libraries. At present, there are many options, including generic solutions which make use of games and physic engines for simulation [23] and robotics simulators [16]. Recently on the market companies that develop simulation products specifically designed to satisfy the needs of autonomous driving have appeared. Some of these companies include Cognata, CARLA, METAMOTO, etc. In this work ad hoc data has been chosen. To obtain the data, a custom dataset was created, as the result of ad hoc driving tests performed using the Cloud Incubator Car autonomous vehicle (CICar) [2] (see Figure 2), an autonomous vehicle prototype based on the adaption of the commercial electric vehicle, Renault Twizy. The vehicle has been conveniently modified and houses a complete perception system consisting of a 2D LIDAR, 3D HD LIDAR, ToF cameras, as well as a localization system which contains a real-time kinetic unit (RTK) and inertial measurement unit (IMU, see Figure 2c) and automation of the driving elements of the vehicle (accelerator, brake, steering and gearbox). All of this is complemented with the biometric data of the drivers taken during the driving tests.

Driving Tests
A group of 30 drivers of different age and gender were selected to perform the driving tests, of which five were discarded due to synchronization problems, recording failure or incomplete data. The driving tests were carried out in Cartagena in the Region of Murcia, Spain, following a previously selected route with real traffic.
This route provides a significant set of typical urban driving scenarios: (a) junctions with right of way and changes of priority; (b) incorporation, internal circulation and exiting of a roundabout; (c) driving along a road with and parking areas; (d) merging traffic situations. In order to contemplate a greater variety of environmental conditions, each driver completed the route twice at different times of day (morning, afternoon or evening). In Figure 3 a sample of some of the dataset images is shown, where some of the different driving conditions captured during the tests can be observed.

Vehicle Configuration
As mentioned previously, the data was collected using the CICar prototype vehicle in manual mode, driven by a human driver. In Table 2 the variables and data acquired during the driving tests are shown, as well as the information about the devices and systems used to obtain the data.
Each sensor works with its own sample rate, and in most cases this is different between devices. To achieve the correct data synchronization and reconstruct the temporal sequence with precision, stamping times have been generated for each sensor and these have been synchronized at the start and end of the recording. Therefore, all the devices are controlled by the control unit onboard the vehicle, providing a perfect temporal and spatial synchronization of the data obtained by the different sensors. The data from each test is downloaded and stored in the central server once the drive has finished.

Deep Learning End-to-End Architectures Classification
End-to-end (e2e) systems based on DNN architectures applied to autonomous driving can model the complex relationships extracted from the information obtained from the vehicle perception system. This is achieved using different types of neural blocks grouped into layers (e.g., convolutional layers, fully-connected layers, recurrent layers, etc.), with the aim of generating direct control actions on the steering wheel, the accelerator and the brake. These actions on the vehicle control elements can be categorical, e.g., increase or decrease the speed, or they can generate a setpoint on the controller, e.g., turn 13.6 degrees or reach 45 km/h.
The machine learning algorithms that are used to model driving actions belong to the set known as supervised learning. These algorithms acquire knowledge from a dataset of samples previously acquired during driving tests with a previously conditioned vehicle [2] or from driving simulators [27]. These datasets include data from the perception system, such as: images (RGB o IR), LIDAR, RADAR, IMU, as well as the actions performed by the driver on the vehicle control elements, such as the steering wheel, the accelerator and the brake.
The generation of discrete variables by a machine learning algorithm is known as regression and is a widely studied problem [28]. Regression models for DNN use the gradient descent function to search for the optimal weights that minimize the loss function. The loss functions used for these models differ from those used in the classification models, with the most used being the mean absolute error, mean square absolute error or mean absolute percentage error, among others.
This work proposes a classification of e2e architectures based on the type of data received by the DNN from the vehicle perception system. This is done by considering the image provided by the visual perception system of the vehicle as the main data source for the e2e architecture. Based on the type of network input, the architectures have been classified into three types: (1) single data e2e architecture (SiD-e2e), (2) mixed data e2e architecture (MiD-e2e) and sequential data e2e architecture (SeD-e2e).

SiD-e2e Architecture
This type of architecture uses a single data source for the input layer to generate the setpoints directly for the control elements of the vehicle. The SiD architectures use the visual information provided by one or more cameras located on the front and periphery of the vehicle to compose a single image of the vehicles field of view of the vehicle as a visual input to the network [15,29,30]. Before being processed by the DNN, the images are reduced in size and normalized. Subsequently, the images go through convolutional layers of different kernel size (k × k) and depth (d) which allow the image features that minimize the cost function to be extracted automatically in successive layers. After the convolutional layers, the resulting vector is transformed into one dimension (F layer) and connected to a set of fully-connected layers (FC) which have the decision-making capacity. Lastly, the FC layers end in the number of neurons equal to the number of variables to be predicted [15,28]. Figure 4 shows an example of the SiD architecture where the normalized image feeds a group of convolutional layers with different kernel sizes, followed by a set of fully-connected layers and a final output layer. The number of convolutional layers, their size, padding and stride, as well as the number of neurons in the FC layers are adjusted empirically. These parameters are dependent on the training dataset and the size of the input images. There are works where the architectures have been designed using banks of convolutional filters of increasing size [30] and there are others where the design is the opposite [31,32]. Generally speaking, the convolutional layers with a small kernel size extract reduced spatial characteristics, such as traffic signs, traffic lights or lane separation lines, while those with a greater kernel size detect larger elements in the image, such as vehicles, pedestrians or the road [31].

MiD-e2e Architecture
Mixed data architectures allow different data sources from the vehicle, such as RADAR, longitudinal and lateral accelerations, angular velocities, maps or GPS to be merged together with the visual information from the vehicle's cameras. The inclusion of more information sources in the DNN aims to: (1) improve the performance of the model, (2) improve the prediction of specific cases or abnormal driving; and (3) increase the tolerance to failures produced by the data sources [21,29,33]. As shown in Figure 5, this type of architecture combines the results of the SiD-e2e, such as those shown in the previous Section 2.3.1, with a set of FC layers which allows the mapping of the characteristics from other vehicle data sources on a layer that concatenates all the information. Figure 5 shows a first input branch where the relevant information is extracted from the image with a second branch that extracts extra information, for example from the IMU or GPS. The concatenation layer receives a specified number of inputs from both branches of the model. The number of connections from each branch is usually determined using empirical techniques. MiD architecture is habitually used in data fusion in the perception systems of autonomous vehicles or ADAS.

SeD-e2e Architecture
Driving is a task where the future actions on the vehicle's control elements depend greatly on the previous actions, therefore the prediction of the control actions can be modeled as a time series analysis [16,26,34]. Sequential data based architectures aim to model the temporal relationships of the data using feedback neural units (see Figure 6), these types of neural networks are known as recurrent neural networks (RNN) [34]. Basic RNNs can learn the short-term dependencies of the data but they have problems with capturing the long-term dependencies due to vanishing gradient problems [35]. To solve the vanishing gradient problems, more sophisticated RNN architectures have appeared which use activation functions based on gating units. The gating unit has the capacity of conditionally deciding what information is remembered, forgotten or for passing through the unit. The long short-term memory (LSTM) [36] and GRU (gated recurrent unit) are two examples of these kinds of RNN architectures [37]. RNN [15], LSTM (long short-term memory) [16] and GRU (gated recurrent unit) are the most used for modeling the temporal relationships in the field of e2e architectures. The use of RNN in e2e architectures requires the network input data to be transformed into temporal sequences in the form of time steps (ts). The partitioning of the N input samples of the network will generate (N-ts) temporal sequences that will correspond to an output vector from the network according to Equation (1): Figure 7 shows the procedures to generate N-ts sequences of size ts from a dataset composed of N images and N pairs of output values (v: speed, θ: steering wheel angle). To create a model from the SeD-e2e architectures, this will be trained with temporal sequences of size ts (I 1 to I ts ) and the next output vector to predict (v ts+1 , θ ts+1 ) as it is shown in the Figure 7.

Paremeters of Deep Neural Network Architectures
The number of parameters which come into play during the design process of a DNN is enormous and we can separate them into three types: (1) Network input parameters. These parameters refer to the way the network input values are presented. For data in the form of images, the shape parameters include: • Normalization. Normalization must be performed on the data before training the DNN. An adequate normalization can improve the convergence and performance of the network. Equations (2) and (3) show the most common techniques.
where min and max, are the maximum and minimum values present in the dataset X = {x 1 , . . . ,x N }, with u and σ being the average and standard deviation of the dataset, respectively. There are other normalization techniques, for example, the mean can be substituted for the mode in Equation (3), for cases in which the data distribution does not align below the mean. • Resizing. As a general rule and especially in e2e architectures for autonomous driving, the image size is reduced before being processed by the network. The main reason for this is to decrease the network processing time and the resources involved in the prediction. • Color space transformations. It is common to transform the input image to a color space other than the one supplied by the camera to improve performance, for example HSI, LAB, etc., [10]. • Preprocessing. When the data is captured from different sources or dataset, these tend to have disparate features from the devise itself or from the lighting of the scene where the images were captured, therefore histogram equalization or image enhancement algorithms are usually applied to normalize the appearance of the entire dataset. • Data augmentation. This technique consists in increasing the size of the original dataset in order to achieve higher levels of generalization and to improve the performance of the network [38].
(2) Architecture configuration parameters. These parameters constitute the composition of one architecture or another, and these include: •

End-to-End Architecture Design
From a top-down design point of view, the process of developing an end-to-end architecture is similar to creating a puzzle made up of different blocks, where each block consists of a set of layers and configuration parameters as explained in Section 2.4. One of the biggest handicaps in the design of DNN architectures appears in the connectivity between the blocks, and this must be solved adequately at the data level for all of them to fit correctly.
In the works reviewed in the literature for the classification of SiD, MiD and SeD architectures (see Sections 2.3.1-2.3.3), groups of layers representing higher level blocks or units can be identified. In these groups, input and output blocks, feature extraction blocks (2D or 3D convolutions), decision-making blocks (fully-connected), concatenation blocks and recurring blocks can be distinguished.
For the design of the e2e architectures developed in this work, seven blocks of different types of layers and parameters with different functionality have been defined as shown in Figure 8. Once the blocks that form the architectures have been defined, the design process has been divided into three stages: 1) Neural block distribution design, 2) Definition of constant and variable parameters and 3) Block optimization with variable parameters.
In the first design stage, and to perform a detailed study of the behavior of the SiD, MiD and SeD architectures for predicting the speed and steering wheel angle (v, θ) for autonomous vehicles, six architectures of two types have been designed which are called: SiD1, SiD2, MiD1, MiD2, SeD1 y SeD2. The difference between type 1 (Figure 9a,c,e) and type 2 (Figure 9b,d,f) lies in the distribution of the output blocks, either in a single branch or in two independent branches. In the design, the blocks described above have been combined according to the different input and output distributions, as shown in Figure 9.  In the final design stage and with the objective of obtaining an optimal parameter configuration for the six architectures, two tests have been designed: (1) Optimization of 2DCB (32,k1,k2,2). In this test the six architectures were evaluated with different combinations of k1 and k2: a. k1 ≤ k2: k1 with a constant value of 3 and k2 variable between 3 and 29 in increments of 2. b. k1 > k2: k1 with a constant value of 29 and k2 variable between 3 and 15 in increments of 2.
The test calculates which combination of k1 and k2 sizes are able to extract the best spatial features from the input images. The variation of the sizes of k1 and k2 will determine if the performance of the models increases when initially using small or large kernels.
(2) (Optimization of the MiD1 and MiD2 models. This test is used to determine the influence of the vehicle dynamics parameters in the prediction of the models. This is performed using the discrete acceleration and angular velocity values obtained by the IMU on the XYZ axes of the vehicle, together with the input images.

Results
The implementation of the models has been performed using the Keras 2.

Model Configuration
To train the models, a dataset of 78,011 samples has been used (see Section 2.1), consisting of discrete values of speed in km/h, steering wheel angle (degrees), XYZ accelerations (m/s 2 ), XYZ angular velocities ( • /s) together with color images of size 160 × 180 × 3. The discrete values have been normalized using Equation (3) and the images using Equation (2). The dataset has been shuffled and has been divided into two groups:

1.
Training dataset, which consists of 58,508 samples, 75% of the total samples.

2.
Validation dataset, which consists of 19,503 samples, 25% of the total samples. Table 3 shows the hyperparameters used to configure the training and validating phases of the models.

Performance Metrics
To evaluate the performance of the proposed architectures, the metrics mean absolute error were calculated during the training and validation of the models. The mean absolute percentage error (MAPE see Equation (4)) was used to compare the speed and steering wheel angle obtained by the models. All data supplied in this section were calculated using the validation dataset.
Being MAE, the mean absolute error of the prediction (speed or angle) and spanV the difference between the maximum value and the minimum value of variable to predict (or span). The span of the speed and steering wheel angle are 58.05 km/h and 841 • , respectively.

•
Convolutional block Optimization for k1 ≤ k2 Table 4 shows the values obtained when calculating the MAPE metric for the speed and steering wheel angle predictions during the optimization process of the convolutional blocks for k1 ≤ k2 for all the proposed architectures. 1.51% (32,3,29,2) Generally, as can be observed in Table 4, all architectures obtain a lower percentage error for the steering wheel angle prediction than for the speed prediction, which is logical since it is simpler for the models to relate the geometric features from the images (lines, obstacles, traffic signs or the road itself) with the steering wheel rotation angle than with the speed of the vehicle. The architecture with the lowest percentage error for speed prediction is MiD2_wxyz with 1.66% (MAE: 0.96 km/h), with 2DCB(32,3,21,2). The lowest percentage error for the steering wheel angle prediction is MiD1_wxyz with 1.06% (MAE: 1.09 • ), with 2DCB (32,3,25,2).
Regarding the type of architecture (type 1: SiD1, MiD1, SeD1 or type 2: SiD2, MiD2, SeD2), the last row of Table 4 shows that the type 2 architecture design with two independent output branches always obtains a lower combined error in prediction. Figure 10 shows the trend for the error committed by the models in the prediction of speed for the type 1 (SiD1, MiD1, SeD1 see Figure 10a) and type 2 (SiD2, MiD2, SeD2 see Figure 10b) architectures, with respect to the kernel size k2 × k2 for both types of architectures. As can be observed in the speed prediction, the error tends to decrease as the size of the kernel k2 × k2 increases, for both types of architecture.  Figure 11 shows the trend of the errors made by the models for the steering wheel angle prediction for both architectures, type 1 (SiD1, MiD1, SeD1 see Figure 11a) and type 2 (SiD2, MiD2, SeD2 see Figure 11b) according to the kernel size k2 × k2. As was the case for the speed, the error tends to decrease as the kernel value k2 increases.

•
Convolutional block Optimization for k1 > k2 Table 5 shows the values obtained using the MAPE metric for the speed and steering wheel angle predictions during the convolutional block optimization process for k1 > k2 in the proposed architectures. As can be observed, the lowest error for the speed prediction is 2.21% (MAE: 1.28 km/h) and was achieved with the MiD1_wxyz architecture with 2DCB (32,29,3,2). However, the lowest error for the steering wheel angle prediction was given by the MiD2_wxyz architecture with 2DCB (32,29,13,2).  (v, θ) 5.24% (32,29,13,2) 3.74% (32,29,11,2) 2.03% (32,29,7,2) 4.78% (32,29,13,2) 1.63% (32,29,7,2) 1.51% (32,29,3,2) 4.41% (32,29,3,2) 4.19% (32,29,15,2) The lowest mean percentage error for the speed and steering wheel angle predictions combined is 1.51% (MAE: 1.28 km/h, 6.74 • ) obtained by the MiD2_wxyx architecture with 2DCB (32,29,3,2). Figure 12 shows the trend of the errors obtained by the models for the speed prediction for the type 1 and type 2 architectures for k1 > k2. Figure 12a shows that the type 1 architectures have a constant error (<10%) with kernel sizes equal to or greater than 9 × 9. Furthermore, it can be clearly observed that the MiD1_axyz architecture obtains the lowest error for the speed prediction. Regarding the type 2 architectures (SiD2, MiD2, SeD2), Figure 12b shows that the type 2 architectures have similar behavior to those of type 1, except for the SeD2 architecture which has a decreasing error as k2 × k2 increases.  Figure 13 shows the trend of the MAPE in percentage made by the models in the steering wheel angle prediction for type 1 and type 2 architectures. In Figure 13a,b it can be verified that all architectures maintain the prediction error (<2%) if the kernel size is greater or equal to 5 × 5. The architectures that obtain the lowest error are MiD1_wxyz and MiD2_wxyz.

Complexity of the Architectures
The number of parameters for each architecture is an indicator of its complexity. Figure 14 shows the average number of parameters obtained for the SiD, MiD and SeD architectures in the optimization tests for kernel sizes k1 ≤ k2 (Figure 14a) and k1 > k2 (Figure 14b). The use of kernels k2 > k1 drastically reduces the number of parameters for the architectures as can be observed in Figure 14b compared to Figure 14a. As shown in Table 4, the architecture that offers the lowest error in the optimization test for k1 ≤ k2 was MiD2_wxyx with 2DCB(32,3,25,2) and 1,557,006 parameters were used (Figure 14a). In the k1 > k2 optimization test, Table 5 shows that the best results were obtained by the MiD2_wxyx architecture with 2DCB(32,29,3,2) and 2,062,478 parameters ( Figure 14b). As can be observed, an adequate optimization process produces better results with fewer architecture parameters.

Discussion
In the literature review of [16,[19][20][21]26,30,40], compared to the results achieved in this work, a lack of uniformity in the metrics were used and a lack of clarity were found when presenting the results. In some cases, subjective ad hoc metrics that are difficult to reproduce were performed, for example, in [31] a metric called "steering autonomy" is used, which quantifies the number of times the driver had to attend to the autonomous system which was presented as the sole results of the work. In other studies, the set of samples on which the metrics have been calculated are not specifically indicated, nor the number of samples used for training, validation and test sets [15,16,30]. In this work, and to compare the architectures designed, the works which use the mean absolute error (MAE) metric have been selected for the comparison. Table 6 shows a set of models used for the prediction of control actions orientated at autonomous driving based on end-to-end architectures. In [15,30], one of the pioneering works in the development of end-to-end architectures in the field of autonomous driving is shown. The model was called PilotNet and was tested in [16] with the Udacity dataset for the prediction of the steering wheel angle, obtaining a MAE of 4.26 • . The work performed in [29] uses a PilotNet based architecture with RGB images and depth images as input, managing to reduce the error in the angle prediction to 2.41 • . In [16], a SeD architecture is presented consisting of two blocks, CNN and LTSM. The proposed model was tested with images from the Udacity database (20 min of video) converted to the HSV color space and a database created by the authors (SAIC). The average error obtained in the speed and steering angle prediction was of 1.15 km/h and 0.71 • , respectively. The authors in [26] compare the performance of the speed and steering wheel angle prediction performance of three architectures, one of MiD type and two of SeD type based on recurring LSTM networks. The comparison is carried out on two databases: the Guangzhou Automotive Cooperate (GAC) and the autonomous driving simulation platform of Grand Theft Auto V (GATV). The best results were an average error of 2.86 km/h and 2.87 • with the third architecture proposed. Finally, of the architectures proposed in this work, the architecture which obtained the lowest average error (1.06%) was MiD2_wxyz with the 2DCB(32,3,25,2) configuration. As can be observed, the use of the IMU information, specifically the angular velocity data in the x, y and z axes, together with the use of a small kernel (k1 = 3) followed by a large kernel (k2 = 25) notably improved the vehicle speed prediction with a MAE of 0.96 km/s. Regarding the angle prediction, a MAE of 3.61 • was obtained, which corresponds to an error of 0.43% in the measurement span.
The last column of Table 6 shows the number of parameters of each one of the architectures studied. As can be seen, an excess of complexity represented by the number of parameters does not imply the best result. The architecture that presents the best result in the steering wheel angle prediction is the one given by [16] and uses more than 25 M parameters. However, the best result obtained for the speed prediction is obtained by the proposed architecture MiD2_wxyz (32,3,35,2), with less than 1.5 M parameters. An optimized design of the convolutional layers, such as those developed in this work, shows that low complexity and high performance can be achieved.

Conclusions
In this work, a novel classification of end-to-end architectures (SiD, MiD and SeD) capable of generating control actions directly on the maneuvering elements of a vehicle has been presented. In addition, the stages for a modular and detailed design of end-to-end architectures for prediction of the speed and steering wheel angle have been presented. To validate the proposed classification, six architectures have been implemented and evaluated and it has been concluded that: (1) The type 2 architectures, with an independent output branch for each variable, obtain better results in the optimization tests carried out. (2) In the design of the convolutional blocks, a better performance was obtained using an initial block size of k1 ≤ k2. Furthermore, the increase in the size of k 2 gave a better performance for high values of around eight times k 1 .
In the comparison between the proposed architectures, it was observed that the use of vehicle dynamics information from the IMU with RGB images improves the speed and steering wheel angle predictions. The inclusion of the angular velocity substantially improved the speed prediction and a mean prediction error of 1.06% was obtained. In the comparison with other works (see Table 6), the MiD2_wxyz architecture with the 2DCB(32,3,25,2) configuration obtained a better result for the speed prediction in terms of MAE and a 0.43% of error in the angle prediction over the measurement span.
The optimization process performed on the convolutional layer kernels achieved a lighter architecture with excellent performance results.
Future work will continue to explore new architectures that incorporate new and relevant information, such as depth images and LIDAR or RADAR information to improve performance. In addition, the complexity of the database obtained is intended to be increased with more complex driving scenarios.

Data Availability Statement:
The dataset is available upon request from corresponding author.