Since the light sensor of the camera is very sensitive to changes in the external environment, the image quality of the sensor will be greatly reduced under external interference such as road shadows, abnormal exposure, and motion blur. As a result, the performance of IE will also be greatly affected. Therefore, considering the common visual degradation factors mentioned above, an image-based fusion estimator is proposed to identify the road surface condition that the vehicle will travel in the future period.
3.1. Road Feature Identification Module Considering Deep-Learning Model Uncertainty
As shown in
Figure 2, the road feature extraction network is designed in our previous work [
21]. The network structure is a typical encoder and decoder structure, and the network parameters of the encoder part are shared to the two decoder branches at the same time. Each decoder branch undertakes a segmentation task, namely, to achieve drivable area segmentation and road shadow segmentation.
Considering the computing power limitation of on-board equipment of intelligent vehicles, the lightweight neural network ShuffleNet [
22] is selected as the road feature classification model. The basic unit of the ShuffleNet network is improved based on the residual module of ResNet. It mainly adopts point group convolution and channel random shuffling, which achieves the purpose of improving computational efficiency and maintaining classification accuracy. However, due to the inherent noise in the sample data and insufficient training samples, deep learning models suffer from aleatoric and epistemic uncertainty [
23]. Therefore, the deep learning model still has limitations in high-risk application scenarios such as medical treatment and autonomous driving, and the uncertainty estimation of the model is of great significance in the above application fields.
For multi-classification tasks of road surface conditions, the uncertainty distribution of the model can be obtained by the confidence distribution output by the softmax layer of ShuffleNet. Due to the uncertainty of deep learning models, deep neural network models often suffer from overconfidence or underconfidence. The confidence of the model prediction results does not match the actual accuracy. The model calibration error can be evaluated based on the quantitative index: the expected calibration error (ECE) and the maximum calibration error (MCE) [
24].
In order to train and calibrate the classification model, a structured road image dataset containing the samples disturbed by common visual degradation factors is constructed to expand the sample space of the validation set for subsequent model calibration.
Figure 3 shows the samples of the road image dataset based on BDD100K and the self-collected road image dataset. The constructed road image dataset with noised samples contains common visual degradation factors (i.e., abnormal exposure, motion blur, road shadow). BDD100K provides image data of dry asphalt road and wet asphalt road. The self-collected dataset provides image data of snow/ice road and supplements the image data of dry asphalt road. The image data augmentation library Albumentations [
25] is used to generate abnormal exposure and motion blur interference. Some abnormal exposure interference in the self-collected dataset is generated by adjusting the camera exposure gain.
Considering the large workload of manual extraction from the entire image, first the original road image data are divided according to the road surface condition (i.e., dry asphalt, wet asphalt and snow/ice). Then, the trained road feature extraction network model is used to automatically segment the drivable area and extract the road image parts. The distribution of the constructed road image part dataset is presented in
Table 1.
In
Table 1, it is worth noting that the training set in the constructed dataset in this paper does not contain the road image parts extracted from the noised samples and only contains noised samples in the validation set. This can prevent the model from learning noised features to avoid a significant drop in the model’s prediction accuracy for regular samples. The training loss function is set to cross-entropy loss. The Adam optimizer is also used to train the model, and the initial learning rate is set to 0.0001.
Model calibration methods mainly include regularization calibration methods based on modified training loss function and postprocessing calibration methods based on validation set. Since the addition of noised samples in the training set will make the model learn relevant noised features, the regularization method based on adjusting the loss function may lead to the model underfitting, so the temperature scaling [
24] in the postprocessing calibration method is chosen to calibrate the original ShuffleNet model.
Temperature scaling is a soft label smoothing technology. The main principle is to introduce a temperature coefficient
into the original softmax function to smooth or sharpen the logic vector output by the model. When
, the information entropy of the confidence distribution of the model output will increase, and the model output will become smooth to alleviate the overconfidence of the model. When
, the information entropy of the confidence distribution of the model output will decrease, and the model output will become sharp to alleviate the underconfidence of the model. In addition, since the temperature scaling factor
is a parameter greater than zero, the order of the dimensions of the output logic vector after temperature scaling will not change. As a result, the model calibration method based on temperature scaling will not affect the classification performance of the model. The temperature scaling factor can be learned by optimizing the cross-entropy loss function of the validation set. The learning process of the temperature scaling factor is shown in
Figure 3.
The reliability diagrams of the classification model before and after calibration are presented in
Figure 4, respectively. As can be seen from
Figure 4, the model confidence of the uncalibrated model does not match the actual average accuracy, and the degree of overconfidence is large. Overconfidence of the model is generally caused by the cross-entropy loss function used for training. During the training process of the model, only the dimension of the label corresponding to the maximum value in the confidence distribution participates in the calculation of the loss function, and the relationship between the real label and other labels is ignored, so the output of the model may be extremely biased toward one class, which makes the model overconfident. It can be seen from
Figure 4b that both the expected calibration error (ECE) and the maximum calibration error (MCE) of the model are significantly reduced, indicating that the model is well calibrated.
3.2. Ego-Vehicle Trajectory Reckoning Module Considering Kinematic Model Uncertainty
Existing image-based road recognition methods usually only process the front-view image provided by the camera. Due to the perspective effect of the obtained front-view image, the representative area of the obtained prediction result is often large, and the processing of irrelevant areas in the image may interfere with the final identification result and reduce the computational efficiency. Therefore, in order to accurately identify the road surface condition of the area that the front wheels will pass through, an ego-vehicle trajectory reckoning module considering kinematic model uncertainty is introduced.
Considering that most of the existing mass-produced vehicles are still at the partial driver assistance level below L3, there is no reference trajectory provided by the decision-making and planning system. In addition, most of the vehicle’s driving conditions do not involve extreme driving conditions, so the ego-vehicle trajectory is reckoned based on the extended kinematic bicycle model.
As shown in
Figure 5, taking the vehicle acceleration
and the front wheel steering angle
as the control input, the vehicle single-track kinematics model in the inertial coordinate system can be established. In order to reckon the trajectory of the left and right front wheels, the single-track bicycle model can be extended to the double-track form, as shown in Equation (1).
where
,
are the horizontal and vertical coordinates of the left front wheel in the inertial coordinate system, respectively,
,
are the horizontal and vertical coordinates of the right front wheel in the inertial coordinate system, respectively, and
is the front wheel track,
is the vehicle yaw angle,
is the vehicle speed, and
,
are the distances between the vehicle center of gravity and the front or rear axles, respectively.
Due to the uncertainty of driving behavior, the reckoned ego-vehicle trajectory based on the kinematic model will not be completely consistent with the real trajectory. Generally, it can be considered that the reckoned trajectory is relatively accurate in a short time, and the relative error of the reckoned trajectory in a long time is relatively large. Therefore, it is necessary to establish the uncertainty distribution to better characterize the trajectory distribution.
It can be seen from Equation (1) that the system variables that directly affect the reckoned trajectory are the coordinates of the vehicle center of gravity
,
and the vehicle yaw angle
. For the modeling of vehicle state uncertainty, the uncertainty of the vehicle state can be described as a Gaussian distribution [
26]. Therefore, the Gaussian distribution shown in Equation (2) is used to describe the uncertainty of the above system variables.
Since the error of the reckoned trajectory will continue to be superimposed, assuming that the system state variable
changes linearly, the state prediction equation considering the uncertainty distribution is derived as follows.
where
represents the system uncertainty distribution (i.e., system covariance distribution), the definitions of the state transition matrix
and the noise covariance matrix
are shown in Equation (4). Due to the high consistency of the lateral distribution of the road surface, in order to facilitate the extraction of ROIs,
and
(
) can be set.
where
is the sampling period,
and
are the uncertainty variances of the horizontal and vertical coordinates of the vehicle center of gravity, respectively.
It can be seen from Equation (3) that with the increase of the recursion time, the uncertainty distribution variance of the reckoned trajectory will increase. Therefore, a 95% confidence ellipse with a certain step interval can be set to represent the uncertainty distribution of the reckoned trajectory.
3.3. Fusion Module Based on Improved DSET
In our previous work, the class corresponding to the maximum number of candidate ROIs in a frame is defined as the prediction result of the frame [
20]. Considering that in real driving scenarios, the accuracy of model recognition cannot reach 100%. Moreover, it is easily interfered with by various external environmental factors, causing the estimation results to oscillate. Therefore, the image-based fusion module based on improved DSET is added for subsequent high-quality fusion of IE with DE. The main idea is to define candidate ROIs as virtual sensors to make them form redundant sensor networks. The confidence distribution output by the calibrated road feature classification model is defined as the evidence, and the visual information confidence of the candidate confidence ellipse region is defined. Then, the improved DSET is used to obtain the fusion prediction results and visual information confidence.
For the subsequent extraction of candidate ROIs, it is necessary to project the uncertainty distribution of the reckoned trajectory based on the vehicle coordinate system into the image coordinate system.
It is assumed that the road where the vehicle is located is flat, and the coordinate in the vehicle coordinate system is defined as
. The origin of the vehicle coordinate system is the center of the front axle of the vehicle. The coordinate in the image coordinate system is
, and the origin is the upper left corner pixel of the image. The coordinate in the camera coordinate system is
. The following transformation relationship between the camera coordinate system and the image coordinate system is derived.
where
is the normalization coefficient, and
is the camera internal parameter matrix obtained by calibration.
Based on Equation (5), the transformation relationship from the image coordinate system to the vehicle coordinate system can be expressed as:
where
and
represent the normalized vehicle coordinate system coordinates and image coordinate system coordinates, respectively,
is the camera extrinsic parameter matrix obtained by calibration, and
is the transformation matrix from the vehicle coordinate system to the image coordinate system.
A sample of the reckoned trajectory and uncertainty distribution of the reckoned trajectory in the vehicle coordinate system is shown in
Figure 6. The sample after projection to the image coordinate system is shown in
Figure 7.
Considering that if the candidate ROIs are extracted directly based on the center of the candidate confidence ellipse within the reckoned trajectory, the following problems may be caused: (1) The extracted part of the road surface area is limited, and the identification result is easily disturbed by the external environment; (2) The identification results of the candidate ROI are not representative of the local region. In the image coordinate system, the feature details of the image near the vehicle are richer, and the feature details of the image far away from the vehicle are slightly blurred. Therefore, based on the distance from the vehicle, the confidence ellipses within the selected image regions can be grouped to extract candidate ROIs.
Based on the above analysis, a method for extracting candidate ROIs in the confidence ellipse region is proposed as shown in
Figure 8. Firstly, the selected area in the vehicle coordinate system is divided into grids with size
, and then the coordinates of the corner points of the divided grids are projected and transformed. The total number of corner points covered by road shadows in the group of confidence ellipses
, the total number of grid corner points in the group of confidence ellipses
, and the proportion of non-shaded grid corner points in the group of confidence ellipses
are calculated, respectively. If the number of remaining coverage grids in the confidence ellipse region without shadow coverage
, it means that the available area is too small, then exit the algorithm and continue to search for the next candidate group of confidence ellipses. Otherwise, the
sets of grid corners are randomly sampled in the confidence ellipse region without shadow coverage according to the relative distance, and finally the sequence of candidate ROIs in the confidence ellipse region is output.
The extraction process of the proposed candidate ROIs extraction method is shown in
Figure 9. It is worth noting that both the grid size
and the random sampling group number of the grid corners
will affect the extraction efficiency and accuracy of candidate ROIs. In this paper,
is selected based on reasonable experiments. Since the feature details in the confidence ellipse region with a farther distance from the vehicle are more blurred, it is necessary to sample the image features in the farther region multiple times and select
, respectively.
After obtaining candidate ROIs, inspired by virtual sensing theory, a fusion method based on improved DSET is designed. A virtual sensor is a unique form of sensor, which outputs the required measurement information based on the established mathematical and statistical analysis model. The measurements derived from the mathematical model are defined as outputs of virtual sensors [
27]. The candidate ROIs extracted in each frame are regarded as virtual sensors. The measurement information is defined as the predicted probability of each road surface condition output by the calibrated road feature classification model. Based on the rich measurement information of road surface conditions provided by the constructed redundant virtual sensor network, the improved DSET is used to fuse them to ensure the stability and accuracy of the IFE estimation results.
DSET is a data fusion method that can objectively deal with ambiguous and uncertain information. By combining the independent judgments of multiple incompatible bodies of evidence on the sub-propositions of the identification framework, more reliable conclusions can be drawn than with a single source of information [
28].
In this paper, the DSET combinational rule can be derived as follows:
where
is defined as a virtual sensor (i.e., evidence in DSET),
is the basic probability assignment of each road surface condition in each piece of evidence,
is the conflict coefficient, which can measure the degree of conflict between evidence,
is the fusion confidence corresponding to a certain class of road surface condition.
However, the basic DSET still has the following two problems in practical applications: (1) The fusion problem when a strong conflict exists between evidence. The DSET combinational rule may not be able to combine the known evidence; (2) The problem of the basic probability assignment generation of evidence. The generation result of the basic probability assignment will directly affect the fusion result output by the system. The basic DSET defaults that the weight of each evidence is the same. When the evidence is strongly conflicting, it leads to abnormal fusion results.
In order to solve the problem of abnormal fusion results caused by strong conflicts between evidence, the historical identification results are used to assign different weights to the evidence corresponding to each road surface condition. As a result, a new basic probability assignment is constructed and the weights in it are updated online.
The constructed new basic probability assignment is shown in Equation (8).
where
,
, and
represent the confidence of each class of candidate ROIs output by the road feature classification model,
,
, and
represent the evidence weight corresponding to different road surface conditions,
,
, and
represent the number of candidate ROIs classified as different road surface conditions, and
is the normalized basic probability assignment corresponding to a certain road surface condition.
Considering that the road surface condition changes very slowly within a certain sampling time, an online update strategy for the evidence weight shown in Equation (9) is proposed to make the fusion result as stable as possible. In addition, if there is no candidate ROI or only a single ROI in a certain frame, the class corresponding to the maximum weight value can be defined as the prediction result based on the historical information.
where
is the updated weight of the road surface condition
at time
,
is the normalized weight of the road surface condition
at time
,
is the normalized weight of road surface condition
at time
,
is the total number of extracted candidate ROIs,
is the number of candidate ROIs predicted by ShuffleNet as road surface condition
, and
is a relaxation factor that balances update rate and update stability.
Based on the candidate ROIs extraction method shown in
Figure 8, the obtained sequence of candidate ROIs can be input into the calibrated road feature classification model for uncertainty distribution estimation. Based on the proposed fusion method based on improved DSET, the fusion prediction results and fusion confidence
in a single group of confidence ellipses can be obtained. Therefore, the visual information confidence
for a single group of confidence ellipses within the uncertainty distribution can be defined as follows.
where
is the proportion of the available area in the candidate confidence ellipse region of group
. Considering the subsequent fusion, the visual information confidence flag is defined. After a reasonable experiment, when
, the flag is set to flag “1”.
3.4. Spatiotemporal Transformation Strategy for Identification Results
Since the visual information is predictive, it is necessary to consider the algorithm processing time and the camera image acquisition cycle to perform spatiotemporal transformation processing on the identification results of the proposed IFE. The vehicle speed in the geodetic coordinate system is defined as
, and
represents the preview distance from the image captured by the camera to the front wheel of the vehicle. When the algorithm finishes processing one frame,
represents the current center position of the front wheel of the vehicle in the geodetic coordinate system derived from the reckoned trajectory. Considering that the feature details near the vehicle are rich, this paper chooses to update the identification results of each frame in real time, and then the displacement stamp
of the final identification results of the frame in the geodetic coordinate system can be obtained as follows.
where
is the processing time of the algorithm for one frame, and
is the cycle time of the camera to capture the image. Here,
is set for subsequent synchronization processing.
The spatial position relationship in the spatiotemporal transformation strategy is shown in
Figure 10, including the image coordinate system and the vehicle coordinate system at the corresponding moment of the frame.
is the vehicle’s front wheel position in the geodetic coordinate system when the camera captures the frame. After the algorithm processing time
, the vehicle reaches the position
as shown in
Figure 10, and the identification results of all groups of the candidate confidence ellipses in the red trapezoid can be obtained. Based on the frame-by-frame update strategy selected in this paper, the identification results corresponding to the coordinate
in the vehicle coordinate system corresponding to the displacement stamp
in the geodetic coordinate system can be recorded as the final identification results of the frame. The preview distance set in this paper is
, which is the distance between the bottom of the red trapezoid in
Figure 10 and the center of the front wheel of the vehicle. The preview distance can be adjusted according to the application requirements.