1. Introduction
Citing the Global status report on road safety, 2021, 1.3 million people die each year as a result of numerous road traffic crashes, and an estimated 50 million people suffer nonfatal injuries [
1]. The statistics from NHTSA show that 30~50% of traffic accidents are due to rear-end collisions [
2,
3]. Such a scenario might occur when unforeseen circumstances cause a leading vehicle to brake suddenly [
4]. Because of the unawareness of the situation ahead of the leading vehicles, drivers do not have enough time to react. Studies report that an extra 0.5 s warning time can avoid collisions by 60% and it can be improved to 90% if an extra 1 s warning time can be given [
2]. Hence, it is obvious that the risk can be reduced if the forward vehicle’s images can be fused with the host vehicle’s images to enhance the driver’s or auto-driving system’s visual perception ability. The cooperative visual augmentation algorithm based on V2V will be a key part of the advanced driver assistant systems supporting drivers (ADAS) or autonomous driving system to prevent potential hazards.
To decrease the possibility of tailgating accidents, several works have focused on implementation of ADAS or autopilot. In [
5], binocular cameras equipped in vehicles generate stereo images, which are used to calculate the distance between leading and following vehicles combined with optical flow. The system monitors the distance and alerts drivers. In [
6], as an alternative to equipping the vehicle with expensive sensors, the binocular camera of smartphones or tablets can detect and track forward obstacles, vehicles, and lanes. A further study [
7] proposes a time-based collision avoidance warning system (CSW) for lead vehicles in rear-end collisions. It directly quantifies the threat level of the current dynamic situation using velocity, acceleration, and the gap between vehicles. The authors of [
8] propose a tailgating model used to monitor tailgating behavior of drivers. The tailgating model calculates the minimum gap required considering relative speed, driver’s perception reaction time, weather conditions, and brake efficiency in real time, and alerts drivers with an audio or visual signal.
All the rear-end collision avoidance systems mentioned above only used information obtained from sensors or cameras equipped in the host vehicles. Even the autonomous vehicle system also relies solely on ego-vehicle sensors. Their method has limitations in dealing with the collisions due to the presence of blind spots. If the blind spot can be translucent, the drivers could realize the situation before the sudden break of the leading vehicle occurred. Drivers can then have enough reaction time to avoid collision. Therefore, the risk can be decreased by utilizing sensed information from neighboring vehicles though vehicle-to-vehicle (V2V) communication [
9]. Motivated by this deduction, we contribute to research in the field by elaborating on the cooperative system, formed from forward neighboring vehicles and the host vehicle, to augment the host vehicle driver’s visual ability. Our method is valuable not only to the ADAS system but also to autonomous vehicle systems, which can improve driving safety by extending the visual perception to obstructed areas.
Although many groups have presented their research on collaborative approaches for safe driving, finding an efficient way to enhance visual perception in order to guarantee safe (automated) driving is still an open question. In [
10,
11], location information of vehicles is exchanged periodically to prevent potential danger. The authors of [
4] provided a rear-end distance warning system based on images garnered from stereoscopic cameras on rear vehicles and rear cameras on leading vehicles. These cooperative systems gave text or digital information, such as a warning message, time gap between cars, and routing data. It is still difficult for the drivers to sense the immediate danger because human beings tend to believe what they can see. In [
12,
13], they proposed a collision avoidance scheme based on an occupancy grid which is determined by combining light detection and ranging (LiDAR) data. In [
14], the authors also fused the features extracted from sparse point clouds. Expensive sensors were used to make up the missing parts. In [
15], vehicle trajectory at intersections were estimated based on each vehicle’s velocity through V2V communication. A system for cooperative collision avoidance for overtaking scenarios was proposed in [
16]. The authors of [
9] designed a real-time multisource data fusion scheme through cooperative V2V communications. Multiple confidences were fused based on the Dempster–Shafer theory of evidence (DS).
There exist many studies on collision prediction or avoidance, but few works have been conducted on visual augmentation in a cooperative way. Work [
17] proposes a transparent vehicle method based on V2V video streams in order to deal with passing maneuvers. Their method needs accurate distance information gathered from radar sensors to realize the object projection between two images. The work [
18] uses linear constraints to enable rear vehicle drivers to see through the front vehicle. However, this method only make sense when the vehicles are both in the same lane. The authors of [
19] introduced a method which can “see through” the forward vehicle by adopting affine transformation to fuse images from adjacent vehicles, no matter if they are in same lane or not. However, deviation in the occluded object’s location and size always exists. The deviation might cause incorrect judgement by drivers or an autonomous system.
Following this line, we propose the cooperative visual augmentation algorithm based on V2V technology. Expensive sensors, such as LiDAR, are not needed here. An ordinary camera, for example a driving recorder, can meet our requirement and there is no limitation on the location of the leading and host vehicles. The main contributions of this paper are as follows:
- (1)
A new collaborative visual augmentation method to eliminate blind spots is proposed. Our method can extend the visual perception ability of the driver or autonomous driving system to the obstacle area by fusing images from forward vehicles.
- (2)
We also propose a deep-affine transformation to realize the visual fusing. Depth information and geometric constrains are introduced to optimize the affine matrix parameters.
- (3)
We improve the results of the visual augmented method by projecting occluded objects onto host vehicle images. KITTI data are used as the evaluation dataset.
2. Architecture of Cooperative Visual Augmentation Algorithm
Because dedicated short range communications (DSRC) can support safety applications in high data rates [
20,
21], video information can be transmitted between nearby vehicles in real time. By fusing the image information from neighbored vehicles, we can possibly enhance the host vehicle driver’s or auto-driving system’s visual field. As shown in
Figure 1, the view of the host vehicle (Vehicle B in
Figure 1) is blocked by other vehicles. Vehicle B’s visual perception can be augmented by combing the visual images from leading vehicles (Vehicle A in
Figure 1). The fusing algorithm of inter-vehicle images is based on the 3D inter-vehicle projection model and new deep-affine transformation. Similar to Superman’s ability, our visual augmentation method can make occluded objects visible so as to eliminate blind spots, and thus potential traffic accidents will be decreased sharply. The locations of two cooperative vehicles and occluded objects can be more flexible. The vehicles can drive in the same lane (
Figure 1a) or in different lanes (
Figure 1b).
An overview of the cooperative augmentation procedure is shown in
Figure 2. A connected vehicle environment is considered so that sensor data (images) from forward vehicles are available for acquirement. The algorithm process has been divided into two main phrases: (1) geometric projection based on deep-affine and (2) object based fusion. The first phase features two images,
and
, which are extracted separately. Matching feature pairs
are selected based on two feature maps and those mismatches are eliminated. Based on those filtered matching feature pairs
, the parameters of projection matrix
are computed. Our method adopts affine transformation as the geometry projective transformation and the parameters of the matrix
are automatically optimized by integrating with the depth information. We name this optimized affine the deep-affine transformation with new matrix
. The optimizing part is described in
Section 3.4. The second phase is the fusion part which applies the deep-affine matrix
to improve the results of the visual augmentation. The fusion region is decided by merging results from the object detection module. This step is detailed in
Section 3.5.
3. Implementation
The key idea of the cooperative method is to share sensor data obtained from vehicles in different locations via V2V communications. Here, video images of forward vehicles are transmitted to host vehicles through DSRC technology, hence enhancing the ability of the host vehicle to see the occluded objects. The implementation involves five steps: (1) create 3D projection model between front and back vehicle views; (2) select feature pairs from paired images (front and rear vehicle image obtained synchronized); (3) obtain the depth map of the rear vehicle (host vehicle); (4) calculate and optimize the parameters in the affine transformation matrix; and (5) fuse images to augment the view of the host vehicle. All steps are described in the following sections.
3.1. The 3D Inter-Vehicle Projection Model
The key step to realize cooperative augmentation is to model the geometric projective relation between two view images. As shown in
Figure 3, the same object will map in a different location, scale, and shape in the front vehicle (vehicle A) and host vehicle (vehicle B) images. It is obvious that the object’s points in image plane A and B are according to some geometric projective constrains. We suppose that the view angle between the two cameras is limited, and thus, the shape deformation will be ignored here. Therefore, the mapping relation between two image planes satisfies some linear geometric transformation. In our model, affine transformation, a non-singular linear transformation [
22], is adopted here. It has the matrix representation in block form:
with A a 2 × 2 non-singular matrix,
T a translation 2-vector, and
a null 2-vector.
and
represent points sets in image plane A and B.
Our geometric projection model is shown in
Figure 3.
and
denote the optical centers of the two cameras, and
and
are the correspondence image planes. Points
and
represent 3-space points of vehicle and tree, respectively, in the Euclidean world frame. Applying projective geometry, 3D point
in
(three-dimensional Euclidean space) is mapped to points
and
in
(two-dimensional Euclidean space) in image planes
and
. Similarly,
and
are the mapping points of the 3D point
in
. Here, the tree can be seen by both vehicles (vehicle A and B); however, the red vehicle is visible to vehicle A and is occluded to vehicle B. Illustrated in
Figure 3,
is the known image point and
is the unknown point that needs to be estimated. The estimation process based on this model is as follows:
- (1)
Suppose we have points (i = 1, …, n) seen by both vehicles, matching pair points will be obtained correspondingly.
- (2)
The projection matrix, geometric transformation parameters, are estimated based on the n matching points pair .
- (3)
Through H and , the occluded point can be calculated.
There is an assumption that if the space points
and
are coplanar then there exists a precise projective transformation. However, in fact, they are usually at different depths which will cause a deviation in projection. This situation will result in an inaccurate estimation of point
. In order to obtain a more accurate result, depth information is adopted here to improve the mapping results. We propose a new deep-affine transformation to solve this problem. This part is detailed in
Section 3.4 of implementation.
3.2. Feature Pair Selection
In order to obtain the projection matrix , the selection of more trustful and accurate matching point pairs of images plays a key role. To perform trustful matching, the feature descriptor of points in images should be representative and stable. Matching pairs selection includes feature detection, feature matching, and mismatched elimination.
- (1)
Feature detection: Lowe’s SIFT method [
23] is used to realize feature selection and description. It uses a 128-element-long feature vector descriptor to characterize the gradient pattern in a properly oriented neighborhood surrounding a SIFT feature. The features are invariant to incidental environmental changes in lighting, viewpoint, and scale.
- (2)
Feature matching: By searching the most similar descriptors, SIFT features in front and back images are matched. Brute-force algorithm [
23] is adopted here to match feature pairs. The Euclidean distance, used as the matching score, was computed between feature vectors. The selected matching point pairs (also named feature pairs in the following) need to satisfy Equation (1).
is a pair of corresponding points in image A and image B.
and
represent feature descriptor of
and
.
means the best matching pair and
is the second best one. Figure 7a in experiment part displays the matching result, and it is obvious that error matching pairs exist only based on similarity.
- (3)
Mismatched elimination: To achieve more accurate feature pairs, we use the RANSAC algorithm [
24] to eliminate mismatched feature pairs. Randomly selected
n small subsets “seed” (
n pairs of matching points), and the calculation of fundamental matrix
F is repeated n times. The value of
calls the residual error, which is ideally supposed to be zero.
F will be computed by those outlier-free seeds and will produce small residual errors in
for mostly inlier matching pairs. We preserve those seeds that produce the minimum median
residual errors, so that error pairs are filtered. Figure 7b in experiment part displays the result of features after the RANSAC procedure, and most error feature pairs are eliminated.
3.3. Acquisition of Depth Map
Depth information is critical to improve the geometric projection results. In this section, we use a neural network called monocular residual matching (monoResMatch) network to infer accurate and dense depth estimation in a self-supervised manner from a single image [
25]. As shown in
Figure 4, first, a multi-scale feature extractor takes a single raw image as input and computes deep learnable representations at different scales from quarter resolution
to full-resolution
in order to toughen the network to ambiguities in photometric appearance. Second, deep high-dimensional features at input image resolution are processed to estimate, through an hourglass structure with skip-connections, multi-scale inverse depth (i.e., disparity) maps aligned with the input and a virtual right view learned during training so as to make the network learn to emulate a binocular setup; thus, allowing further processing in the stereo domain. Third, a disparity refinement stage estimates residual corrections to the initial disparity. In particular, deep features from the first stage and back-warped features of the virtual right image are used to construct a cost volume that stores the stereo matching costs using a correlation layer. Finally, the depth map can be obtained according to the theory of binocular matching.
3.4. Deep-Affine Transformation
Selected feature pairs are used to calculate the geometric transformation parameters which are used to map occluded objects from the front image plane
to the host image plane
. Here, we suppose the geometric transformation as the affine transformation. It has the matrix representation as Equation (1).
represents a matching point pair set in two image planes:
,
.
is the affine matrix and the homogeneous formula is as follows:
,
,
,
,
, and
are six parameters in the
matrix. In our situation, two vehicles are running in the same direction and it is reasonable to assume that there is no rotation transformation and shear transformation. So, the parameters
and
normally approach 0. The parameters
and
mean the scale factor of the horizontal and vertical coordinate. It could be computed as:
Figure 5 represents the geometric constrains of affine transformation and depth information. Take object T as an example,
and
are the length and width of the
bounding box in image plane
. Similarly,
and
represent the length and width of the
bounding box in image plane
. As illustrated in
Figure 5,
means the distance from object T to camera optical center
, and
is the distance between two cameras.
Depending on the matched feature pairs of object T, the parameters of matrix
could be calculated. However, object T and occluded object V may have different depths to a camera, which will lead to inaccurate mapping and fusing of object V in image plane
(shown in
Figure 5) based on the 3D inter-vehicle projection model (in
Section 3.1). Here, we introduce the depth information to adjust the parameters in affine matrix
. In the depth map, the value of the pixel represents the depth distance, so we can obtain the distance ratio
of object T and the occluded object V relative to the camera optics.
Suppose the new deep-affine transformation matrix is
. According to Equation (3), the parameter
of
could be computed as:
is the distance from occluded object V to camera optical center
. Because
and
are unknown, Equations (3)–(5) are brought into (6).
Here, we suppose
because of two reasons: (1) the value of focal length is much smaller than the distance and (2) our method uses the KITTI dataset which employs the same camera. Equation (7) can be simplified to:
The same processing procedure is applied to the parameter
. As for the parameters
and
, their value are related to image size and parameters
,
with the center remains unchanged. The equation of
and
is as follows:
where
and
are the length and width of image, and
,
are the adjustment factors. The new deep-affine transformation results in the following matrix representation:
3.5. Object-Based Image Fusion
To achieve visual augmentation here, we need to fuse multiview sensor images from adjacent vehicles. This section estimates fusion region and functional form necessary for achieving image fusion. In order to realize mapping objects from forward vehicle image A to host image B, firstly, we need to figure out some information related to the geometric configuration. The information includes size, shape, and location of the fusion region. All detected street objects’ bounding boxes in image A will be the candidate fusion objects. Only those objects occluded by vehicle A will merge to the fusion regions in image B. Epipolar and can be used to eliminate those objects that are not occluded by vehicle A. Here, the fusion region in image B is a circle area (rectangle and other shapes are also available). The center and radius of the circle depends on the location and size of the detected vehicle region (vehicle A).
Secondly, we need to estimate a functional form to map pixels from the front image to the back one. The mapping matrix
between two images is estimated in
Section 3.4. The affine transformation regarded as the mapping relationship has the following matrix representation:
The fusing location will certainly be determined by affine mapping. The blending method is similar to [
18]. The blending weight is adjusted to use more color from the front, image B, close to fusion center and more color from the back, image A, away from the center which is toward the edge of the circle. The transparency parameter controls the mixture of two images.