Cooperative Visual Augmentation Algorithm of Intelligent Vehicle Based on Inter-Vehicle Image Fusion

: In a connected vehicle environment based on vehicle-to-vehicle (V2V) technology, images from front and ego vehicles are fused to augment a driver’s or autonomous system’s visual ﬁeld, which is helpful in avoiding road accidents by eliminating the blind point (the objects occluded by vehicles), especially tailgating in urban areas. Realizing multi-view image fusion is a tough problem without knowing the relative location of two sensors and the fusing object is occluded in some views. Therefore, we propose an image geometric projection model and a new fusion method between neighbor vehicles in a cooperative way. Based on a 3D inter-vehicle projection model, selected feature matching points are adopted to estimate the geometric transformation parameters. By adding deep information, our method also designs a new deep-afﬁne transformation to realize fusing of inter-vehicle images. Experimental results on KIITI (Karlsruhe Institute of Technology and Toyota Technological Institute) datasets are shown to validate our algorithm. Compared with previous work, our method improves the IoU index by 2~3 times. This algorithm can effectively enhance the visual perception ability of intelligent vehicles, and it will help to promote the further development and improvement of computer vision technology in the ﬁeld of cooperative perception. Author Contributions: Conceptualization, W.L. and Y.M.; methodology, W.L. and Y.M.; software, Y.M.; validation, M.G., S.D. and L.W.; formal analysis, Y.M.; investigation, W.L.; resources, W.L.; data curation, Y.M.; writing—original draft preparation, W.L. and Y.M.; writing—review and editing, M.G. and S.D.; visualization, Y.M.;


Introduction
Citing the Global status report on road safety, 2021, 1.3 million people die each year as a result of numerous road traffic crashes, and an estimated 50 million people suffer nonfatal injuries [1]. The statistics from NHTSA show that 30~50% of traffic accidents are due to rear-end collisions [2,3]. Such a scenario might occur when unforeseen circumstances cause a leading vehicle to brake suddenly [4]. Because of the unawareness of the situation ahead of the leading vehicles, drivers do not have enough time to react. Studies report that an extra 0.5 s warning time can avoid collisions by 60% and it can be improved to 90% if an extra 1 s warning time can be given [2]. Hence, it is obvious that the risk can be reduced if the forward vehicle's images can be fused with the host vehicle's images to enhance the driver's or auto-driving system's visual perception ability. The cooperative visual augmentation algorithm based on V2V will be a key part of the advanced driver assistant systems supporting drivers (ADAS) or autonomous driving system to prevent potential hazards.
To decrease the possibility of tailgating accidents, several works have focused on implementation of ADAS or autopilot. In [5], binocular cameras equipped in vehicles generate stereo images, which are used to calculate the distance between leading and following vehicles combined with optical flow. The system monitors the distance and alerts drivers. In [6], as an alternative to equipping the vehicle with expensive sensors, the binocular camera of smartphones or tablets can detect and track forward obstacles, vehicles, and lanes. A further study [7] proposes a time-based collision avoidance warning system (CSW) for lead vehicles in rear-end collisions. It directly quantifies the threat level of the current dynamic situation using velocity, acceleration, and the gap between vehicles. The authors of [8] propose a tailgating model used to monitor tailgating behavior of drivers. The tailgating model calculates the minimum gap required considering relative speed, driver's perception reaction time, weather conditions, and brake efficiency in real time, and alerts drivers with an audio or visual signal.
All the rear-end collision avoidance systems mentioned above only used information obtained from sensors or cameras equipped in the host vehicles. Even the autonomous vehicle system also relies solely on ego-vehicle sensors. Their method has limitations in dealing with the collisions due to the presence of blind spots. If the blind spot can be translucent, the drivers could realize the situation before the sudden break of the leading vehicle occurred. Drivers can then have enough reaction time to avoid collision. Therefore, the risk can be decreased by utilizing sensed information from neighboring vehicles though vehicle-to-vehicle (V2V) communication [9]. Motivated by this deduction, we contribute to research in the field by elaborating on the cooperative system, formed from forward neighboring vehicles and the host vehicle, to augment the host vehicle driver's visual ability. Our method is valuable not only to the ADAS system but also to autonomous vehicle systems, which can improve driving safety by extending the visual perception to obstructed areas.
Although many groups have presented their research on collaborative approaches for safe driving, finding an efficient way to enhance visual perception in order to guarantee safe (automated) driving is still an open question. In [10,11], location information of vehicles is exchanged periodically to prevent potential danger. The authors of [4] provided a rear-end distance warning system based on images garnered from stereoscopic cameras on rear vehicles and rear cameras on leading vehicles. These cooperative systems gave text or digital information, such as a warning message, time gap between cars, and routing data. It is still difficult for the drivers to sense the immediate danger because human beings tend to believe what they can see. In [12,13], they proposed a collision avoidance scheme based on an occupancy grid which is determined by combining light detection and ranging (LiDAR) data. In [14], the authors also fused the features extracted from sparse point clouds. Expensive sensors were used to make up the missing parts. In [15], vehicle trajectory at intersections were estimated based on each vehicle's velocity through V2V communication. A system for cooperative collision avoidance for overtaking scenarios was proposed in [16]. The authors of [9] designed a real-time multisource data fusion scheme through cooperative V2V communications. Multiple confidences were fused based on the Dempster-Shafer theory of evidence (DS).
There exist many studies on collision prediction or avoidance, but few works have been conducted on visual augmentation in a cooperative way. Work [17] proposes a transparent vehicle method based on V2V video streams in order to deal with passing maneuvers. Their method needs accurate distance information gathered from radar sensors to realize the object projection between two images. The work [18] uses linear constraints to enable rear vehicle drivers to see through the front vehicle. However, this method only make sense when the vehicles are both in the same lane. The authors of [19] introduced a method which can "see through" the forward vehicle by adopting affine transformation to fuse images from adjacent vehicles, no matter if they are in same lane or not. However, deviation in the occluded object's location and size always exists. The deviation might cause incorrect judgement by drivers or an autonomous system.
Following this line, we propose the cooperative visual augmentation algorithm based on V2V technology. Expensive sensors, such as LiDAR, are not needed here. An ordinary camera, for example a driving recorder, can meet our requirement and there is no limitation on the location of the leading and host vehicles. The main contributions of this paper are as follows: (1) A new collaborative visual augmentation method to eliminate blind spots is proposed.
Our method can extend the visual perception ability of the driver or autonomous driving system to the obstacle area by fusing images from forward vehicles. (2) We also propose a deep-affine transformation to realize the visual fusing. Depth information and geometric constrains are introduced to optimize the affine matrix parameters. (3) We improve the results of the visual augmented method by projecting occluded objects onto host vehicle images. KITTI data are used as the evaluation dataset.

Architecture of Cooperative Visual Augmentation Algorithm
Because dedicated short range communications (DSRC) can support safety applications in high data rates [20,21], video information can be transmitted between nearby vehicles in real time. By fusing the image information from neighbored vehicles, we can possibly enhance the host vehicle driver's or auto-driving system's visual field. As shown in Figure 1, the view of the host vehicle (Vehicle B in Figure 1) is blocked by other vehicles. Vehicle B's visual perception can be augmented by combing the visual images from leading vehicles (Vehicle A in Figure 1). The fusing algorithm of inter-vehicle images is based on the 3D inter-vehicle projection model and new deep-affine transformation. Similar to Superman's ability, our visual augmentation method can make occluded objects visible so as to eliminate blind spots, and thus potential traffic accidents will be decreased sharply. The locations of two cooperative vehicles and occluded objects can be more flexible. The vehicles can drive in the same lane (Figure 1a) or in different lanes (Figure 1b). An overview of the cooperative augmentation procedure is shown in Figure 2. A connected vehicle environment is considered so that sensor data (images) from forward vehicles are available for acquirement. The algorithm process has been divided into two main phrases: (1) geometric projection based on deep-affine and (2) object based fusion. The first phase features two images, f a and f b , which are extracted separately. Matching feature pairs ( P A , P B ) are selected based on two feature maps and those mismatches are eliminated. Based on those filtered matching feature pairs (P A , P B ), the parameters of projection matrix H are computed. Our method adopts affine transformation as the geometry projective transformation and the parameters of the matrix H are automatically optimized by integrating with the depth information. We name this optimized affine the deep-affine transformation with new matrix H new . The optimizing part is described in Section 3.4. The second phase is the fusion part which applies the deep-affine matrix H new to improve the results of the visual augmentation. The fusion region is decided by merging results from the object detection module. This step is detailed in Section 3.5.

Implementation
The key idea of the cooperative method is to share sensor data obtained from vehicles in different locations via V2V communications. Here, video images of forward vehicles are transmitted to host vehicles through DSRC technology, hence enhancing the ability of the host vehicle to see the occluded objects. The implementation involves five steps: (1) create 3D projection model between front and back vehicle views; (2) select feature pairs from paired images (front and rear vehicle image obtained synchronized); (3) obtain the depth map of the rear vehicle (host vehicle); (4) calculate and optimize the parameters in the affine transformation matrix; and (5) fuse images to augment the view of the host vehicle. All steps are described in the following sections.

The 3D Inter-Vehicle Projection Model
The key step to realize cooperative augmentation is to model the geometric projective relation between two view images. As shown in Figure 3, the same object will map in a different location, scale, and shape in the front vehicle (vehicle A) and host vehicle (vehicle B) images. It is obvious that the object's points in image plane A and B are according to some geometric projective constrains. We suppose that the view angle between the two cameras is limited, and thus, the shape deformation will be ignored here. Therefore, the mapping relation between two image planes satisfies some linear geometric transformation. In our model, affine transformation, a non-singular linear transformation [22], is adopted here. It has the matrix representation in block form: with A a 2 × 2 non-singular matrix, T a translation 2-vector, and 0 T a null 2-vector. P A and P B represent points sets in image plane A and B. Our geometric projection model is shown in Figure 3. O a and O b denote the optical centers of the two cameras, and Π a and Π b are the correspondence image planes. Points v and t represent 3-space points of vehicle and tree, respectively, in the Euclidean world frame. Applying projective geometry, 3D point v in 3 (three-dimensional Euclidean space) is mapped to points v a ∈ P A and v b ∈ P B in 2 (two-dimensional Euclidean space) in image planes Π a and Π b . Similarly, t a ∈ P A and t b ∈ P B are the mapping points of the 3D point t in 3 . Here, the tree can be seen by both vehicles (vehicle A and B); however, the red vehicle is visible to vehicle A and is occluded to vehicle B. Illustrated in Figure 3, v a is the known image point and v b is the unknown point that needs to be estimated. The estimation process based on this model is as follows: (1) Suppose we have n points T i ∈ 3 (i = 1, · · · , n) seen by both vehicles, matching pair points (t i a , t i b ), t i a ∈ P A , t i b ∈ P B will be obtained correspondingly. There is an assumption that if the space points v and t are coplanar then there exists a precise projective transformation. However, in fact, they are usually at different depths which will cause a deviation in projection. This situation will result in an inaccurate estimation of point v b . In order to obtain a more accurate result, depth information is adopted here to improve the mapping results. We propose a new deep-affine transformation to solve this problem. This part is detailed in Section 3.4 of implementation.

Feature Pair Selection
In order to obtain the projection matrix H, the selection of more trustful and accurate matching point pairs of images plays a key role. To perform trustful matching, the feature descriptor of points in images should be representative and stable. Matching pairs selection includes feature detection, feature matching, and mismatched elimination.
(1) Feature detection: Lowe's SIFT method [23] is used to realize feature selection and description. It uses a 128-element-long feature vector descriptor to characterize the gradient pattern in a properly oriented neighborhood surrounding a SIFT feature. The features are invariant to incidental environmental changes in lighting, viewpoint, and scale. (2) Feature matching: By searching the most similar descriptors, SIFT features in front and back images are matched. Brute-force algorithm [23] is adopted here to match feature pairs. The Euclidean distance, used as the matching score, was computed between feature vectors. The selected matching point pairs (also named feature pairs in the following) need to satisfy Equation (1).
(p a , p b ) is a pair of corresponding points in image A and image B. f a and f b represent feature descriptor of p a ∈ P A and p b ∈ P B .max(dis( f a , f b )) means the best matching pair and max_ sec(dis( f a , f b )) is the second best one. Figure 7a in experiment part displays the matching result, and it is obvious that error matching pairs exist only based on similarity. (3) Mismatched elimination: To achieve more accurate feature pairs, we use the RANSAC algorithm [24] to eliminate mismatched feature pairs. Randomly selected n small subsets "seed" (n pairs of matching points), and the calculation of fundamental matrix F is repeated n times. The value of |p a Fp b | calls the residual error, which is ideally supposed to be zero. F will be computed by those outlier-free seeds and will produce small residual errors in |p a Fp b | for mostly inlier matching pairs. We preserve those seeds that produce the minimum median |p a Fp b | residual errors, so that error pairs are filtered. Figure 7b in experiment part displays the result of features after the RANSAC procedure, and most error feature pairs are eliminated.

Acquisition of Depth Map
Depth information is critical to improve the geometric projection results. In this section, we use a neural network called monocular residual matching (monoResMatch) network to infer accurate and dense depth estimation in a self-supervised manner from a single image [25]. As shown in Figure 4, first, a multi-scale feature extractor takes a single raw image as input and computes deep learnable representations at different scales from quarter resolution F 2 L to full-resolution F 0 L in order to toughen the network to ambiguities in photometric appearance. Second, deep high-dimensional features at input image resolution are processed to estimate, through an hourglass structure with skip-connections, multiscale inverse depth (i.e., disparity) maps aligned with the input and a virtual right view learned during training so as to make the network learn to emulate a binocular setup; thus, allowing further processing in the stereo domain. Third, a disparity refinement stage estimates residual corrections to the initial disparity. In particular, deep features from the first stage and back-warped features of the virtual right image are used to construct a cost volume that stores the stereo matching costs using a correlation layer. Finally, the depth map can be obtained according to the theory of binocular matching.

Deep-Affine Transformation
Selected feature pairs are used to calculate the geometric transformation parameters which are used to map occluded objects from the front image plane Π a to the host image plane Π b . Here, we suppose the geometric transformation as the affine transformation. It has the matrix representation as Equation (1). (p a , p b ) represents a matching point pair set in two image planes: p a = (x a , y a ), p b = (x b , y b ). H is the affine matrix and the homogeneous formula is as follows: a 11 , a 12 , a 21 , a 22 , t 1 , and t 2 are six parameters in the H matrix. In our situation, two vehicles are running in the same direction and it is reasonable to assume that there is no rotation transformation and shear transformation. So, the parameters a 12 and a 21 normally approach 0. The parameters a 11 and a 22 mean the scale factor of the horizontal and vertical coordinate. It could be computed as: Figure 5 represents the geometric constrains of affine transformation and depth information. Take object T as an example, l a and h a are the length and width of the T a bounding box in image plane Π a . Similarly, l b and h b represent the length and width of the T b bounding box in image plane Π b . As illustrated in Figure 5, d ta means the distance from object T to camera optical center O a , and d ab is the distance between two cameras. Depending on the matched feature pairs of object T, the parameters of matrix H could be calculated. However, object T and occluded object V may have different depths to a camera, which will lead to inaccurate mapping and fusing of object V in image plane Π b (shown in Figure 5) based on the 3D inter-vehicle projection model (in Section 3.1). Here, we introduce the depth information to adjust the parameters in affine matrix H. In the depth map, the value of the pixel represents the depth distance, so we can obtain the distance ratio γ of object T and the occluded object V relative to the camera optics.
Suppose the new deep-affine transformation matrix is H new . According to Equation (3), the parameter a 11new of H new could be computed as: d va is the distance from occluded object V to camera optical center O a . Because d ab and d va are unknown, Equations (3)-(5) are brought into (6).
Here, we suppose d a = d b because of two reasons: (1) the value of focal length is much smaller than the distance and (2) our method uses the KITTI dataset which employs the same camera. Equation (7) can be simplified to: a 11new = a 11 a 11 + γ − γ × a 11 (8) The same processing procedure is applied to the parameter a 22new . As for the parameters t 1new and t 2new , their value are related to image size and parameters a 11new , a 22new with the center remains unchanged. The equation of t 1new and t 2new is as follows: where L and W are the length and width of image, and α, β are the adjustment factors. The new deep-affine transformation results in the following matrix representation:

Object-Based Image Fusion
To achieve visual augmentation here, we need to fuse multiview sensor images from adjacent vehicles. This section estimates fusion region and functional form necessary for achieving image fusion. In order to realize mapping objects from forward vehicle image A to host image B, firstly, we need to figure out some information related to the geometric configuration. The information includes size, shape, and location of the fusion region. All detected street objects' bounding boxes in image A will be the candidate fusion objects. Only those objects occluded by vehicle A will merge to the fusion regions in image B. Epipolar e a and e b can be used to eliminate those objects that are not occluded by vehicle A. Here, the fusion region in image B is a circle area (rectangle and other shapes are also available). The center and radius of the circle depends on the location and size of the detected vehicle region (vehicle A).
Secondly, we need to estimate a functional form to map pixels from the front image to the back one. The mapping matrix H new between two images is estimated in Section 3.4. The affine transformation regarded as the mapping relationship has the following matrix representation: The fusing location will certainly be determined by affine mapping. The blending method is similar to [18]. The blending weight is adjusted to use more color from the front, image B, close to fusion center and more color from the back, image A, away from the center which is toward the edge of the circle. The transparency parameter controls the mixture of two images.

Datasets
Experiments were performed on the KITTI dataset. The KITTI stereo dataset [26] is a collection of rectified stereo pairs made up of 61 scenes (containing about 42,382 stereo frames) mainly concerned with driving scenarios. The predominant image size is 1242 × 375 pixels. Here, only image frames from the left camera (so as the right camera) are used as the testing data in our method. Instead of obtaining images from the front vehicle and host vehicle simultaneously, we use two frames (with interval ∆t) in the video to imitate the cooperation of the front and back vehicles. ∆t is a random value within 3~20. To simulate the occlusion situation, we selected some vehicles in the picture as the blind spot, and blocked these objects with the white panel in the picture of the back vehicle (shown in Figure 6d).  Figure 6 gives an example of how to use images of KITTI datasets to simulate V2V in reality. In the left column, (a) and (b), two images in KITTI with an interval of 15 frames are chosen. In the right column, (c) and (d), these two images are pretended to be images from the front and host vehicle, respectively. The vehicle with the red rectangle is supposed to be the occluded object. The bottom host image is processed by using a white panel to block the vehicle. We used these image couples to test our method's effectiveness and flexibility.

Depth Map Acquisition Results
We adopt the monoResMatch network to generate the monocular depth map. This network can obtain a high accuracy of up to 90% in the depth map on the KITTI data. The results of the depth map are shown in Figure 8. The top image is the colored depth map and the bottom image is their corresponding images.

Deep-Affine Transformation Results
Based on the above analysis, we assume that the corresponding relationship between the two vehicle images roughly accord with an affine transformation. To remove the noncoplanar problem, our method adopts deep information. The deep-affine matrix is used to estimate the occluded objects' points in the host vehicle's images, which is mapping from the same objects' points in the front vehicle's image. To fully test the geometric projection effect of deep-affine projection, more than 200 images in KITTI were selected as testing data. The results are shown in Figure 9.
In Figure 9, the image (a) represents the front vehicle image and the image (b) is the processed host vehicle image by adding the white panel. The picture (c) shows the ground truth. Images (d) and (e) give the results of transformed front images based on affine transformation and deep-affine transformation. Compared with image (d), image (e) is more approximate to ground truth both in size and location. The outstanding results indicate that adding depth information is effective to improve the results of transformed images.
Moreover, quantitative evaluation is used to measure the performance of deep-affine transformation. Figure 10 gives the IoU (intersection over union) results. The average IoU and IoU statistical data are shown in Table 1 and Figure 11. The IoU can be computed as:     In Figure 10, red boxes are the ground truth bounding box, yellow boxes show the result of affine transformation, and the green box represents the result of our deep-affine transformation. The higher the value of IoU is, the closer the result is to the ground truth. The proposed method can achieve good performance and it shows good robustness with regard to the changes in different views and variable distance, as well as excellent environment adaptability with regard to illumination variations and differing backgrounds. Table 1 illuminates the average IoU on different groups. A total of 10 groups (10 images in a group) are randomly selected from KITTI data and the IoU average value results are listed in Table 1. By adding the depth information to adjust affine transformation, the IoU value can be improved remarkably.
Statistical data of IoU value are shown in Figure 11. The value of IoU ranges from 0 to 1 and is divided into five intervals. The pie chart gives the statistical information of each interval. The deep-affine transformation performs better than the normal affine transformation, largely due to the effective fusion of deep features. As shown in Figure 10, the IoU values of affine projection are mainly concentrated between 0 and 0.2. By comparison, the results of deep-affine transformation mainly fall in (0.4, 0.6) and (0.6, 0.8). Figure 11 shows the final visual augmentation results: the left column are the fusion results based on affine transformation and the right column shows the results on deepaffine transformation. If the front vehicle detected the object on the street, it will send its image data to the host vehicle to realize fusion. After being filtered, the occluded objects are fused with the fusion region in the host image. The fusion process blends the pixel colors in the back vehicle image with the corresponding pixels in the occluded objects' area in the front vehicle image. The fusion region is a circle. As described in Section 4.5, the blending weight is adjusted to use more color of the pixels from the front image close to the center and retain more pixel color from the host image away from the center.

Cooperative Visual Augmentation Results Based on Fusion
As shown in Figure 12, the left column images show the final fusion results by using the original affine matrix in [19] and the right column images are the results of the new deepaffine matrix in our method. The top three rows of images give the real occluded situation, showing that the occluded vehicles are blocked by other vehicles in road. However, in the bottom three rows of images, the vehicles are artificially blocked by a white panel to simulate occlusion. In either situation, the occluded vehicles can be visually perceived by drivers or autonomous systems of ego vehicles. Furthermore, the fusing size and location of the blind spots are closer to the ground truth after adding the depth information.  Certainly, our method fails to obtain accurate infused images in some cases. As shown in Figure 13, for example, the existence of many mismatched and sparse feature pairs between the inter-vehicle images result in incorrect fusion and terrible IoU performance.

Conclusions
In this paper, we propose a cooperative visual augmentation algorithm for occluded objects in connected vehicle environments. In our situation, front and host vehicle images are used cooperatively to enhance the visual perception of the host vehicle if occluded objects exist in front of the leading vehicle. To gain correct size and location of the transformed occluded objects, this algorithm optimizes the parameters of the geometric transformation matrix by combing the depth information and adopting the geometric constraints of the camera model. The KITTI dataset are conducted to evaluate the effectiveness and scalability of our algorithm. The results have shown that IoU values are greatly improved (2~3 times higher than the previous method) and the fusion objects are approaching the ground truth. The limitation of this method is that the influence of the view angle is ignored which will cause size deviation in some situations. Furthermore, the results do not perform well when few feature pairs are matched. In spite of this, our cooperative visual enhancement algorithm can still effectively eliminate blind spots to avoid accidents in urban areas.