A Unified Multiple-Target Positioning Framework for Intelligent Connected Vehicles

Future intelligent transport systems depend on the accurate positioning of multiple targets in the road scene, including vehicles and all other moving or static elements. The existing self-positioning capability of individual vehicles remains insufficient. Also, bottlenecks in developing on-board perception systems stymie further improvements in the precision and integrity of positioning targets. Vehicle-to-everything (V2X) communication, which is fast becoming a standard component of intelligent and connected vehicles, renders new sources of information such as dynamically updated high-definition (HD) maps accessible. In this paper, we propose a unified theoretical framework for multiple-target positioning by fusing multi-source heterogeneous information from the on-board sensors and V2X technology of vehicles. Numerical and theoretical studies are conducted to evaluate the performance of the framework proposed. With a low-cost global navigation satellite system (GNSS) coupled with an initial navigation system (INS), on-board sensors, and a normally equipped HD map, the precision of multiple-target positioning attained can meet the requirements of high-level automated vehicles. Meanwhile, the integrity of target sensing is significantly improved by the sharing of sensor information and exploitation of map data. Furthermore, our framework is more adaptable to traffic scenarios when compared with state-of-the-art techniques.


Introduction
The intelligent transportation system (ITS) is one of the most indispensable components of the smart city concept that integrates sensing, control, information, and communication technologies into transportation [1]. In recent years, with the emergence of cutting-edge applications of ITS, the positioning of multiple targets, including vehicles and other elements has been playing an increasingly important role in improving safety, mobility, and efficiency [2][3][4]. For example, future intelligent connected vehicles (ICVs) require the positioning of their own real-time location with centimeter-level precision [5], and the awareness of all objects such as surrounding vehicles and vulnerable road users with significant integrity and confidence. In ITS, the positioning of vehicles and other targets are usually referred to as vehicular self-positioning and target localization, respectively. Although attention has been paid in these topics [6][7][8][9], there still exist many limitations that need to be eliminated.

(FA)
Completely Automated Driving HD Centimeter Essential (auto updated) The HD map-based method benefits from the high precision of the map used. For example, the digital map used in [25][26][27] is created from light detection and ranging (LiDAR) data and has a precision of up to 10 cm. A high-accuracy localization technique using urban environment maps for vehicles in motion is proposed in [28], and these maps are generated by integrating GNSS, LiDAR data, and on-board sensors. In addition, more features are used in the process of map alignment, which contributes to higher self-positioning precision. Traffic lights and visual lane markers are used as landmarks in [5,29], respectively. In a more recent study [26], a unified framework using more references in addition to the abovementioned features (lamp poles, traffic signs, etc.) is proposed, and self-localization with an accuracy of within 30 cm is achieved with merely a low-cost camera. While the abovementioned studies conduct map-based localization independently, there is much research that integrates it with other on-board sensors. A lane determination system that fuses on-board sensors, GNSS, and commercially available road network maps is proposed in [30]. A proof-of-concept study using INS and maps for vehicular localization in GNSS-denied environments is conducted in [31].

Target Localization
Apart from vehicular self-positioning, the relative localization of targets in the surrounding environment is another fundamental technology underpinning ICVs. This task is mainly undertaken by vehicle perception modules, and the positioning result is obtained in the vehicular coordinate instead of in the world coordinate. Although the recent decades have witnessed the rapid development of on-board sensors, the current on-board sensing technology still faces the following problems [4]. First, there is a trade-off between localization accuracy and cost. For example, low-cost cameras and radars can achieve accuracies of only several centimeters, while LiDAR systems with centimeter-level ranging accuracy are expensive [32]. Second, all sensors have limited sensing ranges, and the occlusion of sensors by other vehicles and objects is a frequent occurrence [33]. Irrespective of the number of sensors equipped in the vehicle, the perception of the environment remains incomplete. Attempts to improve the perception of bicycles have already encountered bottlenecks to a certain extent.
Studies using the perceptual information of other vehicles to improve the integrity of target localization have proven to be effective. This is because other vehicles in the network may have seen a target that cannot be seen by the ego-vehicle because of occlusion or limited field of view. A vehicle-to-vehicle (V2V) communication and map merging-based cooperative perception system to extend the perception range beyond line of sight and field of view is proposed in [34]. In [35], the results of the awareness of other vehicles are integrated into the ego-vehicle's perception system as virtual sensors to achieve perception enhancement. In [36], a multi-vehicle perception framework combining image and semantic features is proposed, and experiments have proved that the problem of front-vehicle occlusion can be solved. In these studies, the problems of self-positioning and localization of other targets were considered separately, which rendered the effect of the fusion very sensitive to their relative positioning. In addition, these articles do not provide quantitative analyses of the integrity of the results perceived.
Maps also contribute to the relative positioning of targets. For example, the geometry of intersection can be directly extracted from a HD map for motion planning and control [37]. This reduces the pressure on the vehicle-mounted sensing system, but relies on vehicular self-positioning. Incorrect self-positioning greatly affects decision-making. Other works integrate the semantic and geometric prior knowledge in HD maps with the on-board sensing system to improve positioning confidence. In [38,39], a prior probability map is generated in a bird's-eye view or image plane to aid understanding of the scene. Recently, a neural network incorporating prior knowledge with on-board sensors was presented in [40]. However, these studies treat the map only as an auxiliary tool for perception to improve the recognition result without improving the integrity of perception. Moreover, these studies also rely on the accuracy of vehicular positioning.

Contributions
In general, recent research exploits additional sources of information to improve vehicular self-positioning and localization of other targets. However, to the best of our knowledge, V2X and HD maps are considered separately in the literature, while the positioning of the vehicles themselves and those of other targets are usually treated as different modules. In this paper, we propose a unified theoretical positioning framework for multiple targets for ICVs. The bottlenecks of vehicular self-positioning and target localization can be eliminated. Our main contributions are summarized as follows.

•
A unified theoretical framework for vehicular self-positioning and relative localization of targets based on V2X is proposed, and it can integrate data from the on-board sensors in the vehicular network and HD maps with GNSS/INS measurements into a unified system. • By cooperative positioning, accuracy of under 0.2 m can be achieved in terms of self-positioning and relative localization of targets in urban areas using low-cost GNSS/INS, on-board sensors, and widely equipped HD maps. Simultaneously, the target sensing range is extended beyond the line of sight and field of view, and this greatly improves the integrity of perception. • Furthermore, compared with state-of-the-art techniques, the proposed framework places fewer demands on vehicular network nodes' density and the amount of vehicle-to-target measurements.
The remainder of the paper is organized as follows. In Section 2, the system model is provided. The development of the proposed joint multiple-target positioning for ICVs is detailed in Section 3. Detailed implementation aspects are introduced in Section 4. Theoretical studies are explained in Section 5. Numerical results are given in Section 6. Finally, we conclude the paper in Section 7.

Problem Formulation
Firstly, we describe the targets in a traffic scene in this section.

•
Targets: All objects related to vehicle driving, including the connected vehicles themselves and the elements that constitute the environment.

•
Connected vehicles: Vehicles in the vehicular network that can obtain information from other vehicles and HD maps.
A set of N f static features F = 1, 2, · · · , N f also exists in the scene. Their positions are stored on a HD map with noise and, although not necessary, can be captured by the on-board sensors. We use Equation (2) to describe the two-dimensional position of the jth feature.
In addition, we also consider a set of N o objects, moving or static, O = 1, 2, · · · , N o , which is described in Equation (3).
It must be noted that we do not estimate the orientations of the features and objects because the planning module usually does not require this information.
Our task is to estimate the localization and orientation of all the connected vehicles, We also attempt to localize the features and objects (see Equation (5)) Based on Equations (4) and (5), we can obtain the relative localization of other targets by transforming their location into the vehicles' coordinate system.
In terms of the measurements, a target may be captured by a vehicle if it is within the vehicle's sensing range and without any occlusions. As shown in Figure 1, the target can be a connected vehicle, a feature, or an object, which are indicated with red, blue, and brown arrows, respectively. Its measurement model is described as Equation (6).
is a function which denotes the measurement of target at position The connected vehicles are also equipped with GNSS/INS, which can provide measurements of their localization and orientation. The corresponding measurements are indicated with black arrows in Figure 1. Similarly, we treat the map as a virtual sensor with measurements pertaining to an associated feature as shown with green arrows. The measurement of GNSS/INS on vehicle i and that of HD map on feature j is indicated by Equations (7) and (8).
and z (M) where denotes the measurement noise from the map.
If we consider this problem as analogous to a distributed sensor network, the features and connected vehicles are static and mobile anchors, respectively, and their locations are constrained by the HD map and GNSS/INS. The objects are static or mobile nodes, and the on-board sensors generate constraints between the vehicle and nodes. For the vehicles, additional constraints arise from the V-V measurements.

The Unified Multiple-Target Positioning Framework
The objective of multiple-target positioning in this paper is to estimate the states , from measurements, where k ∈ V, j ∈ {V, F , O}, l ∈ F . From a probabilistic perspective, the maximum likelihood estimation of X t is given by Equation (10).
where P (Z t |X t ) is the likelihood of the measurements Z t given the states X t . The conditional distribution of the on-board sensor measurements in Equation (6) is given by Equation (11), where P z Similarly, we get the conditional distributions of the measurement from GNSS/INS P z given states of vehicles p (V) i,t or feature states p (F) j,t (see Equations (12) and (13)). Their distributions are also normal with perception model h as expectation. and Remark 1. Given a set of independent and identically distributed (i.i.d.) data D = {x n , n = 1, 2, · · · , N}, where observation x n ∈ R D×1 is drawn from a multivariate Gaussian distribution N (x n ; µ n , R n ).
The log-likelihood of the data set can be written as Equation (14).
where e n = x n − µ n .
According to Remark 1, the maximization of L(D) is equivalent to the minimization of J (D) (see Equation (15)). The problem is solved with optimization as described in Section 4.
Considering (11)- (13) and assuming that the three types of measurements are independent, the joint probability density can be factorized as given in Equation (16).
The maximization of P(Z t |X t ) can be reformulated as the following nonlinear least squares problem (see Equation (17)).
To enable insightful visualization, the nonlinear least-squares problem is interpreted in terms of inference over a factor graph [41]. This graph consists of 2 types of nodes: variable nodes, which represent the state X t , and factor nodes, which represent the constraints to on the variables. The factor nodes can be further divided into bi-directed nodes, which denote the constraints for 2 states (from the on-board sensor measurements), and directed prior nodes, which denote the constraints from the map and GNSS/INS. As shown in Figure 2, for each measurement, we have the following factors.
• Factor between the variables V and Ξ = {V, F , O}, on behalf of the constraints of V-V, vehicle-feature (V-F), vehicle-object (V-O), as expressed in Equation (18).
• Factor between the variables V and GNSS/INS, on behalf of the constraints from GNSS/INS, as expressed in Equation (19).
• Factor between the variables F and the map, on behalf of the constraints from the HD map, as expressed in Equation (20).
The joint probability in Equation (16) can then be rewritten as the product of all the factors.
We can clearly see the constraints applied on each node in the factor graph.

Implementation Aspects
In this section, we introduce the implementation aspects related to the hypothesis on vehicle perception, the measurement model, optimization, and data association.

Perception Demands and Sensing Capability of Vehicles
We assume that due to occlusion and the limitations of perception range, the vehicle cannot completely locate the desired target. In this section, we explain the hypothesis of this work. It should be noted that our hypothesis is based on typical perceptual systems, but can be easily adapted to other forms.
As shown in Figure 3, we identify the scope of targets that need to be localized by a vehicle as "demanding space" and assume that it is a rectangle that can be quantitatively described by l f and l r , i.e., the distances that the vehicle requires to sense ahead of and behind itself, respectively, and W d , the range that should be perceived laterally. We assume that the vehicle sensing range is a forward-facing cone with a radius of R s , and the field of view is θ FOV . To consider situations of occlusion, we assume that there is an object P in the sensing range, and that the area outside P in the sector with line VP as the axis of symmetry is regarded as the occlusion area. Thus, only the blue area in Figure 3 can be perceived. The limitations in sensing range and occlusion constitute the blind spots of environment perception.

Measurement Model
In this work, we assume that data from the on-board sensors are in a 2D vehicular coordinate fashion where h (S) ∼ R 2×1 , correspond to the measurements from low-cost cameras. This can be easily adapted to other measurement types, such as polar coordinates in LiDAR measurements. The measurement model is expressed as, where the rotation matrix is expressed as, and the covariance of the on-board sensors' measurements is given in Equation (24).
We assume that the GNSS/INS can provide the measurements of the coordinates and angles of the vehicles. There are many studies on modeling the measurements noise of GNSS/INS [42,43]. In this paper, we simplify the error of GNSS to Gaussian distribution, and the measurement model and uncertainty of the GNSS/INS on vehicle i are indicated as Equations (25) and (26). Our framework is also suitable for extending to other error assumptions. and In commercial HD maps, the coordinates of the features are provided along with noise, so we formulate the measurement mode and covariance matrix as expressed by Equations (27) and (28).

Optimized Variable Allocation and Data Association
The optimized variables are allocated to observations within the demanding space of perception. Unlike in a traditional multi-vehicle cooperative system, barring objects and features captured by vehicles, features that are not seen by any vehicle but are within the demanding space are also included in the optimized variables, and further optimized to yield the results of perception.
Observations that are associated are merged into existing variables and form constraints in the process of optimization. There are many methods that can be applied to our framework [44,45]. In this study, we assume that the vehicle's on-board sensors and HD maps can provide enough semantic clue to identify objects. The association algorithm itself is beyond the scope of this article.

Optimization Problem Solving
In this study, the nonlinear optimization problem in Equation (17) is solved via the Levenberg-Marquardt method [46]. We reorganize the residuals of time t into one vector, as expressed in Equation (29). is the residual of measurement from HD map to feature q. The optimized variable at time t can then be rewritten as, where Ξ ∈ F, O . Let R be the overall covariance matrix such that The cost function can be rewritten as, We can get the Jacobian matrix (see Equation (33)).

J(X)
The initial values of the optimization iterations are given as follows. The vehicular position and attitude are calculated by the measurements of the GNSS/INS. The positions of features are determined by the map, and the initial positions of objects are calculated by converting the positions measured by the on-board sensors to the geodetic coordinate system according to the initial vehicle position and attitude. The cost function (17) can be minimized towards zero by iterations: where λ is determined by the Levenberg-Marquardt method, and J and f are defined in Equations (32) and (33), respectively.

Theoretical Analysis on the Framework Performance
The proposed multiple-target positioning framework aims to solve a parameter estimation problem. Its performance can be evaluated either numerically or theoretically. In this section, the lower bounds on the estimation errors are determined from theoretical studies. As one of the most widely used lower bounds, the Cramér-Rao lower bound (CRLB) is chosen as the performance benchmark. The framework is performance-bound in terms of the minimum achievable variance provided by any unbiased estimators.
Assume that a deterministic signal s t (θ) with an unknown vector parameter θ is observed in white Gaussian noise as Equation (35).
where v t ∼ N (0, C t ). We wish to estimate θ from z. The Fisher information matrix [47] is given by Equation (36).
Taking the inverse of I(θ), the CRLB for the parameters is then obtained from its diagonal elements. The CRLB for θ m is the (m, m) entry of I −1 (θ).
For the proposed framework, the following measurements are considered. For convenience, all the measurements available are reformulated to the following compact form: where z G i,t , z M j,t and z Ξ i,j,t are defined in Equations (6)-(8), respectively. The unknown parameters are obtained from Equation (38). (38) We observe that z t is Gaussian distributed with mean h t (θ) and covariance matrix C t : The CRLB for θ is obtained by substituting h t (θ) and C t into Equation (36).

Numerical Results
In this section, we discuss the simulation experiments conducted under typical vehicular network scenarios to verify the localization and perception capacity results of the proposed algorithm. We also demonstrate its environmental adaptability in subsequent discussions on factors that influence the final performance by considering different scene configurations.
As shown in Figure 4, we build an intersection with 2 two-way two-lane roads. This scenario consists of a busy urban area and a suburban area. The trajectories of all the vehicles and objects as well as the traffic scene configuration come from VISSIM, a behavior-based traffic flow simulator [48]. Each road is 330 m long. In the middle, until approximately 200 m from the intersection, we simulate a busy urban scenario with lamps, traffic lights, and traffic signs located randomly on the roadside. Pedestrians walking around the road and across the intersection are generated. Outside of and far from the intersection, nothing is placed on the roadside, which simulates the scenario of a suburban area. In the simulation, connected and disconnected vehicles start from one end of the road simultaneously, then travel straight or turn left or right at the intersection, and exit the scene almost simultaneously. Therefore, vehicles are in the urban area in the middle section of the simulation steps, and the starting and ending segments correspond to suburban scenes.

Performance in a Typical Scenario
First, we validate the effectiveness of our algorithm in a fixed scenario and compare it with the method proposed by Gloria et al. [19], as well as the theoretical bound CRLB. We set the accuracy of each measurement to that of low-cost devices. The configuration for the scenario and each measurement are as follows. We run the simulation 200 times, and noise is added to the measurements independently for each iteration. One localization result of the vehicles, objects, and features, and their true positions in the urban area is shown in Figure 5. As the 6 vehicles face similar scenes in every simulation step, we statistically analyze the positioning errors of all the vehicles. The root-mean-square error (RMSE) of self-positioning for all 6 vehicles at simulation time t is calculated using Equation (40).
i,j,t is the self-positioning result of vehicle i at the jth run at simulation step t, and p i,j,t is its corresponding ground truth. The self-positioning mean-square error MSE bound is calculated using Equation (41) where CRLB(x  As shown in Figure 6, the proposed method is compared with that of Gloria et al. [19], as well as the RMSE bound √ CRLB t (see Equation (41)). It is obvious that compared with the original GNSS measurements, we obtain a significantly improved positioning result by using the information from V2X and the HD map. In particular, in the urban area (simulation steps 21-55), our algorithm achieves high positioning accuracy (0.16 m), which is also lower than that of the method in [19] (1.79 m). Our positioning accuracy is close to the theoretical lower bound given by CRLB, which shows that we have effectively used all valuable information. We also give the number of constraints at each simulation time in Figure 7. Overall, the greater the number of constraints available, the better our positioning results are. In fact, the study in [19] only uses GNSS and V-O constraints, while we have used additional constraints including V-F, V-V, and prior constraints of the HD map. In the suburban area where the sensing ranges of different vehicles have little overlap, the method proposed in [19] acts ineffectively as few constraints are available. However, our positioning is still a significant improvement in these challenging areas. Environmental adaptability will be further discussed in the next section.
In terms of the positioning of other targets, we compared both the target location precision and sensing integrity. We transform the positioning results of these targets into the body coordinate system (i.e., the vehicle coordinate system shown in Figure 3 for analysis, as this analysis is consistent with the vehicle sensing system. The target positioning accuracy of a vehicle is evaluated by the RMSE of all targets within the demanding space (see Equation (42)).
k,j,t is the localization result of the kth target in the vehicle coordinate system at the jth runtime, and p (T) k,j,t is its truth position. The RMSE of relative localization i.e., RMSE (T) t is defined as the root mean square of the location of all vehicles. In fact, such a result is affected by both the absolute positioning and vehicular self-positioning, which makes our analysis more rigorous.The result is shown in Figure 8. In the urban area, the RMSE is 0.17 m; it is much smaller than that (0.24 m) obtained by Gloria's method as well as that (0.32 m) provided by the vehicles' on-board sensors. Higher perception accuracy is also achieved in suburban areas.   In summary, our approach significantly improves multiple-target positioning in terms of accuracy and integrity over that achieved using the original measurements, and is also more effective than other methods.

Adaptability to Different Scenarios
In the following section, we analyze the impact of different elements on the results of multiple-target positioning to demonstrate the adaptability of our method to different environments. Simultaneously, we discuss the contributions of different constraint types to the results.

Number of Connected Vehicles
The vehicular positioning and relative localization of the targets in terms of the number of connected vehicles are shown in Figures 10 and 11, respectively. Except for the number of connected vehicles, the configurations of the scene are identical, with 6 features and 3 objects. The accuracy of sensing and GNSS/INS are the same as those in previous experiments.  As can be seen from the figures, although there is a limited number of connected vehicles (only 2 vehicles), the positioning capacities of the vehicles and other targets are significantly improved in the urban areas. In general, the greater the number of connected vehicles, the smaller the corresponding positioning errors, as more V-V constraints are imposed between the vehicles. It is worth noting that there is a trend that the RMSEs of self-positioning increase as the simulation step increases from 70 to 80, as the vehicles are heading the end of the roads, where features and objects are becoming increasingly sparse. In theory, the lower bound of positioning error of 12 vehicles is lower than 8 vehicles. Due to the limited simulation times, the RMSE fluctuates near the theoretical bounds. Therefore, the RMSE of 12 vehicles seems close to those of 8 vehicles. However, based on the existing results, we cannot say that there is a trend that they will exceed those of 8 vehicles.
Another interesting observation is that for a given number of connected vehicles, the positioning errors in the suburban areas are larger than those in the urban areas, but it is worth noting that the decreasing trend of RMSE with increasing the number of connected vehicles is more significant. This is because, unlike the urban area with sufficient types of constraints, the constraints in suburbs are mainly of V-V. Therefore, we argue that the number of connected vehicles is important for improving location accuracy in the suburban area, which is consistent with the results shown in Figure 10.
As for the perception accuracy, in suburban area, with the increase of the number of vehicles, there are more constraints which benefits the positioning of connected vehicles and other objects. The underlying mechanism can be explained by Equation (36). As the constraints increase, the dimension of h t (θ) increases, which leads to an increase in the value of the information matrix I(θ). The CRLB for θ m , which can be calculated with the diagonal element of I(θ)) −1 will decrease, which leads to the reduction of the absolute positioning error for each object. As the perception result is gained by projecting the absolute of other targets to the vehicle-body coordinate system based on the self-positioning, the reduction of the absolute positioning error finally improves the perception accuracy.

Number of Features
The effect of the number of features are shown in Figures 12 and 13. There are four connected vehicles running on the road with four objects and different numbers of features. It is obvious that increasing the number of features improves the accuracy of vehicle positioning and relative localization of the targets. Compared to raw measurements, even with a few features, the positioning and perception errors are reduced by exploiting the vehicle-to-target constraints and HD map information. It is noteworthy that at the intersection, the positioning accuracy is high when the number of features is 5, 10, or 30. However, if there is no feature i.e., no map information is used, the positioning error is obviously higher. This reflects the contribution of the HD map to vehicle positioning.

Number of Objects
In comparing the results obtained when the number of objects varies, we set N v = 8 and N f = 0. The corresponding results are given in Figures 14 and 15. In the suburban zones, the increase in the number of objects improves the positioning. However, it is noteworthy that such an improvement in the intersection zone is not obvious. The reason is that in the former zone, there are very few V-V constraints, and V-O constraints play the main role in improving the results. Hence, adding objects can effectively improve the positioning. However, at the intersections, the V-V constraints formed by 8 vehicles are dominant and the results approach the theoretical bounds achievable. There is no significant improvement in accuracy with the addition of objects.  In the case of relative localization, increasing the number of objects can reduce the overall perception accuracy, instead. As the number of objects increases, the proportion of objects among all the perceived targets participating in the perceptual precision calculation increases. The objects are less constrained relative to other targets, and the overall clarity of perception declines. Considering this and the former discussion, we argue that V-O constraints are less effective than V-V and V-F constraints in improving multiple-target positioning accuracy.

Conclusions
This study focuses on the problem of multiple-target positioning for ICVs. We propose a unified theoretical framework for positioning both vehicles and other targets, wherein sensor data from V2X and HD map data are effectively fused with GNSS/INS and on-board sensors. By jointly exploiting the vehicle-to-target constraints and HD map information, the vehicular localization accuracy can be enhanced to meet the requirements of high-level automated driving by using low-cost GNSS/INS and on-board sensors in urban areas. Meanwhile, the confidence and integrity of the results of relative localization of targets are significantly improved, realizing sensing beyond line of sight and field of view, which can improve the transportation efficiency and safety. Furthermore, the proposed framework is applicable to more challenging scenarios entailing fewer connected vehicles and sparse features and objects. In future research, we plan to remove the limiting assumption of data association employed in this study by applying association methods in the process of optimization. We will also study the formulation of communication delay of V2X in the data fusion framework.

Conflicts of Interest:
The authors declare no conflict of interest.