Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network

We present a method to acquire 3D position measurements for decentralized target tracking with an asynchronous camera network. Cameras with known poses have fields of view with overlapping projections on the ground and 3D volumes above a reference ground plane. The purpose is to track targets in 3D space without constraining motion to a reference ground plane. Cameras exchange line-of-sight vectors and respective time tags asynchronously. From stereoscopy, we obtain the fused 3D measurement at the local frame capture instant. We use local decentralized Kalman information filtering and particle filtering for target state estimation to test our approach with only local estimation. Monte Carlo simulation includes communication losses due to frame processing delays. We measure performance with the average root mean square error of 3D position estimates projected on the image planes of the cameras. We then compare only local estimation to exchanging additional asynchronous communications using the Batch Asynchronous Filter and the Sequential Asynchronous Particle Filter for further fusion of information pairs’ estimates and fused 3D position measurements, respectively. Similar performance occurs in spite of the additional communication load relative to our local estimation approach, which exchanges just line-of-sight vectors.


Introduction
Tracking multiple targets in a distributed-processing wireless-communication network (WCN) of cameras encompasses several areas such as surveillance, computer vision, threat detection [1], inertial navigation aided by additional sensors, and autonomous vehicles. In the tracking process investigated here, we assume fixed cameras whose poses are known and with computational processing resources (network nodes). The cameras estimate the location of a moving object of interest, here called a target, over time [2] in threedimensional (3D) space. Network topology affects the distributed processing that aims at greater flexibility and robustness [3], and target tracking is expected to become less likely subject to failure. Each network node implements detection algorithms and state estimators, where state estimate refers to the position and velocity of the estimated target. Detected targets are represented by the location of their centroids on the image plane, and the corresponding measurements are the line-of-sight (LoS) vectors. Target 3D position arises from intersecting LoS vectors measured by neighboring nodes covering the 3D space the target traverses. Hence, the target 3D motion is not restricted to a reference ground plane, and targets can move freely in the 3D space covered by the field of view of each camera in the network. The LoS measurements are usually noisy or erroneous owing to the limitations of computer vision algorithms [4]. With a network of cameras, it is possible to

Related Work
Target tracking in a distributed asynchronous camera network can either use a sequential method [13] or a batch method [6]. In sequential methods, a network node computes the target state not only when local measurements are obtained but also when the same node receives information from neighboring camera nodes. The nodes do not propagate the estimates received from neighboring nodes to alignment with the local frame capture instant prior to update; the reception instants are used as the sending camera frame capture instants and the probability density for target state estimation; namely, the particle weights are updated sequentially [15,17]. The SA-PF filter disregards the processing delays.
In batch processing methods, the update with data information received asynchronously from other nodes occurs after the received data are propagated to the local frame capture instant for the purpose of time alignment [6,15]. BAF [15] considers the asynchronous frame capturing among cameras and the processing delay in the estimation and fusion phases at each camera. In the estimation phase, a camera obtains the 3D position measurement assuming the target motion develops on a ground reference plane and estimates the information pair at the frame capture instant. Cameras exchange local target state information, and all information received during a time window period is stored in a buffer. At the end of the time window, a camera enters the fusion phase and propagates the received target information pairs of other cameras either backward or forward to temporal alignment with the local frame capture instant [21].
Previous work mostly related to target tracking with a camera network used information exchange of two-dimensional target states restricted to move on a reference ground plane [2,3,7,8,10,[13][14][15]17,21]. However, much less attention has been given to the oftenoccurring scenario in which targets move in 3D space while taking into account network asynchrony and delays in image processing. For example, in [22], multi-view histograms were employed to characterize targets in 3D space using color and texture as the visual features; the implementation was based on particle filtering. In [23], a deep network model was trained with an RGB-D dataset, and then the 3D structure of the target was reconstructed from the corresponding RGB image to predict depth information. In [24], images with a detected hand and head were used with each camera transmitting to a central processor using a low-power communication network. The central processor performed 3D pose reconstruction combining the data in a probabilistic manner. In [16], state estimation with complete maximum likelihood information transfer produced 3D cartesian estimates of target position and velocity from azimuth and elevation provided by multiple cameras. Ref. [25] addressed the competing objectives of monitoring 3D indoor structured areas against tracking multiple 3D points with a heterogeneous visual sensor network composed of both fixed and Pan-Tilt-Zoom (PTZ) controllable cameras. The measurement was the projection of a distinct 3D point representing the target onto an image plane wherein all cameras followed the pinhole model with unit focal length. The authors applied decentralized EKF information filtering, neither handling camera network asynchrony nor the issue that in real image sequences a target occupies a volume; thus, cameras aiming at a target from different directions see quite distinct portions of the target surface and different 3D points.
None of the above methods considered frame processing delays and network asynchrony among camera nodes.
Reference [26] applied DAT2TF to an asynchronous network of 2D radars that provided local Kalman filter state estimates to a central fusion node performing global estimation of the 3D target state. While focusing on the fusion of state estimates, a linearized approach is highlighted to determine the Cramer-Rao estimation error uncertainty bound when converting radar measurements from polar to Cartesian coordinates, and the authors claim the result is equivalent to decentralized filtering.

Our Proposal
To track multiple actual targets undergoing unrestricted motion in a real image sequence with a network of calibrated cameras whose poses are known, we propose the decentralized approach wherein our focus is on the fusion of 3D position measurements combined with only local estimation, without involvement of a central node. For that, we need to invert the projection function from 3D space to the image plane with stereoscopy and the known camera poses and calibration data: Step 1: Within a time window about the local frame capture instant and among neighboring cameras, transmit and receive asynchronously LoS vectors represented in world coordinate system {W} to a target centroid in the image plane and the corresponding time stamps among neighboring cameras within a time window about the frame capture instant [15]; Step 2: Use the unscented transform (UT) and stereovision principles to acquire 3D position measurements and covariances from the received LoS vectors when they intersect to within a 3D spatial distance threshold; Step 3: Propagate forward or backward in time the 3D position measurements and respective covariances via interpolation until temporal alignment with the local frame capture instant; Step 4: Fuse locally the 3D position measurements and covariances at the frame capture instant by means of averaging, use them to update the local estimate, and propagate to the next frame capture instant.
We compare our approach to SA-PF and BAF wherein additional communication is exchanged other than just LoS vectors to measure 3D positions. After the aforementioned time window in Step 1, each camera starts a fusion window after the time window to receive and process the additional communications. In SA-PF, a camera transmits and receives fused 3D position measurements and respective covariances at distinct instants in the fusion window. Differently, BAF receives asynchronous information pairs' estimates at distinct instants in the fusion window.
Both BAF and SA-PF are compared to our proposal with the average root mean square error in the image planes of the cameras tracking the various targets in the scene. The comparison also aims at investigating whether there is a relevant change in performance when a camera fails to receive communications due to ongoing frame processing.

Contributions
Our innovative approach directly measures the 3D centroid position of an actual target with stereoscopy as different, calibrated cameras with known poses view distinct portions of the surface of the same target. Moreover, we explicitly dispense with the fusion node to add robustness to target tracking via decentralized processing in a camera network with partial connectivity. Our focus is on the fusion of 3D position measurements and respective covariances computed with LoS vectors received asynchronously from neighboring cameras. We also consider the image processing delay of each node when obtaining the local measurements.
We derive LoS vectors from the optical center of each camera to the various target centroids in the respective image plane with the corresponding error covariances. Then, we use stereoscopy combined with an unscented transform (UT) that employs the minimum distance search between pairs of LoS vectors to compute the corresponding 3D positions and error covariance in the world reference frame. One LoS vector is from the local camera and the other LoS vector is received asynchronously from any of the neighboring cameras tracking the same target. We handle asynchronous LoS vectors and the corresponding error covariances, all represented in the world coordinate frame, with time propagation from the received instant into alignment with the camera local capture instant. A set of 3D position measurements and covariances arise with representation in the world coordinate frame at the capture instant of the camera. Then, we fuse the measurements and covariances by means of averaging and proceed to local estimation.
We emphasize that our contribution is not in filtering, though we modified both BAF and SA-PF to add a fusion stage to test our fused 3D position measurement approach and local estimation against a further fusion step among the neighboring cameras, and thus investigate the tradeoff between performance and required communications. We find that local estimation provides similar performance with a smaller communication workload.
Moreover, in the real scenario evaluated here, namely the PETS2009 sequence [27], subjects enter and leave the fields of view of cameras the camera network, resulting in targets that merge and separate as they suddenly change motion direction in 3D space. Our approach handles such occurrences gracefully. According to our research, no previous investigation has focused on decentralized target tracking using LoS vectors exchanged asynchronously among neighboring camera nodes with frame processing delay to obtain position measurements in 3D space from stereoscopy principles. In Table 1, we address an analytical comparison of methods found in the literature.

Paper Structure
The paper is organized as follows. Section 2 formulates how to measure 3D position considering the processing delay, network asynchrony, and tracking without constraining target motion to a reference ground plane. Section 3 describes the algorithms to obtain 3D measurements and the modifications to the Bayesian BAF and SA-PF filter algorithms to include an additional fusion window for comparison with just local estimation. Section 4 presents simulation results and discusses performance, and Section 5 concludes the paper.

Problem Formulation
Some background is required initially: -Variable k instants or time steps within a time window: they are integer multiples of the camera inter-frame interval and represent the capture instants of target centroids, and the instants of LoS vectors are received from neighboring cameras during the time window about the frame capture instant k c of a local camera. They are stored in the i-th camera node; -Frame capture instants k c : the moment when a frame is captured and defines a time window; -Here we assume that the communication is ideal among cameras, i.e., there is no communication delays.
Unlike the fully connected network in [15,17], we investigate a partially connected network. Consider a WCN of cameras C i , i = {1, 2, . . . , N s }, where N s denotes the maximum number of sensors. This network is based on decentralized processing, meaning that each C i camera node fuses its information with the information received from the set N j,i of camera nodes C j , j = i considered neighbors; i.e., N j,i is the set of cameras with which the i-th camera communicates to track targets in 3D space. The network topology is fixed and independent of the tracked target. The term decentralized indicates that each camera in the network has a communication module, with detection and tracking algorithms independent from the rest of the cameras. There is no fusion center involved. Given that we aim at 3D tracking, the state of the t-th target at the i-th camera is given by The first phase to model this problem involves obtaining the t-th target position measurements z i t,k c ,3D in 3D space for the targets at the i-th camera node from intersecting LoSs measured asynchronously in neighboring camera nodes. Subsequently, at each camera node, position and velocity estimation are initialized for each target. The local estimation algorithms are the Bayesian filters BAF and particle filtering (PF) without engaging in the fusion step though [15,17]. They compute the probability density function of the target pdf p(x i,i t,k c ,3D z i t,1:k c ,3D ) based on the known predicted posterior pdf p(x i t,k c ,3D z i t,1:k c −1,3D ) corresponding to the previous capture instant k c − 1 and propagated to capture instant k c , the state transition pdf p(x i,i t,k c ,3D x i t,k c −1,3D ) , and the measurement likelihood pdf p(z i t,k c ,3D x i,i t,k c ,3D ) . The second superscript in i,i refers to the local state estimate processed at the i-th camera not using information received from a neighboring camera.
We consider the frame processing delay incurred to obtain z i t,k c ,3D and the asynchrony of neighboring camera capture instants. To this end, we borrow from [15] the concept of a time window K i k about the local camera capture instant k c , wherein LoS vectors are received from neighboring cameras. We define τ i k c as the frame processing delay to obtain the measurement z i t,k c ,3D for the i-th camera, and α ij is the relative offset between the capture instants of neighboring cameras C i and C j .
The problems arising with false positives and false negatives in target detection were considered previously [28] and are addressed here as well. Scene complexity (quantity and speed of the targets) [29], construction parameters of the cameras, and accuracy in their calibration [30][31][32] are factors that also hinder target detection. Thus, if the detection algorithm fails to confirm the presence of the same target within the limits of a minimum capture count up to a maximum count, then the target is discarded. A new identifier number (Id) assignment is created, and a new initial target centroid measurement in the image plane, expressed in pixels, is generated when detecting a novel target. We assume the detection algorithm does not incur in a frame processing delay [33] to obtain a target centroid measurement in the image plane, and the delay is solely from computing the position measurement z i t,k c ,3D in 3D space with the received LoS vectors transmitted by neighboring camera nodes.
As stated earlier, the cameras will fuse information from targets detected in their FoVs with information received from the targets of neighboring cameras belonging to N j,i . Recall the LoS vector from a single pinhole camera model to a target does not convey depth information for tracking in 3D space [32]. Then, subject to network asynchrony, j-th cameras in the i-th camera neighborhood set N j,i transmit LoS vectors to distinct targets, each target being detected by at least two cameras with overlapping FoVs [30]. We use stereoscopy to measure depth along a LoS vector and determine 3D position as targets move freely. Stereoscopy must have views to the same target from at least two different directions. The LoS vector from the camera's optical center to the target centroid in the image plane is represented in the global Cartesian coordinate system by means of camera calibration and pose [31] data. In a real scenario, the same target may have very distinct portions of its surface viewed from very different directions by cameras located far apart. In such a case, stereoscopy can produce a biased 3D position measurement located somewhere inside the target surface. A distance tolerance d min ≤ δ is established to detect LoS proximity and ascertain that the measured 3D position is consistent with the pair of LoSs involved.
We evaluate three approaches to information exchange:

LoS
w t,k c +τ i kc ,i to the t-th target from the i-th camera at instant k c + τ i k c within the time window K i k period, and the data are represented in the world coordinate system {W}. The measured z i t,k,2D (referring to the target centroid in the image plane in pixel coordinates) is transformed into → LoS w t,k,i with the camera construction parameters as described in Section 2.1. The fused 3D position measurement z i t,k c ,3D results from averaging intersecting pairs of vector constraints → r w t,k,i , → r w t,k,j derived from LoS vectors of neighboring j-th cameras pointing at the same t-th target according to a LoS-vector 3D proximity determination algorithm. Local target state estimation uses the fused 3D position measurement z i t,k c ,3D to update and predict the state estimate in the next frame capture instant k c + 1.
• Second: For comparison, exchange among cameras C i and C j , where j ∈ N j,i , the 3D fused measurement z i t,k c ,3D for each target, transmitted and instantaneously received at instants within the fusion window with configurable p size after the K i k time window. This is the used in a sequential approach in SA-PF. • Third: Again, for comparison, exchange among cameras C i and C j , where j ∈ N j,i , j = i, information pairs the estimates for each target, transmitted and instantaneously received at instants within the fusion window with configurable p size after the K i k time window. This is used in the batch approach in BAF.
The first approach extends the concept of stereoscopy to neighboring calibrated cameras with known poses and requires a 3D proximity metric among locally detected LoS vectors in the i-th camera and LoS information received from cameras belonging to N j,i .
Note that internal temporary storage of LoS vectors → LoS w t,k,i at each instant within the time window K i k period is required because the LoS vectors → LoS w t,k,i at C i must be aligned temporally, through time stamps, with the vectors → LoS w t,k,j received from C j . In [34], the concept of "virtual synchronization" was used along a "dynamic" memory to store the captured data. Based on this concept, sending and receiving information from cameras are concurrent activities at instants k within the time window period.
The information vector I w t,k c +τ i kc ,i exchanged between cameras conveys the LoS vector to the t-th target at the time instant k c + τ i k c , the time stamp referring to the k c + τ i k c instant and the optical center O w c,i of camera C i expressed in world coordinates {W}:  To obtain measurements in 3D space, the following steps are required, from capturing the target centroid in the image plane z i t,k,2D in pixels to measuring the centroid z i t,k c ,3D in 3D space in meters.  For each target detected by the i-th camera, the target centroid in the image plane represented in the image by a red dot, as shown in Figure 1. All targets are assigned a identification number (Id). Concerning the detection algorithm, accurate data association for maintaining th assignment of an identifier to a target is a challenging problem. The "global neare neighbor" algorithm [35], which in turn implements the Munkres algorithm [36], w used at each camera for this purpose. However, as shown in Figure 1, an Id assigned to target by one of the cameras will not necessarily match the Id assigned to the same targ by neighboring cameras. For example, in Figure 1, for the same target observed b Cameras 1 and 2, Camera 1 assigned Id = 1 while Camera 2 assigned Id = 3. This issue addressed in Section 2.3.
Centroid pixels in Figure 1 should be represented in the corresponding camera {C coordinate system, and the cameras' calibration will correct the distortions caused b camera construction parameters.

Camera Calibration
All cameras follow the pinhole CCD model [32]. Using intrinsic or constructio parameters of a camera, it is possible to correct its distortions and calibrate it so that th distorted pixel of the detected centroid of the t-th target, represented by point  Concerning the detection algorithm, accurate data association for maintaining the assignment of an identifier to a target is a challenging problem. The "global nearest neighbor" algorithm [35], which in turn implements the Munkres algorithm [36], was used at each camera for this purpose. However, as shown in Figure 1, an Id assigned to a target by one of the cameras will not necessarily match the Id assigned to the same target by neighboring cameras. For example, in Figure 1, for the same target observed by Cameras 1 and 2, Camera 1 assigned Id = 1 while Camera 2 assigned Id = 3. This issue is addressed in Section 2.3.
Centroid pixels in Figure 1 should be represented in the corresponding camera {C} coordinate system, and the cameras' calibration will correct the distortions caused by camera construction parameters.

Camera Calibration
All cameras follow the pinhole CCD model [32]. Using intrinsic or construction parameters of a camera, it is possible to correct its distortions and calibrate it so that the distorted pixel of the detected centroid of the t-th target, represented by point P d in Figure 2, corresponds to the corrected position in the image plane represented by point P u . The transformation (x d , y d ) → (x u , y u ) follows [37].
After correcting point P d to point P u in the camera coordinate system {C}, we represent two main points in the world coordinate system {W} for each camera: the optical center point O c,i and point P u that give rise to the LoS vector from the i-th camera to the t-th target. Given that the extrinsic parameters of camera represent its rotation and translation ⃗ , the representation of the optical center , defined as vector = , [32], represented in camera coordinate system {C}, in world coordinate system {W} is expressed as follows: Point is located in image plane, as shown in Figure 2. We must first represent the pixels of in coordinates of the camera system {C}: Denoting the LoS vector , = [ = , = , ] as the coordinates obtained from point for the t-th target, now represented in coordinates of the camera system {C}, we can represent it in a vector base with origin at the camera optical center , and parallel to the world coordinate system {W} as follows: The rotation matrix is expressed as follows: ( , , ) = cos cos cos sin sin − cos sin sin sin + cos cos sin cos sin sin sin sin + cos cos cos sin sin − cos sin − sin cos sin cos cos (4) This is the matrix that performs the rotation from system {W} to system {C} for each camera; its rotation order in camera axes is and angles , , rotate from {C} to {W} [31], then it is necessary to apply the transpose operation (′) in Equations (1) and (3). Pinhole camera model C i : {C} represents the camera coordinate system; {W} represents the world coordinate system; O c,i represents the optical center of the i-th camera expressed in the camera coordinate system; O w represents the origin of the world coordinate system; f is the focal length; P d is the distorted point given in pixels; P u represents the corrected point given in pixels; P w is a point in the 3D space represented in world coordinates {W}; Z c = f is the camera image plane.

The Optical Center O c,i and Point P u Represented in the World Coordinate System {W}: A Preamble for Obtaining the LoS Vector
Given that the extrinsic parameters of camera C i represent its rotation R i and trans- represented in camera coordinate system {C}, in world coordinate system {W} is expressed as follows: Point P u is located in image plane, as shown in Figure 2. We must first represent the pixels of P u in coordinates of the camera system {C}: Denoting the LoS vector X i t,c = [x c = x u , y c = y u , f ] as the coordinates obtained from point P u for the t-th target, now represented in coordinates of the camera system {C}, we can represent it in a vector base with origin at the camera optical center O c,i and parallel to the world coordinate system {W} as follows: The rotation matrix w c R i is expressed as follows: This is the matrix that performs the rotation from system {W} to system {C} for each camera; its rotation order in camera axes is zyx and angles θ z , θ y , θ x rotate from {C} to {W} [31], then it is necessary to apply the transpose operation ( ) in Equations (1) and (3).   It should be noted that ⃗ is a constant vector because it refers to the fixed position of the optical center of the i-th camera. The LoS vector, in turn, depends on the position of the target, referring to point , , in the image plane expressed in camera coordinates system {C}. Writing the constraint vector ⃗ , , in the form of a parametric equation [38] yields the following expression:

LoS Vector
Thus, the vector constraint ⃗ , , ( ) changes when extending the LoS vector ⃗ , , with parameter . Figure It should be noted that w c → t i is a constant vector because it refers to the fixed position of the optical center of the i-th camera. The LoS vector, in turn, depends on the position of the target, referring to point X i t,c , in the image plane expressed in camera coordinates system {C}. Writing the constraint vector → r w t,k,i in the form of a parametric equation [38] yields the following expression: Thus, the vector constraint

Proximity Metric between Vector Constraints Originating from LoS to a Target
Assume camera C i receives the information vector I w t,k c +τ j kc ,j from one of the transmitting cameras C j . Suppose also that the observed scene contains only one target, and two synchronized cameras-C i and C j with α ij = 0-observe it with overlapping FoVs at the same time of capture instant in both cameras; moreover, there is no image processing delay, i.e., τ i k c = τ j k c = 0. In these circumstances, we adopt for simplicity here subscript k to mean the capture instant k c . The vector constraints → r w t,k,i and → r w t,k,j , respectively, are computed from the information vectors of cameras C i and C j , respectively, and can be written as follows: Then, target triangulation must find λ * , β * = min k,j that minimize the distance between vector constraints → r w t,k,i (λ) and → r w t,k,j (β). The concept of distance between reversed straight lines is applied [38,39], and the solution is obtained by calculating the partial derivative of the distance with respect to λ and β.
Solving (7) produces the optimized nonlinear parameters expressed as follows: The operator · represents the inner product between the vectors, and ||. || represents the vector 2-norm operator. Appendix A shows how to find the optimized parameters λ * and β * .
Once obtained, these parameters are substituted in (6), and the minimum distance 3 m was chosen to indicate whether the pair of LoS vectors point at the same target. Figure 5 shows LoS vectors → LoS w t,k,i and → LoS w t,k,j pointing at the same target. The above description concerns information vectors exchanged synchronously. Next, we analyze asynchronous information exchange considering the processing time required for obtaining position measurements in 3D space.
represents the vector 2-norm operator. Appendix A shows how to find the optimized parameters * and * .
Once obtained, these parameters are substituted in (6), and the minimum distance occurs where the 3D position is in ⃗ , , ( * ). A tolerance ≤ = 0.3 m was chosen to indicate whether the pair of LoS vectors point at the same target. Figure 5 shows LoS vectors ⃗ , , and ⃗ , , pointing at the same target. The above description concerns information vectors exchanged synchronously. Next, we analyze asynchronous information exchange considering the processing time required for obtaining position measurements in 3D space. Figure 6 shows the top and front views of the t-th target centroid trajectory during time window in one of the cameras, assuming = to be the frame capture instant. Both the optical center , and the LoS vector ⃗ , , that define the vector constraint ⃗ , , ( ) are observed in Figure 6. The frame processing delay of the detection algorithm is assumed negligible relative to acquiring the 3D position measurements of targets' centroids instants within the time window . An example of a target centroid trajectory in the image plane expressed in world coordinates {W} is depicted in Figure 7.   Following an idea reported in [34], the information vectors , , are stored for each k-th instant within the time window , as can be observed in Buffer 1 of the i-th camera in Figure 8. They are subsequently replaced by information vectors of the next time

Asynchronous Exchange of Information Vectors
Following an idea reported in [34], the information vectors i w t,k,i are stored for each k-th instant within the time window K i k , as can be observed in Buffer 1 of the i-th camera in Figure 8. They are subsequently replaced by information vectors of the next time window K i k+1 . Note that to each detected t-th target a buffer is allocated to store its information vectors. window . Note that to each detected t-th target a buffer is allocated to store its information vectors.
i-th camera: Information vectors are sent and received by the i-th camera within the time window about the capture instant , expressed as = [ − , + ], > 0 and > . The information vectors , , of the t-th target in the i-th camera (Buffer 1) are aligned according to their time stamps with respect to time stamps in the information vectors of various targets received from the j-th neighboring cameras (Buffer 2). After the temporal alignment, will contain local information vectors (Buffer 1) and received information vectors from the j-th neighboring cameras (Buffer 2), as depicted in Figure 9.
The information vectors i w t,k,i of the t-th target in the i-th camera (Buffer 1) are aligned according to their time stamps with respect to time stamps in the information vectors of various targets received from the j-th neighboring cameras (Buffer 2). After the temporal alignment, C i will contain local information vectors (Buffer 1) and received information vectors from the j-th neighboring cameras (Buffer 2), as depicted in Figure 9.
, , according to their time stamps with respect to time stamps in the information vectors of various targets received from the j-th neighboring cameras (Buffer 2). After the temporal alignment, will contain local information vectors (Buffer 1) and received information vectors from the j-th neighboring cameras (Buffer 2), as depicted in Figure 9. After temporal synchronization of the information vectors, the proximity metric algorithm is initialized to determine the distance between constraint pairs ⃗ , , and ⃗ , , . The 3D position measurement in 3D space , , results from local constraints ⃗ , , and constraints ⃗ , , received from neighboring cameras within each time window that are found to point to the same t-th target at the same instant k when distance between i th camera: Buffer 1: After temporal synchronization of the information vectors, the proximity metric algorithm is initialized to determine the distance between constraint pairs → r w t,k,i and → r w t,k,j . The 3D position measurement in 3D space z i t,k,3D results from local constraints → r w t,k,i and constraints → r w t,k,j received from neighboring cameras within each time window K i k that are found to point to the same t-th target at the same instant k when distance between the constraint pair is less than tolerance δ. A polynomial interpolates the 3D position measurements of the same target z i t,k,3D , propagating these 3D measurements backward or forward in time within the time window K i k until alignment with the i-th camera captures instant k c , and fuses these 3D measurements by averaging them, thus producing the fused 3D position measurement z i t,k c ,3D . An internal variable N c defined for each t-th target counts the number of cameras tracking the same target for the purpose of averaging. Then follows local estimation via update with measurement z i t,k c ,3D and time propagation to the next capture instant, either with BAF or SA-PF, without engaging in any further fusion phase and hence reducing the communication load.
For comparison, the fusion window takes place at the end of each time window K i k at time (k c + α 2 ), either using information pairs' estimates in BAF or fused 3D position measurements in SA-PF received asynchronously from neighboring cameras. As described in [15], each camera node knows the maximum relative offset α max = max i,j α ij and the maximum and minimum processing delays To avoid the next capture instant k c coinciding with the fusion phase of the previous time window, a separation T = α 1 + α 2 between consecutive capture instants must exist where α 1 = α max − τ min and α 2 = α max + τ max [15]. Figure 10 shows the timeline that describes how the actions elapse within time window K i k . The algorithm that obtains the 3D position measurements takes into account the information vectors I w t,k c +τ i kc ,j exchanged at the time window within the period [k c − α 1 , k c + τ i k c ]. The estimation phase takes place at k c + τ i k c using the fused 3D position measurement z i t,k c ,3D .
To avoid the next capture instant coinciding with the fusion phase of the previous time window, a separation = + between consecutive capture instants must exist where = − and = + [15]. Figure 10 shows the timeline that describes how the actions elapse within time window . The algorithm that obtains the 3D position measurements takes into account the information vectors , , exchanged at the time window within the period − , + . The estimation phase takes place at + using the fused 3D position measurement , , . An important point addressed next is how to model the uncertainties inherent to the whole process described above and thus find the fused 3D position measurement , , .

Obtaining the 3D Position Measurement at Capture Instant
In the initialization of the proximity metric algorithm to determine the distance between constraint pairs ⃗ , , and ⃗ , , , we model the LoS vector ⃗ , , with the "Unscented Transform" (UT) [19,40] to approximate the statistics of the random LoS variables transformed by the nonlinear function that determines depth from stereo. The uncertainty in the image plane is a three-dimensional Gaussian distribution ( , ) that also accounts for camera calibration error in the focal length , where is the mean, and is the covariance. This Gaussian distribution defines seven sigma points [40]. Figure 11 shows the 3D sigma points along the three axes of the i-th camera due to An important point addressed next is how to model the uncertainties inherent to the whole process described above and thus find the fused 3D position measurement z i t,k c ,3D .

Obtaining the 3D Position Measurement at Capture Instant k c
In the initialization of the proximity metric algorithm to determine the distance between constraint pairs → r w t,k,i and → r w t,k,j , we model the LoS vector → LoS w t,k,i with the "Unscented Transform" (UT) [19,40] to approximate the statistics of the random LoS variables transformed by the nonlinear function that determines depth from stereo. The uncertainty in the image plane is a three-dimensional Gaussian distribution N (µ x , C x ) that also accounts for camera calibration error in the focal length f , where µ x is the mean, and C x is the covariance. This Gaussian distribution defines seven sigma points [40]. Figure 11 shows the 3D sigma points along the three axes of the i-th camera due to uncertainty in the detected centroid location in the image plane that gives rise to the LoS vector → LoS w t,k,i = w c R i X i t,c represented in the world coordinate system {W}.
Given that the dimension of X i t,c is L = 3, then 2L + 1 = 7 sigma points in threedimensions are generated in the image plane and along the optical axis, as shown in Figure 12. Consequently, seven constraints → r w t,k,i are found for the same t-th target. The sigma points in the image plane are generated for each target from information that camera C i either detects locally or receives information about from a neighboring camera C j . The mean µ x of the Gaussian distribution is the LoS vector Figure 11, represented by a red arrow.
The proximity metric determines the distance (a nonlinear function) between each candidate constraint pair, i.e., → r w t,k,i and → r w t,k,j (amounting to a total of 49 candidate pairs, but only the 7 points with minimum distance are considered for each constraint pair). Any point in 3D space along constraint → r w t a ,k,i whose distance to another candidate constraint → r w t b ,k,j falls within the proximity tolerance d min ≤ δ indicates that targets t a seen by C i and t b seen by C j are the same target, and a set of points in 3D space are thus obtained. Figure 13 shows the result of proximity determination between a constraint to a target detected locally by Camera 1 and each of the constraints to targets received in information vectors from neighboring cameras. In this case, Cameras 2, 3, 4, and 5 send constraints → r w t,k,j to Camera 1 within time window K 1 k and some of them are relative to the same Target 1.
, , ⃗ , , ( * ) uncertainty in the detected centroid location in the image plane that gives rise to the LoS vector ⃗ , , = ′ , represented in the world coordinate system {W}. Figure 11. Application of the UT transformation to approximate the 3D position measurement error statistics arising from the uncertainty in LoS vector ⃗ , , in the image plane. This uncertainty in the LoS vector affects the output error statistics of the proximity metric algorithm that determines the distance between constraint vector pairs ⃗ , , and ⃗ , , (the latter not shown for the sake of clarity) and consequently impacts the 3D position measurement. The resulting seven points in 3D space are approximated with a normal distribution , . Subscript f here means "final" in the UT 3D position mean and covariance determined from averaging with respect to the neighboring cameras tracking the same target centroid at the same time instant for each instant k within a temporal window. 3D motion of the imaged targets is not constrained to a ground reference plane. 3D position is in ⃗ , , ( * ) when the distance tolerance between ⃗ , , ( * ) and ⃗ , , ( * ) is ≤ .
Given that the dimension of , is = 3, then 2 + 1 = 7 sigma points in threedimensions are generated in the image plane and along the optical axis, as shown in Figure 12. Consequently, seven constraints ⃗ , , are found for the same t-th target. The sigma points in the image plane are generated for each target from information that camera either detects locally or receives information about from a neighboring camera . The mean of the Gaussian distribution is the LoS vector ⃗ , , = ′ , shown in Figure 11, represented by a red arrow.

( , )
Normal Distribution space are approximated with a normal distribution N µ f k, C f k . Subscript f here means "final" in the UT 3D position mean and covariance determined from averaging with respect to the neighboring cameras tracking the same target centroid at the same time instant for each instant k within a temporal window. 3D motion of the imaged targets is not constrained to a ground reference plane. 3D position is in → r w t,k,i (λ * ) when the distance tolerance between → r w t,k,i (λ * ) and Each set of 3D points with corresponding UT mean and covariance (µ f k , C f k ) after averaging refers to a time instant within the time window K i k . Notice that more than one such set may relate to the same instant k when the UT processes simultaneous constraints that originate in information vectors transmitted by n neighboring cameras tracking the same target at instant k. The representative point in 3D space and respective covariance at instant k, µ f k , and C f k are the average over the n UT means and covariances µ jk , C jk , j ∈ N j,i of simultaneous sets of points in 3D space at instant k. The proximity metric determines the distance (a nonlinear function) between each candidate constraint pair, i.e., ⃗ , , and ⃗ , , (amounting to a total of 49 candidate pairs, but only the 7 points with minimum distance are considered for each constraint pair). Any point in 3D space along constraint ⃗ , , whose distance to another candidate constraint ⃗ , , falls within the proximity tolerance ≤ indicates that targets seen by and seen by are the same target, and a set of points in 3D space are thus obtained. Figure 13 shows the result of proximity determination between a constraint to a target detected locally by Camera 1 and each of the constraints to targets received in information vectors from neighboring cameras. In this case, Cameras 2, 3, 4, and 5 send constraints ⃗ , , to Camera 1 within time window and some of them are relative to the same Target 1. Each set of 3D points with corresponding UT mean and covariance ( , ) after averaging refers to a time instant within the time window . Notice that more than one such set may relate to the same instant k when the UT processes simultaneous constraints that originate in information vectors transmitted by n neighboring cameras tracking the same target at instant k. The representative point in 3D space and respective covariance at instant k, , and are the average over the n UT means and covariances , , ∈ , of simultaneous sets of points in 3D space at instant k.
The final 3D mean, , averages over the n simultaneous UT means , … , in 3D space at instant k as follows: The final 3D mean, µ f k , averages over the n simultaneous UT means µ 1k , . . . , µ nk in 3D space at instant k as follows: Likewise, C f k averages over simultaneous UT covariances C 1k , . . . , C nk . Figure 14 depicts this situation, where five UT means are obtained at instant k = 13, one at instant k = 14, and one at instant k = 15. The result is the average of the UT means at each k time instant, as depicted in Figure 15. The corresponding UT covariance average is not shown in Figure 15.  Thus, 3D points are measured at each instant within period − , + of the time window . Two conditions must be met for these measurements to be generated: (1) at least one information vector , , is received from a neighboring camera at any instant within the time window period − , + , and (2) this information vector relates to the same target observed at the i-th camera. These two conditions are not always met; in that case, the 3D position of the target centroid is not measured at all instants within the aforementioned period. At least two 3D position measurements at different instants within the time window period − , + are Figure 14. An instance of distinct UT means µ jk , j = 1, . . . , n in 3D space at instant k for the sets of 3D points shown in Figure 13.  Thus, 3D points are measured at each instant within period − , + of the time window . Two conditions must be met for these measurements to be generated: (1) at least one information vector , , is received from a neighboring camera at any instant within the time window period − , + , and (2) this information vector relates to the same target observed at the i-th camera. These two conditions are not always met; in that case, the 3D position of the target centroid is not measured at all instants within the aforementioned period. At least two 3D position measurements at different instants within the time window period − , + are Thus, 3D points are measured at each instant k within period [k c − α 1 , k c + τ i k c ] of the time window K i k . Two conditions must be met for these measurements to be generated: (1) at least one information vector I w t,k c +τ j kc ,j is received from a neighboring camera at any instant k within the time window period [k c − α 1 , k c + τ i k c ], and (2) this information vector relates to the same target observed at the i-th camera. These two conditions are not always met; in that case, the 3D position of the target centroid is not measured at all instants within the aforementioned period. At least two 3D position measurements at different instants within the time window period [k c − α 1 , k c + τ i k c ] are necessary to derive a linear polynomial and interpolate from the target 3D position measurements forward or backward in time until the capture instant k c to compute the fused 3D position measurement z i t,k c ,3D . This interpolation process occurs at the instant k c + τ i k c , and z i t,k c ,3D is the average of the 3D UT means measured within the period [k c − α 1 , k c + τ i k c ] and interpolated in time until the capture instant k c . The corresponding fused covariance C i t,k c ,3D is the average of UT covariances interpolated in time until the capture instant.
In essence, our approach provides the fused 3D position measurement z i t,k c ,3D and respective covariance C i t,k c ,3D from the following steps: (1) The 3D position measurements z i t,k,3D and covariances C i t,k,3D are output by the i-th camera local proximity algorithm at each k instant within period [k c − α 1 , k c + τ i k c ] if the two conditions above are met; (2) Then, a linear polynomial interpolation propagates z i t,k,3D and C i t,k,3D backward or forward in time until alignment with the i-th camera local capture instant k c ; (3) Finally, the fused 3D position measurements z i t,k c ,3D and covariances C i t,k c ,3D result from averaging the interpolated measurements and covariances at capture instant k c .
A remark is that when just one 3D position measurement is generated at an instant that differs from k c , the measurement is discarded because there is no other point to derive a linear interpolating polynomial, and the next time window is awaited. By contrast, when just a single 3D position measurement occurs at frame capture instant k c , this measurement is accepted as z i t,k c ,3D with the corresponding covariance C i t,k c ,3D . The fused 3D position measurement z i t,k c ,3D and covariance C i t,k c ,3D at instant k c are the required information for our decentralized approach with either the local estimation phase at instant k c + τ i k c or further fusion-based estimation with BAF or SA-PF at instant k c + α 2 .

Dynamic Target Model: Linear Gaussian Model
The discrete-time dynamic model for each target [41] at each camera, is given by where F(k c , k c + ∆k c ) is the state transition matrix from instant k c to k c + ∆k c , defined as and w(k c , k c + ∆k c ) is the cumulative noise between time instants k c and k c + ∆k c , which is considered to be Gaussian with zero mean and covariance Q(k c , k c + ∆k c ) expressed as follows: where σ 2 u is the variance of process noise in the nearly-constant-velocity model [41]. The measurement model for the i-th camera is given by z i t,k c ,3D = Hx i t,k c ,3D + ν i , where ν i is the measurement noise vector with dimensions 3 × 1, which is assumed to be Gaussian

Target Information Fusion within the Fusion Window [15]
Our proposal is the use of fused 3D position measurements derived from stereoscopy in combination with local state estimation to track actual moving subjects undergoing free 3D motion within the fields of view of asynchronous, calibrated cameras with known poses and located far apart. In certain circumstances, very distinct parts of a target surface can be viewed from very different viewing directions; thus, the 3D centroid position of such a target lies at some point inside and not on the viewed surfaces. Local estimation uses either the Kalman filter in information form or particle filtering. Then we test additional fusion with either BAF or SA-PF for comparison with just local estimation and investigate the tradeoff with the additional communication load.
For that purpose, SA-PF and BAF have their algorithms modified as follows. After the local estimation at the i-th camera used the fused 3D position measurement from the asynchronous LoS vectors received within the time window K i k , further fusion occurs with the additional content received asynchronously from the neighboring cameras within a fusion window with a configurable p size. This additional content being exchanged is either the information pair ( y i,i t,k c ,3D , Y i,i t,k c ,3D ) in BAF or the fused 3D position measurement and covariance (z i t,k c ,3D , C i t,k c ,3D ) in SA-PF. Figure 16 represents the actions occurring in two time windows in sequence. The motivation to use SA-PF is that it is not limited to the linear Gaussian assumption of the information filter.  We then investigate whether an improvement occurs with the additional communication exchange in the fusion window. Define k c and k c as, respectively, transmit and reception instants when camera C i exchanges with neighboring cameras C j either the information pair in BAF or the fused 3D position measurement and covariance in SA-PF. Fusion with additional communication exchange in BAF and SA-PF is as follows: (1) BAF fusion: The t-th target information pair is propagated from frame capture instant k c to k c , thus yielding ( y i,i t,k c ,3D , Y i,i t,k c ,3D ) for C i transmission. Define k c to be the reception instant at which camera C i receives an information pair estimate ). At the end of the fusion window, C i updates the state estimate at k c and propagates to the next frame capture instant k c + 1. Figure 16 shows how we changed the timeline of the Batch Asynchronous Filter (BAF)-Lower Load Mode (LLM) [15] and the Sequential Asynchronous Particle Filter (SA-PF) [17] to accommodate further fusion. Our purpose here is to investigate whether improved target tracking across the asynchronous camera network results from further communication other than just exchanging LoS vectors among neighboring cameras to compute fused 3D position measurements and then use local estimation.

Asynchronous Bayesian State Estimation
Neighboring cameras transmit either the information pairs' estimates (BAF) or the fused 3D position measurement and covariance (SA-PF) at instant k c in the end of the time window K i k , and introduce a fusion window with configurable p duration after K i k . At instant k c within the fusion window, neighboring cameras receive asynchronously either information pairs' estimates (BAF) or fused 3D position measurements and covariances (SA-PF).
Relative to local estimation with fused 3D position measurements, we investigate (a) the KLA-based fusion of received information pairs' estimates temporally aligned in batch with the local frame capture instant k c (BAF), or (b) the sequential update of particles' weights by sampling from a likelihood density function with the received fused 3D position measurements and covariances (SA-PF).
The following Algorithm 1 describes the steps to investigate the effect of additional communications and fusion other than exchanging LoS vectors among neighboring cameras for computing fused 3D position measurements: Algorithm1. Algorithm to investigate the effect on target tracking of further fusion other than exchanged LoS vectors and local estimation Input: I-images taken by camera C i Output: estimated state x i t,kc,3D and the root mean square error computation i t,kc,2D for r Monte-Carlo simulation runs for each camera 1-If t-th target detection is confirmed after three occurrences in images, its position estimateẑ i t,kc,3D is computed as follows: for each detected target 2-Detect target centroids in image plane z i t,k,2D within the time window period K i k ; 3-Transform measurements z i t,k,2D into stored information vectors i w t,k,i = {k ; O w c,i ; → LoS w t,k,i }; 4-Asynchronous transmission/reception of information vectors I w t,kc +τ i

LoS
w t,kc +τ kc ,i } among cameras C i and C j , j ∈ N j,i , i = j; 5-Compute 3D position measurements z i t,k,3D and covariances C i t,k,3D within the time window K i k from received LoS vectors, and interpolate all of them into temporal alignment with the frame capture instant k c to account for frame processing delay and camera asynchrony; 6-Local estimation of the target position and velocity in 3D space x i t,kc,3D ; 7-Investigate the added fusion phase with either the BAF or the SA-PF Asynchronously transmit/receive information pairs's estimates (BAF) or fused 3D position measurements (SA-PF) among cameras C i and C j , j ∈ N j,i , j = i; 8-ẑ i t,kc,3D is the resulting 3D position component of the target state estimate; 9-Perspective projection of the 3D position estimate of each target being tracked onto the 2D image planes of each neighboring camera, expressed in pixel coordinates:ẑ i t,kc,3D →ẑ i t,kc,2D ; end for; end for; 10-Average root mean square error i t,kc,2D = ẑ i t,kc,2D,r − z i t,kc,2D 2 2 to compare performance in steps 6 and 7. end for.

Batch Asynchronous Filter (BAF)-Lower Load Mode (LLM) [15]
The initialization of each target involves the fused 3D position measurement z i t,k c ,3D and respective covariance C i t,k c ,3D . Each camera C i uses Kalman filtering in information form [7]. The fused 3D position measurement pair (z i t,k c ,3D , C i t,k c ,3D ) is transformed in the measurement information pair H T C i t,k c ,3D Local estimation uses the measurement information pair and the information pair (ŷ i t,k c ,3D ,Ŷ i t,k c ,3D ) predicted from the previous frame capture instant to update locally the information pair to of the t-th target at camera C i where is the 3D position estimate of the target for later use to evaluate performance in step 6 of the Algorithm 1 with the average root mean square error in the image plane.
To investigate the added fusion phase in step 7 of the Algorithm 1, for the purpose of transmission C i predicts the information pair at instant k c within the fusion window illustrated in Figure 16 as follows: C i transmits the predicted information pair y i,i t,k c ,3D , Y i,i t,k c ,3D to all cameras j ∈ N j,i . Furthermore, C i receives at the local instant k c within the fusion window the information pair y j,j t,k c ,3D , Y j,j t,k c ,3D from a neighboring camera C j , as depicted in Figure 16. Camera C i has no knowledge of the frame capture instant of neighboring camera C j . Therefore, , which undergoes propagation backward in time to the local frame capture instant k c : Fusion of local and received information pairs at camera C i for each target at instant k c is obtained with the Kullback-Leibler Average (KLA) [20] as follows: where n N j,i indicates the number of neighboring cameras in the set N j,i whose information pairs of the t-th target are received within the fusion window of the i-th camera. Then, t,k c ,3D is the fused 3D position estimate of the target in step 7 of the Algorithm 1 for later use to evaluate performance with the average root mean square error in the image plane.
We define as strategy (a) the local estimation in step 6, with exchange of just the information vectors I w t,k c +τ i kc ,i conveying LoS vectors. This strategy predicts from the updated (y i,i t,k c, ,3D , Y i,i t,k c ,3D ) to ŷ i t,k c +1,3D ,Ŷ i t,k c +1,3D in the next capture instant k c + 1, and computes the next measurement information pair H T C i t,k c +1,3D −1 H based on the fused 3D position measurement and covariance, which we compute with LoS vectors received in the next time window K i k+1 . For comparison, we define strategy (b) in step 7 as adding KLA-based fusion with information pairs' estimates received from neighboring cameras within the fusion window other than the LoS vectors received within K i k . This strategy predicts from the KLAfused information pair (ŷ i t,k c ,3D ,Ŷ i t,k c ,3D ) to ŷ i t,k c +1,3D ,Ŷ i t,k c +1,3D in the next capture instant k c + 1 and, as in strategy (a), computes the next measurement information pair −1 H from the fused 3D position measurement and covariance from LoS vectors received in the next time window K i k+1 . For clarification, see Figure 17.

Sequential Asynchronous Particle Filter [17]
The initial step for obtaining a fused 3D position measurement pair ( , , , , , ) The position estimate of the t-th target resulting from strategies (a) and (b) at camera C i at frame capture instant k c are, respectively,ẑ i t,k c ,3D = HY i,i t,k c ,3D t,k c ,3D .

Sequential Asynchronous Particle Filter [17]
The initial step for obtaining a fused 3D position measurement pair (z i t,k c ,3D , C i t,k c ,3D ) at each camera C i is the same as previously described. Each camera has the same inter-frame period T = α 1 + α 2 between consecutive frame capture instants as depicted in Figure 10. The model of the t-th target is the same from Section 3.2.
In the prediction process of SA-PF, new particles x  are: The corresponding weightsω i,j m ,(l) t,k c , l = 1, . . . , L are updated sequentially according to: where p(z ) is the likelihood function in Equation (21). The weights in Equation (19) are normalized according toω t,k c ,3D received at camera C i during the fusion window at distinct instants k c , including the local z i t,k c ,3D and up to the neighboring camera C j m as indicated by superscript (j 0 : j m ) [17].
This update in SA-PF is sequential until the last remaining measurement from the neighboring cameras is received within the fusion window, and then the "global posterior" weights [17] become ω

Evaluation Setup
We use MATLAB [42] and the PETS2009 database [27] with a camera network composed of N s = 8 cameras. The targets move within the surveillance volume of a network of calibrated cameras, whose construction (intrinsic) parameters and poses are given in the header of the PETS2009 image sequence dataset file. This dataset is challenging for tracking tasks because targets change direction, and clusters and scattering occur frequently. Scene illumination and obstacles also hinder tracking. In this experiment, the frame sampling rate was 7 frames/s ≈ 143 ms, with a total of 794 frames per camera. Each k time step takes approximately 143 ms. Figure 18 shows all FoVs projected on the ground reference plane for each of the eight cameras. Portions of some FoVs do not project on the ground and instead cover the scenario above ground. The surveillance area of the cameras is approximately 60 m × 90 m as depicted in Figure 18, and their FoVs overlap. Moreover, the neighborhoods for each camera are defined by the network communication topology also shown. The chosen topology reflects two camera neighborhoods (clusters), one with cameras having large FoVs areas projected on the ground and the other one consisting of cameras with smaller projections. Communications between the clusters occur by means of cameras 4 and 8 and cameras 1 and 5. network communication topology also shown. The chosen topology reflects two camera neighborhoods (clusters), one with cameras having large FoVs areas projected on the ground and the other one consisting of cameras with smaller projections. Communications between the clusters occur by means of cameras 4 and 8 and cameras 1 and 5. BAF assumes the information pdfs are Gaussian whereas PF makes no such assumption. They are evaluated under the two strategies (a) and (b) previously explained.
The random process modeling the processing delay in frame units is is the uniform distribution. A uniform distribution is also used for the offset between cameras: {0, }. Denote Λ , ⊂ [1,794] as the set of frame capture instants during the r-th run of a Monte Carlo simulation, r=1,…,M of camera tracking a target and Ρ ⊂ {1, … , } as the set of targets in the FoV of camera at capture instant . The evaluation is based on the average root mean square error: where , , , is the estimated 3D position of the t-th target projected according to the calibrated pinhole model onto the image plane of camera at the instant during the r-th round, and , , corresponds to the actual measurement in the image plane as provided by the target detection algorithm. The latter does not change with , since the image sequence dataset is given and processed previously by the target detection algorithm. and in the likelihood function in Equation (21). and are given by the dynamic model of the target. The error statistics converged with = 12 runs. We skip dataset frames to simulate asynchronous frame capture and frame BAF assumes the information pdf s are Gaussian whereas PF makes no such assumption. They are evaluated under the two strategies (a) and (b) previously explained.
The random process modeling the processing delay in frame units is τ i k ∈ U {0, τ max }, where U is the uniform distribution. A uniform distribution is also used for the offset α ij between cameras: U {0, α max }. Denote Λ i,r ⊂ [1, 749] as the set of frame capture instants during the r-th run of a Monte Carlo simulation, r = 1, . . . , M of camera C i tracking a target and P i k ⊂ {1, . . . , N t } as the set of targets in the FoV of camera C i at capture instant k c . The evaluation is based on the average root mean square error: whereẑ i t,k c ,2D,r is the estimated 3D position of the t-th target projected according to the calibrated pinhole model onto the image plane of camera C i at the instant k c during the r-th round, and z i t,k c ,2D corresponds to the actual measurement in the image plane as provided by the target detection algorithm. The latter does not change with r, since the image sequence dataset is given and processed previously by the target detection algorithm. σ 2 u = 4 in Equation (12), and R i = C i t,k c ,3D in the information filter with the measurement Equation (21). F and Q are given by the dynamic model of the target. The error ε statistics converged with M = 12 runs. We skip dataset frames to simulate asynchronous frame capture and frame processing delay. ε is analyzed by increasing the relative offset 0 ≤ α max ≤ 40 (see Figure 19) with a fixed offset α max and processing delay τ i k = τ i at each i-th camera and r-th run. Each camera has τ i sampled once from U {0, τ max }, 0 ≤ τ max ≤ 3 and remains the same throughout all runs. The same process is applied to analyze the effect on ε of increasing 0 ≤ τ max ≤ 40 (see Figure 19) with a fixed processing delay τ max and relative offset α i at each i-th camera, and r-th run. Each camera has α i sampled once from U {0, α max }, (synchronous case) 0 ≤ α max ≤ 4 (asynchronous case) and remains the same throughout all runs.
Concerning the particle filter, the likelihood function in Equation (17) is the multivariate Gaussian pdf [43]: N = 3 since z i t,k c ,3D and C i t,k c ,3D refer to the fused 3D position measurement. Equation (19) p(z ) shares the same structure of Equation (21) but changes the corresponding variables. L = 1000 particles are used. Figure 19 shows the simulation results for strategies (a) and (b). (a) exchanges LoS vectors among neighboring cameras followed by local estimation using fused 3D position measurements (BAF and PF), and (b) employs a fusion window after the time window K i k with duration p = {1, 2, 4} in frames to process either exchanged information pairs' estimates (BAF) or fused 3D position measurements (SA-PF) other than LoS vectors. is considered as the frame capture instant of the neighboring camera, since SA-PF is not informed about the frame capture time at the neighboring camera. Moreover, just one significantly delayed measurement from a neighboring camera yields a small likelihood value, thus compromising the remaining particle weight updates in the sequential fusion step of SA-PF. The small likelihood in such condition is due to the particle state estimate undergoing time propagation in Equation (19) from capture instant to reception instant, whereas the neighboring camera measurement received asynchronously in the fusion window presents a large delay relative to the local camera capture instant. Then, if k c k c , the updated estimate is negatively affected by the delay. Therefore, it is justified that for τ max and α max ≤ 16 (small values of processing delay and asynchrony), the SA-PF filter produces an average root mean square error response similar to that of the BAF-LLM filter; i.e., small α max and τ max do not degrade SA-PF filter performance in strategy (b).

Discussions
As depicted in Figure 19 for all tested p = {1,2,4}, the average root mean square error of SA-PF undergoes sudden degradation relative to BAF-LLM when either asynchrony or frame processing delay increase significantly (blue and magenta curves-strategy (b)). BAF-LLM is not susceptible due to its prediction step from the frame capture instant to the transmission instant.
One problem is to validate the target information received by camera C i because, as stated previously, different Ids can be assigned to the same target in different cameras. This problem is solved in BAF by evaluating the Euclidean distance between the target 3D position estimate in camera C i with the position estimate received from the neighborhood. In the SA-PF filter, we use the Euclidean distance between measurement z i t,k c ,3D and measurement z j m t,k c ,3D received from the neighborhood; i.e., for each information exchange between cameras in the fusion window of a potential target, a distance tolerance ensures the correct fusion of data arising from the same target.
Target tracking by a camera is lost if the camera capture interval T = α 1 + α 2 exceeds the target duration within the camera FoV. Different from [15] where a fixed number of targets remain within sight of the camera network during the experiment, our distributed target tracking algorithm can track a varying number of targets as they move within the FoV of at least 2 cameras.
Cameras in close spatial proximity degrade the 3D estimation accuracy of targets at long range, since the 3D measurement from imaging as proposed here borrow the concept of depth estimation from baseline length and the epipolar constraint in stereoscopy.
Additionally, we conducted further evaluation of (1) the effect of camera pose uncertainty, with focus on camera attitude error, and (2) computational complexity of our proposed 3D position measurement approach.
The PETS2009 image sequence dataset is a real benchmark scenario for target tracking with fixed cameras whose calibration parameters and pose data are accurate. In spite of that, we conduct a Monte Carlo simulation with M = 20 runs wherein uncertainties are added in the extrinsic calibration parameters of each camera (rotation and translation parameters). To the calibrated rotation angles θ z , θ y , θ x of each camera in radians, we add a Gaussian uncertainty N (µ = 0.05 rad, σ = 0.001 rad). Additive Gaussian uncertainty N (µ = 10 cm, σ = 1 cm) in camera location is also applied. We evaluate the mean square error in Equation (20). The simulation investigated processing delays τ = (0, 2, 4, 6,8,10,12,14,16,18), maximum asynchrony among cameras α max = 4, and the particular asynchrony value for each camera sampled from the uniform distribution U {0, α max }, i.e., each camera received 0 ≤ α ≤ α max . Figure 20 indicates uncertainties in cameras' poses degrade the fused 3D position measurement of targets when compared with exact poses, as expected.
3D position estimate in camera with the position estimate received from neighborhood. In the SA-PF filter, we use the Euclidean distance between measure , , and measurement , , received from the neighborhood; i.e., for information exchange between cameras in the fusion window of a potential targ distance tolerance ensures the correct fusion of data arising from the same target. Target tracking by a camera is lost if the camera capture interval = exceeds the target duration within the camera FoV. Different from [15] where a number of targets remain within sight of the camera network during the experimen distributed target tracking algorithm can track a varying number of targets as they within the FoV of at least 2 cameras.
Cameras in close spatial proximity degrade the 3D estimation accuracy of targ long range, since the 3D measurement from imaging as proposed here borrow the co of depth estimation from baseline length and the epipolar constraint in stereoscopy.
Additionally, we conducted further evaluation of (1) the effect of camera uncertainty, with focus on camera attitude error, and (2) computational complexity o proposed 3D position measurement approach.
The PETS2009 image sequence dataset is a real benchmark scenario for t tracking with fixed cameras whose calibration parameters and pose data are accura spite of that, we conduct a Monte Carlo simulation with M = 20 runs wherein uncerta are added in the extrinsic calibration parameters of each camera (rotation and transl parameters). To the calibrated rotation angles , , of each camera in radians, w a Gaussian uncertainty ( = 0.05 rad, = 0.001 rad). Additive Gaussian uncert ( = 10 cm, = 1 cm) in camera location is also applied. We evaluate the mean sq error in Equation (20). The simulation investigated processing delays = (0, 2, 4, 6,12,14,16,18), maximum asynchrony among cameras = 4 , and the parti asynchrony value for each camera sampled from the uniform distribution {0, each camera received 0 ≤ ≤ . Figure 20 indicates uncertainties in cameras' poses degrade the fused 3D po measurement of targets when compared with exact poses, as expected. We evaluate numerically the computational complexity by increasing the numb of targets observed in the field of view of the i-th camera and the number of ta received by the i-th camera of the j-th cameras, thus resulting a total of • targets processing tasks considered for each of the targets are: We evaluate numerically the computational complexity by increasing the number n i of targets observed in the field of view of the i-th camera and the number n j of targets received by the i-th camera of the j-th cameras, thus resulting a total of n i ·n j targets. The processing tasks considered for each of the targets are: (a) Variable n i -each target has 7 sigma points assigned to the local LoS vector pointing to the t-th target in the image plane; (b) Variable n j -each target with its 7 sigma points assigned to the received LoS vector pointing to the received t-th target in the image plane; (c) Intersection search of pairs of LoS from the set of 49 LoS vectors providing points in 3D space (Equations (6) and (8)); (d) Checks which 3D points pass the distance criterion d ≤ 0.3 m; (e) Selects the 7 LoS vectors from the set of 49 LoS vectors that pass the above test with smallest distance d; and (f) UT computation: 3D position measurement mean and covariance obtained at a k c capture instant of the i-th camera.
We noticed three regions in Figure 21: to the t-th target in the image plane; (b) Variable -each target with its 7 sigma points assigned to the received LoS vector pointing to the received t-th target in the image plane; (c) Intersection search of pairs of LoS from the set of 49 LoS vectors providing points in 3D space (Equations (6) and (8)); (d) Checks which 3D points pass the distance criterion ≤ 0.3 m; (e) Selects the 7 LoS vectors from the set of 49 LoS vectors that pass the above test with smallest distance ; and (f) UT computation: 3D position measurement mean and covariance obtained at a capture instant of the i-th camera.
We noticed three regions in Figure 21: (a) For the number of targets varying from 1 up to 90 targets (i.e., from 1 × 1 up to 90 × 90 targets), the elapsed processing time shows an exponential tendency; (b) From 91 up to 262 targets (i.e., from 91 × 91 up to 262 × 262 targets) we observe a linear tendency, and over 262 × 262 targets, the elapsed processing time presents an exponential tendency.
In the PETS2009 image sequence dataset, the number of observed targets at any capture instant does not reach above 10 targets simultaneously. Therefore, our approach yields a very low processing time to carry out the 3D position measurements.

Conclusions
We propose position measurement in 3D space with intersecting vector constraints derived from exchanged LoS vectors in a camera network with known relative poses. Solving for the 3D intersection of vector constraints is based on computing the proximity of reverse straight lines. Each camera does not need the construction parameters of the other ones to obtain the 3D measurements. We then applied our approach to tracking simultaneous targets with a network of asynchronously communicating cameras with frame processing delays. The targets enter and leave the fields of view of the cameras in the image sequence dataset. (a) For the number of targets varying from 1 up to 90 targets (i.e., from 1 × 1 up to 90 × 90 targets), the elapsed processing time shows an exponential tendency; (b) From 91 up to 262 targets (i.e., from 91 × 91 up to 262 × 262 targets) we observe a linear tendency, and over 262 × 262 targets, the elapsed processing time presents an exponential tendency.
In the PETS2009 image sequence dataset, the number of observed targets at any k c capture instant does not reach above 10 targets simultaneously. Therefore, our approach yields a very low processing time to carry out the 3D position measurements.

Conclusions
We propose position measurement in 3D space with intersecting vector constraints derived from exchanged LoS vectors in a camera network with known relative poses. Solving for the 3D intersection of vector constraints is based on computing the proximity of reverse straight lines. Each camera does not need the construction parameters of the other ones to obtain the 3D measurements. We then applied our approach to tracking simultaneous targets with a network of asynchronously communicating cameras with frame processing delays. The targets enter and leave the fields of view of the cameras in the image sequence dataset.
We use the average root mean square error across all image planes of the asynchronous camera network to measure tracking performance under two conditions: fused 3D position measurement from exchanged LoS vectors and local state estimation (strategy (a)), and fusion of additional information other than the exchanged LoS vectors (strategy (b)).
BAF and PF present similar performance under strategy (a), which uses local estimation. With regard to strategy (b) involving further fusion, BAF-LLM is superior to SA-PF under large asynchrony and processing delay. Comparing between strategies (a) and (b), BAF, PF, and BAF-LLM yield approximately the same performance in a wide range of asynchrony and processing delay. We remark that local estimation in BAF and PF demands less communications than further fusion used in BAF-LLM and SA-PF. Further investigation should focus on the use of consensus to improve target tracking subject to asynchronous exchange in the camera network.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: