Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network

Di Gennaro, Thiago Marchi; Waldmann, Jacques

doi:10.3390/s23031194

Open AccessArticle

Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network

by

Thiago Marchi Di Gennaro

^1,*,†

and

Jacques Waldmann

²

¹

Systems and Control Department, Weapons Systems Directorate, Brazilian Navy, Rio de Janeiro 20010-100, RJ, Brazil

²

Systems and Control Department, Eletronics Engineering Division, Instituto Tecnológico de Aeronáutica, São José dos Campos 12228-900, SP, Brazil

^*

Author to whom correspondence should be addressed.

^†

Current Address: Rua Primeiro de Março 118—Edifício Barão de Ladário—Andares 17° a 21°—Centro, Rio de Janeiro 20010-100, RJ, Brazil.

Sensors 2023, 23(3), 1194; https://doi.org/10.3390/s23031194

Submission received: 15 November 2022 / Revised: 13 January 2023 / Accepted: 16 January 2023 / Published: 20 January 2023

(This article belongs to the Section Sensor Networks)

Download

Browse Figures

Versions Notes

Abstract

:

We present a method to acquire 3D position measurements for decentralized target tracking with an asynchronous camera network. Cameras with known poses have fields of view with overlapping projections on the ground and 3D volumes above a reference ground plane. The purpose is to track targets in 3D space without constraining motion to a reference ground plane. Cameras exchange line-of-sight vectors and respective time tags asynchronously. From stereoscopy, we obtain the fused 3D measurement at the local frame capture instant. We use local decentralized Kalman information filtering and particle filtering for target state estimation to test our approach with only local estimation. Monte Carlo simulation includes communication losses due to frame processing delays. We measure performance with the average root mean square error of 3D position estimates projected on the image planes of the cameras. We then compare only local estimation to exchanging additional asynchronous communications using the Batch Asynchronous Filter and the Sequential Asynchronous Particle Filter for further fusion of information pairs’ estimates and fused 3D position measurements, respectively. Similar performance occurs in spite of the additional communication load relative to our local estimation approach, which exchanges just line-of-sight vectors.

Keywords:

3D tracking; line-of-sight fusion; camera network; processing delay; asynchrony

1. Introduction

Tracking multiple targets in a distributed-processing wireless-communication network (WCN) of cameras encompasses several areas such as surveillance, computer vision, threat detection [1], inertial navigation aided by additional sensors, and autonomous vehicles. In the tracking process investigated here, we assume fixed cameras whose poses are known and with computational processing resources (network nodes). The cameras estimate the location of a moving object of interest, here called a target, over time [2] in three-dimensional (3D) space. Network topology affects the distributed processing that aims at greater flexibility and robustness [3], and target tracking is expected to become less likely subject to failure. Each network node implements detection algorithms and state estimators, where state estimate refers to the position and velocity of the estimated target. Detected targets are represented by the location of their centroids on the image plane, and the corresponding measurements are the line-of-sight (LoS) vectors. Target 3D position arises from intersecting LoS vectors measured by neighboring nodes covering the 3D space the target traverses. Hence, the target 3D motion is not restricted to a reference ground plane, and targets can move freely in the 3D space covered by the field of view of each camera in the network. The LoS measurements are usually noisy or erroneous owing to the limitations of computer vision algorithms [4]. With a network of cameras, it is possible to improve target state estimation via fusion of information from different cameras observing the same target.

Two approaches for information exchange among cameras are found in the literature: measurement information exchange [3,5,6] and state estimate information exchange [7,8,9,10,11,12,13,14,15,16,17,18]. The former computes the fusion of exchanged measurements and subsequently proceeds to target state estimation. Concerning the target state estimate exchange, the i-th camera fuses the local state estimate with the same target’s state estimate received from neighboring cameras.

Each camera node has a local clock representing the timeline for each of its pre-programmed actions. The network nodes share the same inter-frame interval, though frame capture instants are asynchronous [8]. Moreover, for each captured frame, there is a processing delay when detecting the targets and computing the measurements. Here, this measurement is the 3D target position in world Cartesian coordinates from intersecting local and received LoS vectors of cameras with known poses followed by the unscented transform (UT) [19]. Processing delay and asynchronous capture are both integer multiples of the inter-frame interval and dealt with by means of temporal alignment of the 3D position measurements with the local frame capture instant and measurement fusion by averaging the aligned 3D position measurements. Then, we proceed with the local estimation of either the target information pair with Kalman filtering in information form or the target state estimate with particle filtering.

We compare the above with the Batch Asynchronous Filter (BAF) [15] and the Sequential Asynchronous Particle Filter (SA-PF) [17], both Bayesian decentralized filters, in combination with our fused 3D position measurement approach. BAF and SA-PF originally exchange information other than just asynchronous line-of-sight vectors to targets, respectively, asynchronous information pairs’ estimates, and fused 3D position measurements.

BAF exchanges asynchronous information pair estimates among neighboring cameras from local Kalman information filters, aligns them temporally at the frame capture instant for Kullback–Leibler Average (KLA) fusion [20], and proceeds to target information pair update followed by time propagation to the next local capture instant.

SA-PF [17] asynchronously exchanges fused 3D position measurements among neighboring cameras, then propagates each local particle state estimate to the reception instant of each measurement in the local reception sequence, iteratively updates each local particle weight with the likelihood of each received measurement, updates the local state estimate, and propagates to the next local capture instant.

1.1. Related Work

Target tracking in a distributed asynchronous camera network can either use a sequential method [13] or a batch method [6]. In sequential methods, a network node computes the target state not only when local measurements are obtained but also when the same node receives information from neighboring camera nodes. The nodes do not propagate the estimates received from neighboring nodes to alignment with the local frame capture instant prior to update; the reception instants are used as the sending camera frame capture instants and the probability density for target state estimation; namely, the particle weights are updated sequentially [15,17]. The SA-PF filter disregards the processing delays.

In batch processing methods, the update with data information received asynchronously from other nodes occurs after the received data are propagated to the local frame capture instant for the purpose of time alignment [6,15]. BAF [15] considers the asynchronous frame capturing among cameras and the processing delay in the estimation and fusion phases at each camera. In the estimation phase, a camera obtains the 3D position measurement assuming the target motion develops on a ground reference plane and estimates the information pair at the frame capture instant. Cameras exchange local target state information, and all information received during a time window period is stored in a buffer. At the end of the time window, a camera enters the fusion phase and propagates the received target information pairs of other cameras either backward or forward to temporal alignment with the local frame capture instant [21].

Previous work mostly related to target tracking with a camera network used information exchange of two-dimensional target states restricted to move on a reference ground plane [2,3,7,8,10,13,14,15,17,21]. However, much less attention has been given to the often-occurring scenario in which targets move in 3D space while taking into account network asynchrony and delays in image processing. For example, in [22], multi-view histograms were employed to characterize targets in 3D space using color and texture as the visual features; the implementation was based on particle filtering. In [23], a deep network model was trained with an RGB-D dataset, and then the 3D structure of the target was reconstructed from the corresponding RGB image to predict depth information. In [24], images with a detected hand and head were used with each camera transmitting to a central processor using a low-power communication network. The central processor performed 3D pose reconstruction combining the data in a probabilistic manner. In [16], state estimation with complete maximum likelihood information transfer produced 3D cartesian estimates of target position and velocity from azimuth and elevation provided by multiple cameras. Ref. [25] addressed the competing objectives of monitoring 3D indoor structured areas against tracking multiple 3D points with a heterogeneous visual sensor network composed of both fixed and Pan-Tilt-Zoom (PTZ) controllable cameras. The measurement was the projection of a distinct 3D point representing the target onto an image plane wherein all cameras followed the pinhole model with unit focal length. The authors applied decentralized EKF information filtering, neither handling camera network asynchrony nor the issue that in real image sequences a target occupies a volume; thus, cameras aiming at a target from different directions see quite distinct portions of the target surface and different 3D points.

None of the above methods considered frame processing delays and network asynchrony among camera nodes.

Reference [26] applied DAT2TF to an asynchronous network of 2D radars that provided local Kalman filter state estimates to a central fusion node performing global estimation of the 3D target state. While focusing on the fusion of state estimates, a linearized approach is highlighted to determine the Cramer–Rao estimation error uncertainty bound when converting radar measurements from polar to Cartesian coordinates, and the authors claim the result is equivalent to decentralized filtering.

1.2. Our Proposal

To track multiple actual targets undergoing unrestricted motion in a real image sequence with a network of calibrated cameras whose poses are known, we propose the decentralized approach wherein our focus is on the fusion of 3D position measurements combined with only local estimation, without involvement of a central node. For that, we need to invert the projection function from 3D space to the image plane with stereoscopy and the known camera poses and calibration data:

Step 1: Within a time window about the local frame capture instant and among neighboring cameras, transmit and receive asynchronously LoS vectors represented in world coordinate system {W} to a target centroid in the image plane and the corresponding time stamps among neighboring cameras within a time window about the frame capture instant [15];

Step 2: Use the unscented transform (UT) and stereovision principles to acquire 3D position measurements and covariances from the received LoS vectors when they intersect to within a 3D spatial distance threshold;

Step 3: Propagate forward or backward in time the 3D position measurements and respective covariances via interpolation until temporal alignment with the local frame capture instant;

Step 4: Fuse locally the 3D position measurements and covariances at the frame capture instant by means of averaging, use them to update the local estimate, and propagate to the next frame capture instant.

We compare our approach to SA-PF and BAF wherein additional communication is exchanged other than just LoS vectors to measure 3D positions. After the aforementioned time window in Step 1, each camera starts a fusion window after the time window to receive and process the additional communications. In SA-PF, a camera transmits and receives fused 3D position measurements and respective covariances at distinct instants in the fusion window. Differently, BAF receives asynchronous information pairs’ estimates at distinct instants in the fusion window.

Both BAF and SA-PF are compared to our proposal with the average root mean square error in the image planes of the cameras tracking the various targets in the scene. The comparison also aims at investigating whether there is a relevant change in performance when a camera fails to receive communications due to ongoing frame processing.

1.3. Contributions

Our innovative approach directly measures the 3D centroid position of an actual target with stereoscopy as different, calibrated cameras with known poses view distinct portions of the surface of the same target. Moreover, we explicitly dispense with the fusion node to add robustness to target tracking via decentralized processing in a camera network with partial connectivity. Our focus is on the fusion of 3D position measurements and respective covariances computed with LoS vectors received asynchronously from neighboring cameras. We also consider the image processing delay of each node when obtaining the local measurements.

We derive LoS vectors from the optical center of each camera to the various target centroids in the respective image plane with the corresponding error covariances. Then, we use stereoscopy combined with an unscented transform (UT) that employs the minimum distance search between pairs of LoS vectors to compute the corresponding 3D positions and error covariance in the world reference frame. One LoS vector is from the local camera and the other LoS vector is received asynchronously from any of the neighboring cameras tracking the same target. We handle asynchronous LoS vectors and the corresponding error covariances, all represented in the world coordinate frame, with time propagation from the received instant into alignment with the camera local capture instant. A set of 3D position measurements and covariances arise with representation in the world coordinate frame at the capture instant of the camera. Then, we fuse the measurements and covariances by means of averaging and proceed to local estimation.

We emphasize that our contribution is not in filtering, though we modified both BAF and SA-PF to add a fusion stage to test our fused 3D position measurement approach and local estimation against a further fusion step among the neighboring cameras, and thus investigate the tradeoff between performance and required communications. We find that local estimation provides similar performance with a smaller communication workload.

Moreover, in the real scenario evaluated here, namely the PETS2009 sequence [27], subjects enter and leave the fields of view of cameras the camera network, resulting in targets that merge and separate as they suddenly change motion direction in 3D space. Our approach handles such occurrences gracefully. According to our research, no previous investigation has focused on decentralized target tracking using LoS vectors exchanged asynchronously among neighboring camera nodes with frame processing delay to obtain position measurements in 3D space from stereoscopy principles. In Table 1, we address an analytical comparison of methods found in the literature.

1.4. Paper Structure

The paper is organized as follows. Section 2 formulates how to measure 3D position considering the processing delay, network asynchrony, and tracking without constraining target motion to a reference ground plane. Section 3 describes the algorithms to obtain 3D measurements and the modifications to the Bayesian BAF and SA-PF filter algorithms to include an additional fusion window for comparison with just local estimation. Section 4 presents simulation results and discusses performance, and Section 5 concludes the paper.

2. Problem Formulation

Some background is required initially:

-: Variable k instants or time steps within a time window: they are integer multiples of the camera inter-frame interval and represent the capture instants of target centroids, and the instants of LoS vectors are received from neighboring cameras during the time window about the frame capture instant $k_{c}$ of a local camera. They are stored in the i-th camera node;
-: Frame capture instants $k_{c}$ : the moment when a frame is captured and defines a time window;
-: Here we assume that the communication is ideal among cameras, i.e., there is no communication delays.

Unlike the fully connected network in [15,17], we investigate a partially connected network. Consider a WCN of cameras

C^{i}

,

i = {1, 2, \dots, N_{s}}

, where

N_{s}

denotes the maximum number of sensors. This network is based on decentralized processing, meaning that each

C^{i}

camera node fuses its information with the information received from the set

N^{j, i}

of camera nodes

C^{j}, j \neq i

considered neighbors; i.e.,

N^{j, i}

is the set of cameras with which the i-th camera communicates to track targets in 3D space. The network topology is fixed and independent of the tracked target. The term decentralized indicates that each camera in the network has a communication module, with detection and tracking algorithms independent from the rest of the cameras. There is no fusion center involved. Given that we aim at 3D tracking, the state of the t-th target at the i-th camera is given by

x_{t, k_{c}, 3 D}^{i} = {[x_{t, k_{c}}^{i}, {\dot{x}}_{t, k_{c}}^{i}, y_{t, k_{c}}^{i}, {\dot{y}}_{t, k_{c}}^{i}, z_{t, k_{c}}^{i}, {\dot{z}}_{t, k_{c}}^{i}]}^{T}

, where

[x_{t, k_{c}}^{i}, y_{t, k_{c}}^{i}, z_{t, k_{c}}^{i}]^{T}

represents the position, and

[{\dot{x}}_{t, k_{c}}^{i}, {\dot{y}}_{t, k_{c}}^{i}, {\dot{z}}_{t, k_{c}}^{i}]^{T}

is the velocity at the frame capture instant

k_{c}

.

The first phase to model this problem involves obtaining the t-th target position measurements

z_{t, k_{c}, 3 D}^{i}

in 3D space for the targets at the i-th camera node from intersecting LoSs measured asynchronously in neighboring camera nodes. Subsequently, at each camera node, position and velocity estimation are initialized for each target. The local estimation algorithms are the Bayesian filters BAF and particle filtering (PF) without engaging in the fusion step though [15,17]. They compute the probability density function of the target pdf

p (x_{t, k_{c}, 3 D}^{i, i} | z_{t, 1 : k_{c}, 3 D}^{i})

based on the known predicted posterior pdf

p ({\hat{x}}_{t, k_{c}, 3 D}^{i} | z_{t, 1 : k_{c} - 1, 3 D}^{i})

corresponding to the previous capture instant

k_{c} - 1

and propagated to capture instant

k_{c}

, the state transition pdf

p (x_{t, k_{c}, 3 D}^{i, i} | {\hat{x}}_{t, k_{c} - 1, 3 D}^{i})

, and the measurement likelihood pdf

p (z_{t, k_{c}, 3 D}^{i} | x_{t, k_{c}, 3 D}^{i, i})

. The second superscript in i,i refers to the local state estimate processed at the i-th camera not using information received from a neighboring camera.

We consider the frame processing delay incurred to obtain

z_{t, k_{c}, 3 D}^{i}

and the asynchrony of neighboring camera capture instants. To this end, we borrow from [15] the concept of a time window

K_{k}^{i}

about the local camera capture instant

k_{c}

, wherein LoS vectors are received from neighboring cameras. We define

τ_{k_{c}}^{i}

as the frame processing delay to obtain the measurement

z_{t, k_{c}, 3 D}^{i}

for the i-th camera, and

α^{i j}

is the relative offset between the capture instants of neighboring cameras

C^{i}

and

C^{j}

.

The problems arising with false positives and false negatives in target detection were considered previously [28] and are addressed here as well. Scene complexity (quantity and speed of the targets) [29], construction parameters of the cameras, and accuracy in their calibration [30,31,32] are factors that also hinder target detection. Thus, if the detection algorithm fails to confirm the presence of the same target within the limits of a minimum capture count up to a maximum count, then the target is discarded. A new identifier number (Id) assignment is created, and a new initial target centroid measurement in the image plane, expressed in pixels, is generated when detecting a novel target. We assume the detection algorithm does not incur in a frame processing delay [33] to obtain a target centroid measurement in the image plane, and the delay is solely from computing the position measurement

z_{t, k_{c}, 3 D}^{i}

in 3D space with the received LoS vectors transmitted by neighboring camera nodes.

As stated earlier, the cameras will fuse information from targets detected in their FoVs with information received from the targets of neighboring cameras belonging to

N^{j, i}

. Recall the LoS vector from a single pinhole camera model to a target does not convey depth information for tracking in 3D space [32]. Then, subject to network asynchrony, j-th cameras in the i-th camera neighborhood set

N^{j, i}

transmit LoS vectors to distinct targets, each target being detected by at least two cameras with overlapping FoVs [30]. We use stereoscopy to measure depth along a LoS vector and determine 3D position as targets move freely. Stereoscopy must have views to the same target from at least two different directions. The LoS vector from the camera’s optical center to the target centroid in the image plane is represented in the global Cartesian coordinate system by means of camera calibration and pose [31] data. In a real scenario, the same target may have very distinct portions of its surface viewed from very different directions by cameras located far apart. In such a case, stereoscopy can produce a biased 3D position measurement located somewhere inside the target surface. A distance tolerance

d_{m i n} \leq δ

is established to detect LoS proximity and ascertain that the measured 3D position is consistent with the pair of LoSs involved.

We evaluate three approaches to information exchange:

First: We propose to exchange among neighboring cameras LoS vector measurements ${\vec{L o S}}_{t, k_{c} + τ_{k_{c}}^{i}, i}^{w}$ to the t-th target from the i-th camera at instant $k_{c} +$ $τ_{k_{c}}^{i}$ within the time window $K_{k}^{i}$ period, and the data are represented in the world coordinate system {W}. The measured $z_{t, k, 2 D}^{i}$ (referring to the target centroid in the image plane in pixel coordinates) is transformed into ${\vec{L o S}}_{t, k, i}^{w}$ with the camera construction parameters as described in Section 2.1. The fused 3D position measurement $z_{t, k_{c}, 3 D}^{i}$ results from averaging intersecting pairs of vector constraints ${\vec{r}}_{t, k, i}^{w}, {\vec{r}}_{t, k, j}^{w}$ derived from LoS vectors of neighboring j-th cameras pointing at the same t-th target according to a LoS-vector 3D proximity determination algorithm. Local target state estimation uses the fused 3D position measurement $z_{t, k_{c}, 3 D}^{i}$ to update and predict the state estimate in the next frame capture instant $k_{c} + 1$ .
Second: For comparison, exchange among cameras $C^{i}$ and $C^{j}$ , where $j \in N^{j, i}$ , the 3D fused measurement $z_{t, k_{c}, 3 D}^{i}$ for each target, transmitted and instantaneously received at instants within the fusion window with configurable p size after the $K_{k}^{i}$ time window. This is the used in a sequential approach in SA-PF.
Third: Again, for comparison, exchange among cameras $C^{i}$ and $C^{j}$ , where $j \in N^{j, i}, j \neq i$ , information pairs the estimates for each target, transmitted and instantaneously received at instants within the fusion window with configurable p size after the $K_{k}^{i}$ time window. This is used in the batch approach in BAF.

The first approach extends the concept of stereoscopy to neighboring calibrated cameras with known poses and requires a 3D proximity metric among locally detected LoS vectors in the i-th camera and LoS information received from cameras belonging to

N^{j, i}

. Note that internal temporary storage of LoS vectors

{\vec{L o S}}_{t, k, i}^{w}

at each instant within the time window

K_{k}^{i}

period is required because the LoS vectors

{\vec{L o S}}_{t, k, i}^{w}

at

C^{i}

must be aligned temporally, through time stamps, with the vectors

{\vec{L o S}}_{t, k, j}^{w}

received from

C^{j}

. In [34], the concept of “virtual synchronization” was used along a “dynamic” memory to store the captured data. Based on this concept, sending and receiving information from cameras are concurrent activities at instants

k

within the time window period.

The information vector

I_{t, k_{c} + τ_{k_{c}}^{i}, i}^{w}

exchanged between cameras conveys the LoS vector to the t-th target at the time instant

k_{c} + τ_{k_{c}}^{i}

, the time stamp referring to the

k_{c} + τ_{k_{c}}^{i}

instant and the optical center

O_{c, i}^{w}

of camera

C^{i}

expressed in world coordinates {W}:

I_{t, k_{c} + τ_{k_{c}}^{i}, i}^{w} = {k_{c} + τ_{k_{c}}^{i}; O_{c, i}^{w}; {\vec{L o S}}_{t, k_{c} + τ_{k_{c}}, i}^{w}}

; the information vector stored for each of the

k

instants within the time window

K_{k}^{i}

period is defined as

i_{t, k, i}^{w} = {k; O_{c, i}^{w}; {\vec{L o S}}_{t, k, i}^{w}}

. We adopt capital I for the exchanged information vector and small i for the stored information vector.

2.1. Obtaining $z_{t, k_{c}, 3 D}^{i}$

To obtain measurements in 3D space, the following steps are required, from capturing the target centroid in the image plane

z_{t, k, 2 D}^{i}

in pixels to measuring the centroid

z_{t, k_{c}, 3 D}^{i}

in 3D space in meters.

2.1.1. Target Centroid Detection in the Image Plane in Pixels and Identification Number (Id) Assignment

For each target detected by the i-th camera, the target centroid in the image plane is represented in the image by a red dot, as shown in Figure 1. All targets are assigned an identification number (Id).

Concerning the detection algorithm, accurate data association for maintaining the assignment of an identifier to a target is a challenging problem. The “global nearest neighbor” algorithm [35], which in turn implements the Munkres algorithm [36], was used at each camera for this purpose. However, as shown in Figure 1, an Id assigned to a target by one of the cameras will not necessarily match the Id assigned to the same target by neighboring cameras. For example, in Figure 1, for the same target observed by Cameras 1 and 2, Camera 1 assigned Id = 1 while Camera 2 assigned Id = 3. This issue is addressed in Section 2.3.

Centroid pixels in Figure 1 should be represented in the corresponding camera {C} coordinate system, and the cameras’ calibration will correct the distortions caused by camera construction parameters.

2.1.2. Camera Calibration

All cameras follow the pinhole CCD model [32]. Using intrinsic or construction parameters of a camera, it is possible to correct its distortions and calibrate it so that the distorted pixel of the detected centroid of the t-th target, represented by point

P_{d}

in Figure 2, corresponds to the corrected position in the image plane represented by point

P_{u}

. The transformation

(x_{d}, y_{d}) \to (x_{u}, y_{u})

follows [37].

After correcting point

P_{d}

to point

P_{u}

in the camera coordinate system {C}, we represent two main points in the world coordinate system {W} for each camera: the optical center point

O_{c, i}

and point

P_{u}

that give rise to the LoS vector from the i-th camera to the t-th target.

2.1.3. The Optical Center $O_{c, i}$ and Point $P_{u}$ Represented in the World Coordinate System {W}: A Preamble for Obtaining the LoS Vector

Given that the extrinsic parameters of camera

C^{i}

represent its rotation

R^{i}

and translation

{\vec{t}}^{i}

, the representation of the optical center

O_{c, i}

defined as vector

{\tilde{C}}^{i} = O_{c, i}

[32], represented in camera coordinate system {C}, in world coordinate system {W} is expressed as follows:

{C} \to {W} \to O_{c, i}^{w} = {}_{c}^{w}{\vec{t}}^{i} = - {}_{c}^{w}R^{i'} {\tilde{C}}^{i}

(1)

Point

P_{u}

is located in image plane, as shown in Figure 2. We must first represent the pixels of

P_{u}

in coordinates of the camera system {C}:

{p i x e l s} \to {C} \to (x_{u} = f \frac{x_{c}}{z_{c}}, y_{u} = f \frac{y_{c}}{z_{c}}, z_{c} = f) \to (x_{u} = x_{c}, y_{u} = y_{c}, f)

(2)

Denoting the LoS vector

X_{t, c}^{i} = [x_{c} = x_{u}, y_{c} = y_{u}, f]

as the coordinates obtained from point

P_{u}

for the t-th target, now represented in coordinates of the camera system {C}, we can represent it in a vector base with origin at the camera optical center

O_{c, i}

and parallel to the world coordinate system {W} as follows:

{C} \to {W} \to {\vec{L o S}}_{t, k, i}^{w} = {}_{c}^{w}R^{i'} X_{t, c}^{i}

(3)

The rotation matrix

{}_{c}^{w}R^{i}

is expressed as follows:

{}_{c}^{w}R^{i} (θ_{z}, θ_{y}, θ_{x}) = [\begin{matrix} \cos θ_{y} \cos θ_{z} & \cos θ_{z} \sin θ_{x} \sin θ_{y} - \cos θ_{x} \sin θ_{y} & \sin θ_{x} \sin θ_{z} + \cos θ_{x} \cos θ_{z} \sin θ_{y} \\ \cos θ_{y} \sin θ_{z} & \sin θ_{x} \sin θ_{y} \sin θ_{z} + \cos θ_{x} \cos θ_{z} & \cos θ_{x} \sin θ_{y} \sin θ_{z} - \cos θ_{z} \sin θ_{x} \\ - \sin θ_{y} & \cos θ_{y} \sin θ_{x} & \cos θ_{x} \cos θ_{y} \end{matrix}]

(4)

This is the matrix that performs the rotation from system {W} to system {C} for each camera; its rotation order in camera axes is

z y x

and angles

θ_{z}, θ_{y}, θ_{x}

rotate from {C} to {W} [31], then it is necessary to apply the transpose operation

(')

in Equations (1) and (3).

2.1.4. LoS Vector ${\vec{L o S}}_{t, k, i}^{w}$ and Parametric Equation of the Line

Note in Figure 3 that the LoS vector

{\vec{L o S}}_{t, k, i}^{w}

is simply the point

X_{t, c}^{i}

multiplied by the transposed rotation matrix of the i-th camera according to Equation (3).

It should be noted that

{}_{c}^{w}{\vec{t}}^{i}

is a constant vector because it refers to the fixed position of the optical center of the i-th camera. The LoS vector, in turn, depends on the position of the target, referring to point

X_{t, c}^{i}

, in the image plane expressed in camera coordinates system {C}. Writing the constraint vector

{\vec{r}}_{t, k, i}^{w}

in the form of a parametric equation [38] yields the following expression:

{\vec{r}}_{t, k, i}^{w} (λ) = {}_{c}^{w}{\vec{t}}^{i} + λ {\vec{L o S}}_{t, k, i}^{w}, λ \in ℝ, λ > 1

(5)

Thus, the vector constraint

{\vec{r}}_{t, k, i}^{w} (λ)

changes when extending the LoS vector

{\vec{L o S}}_{t, k, i}^{w}

with parameter

λ

. Figure 4 shows the LoS vector

{\vec{L o S}}_{t, k, i}^{w}

and points related to vectors constraint

{\vec{r}}_{t, k, i}^{w} (λ)

and optical center position

{}_{c}^{w}{\vec{t}}^{i}

to Target 1 on Camera 1. The last two vectors have their origins coinciding with the origin

{0; 0; 0}

of the world coordinate system {W}.

2.1.5. Proximity Metric between Vector Constraints Originating from LoS to a Target

Assume camera

C^{i}

receives the information vector

I_{t, k_{c} + τ_{k_{c}}^{j}, j}^{w}

from one of the transmitting cameras

C^{j}

. Suppose also that the observed scene contains only one target, and two synchronized cameras—

C^{i}

and

C^{j}

with

α^{i j} = 0

—observe it with overlapping FoVs at the same time of capture instant in both cameras; moreover, there is no image processing delay, i.e.,

τ_{k_{c}}^{i} = τ_{k_{c}}^{j} = 0

. In these circumstances, we adopt for simplicity here subscript k to mean the capture instant

k_{c}

. The vector constraints

{\vec{r}}_{t, k, i}^{w}

and

{\vec{r}}_{t, k, j}^{w}

, respectively, are computed from the information vectors of cameras

C^{i}

and

C^{j}

, respectively, and can be written as follows:

\begin{matrix} {\vec{r}}_{t, k, i}^{w} (λ) = {}_{c}^{w}{\vec{t}}^{i} + λ {\vec{L o S}}_{t, k, i}^{w}, λ \in ℝ, λ > 1 \\ {\vec{r}}_{t, k, j}^{w} (β) = {}_{c}^{w}{\vec{t}}^{j} + β {\vec{L o S}}_{t, k, j}^{w}, β \in ℝ, β > 1 \end{matrix}

(6)

Then, target triangulation must find

λ^{*}, β^{*} = \min_{λ, β} d ({\vec{r}}_{t, k, i}^{w} (λ), {\vec{r}}_{t, k, j}^{w} (β)) = \min_{λ, β} d ({}_{c}^{w}{\vec{t}}^{i} + λ {\vec{L o S}}_{t, k, i}^{w}, {}_{c}^{w}{\vec{t}}^{j} + β {\vec{L o S}}_{t, k, j}^{w})

that minimize the distance between vector constraints

{\vec{r}}_{t, k, i}^{w} (λ)

and

{\vec{r}}_{t, k, j}^{w} (β)

. The concept of distance between reversed straight lines is applied [38,39], and the solution is obtained by calculating the partial derivative of the distance with respect to

λ and β

.

{\begin{matrix} \frac{\partial d ({\vec{r}}_{t, k, i}^{w} (λ), {\vec{r}}_{t, k, j}^{w} (β))}{\partial λ} = 0 \\ \frac{\partial d ({\vec{r}}_{t, k, i}^{w} (λ), {\vec{r}}_{t, k, j}^{w} (β))}{\partial β} = 0 \end{matrix} .

(7)

Solving (7) produces the optimized nonlinear parameters expressed as follows:

{\begin{matrix} λ^{*} = \frac{({\vec{L o S}}_{t, k, i}^{w} \cdot {\vec{L o S}}_{t, k, j}^{w}) [({}_{c}^{w}{\vec{t}}^{i} - {}_{c}^{w}{\vec{t}}^{j}) \cdot {\vec{L o S}}_{t, k, j}^{w}] - {\vec{|| L o S}}_{t, k, j}^{w} {||}^{2} [({}_{c}^{w}{\vec{t}}^{i} - {}_{c}^{w}{\vec{t}}^{j}) \cdot {\vec{L o S}}_{t, k, i}^{w}]}{{\vec{|| L o S}}_{t, k, i}^{w} {||}^{2} {\vec{|| L o S}}_{t, k, j}^{w} {||}^{2} - {({\vec{L o S}}_{t, k, i}^{w} \cdot {\vec{L o S}}_{t, k, j}^{w})}^{2}} \\ β^{*} = \frac{({}_{c}^{w}{\vec{t}}^{i} - {}_{c}^{w}{\vec{t}}^{j}) \cdot {\vec{L o S}}_{t, k, j}^{w} + λ^{*} ({\vec{L o S}}_{t, k, i}^{w} \cdot {\vec{L o S}}_{t, k, j}^{w})}{{\vec{|| L o S}}_{t, k, j}^{w} {||}^{2}} \end{matrix}

(8)

The operator

\cdot

represents the inner product between the vectors, and

|| . ||

represents the vector 2-norm operator. Appendix A shows how to find the optimized parameters

λ^{*} and β^{*}

.

Once obtained, these parameters are substituted in (6), and the minimum distance occurs where the 3D position is in

{\vec{r}}_{t, k, i}^{w} (λ^{*})

. A tolerance

d_{m i n} \leq δ = 0.3

m was chosen to indicate whether the pair of LoS vectors point at the same target. Figure 5 shows LoS vectors

{\vec{L o S}}_{t, k, i}^{w}

and

{\vec{L o S}}_{t, k, j}^{w}

pointing at the same target.

The above description concerns information vectors exchanged synchronously. Next, we analyze asynchronous information exchange considering the processing time required for obtaining position measurements in 3D space.

2.1.6. Asynchronous Exchange of Information Vectors

Figure 6 shows the top and front views of the t-th target centroid trajectory during time window

K_{k}^{i}

in one of the cameras, assuming

k_{c} = k_{3}

to be the frame capture instant. Both the optical center

O_{c, i}^{w}

and the LoS vector

{\vec{L o S}}_{t, k, i}^{w}

that define the vector constraint

{\vec{r}}_{t, k, i}^{w} (λ)

are observed in Figure 6. The frame processing delay of the detection algorithm is assumed negligible relative to acquiring the 3D position measurements of targets’ centroids instants

k

within the time window

K_{k}^{i}

.

An example of a target centroid trajectory in the image plane expressed in world coordinates {W} is depicted in Figure 7.

Following an idea reported in [34], the information vectors

i_{t, k, i}^{w}

are stored for each k-th instant within the time window

K_{k}^{i}

, as can be observed in Buffer 1 of the i-th camera in Figure 8. They are subsequently replaced by information vectors of the next time window

K_{k + 1}^{i}

. Note that to each detected t-th target a buffer is allocated to store its information vectors.

Information vectors are sent and received by the i-th camera within the time window about the capture instant

k_{c}

, expressed as

K_{k}^{i} = [k_{c} - α_{1}, k_{c} + α_{2}]

,

(α_{1} > 0 and α_{2} > τ_{k_{c}}^{i})

. The information vectors

i_{t, k, i}^{w}

of the t-th target in the i-th camera (Buffer 1) are aligned according to their time stamps with respect to time stamps in the information vectors of various targets received from the j-th neighboring cameras (Buffer 2). After the temporal alignment,

C^{i}

will contain local information vectors (Buffer 1) and received information vectors from the j-th neighboring cameras (Buffer 2), as depicted in Figure 9.

After temporal synchronization of the information vectors, the proximity metric algorithm is initialized to determine the distance between constraint pairs

{\vec{r}}_{t, k, i}^{w}

and

{\vec{r}}_{t, k, j}^{w}

. The 3D position measurement in 3D space

z_{t, k, 3 D}^{i}

results from local constraints

{\vec{r}}_{t, k, i}^{w}

and constraints

{\vec{r}}_{t, k, j}^{w}

received from neighboring cameras within each time window

K_{k}^{i}

that are found to point to the same t-th target at the same instant k when distance between the constraint pair is less than tolerance

δ

. A polynomial interpolates the 3D position measurements of the same target

z_{t, k, 3 D}^{i}

, propagating these 3D measurements backward or forward in time within the time window

K_{k}^{i}

until alignment with the i-th camera captures instant

k_{c}

, and fuses these 3D measurements by averaging them, thus producing the fused 3D position measurement

z_{t, k_{c}, 3 D}^{i}

. An internal variable

N_{c}

defined for each t-th target counts the number of cameras tracking the same target for the purpose of averaging. Then follows local estimation via update with measurement

z_{t, k_{c}, 3 D}^{i}

and time propagation to the next capture instant, either with BAF or SA-PF, without engaging in any further fusion phase and hence reducing the communication load.

For comparison, the fusion window takes place at the end of each time window

K_{k}^{i}

at time

(k_{c} + α_{2})

, either using information pairs’ estimates in BAF or fused 3D position measurements in SA-PF received asynchronously from neighboring cameras. As described in [15], each camera node knows the maximum relative offset

α^{m a x} = \max_{i, j} {α^{i j}}

and the maximum and minimum processing delays

τ^{m a x} = \max_{i, k_{c}} {τ_{k_{c}}^{i}}

e

τ^{m i n} = \min_{i, k_{c}} {τ_{k_{c}}^{i}}

. To avoid the next capture instant

k_{c}

coinciding with the fusion phase of the previous time window, a separation

T = α_{1} + α_{2}

between consecutive capture instants must exist where

α_{1} = α^{m a x} - τ^{m i n}

and

α_{2} = α^{m a x} + τ^{m a x}

[15]. Figure 10 shows the timeline that describes how the actions elapse within time window

K_{k}^{i}

. The algorithm that obtains the 3D position measurements takes into account the information vectors

I_{t, k_{c} + τ_{k_{c}}^{i}, j}^{w}

exchanged at the time window within the period

[k_{c} - α_{1}, k_{c} + τ_{k_{c}}^{i}]

. The estimation phase takes place at

k_{c} + τ_{k_{c}}^{i}

using the fused 3D position measurement

z_{t, k_{c}, 3 D}^{i}

.

An important point addressed next is how to model the uncertainties inherent to the whole process described above and thus find the fused 3D position measurement

z_{t, k_{c}, 3 D}^{i}

.

2.1.7. Obtaining the 3D Position Measurement at Capture Instant $k_{c}$

In the initialization of the proximity metric algorithm to determine the distance between constraint pairs

{\vec{r}}_{t, k, i}^{w}

and

{\vec{r}}_{t, k, j}^{w}

, we model the LoS vector

{\vec{L o S}}_{t, k, i}^{w}

with the “Unscented Transform” (UT) [19,40] to approximate the statistics of the random LoS variables transformed by the nonlinear function that determines depth from stereo. The uncertainty in the image plane is a three-dimensional Gaussian distribution

N (μ_{x}, C_{x})

that also accounts for camera calibration error in the focal length

f

, where

μ_{x}

is the mean, and

C_{x}

is the covariance. This Gaussian distribution defines seven sigma points [40]. Figure 11 shows the 3D sigma points along the three axes of the i-th camera due to uncertainty in the detected centroid location in the image plane that gives rise to the LoS vector

{\vec{L o S}}_{t, k, i}^{w} = {}_{c}^{w}R^{i'} X_{t, c}^{i}

represented in the world coordinate system {W}.

Given that the dimension of

X_{t, c}^{i}

is

L = 3

, then

2 L + 1 = 7

sigma points in three-dimensions are generated in the image plane and along the optical axis, as shown in Figure 12. Consequently, seven constraints

{\vec{r}}_{t, k, i}^{w}

are found for the same t-th target. The sigma points in the image plane are generated for each target from information that camera

C^{i}

either detects locally or receives information about from a neighboring camera

C^{j}

. The mean

μ_{x}

of the Gaussian distribution is the LoS vector

{\vec{L o S}}_{t, k, i}^{w} = {}_{c}^{w}R^{i'} X_{t, c}^{i}

shown in Figure 11, represented by a red arrow.

The proximity metric determines the distance (a nonlinear function) between each candidate constraint pair, i.e.,

{\vec{r}}_{t, k, i}^{w}

and

{\vec{r}}_{t, k, j}^{w}

(amounting to a total of 49 candidate pairs, but only the 7 points with minimum distance are considered for each constraint pair). Any point in 3D space along constraint

{\vec{r}}_{t_{a}, k, i}^{w}

whose distance to another candidate constraint

{\vec{r}}_{t_{b}, k, j}^{w}

falls within the proximity tolerance

d_{m i n} \leq δ

indicates that targets

t_{a}

seen by

C^{i}

and

t_{b}

seen by

C^{j}

are the same target, and a set of points in 3D space are thus obtained. Figure 13 shows the result of proximity determination between a constraint to a target detected locally by Camera 1 and each of the constraints to targets received in information vectors from neighboring cameras. In this case, Cameras 2, 3, 4, and 5 send constraints

{\vec{r}}_{t, k, j}^{w}

to Camera 1 within time window

K_{k}^{1}

and some of them are relative to the same Target 1.

Each set of 3D points with corresponding UT mean and covariance (

μ_{f k}, C_{f k})

after averaging refers to a time instant within the time window

K_{k}^{i}

. Notice that more than one such set may relate to the same instant k when the UT processes simultaneous constraints that originate in information vectors transmitted by n neighboring cameras tracking the same target at instant k. The representative point in 3D space and respective covariance at instant k,

μ_{f k}

, and

C_{f k}

are the average over the n UT means and covariances

μ_{j k}, C_{j k}, j \in N^{j, i}

of simultaneous sets of points in 3D space at instant k.

The final 3D mean,

μ_{f k}

, averages over the n simultaneous UT means

μ_{1 k}, \dots, μ_{n k}

in 3D space at instant k as follows:

μ_{f k} = \frac{μ_{1 k} + μ_{2 k} + \dots + μ_{n k}}{n} .

(9)

Likewise,

C_{f k}

averages over simultaneous UT covariances

C_{1 k}, \dots, C_{n k}

. Figure 14 depicts this situation, where five UT means are obtained at instant

k = 13, one at instant k = 14, and one at instant k = 15

. The result is the average of the UT means at each

k

time instant, as depicted in Figure 15. The corresponding UT covariance average is not shown in Figure 15.

Thus, 3D points are measured at each instant

k

within period

[k_{c} - α_{1}, k_{c} + τ_{k_{c}}^{i}]

of the time window

K_{k}^{i}

. Two conditions must be met for these measurements to be generated: (1) at least one information vector

I_{t, k_{c} + τ_{k_{c}}^{j}, j}^{w}

is received from a neighboring camera at any instant

k

within the time window period

[k_{c} - α_{1}, k_{c} + τ_{k_{c}}^{i}]

, and (2) this information vector relates to the same target observed at the i-th camera. These two conditions are not always met; in that case, the 3D position of the target centroid is not measured at all instants within the aforementioned period. At least two 3D position measurements at different instants within the time window period

[k_{c} - α_{1}, k_{c} + τ_{k_{c}}^{i}]

are necessary to derive a linear polynomial and interpolate from the target 3D position measurements forward or backward in time until the capture instant

k_{c}

to compute the fused 3D position measurement

z_{t, k_{c}, 3 D}^{i}

. This interpolation process occurs at the instant

k_{c} + τ_{k_{c}}^{i}

, and

z_{t, k_{c}, 3 D}^{i}

is the average of the 3D UT means measured within the period

[k_{c} - α_{1}, k_{c} + τ_{k_{c}}^{i}]

and interpolated in time until the capture instant

k_{c}

. The corresponding fused covariance

C_{t, k_{c}, 3 D}^{i}

is the average of UT covariances interpolated in time until the capture instant.

In essence, our approach provides the fused 3D position measurement

z_{t, k_{c}, 3 D}^{i}

and respective covariance

C_{t, k_{c}, 3 D}^{i}

from the following steps:

(1) The 3D position measurements

z_{t, k, 3 D}^{i}

and covariances

C_{t, k, 3 D}^{i}

are output by the i-th camera local proximity algorithm at each k instant within period

[k_{c} - α_{1}, k_{c} + τ_{k_{c}}^{i}]

if the two conditions above are met;

(2) Then, a linear polynomial interpolation propagates

z_{t, k, 3 D}^{i}

and

C_{t, k, 3 D}^{i}

backward or forward in time until alignment with the i-th camera local capture instant

k_{c}

;

(3) Finally, the fused 3D position measurements

z_{t, k_{c}, 3 D}^{i}

and covariances

C_{t, k_{c}, 3 D}^{i}

result from averaging the interpolated measurements and covariances at capture instant

k_{c}

.

A remark is that when just one 3D position measurement is generated at an instant that differs from

k_{c}

, the measurement is discarded because there is no other point to derive a linear interpolating polynomial, and the next time window is awaited. By contrast, when just a single 3D position measurement occurs at frame capture instant

k_{c}

, this measurement is accepted as

z_{t, k_{c}, 3 D}^{i}

with the corresponding covariance

C_{t, k_{c}, 3 D}^{i}

.

The fused 3D position measurement

z_{t, k_{c}, 3 D}^{i}

and covariance

C_{t, k_{c}, 3 D}^{i}

at instant

k_{c}

are the required information for our decentralized approach with either the local estimation phase at instant

k_{c} + τ_{k_{c}}^{i}

or further fusion-based estimation with BAF or SA-PF at instant

k_{c} + α_{2}

.

2.2. Dynamic Target Model: Linear Gaussian Model

The discrete-time dynamic model for each target [41] at each camera, is given by

x_{t, k_{c} + Δ k_{c}, 3 D}^{i} = F (k_{c}, k_{c} + Δ k_{c}) x_{t, k_{c}, 3 D}^{i} + w (k_{c}, k_{c} + Δ k_{c})

(10)

where

F (k_{c}, k_{c} + Δ k_{c})

is the state transition matrix from instant

k_{c}

to

k_{c} + Δ k_{c}

, defined as

F (k_{c}, k_{c} + Δ k_{c}) = [\begin{matrix} 1 & Δ k_{c} & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & Δ k_{c} & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & Δ k_{c} \\ 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}]

(11)

and

w (k_{c}, k_{c} + Δ k_{c})

is the cumulative noise between time instants

k_{c}

and

k_{c} + Δ k_{c}

, which is considered to be Gaussian with zero mean and covariance

Q (k_{c}, k_{c} + Δ k_{c})

expressed as follows:

Q (k_{c}, k_{c} + Δ k_{c}) = σ_{u}^{2} [\begin{matrix} \frac{Δ k_{c}^{3}}{3} & \frac{Δ k_{c}^{2}}{2} & 0 & 0 & 0 & 0 \\ \frac{Δ k_{c}^{2}}{2} & Δ k_{c} & 0 & 0 & 0 & 0 \\ 0 & 0 & \frac{Δ k_{c}^{3}}{3} & \frac{Δ k_{c}^{2}}{2} & 0 & 0 \\ 0 & 0 & \frac{Δ k_{c}^{2}}{2} & Δ k_{c} & 0 & 0 \\ 0 & 0 & 0 & 0 & \frac{Δ k_{c}^{3}}{3} & \frac{Δ k_{c}^{2}}{2} \\ 0 & 0 & 0 & 0 & \frac{Δ k_{c}^{2}}{2} & Δ k_{c} \end{matrix}],

(12)

where

σ_{u}^{2}

is the variance of process noise in the nearly-constant-velocity model [41]. The measurement model for the i-th camera is given by

z_{t, k_{c}, 3 D}^{i} = H x_{t, k_{c}, 3 D}^{i} + ν^{i}

, where

ν^{i}

is the measurement noise vector with dimensions 3 × 1, which is assumed to be Gaussian with zero mean and covariance matrix

R^{i}

. The latter is the fused covariance matrix

C_{t, k_{c}, 3 D}^{i}

resulting from interpolation into alignment with frame capture instant

k_{c}

and averaging as described in Section 3.1, and

H

is given by

H = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \end{matrix}]

(13)

2.3. Target Information Fusion within the Fusion Window [15]

Our proposal is the use of fused 3D position measurements derived from stereoscopy in combination with local state estimation to track actual moving subjects undergoing free 3D motion within the fields of view of asynchronous, calibrated cameras with known poses and located far apart. In certain circumstances, very distinct parts of a target surface can be viewed from very different viewing directions; thus, the 3D centroid position of such a target lies at some point inside and not on the viewed surfaces. Local estimation uses either the Kalman filter in information form or particle filtering. Then we test additional fusion with either BAF or SA-PF for comparison with just local estimation and investigate the tradeoff with the additional communication load.

For that purpose, SA-PF and BAF have their algorithms modified as follows. After the local estimation at the i-th camera used the fused 3D position measurement from the asynchronous LoS vectors received within the time window

K_{k}^{i}

, further fusion occurs with the additional content received asynchronously from the neighboring cameras within a fusion window with a configurable p size. This additional content being exchanged is either the information pair (

{\tilde{y}}_{t, k_{c}^{″}, 3 D}^{i, i}, {\tilde{Y}}_{t, k_{c}^{″}, 3 D}^{i, i}

) in BAF or the fused 3D position measurement and covariance (

z_{t, k_{c}, 3 D}^{i}

,

C_{t, k_{c}, 3 D}^{i}

) in SA-PF. Figure 16 represents the actions occurring in two time windows in sequence. The motivation to use SA-PF is that it is not limited to the linear Gaussian assumption of the information filter.

We then investigate whether an improvement occurs with the additional communication exchange in the fusion window. Define

k_{c}^{″}

and

k_{c}^{‴}

as, respectively, transmit and reception instants when camera

C^{i}

exchanges with neighboring cameras

C^{j}

either the information pair in BAF or the fused 3D position measurement and covariance in SA-PF. Fusion with additional communication exchange in BAF and SA-PF is as follows:

(1): BAF fusion: The t-th target information pair is propagated from frame capture instant $k_{c}$ to $k_{c}^{″}$ , thus yielding ( ${\tilde{y}}_{t, k_{c}^{″}, 3 D}^{i, i}, {\tilde{Y}}_{t, k_{c}^{″}, 3 D}^{i, i}$ ) for $C^{i}$ transmission. Define $k_{c}^{‴}$ to be the reception instant at which camera $C^{i}$ receives an information pair estimate ( ${\tilde{y}}_{t, k_{c}^{‴}, 3 D}^{i, j}, {\tilde{Y}}_{t, k_{c}^{‴}, 3 D}^{i, j}$ ) in the fusion window. We use the Euclidean distance applied to the corresponding 3D centroid position to indicate whether the target is the same. Then, all the received information pairs’ estimates are propagated backward in time to $k_{c}$ for Kullback–Leibler Average (KLA) fusion with the local i-th camera information pair estimate ( ${\tilde{y}}_{t, k_{c}, 3 D}^{i, i}, {\tilde{Y}}_{t, k_{c}, 3 D}^{i, i}$ ), and the result propagated to the next frame capture instant $k_{c} + 1$ .
(2): SA-PF fusion: As stated before, $C^{i}$ sends the fused 3D measurement $z_{t, k_{c}, 3 D}^{i}$ and covariance $C_{t, k_{c}, 3 D}^{i}$ at instant $k_{c}^{″}$ to the neighboring cameras $C^{j}$ without time propagation from $k_{c}$ to $k_{c}^{″}$ . At various instants $k_{c}^{‴}$ , $C^{i}$ receives ( $z_{t, k_{c}^{‴}, 3 D}^{j}$ , $C_{t, k_{c}^{‴}, 3 D}^{j}$ ) from neighboring cameras, only once from each $C^{j}$ . Here as well we use Euclidean distance applied to 3D position measurement to indicate whether the target is the same. Fusion occurs in this filter by propagating particles’ state estimates from the capture instant $k_{c}$ to the distinct receiving instants $k_{c}^{‴}$ and sequentially updates the respective weights with ( $z_{t, k_{c}^{‴}, 3 D}^{j}$ , $C_{t, k_{c}^{‴}, 3 D}^{i}$ ). At the end of the fusion window, $C^{i}$ updates the state estimate at $k_{c}$ and propagates to the next frame capture instant $k_{c} + 1$ .

3. Asynchronous Bayesian State Estimation

Figure 16 shows how we changed the timeline of the Batch Asynchronous Filter (BAF)-Lower Load Mode (LLM) [15] and the Sequential Asynchronous Particle Filter (SA-PF) [17] to accommodate further fusion. Our purpose here is to investigate whether improved target tracking across the asynchronous camera network results from further communication other than just exchanging LoS vectors among neighboring cameras to compute fused 3D position measurements and then use local estimation.

Neighboring cameras transmit either the information pairs’ estimates (BAF) or the fused 3D position measurement and covariance (SA-PF) at instant

k_{c}^{″}

in the end of the time window

K_{k}^{i}

, and introduce a fusion window with configurable p duration after

K_{k}^{i}

. At instant

k_{c}^{‴}

within the fusion window, neighboring cameras receive asynchronously either information pairs’ estimates (BAF) or fused 3D position measurements and covariances (SA-PF).

Relative to local estimation with fused 3D position measurements, we investigate (a) the KLA-based fusion of received information pairs’ estimates temporally aligned in batch with the local frame capture instant

k_{c}

(BAF), or (b) the sequential update of particles’ weights by sampling from a likelihood density function with the received fused 3D position measurements and covariances (SA-PF).

The following Algorithm 1 describes the steps to investigate the effect of additional communications and fusion other than exchanging LoS vectors among neighboring cameras for computing fused 3D position measurements:

Algorithm1. Algorithm to investigate the effect on target tracking of further fusion other than exchanged LoS vectors and local estimation

Input: I–images taken by camera

C^{i}

Output: estimated state

x_{t, k_{c}, 3 D}^{i}

and the root mean square error computation

ϵ_{t, k_{c}, 2 D}^{i}

for r Monte-Carlo simulation runs
for each camera
1-If t-th target detection is confirmed after three occurrences in images, its position estimate

{\hat{z}}_{t, k_{c}, 3 D}^{i}

is computed as follows:
for each detected target
2-Detect target centroids in image plane

z_{t, k, 2 D}^{i}

within the time window period

K_{k}^{i}

;
3-Transform measurements

z_{t, k, 2 D}^{i}

into stored information vectors

i_{t, k, i}^{w} = {k; O_{c, i}^{w}; {\vec{L o S}}_{t, k, i}^{w}}

;
4-Asynchronous transmission/reception of information vectors

I_{t, k_{c} + τ_{k_{c}}^{i}, i}^{w} = {k_{c} + τ_{k_{c}}^{i}; O_{c, i}^{w}; {\vec{L o S}}_{t, k_{c} + τ_{k_{c}}, i}^{w}}

among cameras

C^{i}

and

C^{j}

,

j \in N^{j, i}, i \neq j

;
5-Compute 3D position measurements

z_{t, k, 3 D}^{i}

and covariances

C_{t, k, 3 D}^{i}

within the time window

K_{k}^{i}

from received LoS vectors, and interpolate all of them into temporal alignment with the frame capture instant

k_{c}

to account for frame processing delay and camera asynchrony;
6-Local estimation of the target position and velocity in 3D space

x_{t, k_{c}, 3 D}^{i}

;
7-Investigate the added fusion phase with either the BAF or the SA-PF Asynchronously transmit/receive information pairs’s estimates (BAF) or fused 3D position measurements (SA-PF) among cameras

C^{i}

and

C^{j}

,

j \in N^{j, i}, j \neq i

;
8-

{\hat{z}}_{t, k_{c}, 3 D}^{i}

is the resulting 3D position component of the target state estimate;
9-Perspective projection of the 3D position estimate of each target being tracked onto the 2D image planes of each neighboring camera, expressed in pixel coordinates:

{\hat{z}}_{t, k_{c}, 3 D}^{i} \to {\hat{z}}_{t, k_{c}, 2 D}^{i}

;
end for;
end for;
10-Average root mean square error

ϵ_{t, k_{c}, 2 D}^{i} = \sqrt{∥ {\hat{z}}_{t, k_{c}, 2 D, r}^{i} - z_{t, k_{c}, 2 D}^{i} ∥_{2}^{2}}

to compare performance in steps 6 and 7.
end for.

3.1. Batch Asynchronous Filter (BAF)-Lower Load Mode (LLM) [15]

The initialization of each target involves the fused 3D position measurement

z_{t, k_{c}, 3 D}^{i}

and respective covariance

C_{t, k_{c}, 3 D}^{i}

. Each camera

C^{i}

uses Kalman filtering in information form [7]. The fused 3D position measurement pair

(z_{t, k_{c}, 3 D}^{i}, C_{t, k_{c}, 3 D}^{i})

is transformed in the measurement information pair

(H^{T} C_{t, k_{c}, 3 D}^{i}^{- 1} z_{t, k_{c}, 3 D}^{i}, H^{T} C_{t, k_{c}, 3 D}^{i}^{- 1} H)

.

Local estimation uses the measurement information pair and the information pair (

{\hat{y}}_{t, k_{c}, 3 D}^{i}

,

{\hat{Y}}_{t, k_{c}, 3 D}^{i}

) predicted from the previous frame capture instant to update locally the information pair to

(y_{t, k_{c,}, 3 D}^{i, i}, Y_{t, k_{c}, 3 D}^{i, i})

of the t-th target at camera

C^{i}

where

y_{t, k_{c,}, 3 D}^{i, i} = {\hat{y}}_{t, k_{c}, 3 D}^{i} + H^{T} C_{t, k_{c}, 3 D}^{i}^{- 1} z_{t, k_{c}, 3 D}^{i}

and

Y_{t, k_{c}, 3 D}^{i, i} = {\hat{Y}}_{t, k_{c}, 3 D}^{i} + H^{T} C_{t, k_{c}, 3 D}^{i}^{- 1} H

. Then,

{\hat{z}}_{t, k_{c}, 3 D}^{i} = H {(Y_{t, k_{c}, 3 D}^{i, i})}^{- 1} y_{t, k_{c,}, 3 D}^{i, i}

is the 3D position estimate of the target for later use to evaluate performance in step 6 of the Algorithm 1 with the average root mean square error in the image plane.

To investigate the added fusion phase in step 7 of the Algorithm 1, for the purpose of transmission

C^{i}

predicts the information pair at instant

k_{c}^{″}

within the fusion window illustrated in Figure 16 as follows:

\begin{matrix} {\tilde{Y}}_{t, k_{c}^{″}, 3 D}^{i, i} = {(F (k_{c}, k_{c}^{″}) Y_{t, k_{c}, 3 D}^{i, i}^{- 1} F {(k_{c}, k_{c}^{″})}^{T} + Q (k_{c}, k_{c}^{″}))}^{- 1} \\ {\tilde{y}}_{t, k_{c}^{″}, 3 D}^{i, i} = {\tilde{Y}}_{t, k_{c}^{″}, 3 D}^{i, i} F (k_{c}, k_{c}^{″}) (Y_{t, k_{c}, 3 D}^{i, i}^{- 1} y_{t, k_{c}, 3 D}^{i, i}) . \end{matrix}

(14)

C^{i}

transmits the predicted information pair

({\tilde{y}}_{t, k_{c}^{″}, 3 D}^{i, i}, {\tilde{Y}}_{t, k_{c}^{″}, 3 D}^{i, i})

to all cameras

j \in N^{j, i}

. Furthermore,

C^{i}

receives at the local instant

k_{c}^{‴}

within the fusion window the information pair

(y_{t, k_{c}^{″}, 3 D}^{j, j}, Y_{t, k_{c}^{″}, 3 D}^{j, j})

from a neighboring camera

C^{j}

, as depicted in Figure 16. Camera

C^{i}

has no knowledge of the frame capture instant of neighboring camera

C^{j}

. Therefore,

C^{i}

reads the information pair received from

C^{j}

as

(y_{t, k_{c}^{‴}, 3 D}^{i, j}, Y_{t, k_{c}^{‴}, 3 D}^{i, j})

, which undergoes propagation backward in time to the local frame capture instant

k_{c}

:

\begin{matrix} {\tilde{Y}}_{t, k_{c}, 3 D}^{i, j} = {(F (k_{c}^{‴}, k_{c}) Y_{t, k_{c}^{‴}, 3 D}^{i, j}^{- 1} F {(k_{c}^{‴}, k_{c})}^{T} + Q (k_{c}^{‴}, k_{c}))}^{- 1} \\ {\tilde{y}}_{t, k_{c}, 3 D}^{i, j} = {\tilde{Y}}_{t, k_{c}, 3 D}^{i, j} F (k_{c}^{‴}, k_{c}) (Y_{t, k_{c}^{‴}, 3 D}^{i, j}^{- 1} y_{t, k_{c}^{‴}, 3 D}^{i, j}) . \end{matrix}

(15)

Fusion of local and received information pairs at camera

C^{i}

for each target at instant

k_{c}

is obtained with the Kullback–Leibler Average (KLA) [20] as follows:

\begin{matrix} {\hat{Y}}_{t, k_{c}, 3 D}^{i} = \frac{1}{n (N^{j, i}) + 1} (Y_{t, k_{c}, 3 D}^{i, i} + \sum_{j, j \in N^{j, i}}^{n (N^{j, i})} {\tilde{Y}}_{t, k_{c}, 3 D}^{i, j}) \\ {\hat{y}}_{t, k_{c}, 3 D}^{i} = \frac{1}{n (N^{j, i}) + 1} (y_{t, k_{c}, 3 D}^{i, i} + \sum_{j, j \in N^{j, i}}^{n (N^{j, i})} {\tilde{y}}_{t, k_{c}, 3 D}^{i, j}) . \end{matrix}

(16)

where

n (N^{j, i})

indicates the number of neighboring cameras in the set

N^{j, i}

whose information pairs of the t-th target are received within the fusion window of the i-th camera. Then,

{\overset{˘}{z}}_{t, k_{c}, 3 D}^{i} = H {({\hat{Y}}_{t, k_{c}, 3 D}^{i})}^{- 1} {\hat{y}}_{t, k_{c}, 3 D}^{i}

is the fused 3D position estimate of the target in step 7 of the Algorithm 1 for later use to evaluate performance with the average root mean square error in the image plane.

We define as strategy (a) the local estimation in step 6, with exchange of just the information vectors

I_{t, k_{c} + τ_{k_{c}}^{i}, i}^{w}

conveying LoS vectors. This strategy predicts from the updated (

y_{t, k_{c,}, 3 D}^{i, i}, Y_{t, k_{c}, 3 D}^{i, i}

) to

({\hat{y}}_{t, k_{c} + 1, 3 D}^{i}, {\hat{Y}}_{t, k_{c} + 1, 3 D}^{i})

in the next capture instant

k_{c} + 1

, and computes the next measurement information pair

(H^{T} C_{t, k_{c} + 1, 3 D}^{i}^{- 1} z_{t, k_{c} + 1, 3 D}^{i}, H^{T} C_{t, k_{c} + 1, 3 D}^{i}^{- 1} H)

based on the fused 3D position measurement and covariance, which we compute with LoS vectors received in the next time window

K_{k + 1}^{i}

.

For comparison, we define strategy (b) in step 7 as adding KLA-based fusion with information pairs’ estimates received from neighboring cameras within the fusion window other than the LoS vectors received within

K_{k}^{i}

. This strategy predicts from the KLA-fused information pair (

{\hat{y}}_{t, k_{c}, 3 D}^{i}, {\hat{Y}}_{t, k_{c}, 3 D}^{i}

) to

({\hat{y}}_{t, k_{c} + 1, 3 D}^{i}, {\hat{Y}}_{t, k_{c} + 1, 3 D}^{i})

in the next capture instant

k_{c} + 1

and, as in strategy (a), computes the next measurement information pair

(H^{T} C_{t, k_{c} + 1, 3 D}^{i}^{- 1} z_{t, k_{c} + 1, 3 D}^{i}, H^{T} C_{t, k_{c} + 1, 3 D}^{i}^{- 1} H)

from the fused 3D position measurement and covariance from LoS vectors received in the next time window

K_{k + 1}^{i}

. For clarification, see Figure 17.

The position estimate of the t-th target resulting from strategies (a) and (b) at camera

C^{i}

at frame capture instant

k_{c}

are, respectively,

{\hat{z}}_{t, k_{c}, 3 D}^{i} = H Y_{t, k_{c}, 3 D}^{i, i}^{- 1} y_{t, k_{c,}, 3 D}^{i, i}

and

{\overset{ˇ}{z}}_{t, k_{c}, 3 D}^{i} = H {\hat{Y}}_{t, k_{c}, 3 D}^{i}^{- 1} {\hat{y}}_{t, k_{c}, 3 D}^{i}

.

3.2. Sequential Asynchronous Particle Filter [17]

The initial step for obtaining a fused 3D position measurement pair

(z_{t, k_{c}, 3 D}^{i}, C_{t, k_{c}, 3 D}^{i})

at each camera

C^{i}

is the same as previously described. Each camera has the same inter-frame period

T = α_{1} + α_{2}

between consecutive frame capture instants as depicted in Figure 10. The model of the t-th target is the same from Section 3.2.

Define

k_{c}^{'}

at the previous frame capture instant (

k_{c}^{'} = k_{c} - 1)

and

m = 1, \dots, n (N^{j, i})

, where we recall that

n (N^{j, i})

indicates the number of neighboring cameras

C^{j}, j \in N^{j, i}

, whose fused 3D position measurements

z_{t, k_{c}^{‴}, 3 D}^{j_{m}}

of the t-th target are received within the fusion window of the i-th camera. Each camera

C^{i}

receives measurements

z_{t, k_{c}^{‴}, 3 D}^{j_{m}}

at distinct instants

k_{c}^{‴}

within the fusion window and computes the corresponding state estimates

{\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}}

. The updated state estimate in

C^{i}

is then computed from

{\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}}

,

j_{m} = {j_{1}, j_{2}, \dots, j_{n (N^{j, i})}}

, and

j_{0}

is the

C^{i}

camera. As an example, let camera 2 (

i = 2

) receive measurements from

n (N^{j, i}) = 3

neighboring cameras, say, cameras 5, 7, and 9; then

m = 1, 2, 3

,

N^{j, 2} = {j_{1}, j_{2}, j_{3}} = {5, 7, 9}

, and

j_{0} = 2

.

In the prediction process of SA-PF, new particles

x_{t, k_{c}, 3 D}^{i, (l)}

sample the state space from a proposal pdf

q (x_{t, k_{c}, 3 D}^{i, (l)} | x_{t, k_{c}^{'}, 3 D}^{i, (l)}, z_{t, k_{c}, 3 D}^{i})

with the set of previous particles

x_{t, k_{c}^{'}, 3 D}^{i, (l)}

and weights

ω_{t, k_{c}^{'}}^{i, (l)}

, and a normalization yields weights

{\hat{ω}}_{t, k_{c}}^{i, (l)}

associated with particles

x_{t, k_{c}, 3 D}^{i, (l)}

. Then follows the update process; when

C^{i}

asynchronously receives a measurement

z_{t, k_{c}^{‴}, 3 D}^{j_{m}}

from camera

C^{j_{m}}

in the fusion window, the predicted particles

{\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}, (l)}

are computed, and the corresponding weights

{\hat{ω}}_{t, k_{c}^{‴}}^{i, j_{m}, (l)}

are updated sequentially using measurements

z_{t, k_{c}^{‴}, 3 D}^{j_{m}}

and particles

{\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}, (l)}

in the likelihood function. After receiving the measurement from the last of the neighboring cameras within the fusion window, the updated state estimate

x_{t, k_{c}, 3 D}^{i}

is then computed with the set of weights

ω_{t, k_{c}}^{i, (l)}

and particles

x_{t, k_{c}, 3 D}^{i, (l)}

.

Prediction. At

k_{c} = 0

, L particles initialize with

x_{t, 0, 3 D}^{i, (l)}, l = 1, \dots, L,

sampled from the known pdf

p (x_{t, 0, 3 D}^{i})

. The weights are initialized equally as

ω_{t, 0}^{i, (l)} = 1 / L

for all

l

.

At instants

k_{c} \geq 1

with

z_{t, k_{c}, 3 D}^{i}

previously obtained, new particles

x_{t, k_{c}, 3 D}^{i, (l)}

are propagated from the previous particles

x_{t, k_{c}^{'}, 3 D}^{i, (l)}

with the deterministic part of the motion model in Equation (10). The corresponding weights

{\tilde{ω}}_{t, k_{c}}^{i, (l)}

use the fused 3D position measurement

z_{t, k_{c}, 3 D}^{i}

:

{\tilde{ω}}_{t, k_{c}}^{i, (l)} = ω_{t, k_{c}^{'}}^{i, (l)} \frac{p (z_{t, k_{c}, 3 D}^{i} | x_{t, k_{c}, 3 D}^{i, (l)}) p (x_{t, k_{c}, 3 D}^{i, (l)} | x_{t, k_{c}^{'}, 3 D}^{i, (l)})}{q (x_{t, k_{c}, 3 D}^{i, (l)} | x_{t, k_{c}^{'}, 3 D}^{i, (l)}, z_{t, k_{c}, 3 D}^{i})} = ω_{t, k_{c}^{'}}^{i, (l)} p (z_{t, k_{c}, 3 D}^{i} | x_{t, k_{c}, 3 D}^{i, (l)})

(17)

The above weights are normalized according to

{\hat{ω}}_{t, k_{c}}^{i, (l)} = {\tilde{ω}}_{t, k_{c}}^{i, (l)} / \sum_{l^{'} = 1}^{L} {\tilde{ω}}_{t, k_{c}}^{i, (l^{'})}

. The set of particles and respective normalized weights is

{x_{t, k_{c}, 3 D}^{i, (l)}, {\hat{ω}}_{t, k_{c}}^{i, (l)}}_{l = 1}^{L}

. This is the local estimation defined as strategy (a), wherein SA-PF becomes just PF, which does not involve exchange other than LoS vectors,

{\hat{x}}_{t, k_{c}, 3 D}^{i} = \sum_{l = 1}^{L} {\hat{ω}}_{t, k_{c}}^{i, (l)} x_{t, k_{c}, 3 D}^{i, (l)}

, and follows the propagation to the next frame capture instant

k_{c} + 1

(step 6 of the pseudoalgorithm).

Update. Strategy (b) or step 7 of the Algorithm 1 involves the fusion window, when

C^{i}

receives asynchronously measurements

z_{t, k_{c}^{‴}, 3 D}^{j_{m}}

from the neighboring cameras

C^{j}

at distinct reception instants

k_{c}^{‴}

, then SA-PF predicts particles

{\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}, (l)}

and updates the weights

{\hat{ω}}_{t, k_{c}^{‴}}^{i, (l)}

as in Equations (18) and (19) respectively.

Initially,

{\hat{ω}}_{t, k_{c}^{‴}}^{i, j_{0}, (l)} = {\hat{ω}}_{t, k_{c}}^{i, (l)}

described above and the predicted particles

{\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}, (l)}

are:

{\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}, (l)} = E {x_{t, k_{c}^{‴}, 3 D}^{j_{m}, (l)} | x_{t, k_{c}, 3 D}^{i, (l)}} = F (k_{c}, k_{c}^{‴}) x_{t, k_{c}, 3 D}^{i, (l)}

(18)

The corresponding weights

{\hat{ω}}_{t, k_{c}^{‴}}^{i, j_{m}, (l)}, l = 1, \dots, L

are updated sequentially according to:

{\tilde{ω}}_{t, k_{c}^{‴}}^{i, j_{m}, (l)} = {\hat{ω}}_{t, k_{c}^{‴}}^{i, j_{m - 1}, (l)} p (z_{t, k_{c}^{‴}, 3 D}^{j_{m}} | {\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}, (l)}), m = 1, \dots, n (N^{j, i})

(19)

where

p (z_{t, k_{c}^{‴}, 3 D}^{j_{m}} | {\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}, (l)})

is the likelihood function in Equation (21). The weights in Equation (19) are normalized according to

{\hat{ω}}_{t, k_{c}^{‴}}^{i, j_{m}, (l)} = {\tilde{ω}}_{t, k_{c}^{‴}}^{i, j_{m}, (l)} / \sum_{l^{'} = 1}^{L} {\tilde{ω}}_{t, k_{c}^{‴}}^{i, j_{m}, (l^{'})}

. The set of particles and normalized weights

{{\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}, (l)}, {\hat{ω}}_{t, k_{c}^{‴}}^{i, j_{m}, (l)}}_{l = 1}^{L}

represent the “partial posterior”

p ({\tilde{x}}_{t, k_{c}, 3 D}^{i (j_{0} : j_{m})} | z_{t, k_{c}, 3 D}^{i (j_{0} : j_{m})}, z_{t, 1 : k_{c}^{'}, 3 D})

. This conditioning is on all previous measurements

z_{t, 1 : k_{c}^{'}, 3 D}

, and all current measurements

z_{t, k_{c}, 3 D}^{i (j_{0} : j_{m})}

received at camera

C^{i}

during the fusion window at distinct instants

k_{c}^{‴}

, including the local

z_{t, k_{c}, 3 D}^{i}

and up to the neighboring camera

C^{j_{m}}

as indicated by superscript

(j_{0} : j_{m})

[17].

This update in SA-PF is sequential until the last remaining measurement from the neighboring cameras is received within the fusion window, and then the “global posterior” weights [17] become

ω_{t, k_{c}}^{i, (l)} = {\hat{ω}}_{t, k_{c}^{‴}}^{i, j_{n (N^{j, i})}, (l)}

. The set of particles and respective normalized weights is

{x_{t, k_{c}, 3 D}^{i, (l)}, ω_{t, k_{c}}^{i, (l)}}_{l = 1}^{L}

, and the state estimate is

x_{t, k_{c}, 3 D}^{i} = \sum_{l = 1}^{L} ω_{t, k_{c}}^{i, (l)} x_{t, k_{c}, 3 D}^{i, (l)}

. Finally, we propagate the particles

x_{t, k_{c}, 3 D}^{i, (l)}

to the next frame capture instant

k_{c} + 1

with the deterministic part of the motion model in Equation (10).

4. Simulation Results

4.1. Evaluation Setup

We use MATLAB [42] and the PETS2009 database [27] with a camera network composed of

N_{s} = 8

cameras. The targets move within the surveillance volume of a network of calibrated cameras, whose construction (intrinsic) parameters and poses are given in the header of the PETS2009 image sequence dataset file. This dataset is challenging for tracking tasks because targets change direction, and clusters and scattering occur frequently. Scene illumination and obstacles also hinder tracking. In this experiment, the frame sampling rate was 7 frames/s

\approx 143 ms

, with a total of 794 frames per camera. Each

k

time step takes approximately

143 ms

. Figure 18 shows all FoVs projected on the ground reference plane for each of the eight cameras. Portions of some FoVs do not project on the ground and instead cover the scenario above ground. The surveillance area of the cameras is approximately 60 m × 90 m as depicted in Figure 18, and their FoVs overlap. Moreover, the neighborhoods for each camera are defined by the network communication topology also shown. The chosen topology reflects two camera neighborhoods (clusters), one with cameras having large FoVs areas projected on the ground and the other one consisting of cameras with smaller projections. Communications between the clusters occur by means of cameras 4 and 8 and cameras 1 and 5.

BAF assumes the information pdfs are Gaussian whereas PF makes no such assumption. They are evaluated under the two strategies (a) and (b) previously explained.

The random process modeling the processing delay in frame units is

τ_{k}^{i} \in U {0, τ^{m a x}}

, where

U

is the uniform distribution. A uniform distribution is also used for the offset

α^{i j}

between cameras:

U {0, α^{m a x}}

. Denote

Λ^{i, r} \subset [1, 749]

as the set of frame capture instants during the r-th run of a Monte Carlo simulation, r = 1, …, M of camera

C^{i}

tracking a target and

P_{k}^{i} \subset {1, \dots, N_{t}}

as the set of targets in the FoV of camera

C^{i}

at capture instant

k_{c}

. The evaluation is based on the average root mean square error:

ε = \sqrt{\frac{1}{M N_{c}} \sum_{r = 1}^{M} \sum_{i = 1}^{N_{c}} \frac{1}{| Λ^{i, r} |} \sum_{\forall k \in Λ^{i, r}} \frac{1}{| Ρ_{k}^{i} |} \sum_{\forall t \in P_{k}^{i}} || {\hat{z}}_{t, k_{c}, 2 D, r}^{i} - z_{t, k_{c}, 2 D}^{i} {||}_{2}^{2}}

(20)

where

{\hat{z}}_{t, k_{c}, 2 D, r}^{i}

is the estimated 3D position of the t-th target projected according to the calibrated pinhole model onto the image plane of camera

C^{i}

at the instant

k_{c}

during the r-th round, and

z_{t, k_{c}, 2 D}^{i}

corresponds to the actual measurement in the image plane as provided by the target detection algorithm. The latter does not change with

r

, since the image sequence dataset is given and processed previously by the target detection algorithm.

σ_{u}^{2} = 4

in Equation (12), and

R^{i} = C_{t, k_{c}, 3 D}^{i}

in the information filter with the measurement information pair

(H^{T} C_{t, k_{c}, 3 D}^{i}^{- 1} z_{t, k_{c}, 3 D}^{i}, H^{T} C_{t, k_{c}, 3 D}^{i}^{- 1} H)

and in the likelihood function in Equation (21).

F

and

Q

are given by the dynamic model of the target. The error

ε

statistics converged with

M = 12

runs.

We skip dataset frames to simulate asynchronous frame capture and frame processing delay.

ε

is analyzed by increasing the relative offset

0 \leq α^{m a x} \leq 40

(see Figure 19) with a fixed offset

α^{m a x}

and processing delay

τ_{k}^{i} = τ^{i}

at each i-th camera and r-th run. Each camera has

τ^{i}

sampled once from

U {0, τ^{m a x}},

0

\leq τ^{m a x} \leq

3

and remains the same throughout all runs.

The same process is applied to analyze the effect on

ε

of increasing

0 \leq τ^{m a x} \leq 40

(see Figure 19) with a fixed processing delay

τ^{m a x}

and relative offset

α^{i}

at each i-th camera, and r-th run. Each camera has

α^{i}

sampled once from

U {0, α^{m a x}},

(synchronous case)

0 \leq α^{m a x} \leq 4

(asynchronous case) and remains the same throughout all runs.

Concerning the particle filter, the likelihood function in Equation (17) is the multivariate Gaussian pdf [43]:

\begin{matrix} p (z_{t, k_{c}, 3 D}^{i} | x_{t, k_{c}, 3 D}^{i, (l)}) = \\ \frac{1}{{(2 π)}^{\frac{N}{2}} d e t^{\frac{1}{2}} (C_{t, k_{c}, 3 D}^{i})} e x p [- \frac{1}{2} {(H x_{t, k_{c}, 3 D}^{i, (l)} - z_{t, k_{c}, 3 D}^{i})}^{T} {(C_{t, k_{c}, 3 D}^{i})}^{- 1} (H x_{t, k_{c}, 3 D}^{i, (l)} - z_{t, k_{c}, 3 D}^{i})], \end{matrix}

(21)

N = 3

since

z_{t, k_{c}, 3 D}^{i}

and

C_{t, k_{c}, 3 D}^{i}

refer to the fused 3D position measurement. Equation (19)

p (z_{t, k_{c}, 3 D}^{j_{m}} | {\tilde{x}}_{t, k_{c}^{‴}, 3 D}^{i, j_{m}, (l)})

shares the same structure of Equation (21) but changes the corresponding variables.

L = 1000

particles are used.

Figure 19 shows the simulation results for strategies (a) and (b). (a) exchanges LoS vectors among neighboring cameras followed by local estimation using fused 3D position measurements (BAF and PF), and (b) employs a fusion window after the time window

K_{k}^{i}

with duration

p = {1, 2, 4}

in frames to process either exchanged information pairs’ estimates (BAF) or fused 3D position measurements (SA-PF) other than LoS vectors.

4.2. Discussions

Figure 19 shows that BAF and PF filters present similar performance when varying

α^{m a x}

and

τ^{m a x}

in all tested situations for strategy (a) where LoS vectors measurements exchanged among cameras in the time window

K_{k}^{i}

generate the fused 3D position measurement

z_{t, k_{c}, 3 D}^{i}

and both filters use it to estimate the tracked targets’ positions and velocities. The BAF-LLM filter presents a better performance than the SA-PF filter in strategy (b) when

τ^{m a x} = α^{m a x} \geq 16

. The initial weights

{\hat{ω}}_{t, k_{c}^{‴}}^{i, j_{0}, (l)} = {\hat{ω}}_{t, k_{c}}^{i, (l)}

in the update phase consider the weights

{\hat{ω}}_{t, k_{c}}^{i, (l)}

at instant

k_{c}

in the receiving instant

k_{c}^{‴}

, and the instant of the received measurements

z_{t, k_{c}^{‴}, 3 D}^{j_{m}}

is considered as the frame capture instant of the neighboring camera, since SA-PF is not informed about the frame capture time at the neighboring camera. Moreover, just one significantly delayed measurement from a neighboring camera yields a small likelihood value, thus compromising the remaining particle weight updates in the sequential fusion step of SA-PF. The small likelihood in such condition is due to the particle state estimate undergoing time propagation in Equation (19) from capture instant to reception instant, whereas the neighboring camera measurement received asynchronously in the fusion window presents a large delay relative to the local camera capture instant. Then, if

k_{c}^{‴}

≫

k_{c}

, the updated estimate is negatively affected by the delay. Therefore, it is justified that for

τ^{m a x}

and

α^{m a x} \leq 16

(small values of processing delay and asynchrony), the SA-PF filter produces an average root mean square error response similar to that of the BAF-LLM filter; i.e., small

α^{m a x}

and

τ^{m a x}

do not degrade SA-PF filter performance in strategy (b).

As depicted in Figure 19 for all tested p = {1,2,4}, the average root mean square error of SA-PF undergoes sudden degradation relative to BAF-LLM when either asynchrony or frame processing delay increase significantly (blue and magenta curves—strategy (b)). BAF-LLM is not susceptible due to its prediction step from the frame capture instant to the transmission instant.

One problem is to validate the target information received by camera

C^{i}

because, as stated previously, different Ids can be assigned to the same target in different cameras. This problem is solved in BAF by evaluating the Euclidean distance between the target 3D position estimate in camera

C^{i}

with the position estimate received from the neighborhood. In the SA-PF filter, we use the Euclidean distance between measurement

z_{t, k_{c}, 3 D}^{i}

and measurement

z_{t, k_{c}^{‴}, 3 D}^{j_{m}}

received from the neighborhood; i.e., for each information exchange between cameras in the fusion window of a potential target, a distance tolerance ensures the correct fusion of data arising from the same target.

Target tracking by a camera is lost if the camera capture interval

T = α_{1} + α_{2}

exceeds the target duration within the camera FoV. Different from [15] where a fixed number of targets remain within sight of the camera network during the experiment, our distributed target tracking algorithm can track a varying number of targets as they move within the FoV of at least 2 cameras.

Cameras in close spatial proximity degrade the 3D estimation accuracy of targets at long range, since the 3D measurement from imaging as proposed here borrow the concept of depth estimation from baseline length and the epipolar constraint in stereoscopy.

Additionally, we conducted further evaluation of (1) the effect of camera pose uncertainty, with focus on camera attitude error, and (2) computational complexity of our proposed 3D position measurement approach.

The PETS2009 image sequence dataset is a real benchmark scenario for target tracking with fixed cameras whose calibration parameters and pose data are accurate. In spite of that, we conduct a Monte Carlo simulation with M = 20 runs wherein uncertainties are added in the extrinsic calibration parameters of each camera (rotation and translation parameters). To the calibrated rotation angles

θ_{z}, θ_{y}, θ_{x}

of each camera in radians, we add a Gaussian uncertainty

N (μ = 0.05 rad, σ = 0.001 rad)

. Additive Gaussian uncertainty

N (μ = 10 cm, σ = 1 cm)

in camera location is also applied. We evaluate the mean square error in Equation (20). The simulation investigated processing delays

τ

= (0, 2, 4, 6, 8, 10, 12, 14, 16, 18), maximum asynchrony among cameras

α_{m a x} = 4

, and the particular asynchrony value for each camera sampled from the uniform distribution

U {0, α^{m a x}}

, i.e., each camera received

0 \leq α \leq α_{m a x}

.

Figure 20 indicates uncertainties in cameras’ poses degrade the fused 3D position measurement of targets when compared with exact poses, as expected.

We evaluate numerically the computational complexity by increasing the number

n_{i}

of targets observed in the field of view of the i-th camera and the number

n_{j}

of targets received by the i-th camera of the j-th cameras, thus resulting a total of

n_{i} \cdot n_{j}

targets. The processing tasks considered for each of the targets are:

(a): Variable $n_{i}$ —each target has 7 sigma points assigned to the local LoS vector pointing to the t-th target in the image plane;
(b): Variable $n_{j}$ —each target with its 7 sigma points assigned to the received LoS vector pointing to the received t-th target in the image plane;
(c): Intersection search of pairs of LoS from the set of 49 LoS vectors providing points in 3D space (Equations (6) and (8));
(d): Checks which 3D points pass the distance criterion $d \leq 0.3$ m;
(e): Selects the 7 LoS vectors from the set of 49 LoS vectors that pass the above test with smallest distance $d$ ; and
(f): UT computation: 3D position measurement mean and covariance obtained at a $k_{c}$ capture instant of the i-th camera.

We noticed three regions in Figure 21:

(a): For the number of targets varying from 1 up to 90 targets (i.e., from 1 × 1 up to 90 × 90 targets), the elapsed processing time shows an exponential tendency;
(b): From 91 up to 262 targets (i.e., from 91 × 91 up to 262 × 262 targets) we observe a linear tendency, and over 262 × 262 targets, the elapsed processing time presents an exponential tendency.

In the PETS2009 image sequence dataset, the number of observed targets at any

k_{c}

capture instant does not reach above 10 targets simultaneously. Therefore, our approach yields a very low processing time to carry out the 3D position measurements.

5. Conclusions

We propose position measurement in 3D space with intersecting vector constraints derived from exchanged LoS vectors in a camera network with known relative poses. Solving for the 3D intersection of vector constraints is based on computing the proximity of reverse straight lines. Each camera does not need the construction parameters of the other ones to obtain the 3D measurements. We then applied our approach to tracking simultaneous targets with a network of asynchronously communicating cameras with frame processing delays. The targets enter and leave the fields of view of the cameras in the image sequence dataset.

We use the average root mean square error across all image planes of the asynchronous camera network to measure tracking performance under two conditions: fused 3D position measurement from exchanged LoS vectors and local state estimation (strategy (a)), and fusion of additional information other than the exchanged LoS vectors (strategy (b)).

BAF and PF present similar performance under strategy (a), which uses local estimation. With regard to strategy (b) involving further fusion, BAF-LLM is superior to SA-PF under large asynchrony and processing delay. Comparing between strategies (a) and (b), BAF, PF, and BAF-LLM yield approximately the same performance in a wide range of asynchrony and processing delay. We remark that local estimation in BAF and PF demands less communications than further fusion used in BAF-LLM and SA-PF. Further investigation should focus on the use of consensus to improve target tracking subject to asynchronous exchange in the camera network.

Author Contributions

Software, T.M.D.G.; Validation, T.M.D.G.; Investigation, T.M.D.G.; Writing—review & editing, J.W.; Supervision, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

WCN	Wireless Communication Networks
BAF	Batch Asynchronous Filter
BAF-LLM	Batch Asynchronous Filter – Lower Load Mode
SA-PF	Sequential Asynchronous Particle Filter
LoS	Line Of Sight
FoV	Field Of View

Appendix A

Minimum point between reversed straight lines:

The expressions describing

\vec{r}

e

\vec{s}

are as follows:

{\begin{matrix} \vec{r} = P_{1} + λ {\vec{v}}_{1} \\ \vec{s} = P_{2} + β {\vec{v}}_{2} \end{matrix}

where

λ

,

β

\in

ℝ^{+}

and

{\vec{v}}_{1}, {\vec{v}}_{2}

are the direction vectors. The distances between any two points on lines

\vec{r}

e

\vec{s}

can be calculated as follows:

d (\vec{r}, \vec{s}) = \sqrt{{(r_{x} - s_{x})}^{2} + {(r_{y} - s_{y})}^{2} + {(r_{z} - s_{z})}^{2}} .

To find the minimum point between the lines, it is necessary to find the parameters

λ

and

β

that minimize the distance function

d (\vec{r}, \vec{s})

. Therefore,

λ^{*}, β^{*} = \min_{λ, β} d (λ, β)

is the partial derivative of the function

d (\vec{r}, \vec{s})

:

{\begin{matrix} \frac{\partial d}{\partial λ} = 0 \\ \frac{\partial d}{\partial β} = 0 \end{matrix}

(I) \frac{\partial d}{\partial λ} = 0 = \frac{1}{2} {(d (λ, β))}^{- \frac{1}{2}} [2 (r_{x} - s_{x}) \frac{d r_{x}}{d λ} + 2 (r_{y} - s_{y}) \frac{d r_{y}}{d λ} + 2 (r_{z} - s_{z}) \frac{d r_{z}}{d λ}],

where

\begin{matrix} \frac{d r_{x}}{d λ} = \frac{d (p_{1 x} + λ v_{1 x})}{d λ} = v_{1 x} \\ \frac{d r_{y}}{d λ} = \frac{d (p_{1 y} + λ v_{1 y})}{d λ} = v_{1 y} \\ \frac{d r_{z}}{d λ} = \frac{d (p_{1 z} + λ v_{1 z})}{d λ} = v_{1 z} \end{matrix}

Then,

(r_{x} - s_{x}) v_{1 x} + (r_{y} - s_{y}) v_{1 y} + (r_{z} - s_{z}) v_{1 z} = 0

[p_{1 x} + λ v_{1 x} - (p_{2 x} + β v_{2 x})] v_{1 x} + [p_{1 y} + λ v_{1 y} - (p_{2 y} + β v_{2 y})] v_{1 y} + [p_{1 z} + λ v_{1 z} - (p_{2 z} + β v_{2 z})] v_{1 z} = 0

Arranging the terms of these equations:

(p_{1 x} - p_{2 x}) v_{1 x} + (p_{1 y} - p_{2 y}) v_{1 y} + (p_{1 z} - p_{2 z}) v_{1 z} + λ v_{1 x}^{2} + λ v_{1 y}^{2} + λ v_{1 z}^{2} - β (v_{1 x} v_{2 x} + v_{1 y} v_{2 y} + v_{1 z} v_{2 z}) = 0

(A) (P_{1} - P_{2}) . {\vec{v}}_{1} + λ || {\vec{v}}_{1} {||}^{2} - β {\vec{v}}_{1} . {\vec{v}}_{2} = 0

(I I) \frac{\partial d}{\partial β} = 0 = \frac{1}{2} {(d (λ, β))}^{- \frac{1}{2}} [- 2 (r_{x} - s_{x}) \frac{d s_{x}}{d β} - 2 (r_{y} - s_{y}) \frac{d s_{y}}{d β} - 2 (r_{z} - s_{z}) \frac{d s_{z}}{d β}]

\begin{matrix} (r_{x} - s_{x}) v_{2 x} + (r_{y} - s_{y}) v_{2 y} + (r_{z} - s_{z}) v_{2 z} = 0 \\ (p_{1 x} - p_{2 x}) v_{2 x} + (p_{1 y} - p_{2 y}) v_{2 y} + (p_{1 z} - p_{2 z}) v_{2 z} - β (v_{2 x}^{2} + v_{2 y}^{2} + v_{2 z}^{2}) + λ (v_{1 x} v_{2 x} + v_{1 y} v_{2 y} + v_{1 z} v_{2 z}) = 0 \end{matrix}

(B) (P_{1} - P_{2}) . {\vec{v}}_{2} - β || {\vec{v}}_{2} {||}^{2} - λ {\vec{v}}_{1} . {\vec{v}}_{2} = 0

From (A) and (B):

\begin{matrix} λ = \frac{β {\vec{v}}_{1} . {\vec{v}}_{2} - (P_{1} - P_{2}) . {\vec{v}}_{1}}{|| {\vec{v}}_{1} {||}^{2}} \\ β = \frac{(P_{1} - P_{2}) . {\vec{v}}_{2} + λ {\vec{v}}_{1} . {\vec{v}}_{2}}{|| {\vec{v}}_{2} {||}^{2}} \end{matrix}

Substituting

β

in

λ

, we finally obtain:

{\begin{matrix} λ^{*} = \frac{({\vec{v}}_{1} . {\vec{v}}_{2}) [(P_{1} - P_{2}) . {\vec{v}}_{2}] - || {\vec{v}}_{2} {||}^{2} [(P_{1} - P_{2}) . {\vec{v}}_{1}]}{|| {\vec{v}}_{1} {||}^{2} || {\vec{v}}_{2} {||}^{2} - {({\vec{v}}_{1} . {\vec{v}}_{2})}^{2}} \\ β^{*} = \frac{(P_{1} - P_{2}) . {\vec{v}}_{2} + λ^{*} ({\vec{v}}_{1} . {\vec{v}}_{2})}{|| {\vec{v}}_{2} {||}^{2}} \end{matrix}

References

Sutton, Z.; Willet, P.; Bar-Shalom, Y. Target Tracking Applied to Extraction of Multiple Evolving Threats from a Stream of Surveillance Data. IEEE Transac. Comput. Soc. Syst. 2021, 8, 434–450. [Google Scholar] [CrossRef]
Song, B.; Kamal, A.T.; Soto, C.; Ding, C.; Farrel, J.A.; Roy-Chowdhury, A.K. Tracking and Activity Recognition Through Consensus in Distributed Camera Networks. IEEE Transac. Imag. Proc. 2010, 19, 2564–2579. [Google Scholar] [CrossRef] [PubMed]
Olfati-Saber, R. Distributed Kalman filtering for sensor networks. In Proceedings of the 2007 46th IEEE Conference on Decision and Control, New Orleans, LA, USA, 12–14 December 2007; pp. 5492–5498. [Google Scholar] [CrossRef]
Solomon, C.; Breckon, T. Fundamentos de Processamento de Imagens uma Abordagem práTica com Exemplos em Matlab; LTC: Rio de Janeiro, Brazil, 2013. (In Portuguese) [Google Scholar]
Chagas, R.; Waldmann, J. Difusão de medidas para estimação distribuída em uma rede de sensores. In Proceedings of the XI Symposium on Electronic Warfare, São José dos Campos, Brazil, 29 September–2 October 2009; pp. 71–75. (In Portuguese). [Google Scholar]
Beaudeau, J.; Bugallo, M.F.; Djurić, P.M. Target tracking with asynchronous measurements by a network of distributed mobile agents. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 3857–3860. [Google Scholar] [CrossRef]
Casbeer, D.W.; Beard, R. Distributed information filtering using consensus filters. In Proceedings of the American Control Conference, St. Luis, MO, USA, 10–12 June 2009; pp. 1882–1887. [Google Scholar] [CrossRef]
Garcia-Fernandez, A.; Grajal, J. Asynchronous Particle Filter for tracking using non-synchronous sensor networks. Sig. Proc. 2011, 91, 2304–2313. [Google Scholar] [CrossRef]
Wymeersch, H.; Zazo, V.S. Belief consensus algorithms for fast distributed target tracking in wireless sensor networks. Sig. Process. 2014, 95, 149–160. [Google Scholar] [CrossRef] [Green Version]
Ji, H.; Lewis, F.L.; Hou, Z.; Mikulski, D. Distributed information-weigthed Kalman consensus filter sensor networks. Automatica 2017, 77, 18–30. [Google Scholar] [CrossRef] [Green Version]
Khaleghi, B.; Khamis, A.; Karray, F.O.; Razavi, N.S. Multisensor data fusion: A review of the state-of-the-art. Inf. Fusion 2013, 14, 28–44. [Google Scholar] [CrossRef]
Taj, M.; Cavallaro, A. Distributed and Decentralized Multicamera Tracking. IEEE Sign. Process. Mag. 2011, 28, 46–58. [Google Scholar] [CrossRef]
Hlinka, O.; Hlawatsch, F.; Djurić, P.M. Distributed Sequential Estimation in Asynchronous Wireless Sensor Networks. IEEE Sig. Proc. Lett. 2015, 22, 1965–1969. [Google Scholar] [CrossRef]
Olfati-Saber, R. Kalman-Consensus Filter: Optimality, stability, and performance. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) Held Jointly with 2009 28th Chinese Control Conference, Shanghai, China, 15–18 December 2009; pp. 7036–7042. [Google Scholar] [CrossRef]
Katragadda, S.; Cavallaro, A. A batch asynchronous tracker for wireless smart-camera networks. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar] [CrossRef]
Yang, R.; Bar-Shalom, Y. Full State Information Transfer Across Adjacent Cameras in a Network Using Gauss Helmert Filters. J. Advan. Inform. Fus. 2022, 17, 14–28. [Google Scholar]
Zhu, G.; Zhou, F.; Xie, L.; Jiang, R.; Chen, Y. Sequential Asynchronous Filters for Target tracking in wireless sensor networks. IEEE Sens. J. 2014, 14, 3174–3182. [Google Scholar]
Olfati-Saber, R.; Sandell, N.F. Distributed tracking in sensor networks with limited sensing range. In Proceedings of the 2008 American Control Conference, Seattle, WA, USA, 11–13 June 2008; pp. 3157–3162. [Google Scholar] [CrossRef]
Julier, S.; Uhlmann, J.; Durrant-Whyte, H.F. A new method for the nonlinear transformation of means and covariances in filters and estimators. IEEE Trans. Autom. Cont. 2000, 45, 477–482. [Google Scholar] [CrossRef] [Green Version]
Battistelli, G.; Chisci, L. Kullback-Leibler average, consensus on probability densities, and distributed state estimation with guaranteed stability. Automática 2014, 50, 707–718. [Google Scholar] [CrossRef]
Katragadda, S.; Regazzoni, C.S.; Cavallaro, A. Average consensus-based asynchronous tracking. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4401–4405. [Google Scholar] [CrossRef]
Kushwaha, M.; Koutsoukos, X. Collaborative 3D Target Tracking in Distributed Smart Camera Networks for Wide-Area Surveillance. J. Sens. Actuator Netw. 2013, 2, 316–353. [Google Scholar] [CrossRef] [Green Version]
Matteo, Z.; Matteo, L.; Marangoni, T.; Mariolino, C.; Marco, T.; Manuela, G. 3D Tracking of Human Motion Using Visual Skeletonization and Stereoscopic Vision. Front. Bioeng. Biotech. 2020, 8. [Google Scholar] [CrossRef]
Zivkovic, Z. Wireless smart camera network for real-time human 3D pose reconstruction. Comp. Vis. Im. Underst. 2010, 114, 1215–1222. [Google Scholar] [CrossRef]
Giordano, J.; Lazzaretto, M.; Michieletto, G.; Cenedese, A. Visual Sensor Networks for Indoor Real-Time Surveillance and Tracking of Multiple Targets. Sensors 2022, 22, 2661. [Google Scholar] [CrossRef] [PubMed]
Yan, J.; Liu, H.; Pu, W.; Bao, Z. Decentralized 3-D Target Tracking in Asynchronous 2-D Radar Network: Algorithm and Performance Evaluation. IEEE Sens. J. 2017, 17, 823–833. [Google Scholar] [CrossRef]
PETS 2009 Benchmark Data. Available online: http://cs.binghamton.edu/~mrldata/pets2009 (accessed on 21 August 2022).
Rottmann, M.; Maag, K.; Chan, R.; Hüger, F.; Schlicht, P.; Gottschalk, H. Detection of False Positive and False Negative Samples in Semantic Segmentation. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 1351–1356. [Google Scholar] [CrossRef]
Fang, K.; Xiang, Y.; Li, X.; Savarese, S. Recurrent Autoregressive Networks for Online Multi-Object Tracking. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 12–15 March 2018; pp. 466–475. [Google Scholar] [CrossRef]
Cavallaro, A.; Aghajan, H. Multi-Camera Networks: Principles and Applications; Academic Press: Cambridge, MA, USA, 2009. [Google Scholar]
Corke, P. Robotics, Vision, and Control; International Publishing AG: Cham, Switzerland, 2017. [Google Scholar]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Lao, D.; Sundaramoorthi, G. Minimum Delay Object Detection from Video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5096–5105. [Google Scholar] [CrossRef] [Green Version]
Matsuyama, T.; Ukita, N. Real-time multitarget tracking by a cooperative distributed vision system. Proc. IEEE 2002, 90, 1136–1150. [Google Scholar] [CrossRef]
Konstantinova, P.; Udvarev, A.; Semerdjiev, T. A study of a Target Tracking Algorithm Using Global Nearest Neighbor Approach. In Proceeding of the 4th International Conference on Computer Systems and Technology-CompSysTech, Rousse, Bulgaria, 19–20 June 2003; pp. 290–295. [Google Scholar]
Munkres, J. Algorithms for the assignment and transportation problems. J. Soc. Indus. Appl. Mathem. 1957, 5, 32–38. [Google Scholar] [CrossRef] [Green Version]
Tsai, R.Y. A Versatile Camera Calibration Techniques for High-Accuracy 3D Machine Vision Metrology Using Off-the-Shelf TV Cameras and Lenses. IEEE J. Robot. Autom. 1987, 3, 323–344. [Google Scholar] [CrossRef] [Green Version]
Camargo, I.; Boulos, P. Geometria Analítica um Tratamento Vetorial, 3rd ed.; Pearson: São Paulo, Brazil, 2005. (In Portuguese) [Google Scholar]
Ferreira, P.P.C. Cálculo e Análise Vetoriais com Aplicações Práticas; Moderna: Rio de Janeiro, Brazil, 2012; Volume 1. (In Portuguese) [Google Scholar]
Wan, E.A.; Van Der Merwe, R. The unscented Kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373), Lake Luise, AB, Canada, 4 October 2000; pp. 153–158. [Google Scholar] [CrossRef]
Bar-Shalom, Y.; Li, X.R.; Kirubarajan, T. Estimation with Applications to Tracking and Navigation: Theory, Algorithms and Software; Wiley Online Library: Hoboken, NJ, USA, 2002. [Google Scholar]
Katragadda, S. Available online: http://www.eecs.qmul.ac.uk/~andrea/software.htm (accessed on 10 March 2022).
Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics; MIT Press: Massachusetts, UK, 2005. [Google Scholar]

Figure 1. Targets detected within the FoVs of Cameras 1, 2, 3, and 7 and the assigned corresponding assigned identification numbers [27]. Red dots indicate centroids of each target.

Figure 2. Pinhole camera model

C^{i}

: {C} represents the camera coordinate system; {W} represents the world coordinate system;

O_{c, i}

represents the optical center of the i-th camera expressed in the camera coordinate system;

O_{w}

represents the origin of the world coordinate system; f is the focal length;

P_{d}

is the distorted point given in pixels;

P_{u}

represents the corrected point given in pixels;

P_{w}

is a point in the 3D space represented in world coordinates {W};

Z_{c} = f

is the camera image plane.

Figure 2. Pinhole camera model

C^{i}

: {C} represents the camera coordinate system; {W} represents the world coordinate system;

O_{c, i}

represents the optical center of the i-th camera expressed in the camera coordinate system;

O_{w}

represents the origin of the world coordinate system; f is the focal length;

P_{d}

is the distorted point given in pixels;

P_{u}

represents the corrected point given in pixels;

P_{w}

is a point in the 3D space represented in world coordinates {W};

Z_{c} = f

is the camera image plane.

Figure 3. Camera optical center and LoS vector representation in the world coordinate system {W}. The constraint vector

{\vec{r}}_{t, k, i}^{w}

results from summing the translation vectors

{}_{c}^{w}{\vec{t}}^{i}

and the extended LoS vector

λ {\vec{L o S}}_{t, k, i}^{w}, λ \in ℝ, λ > 1

.

Figure 3. Camera optical center and LoS vector representation in the world coordinate system {W}. The constraint vector

{\vec{r}}_{t, k, i}^{w}

results from summing the translation vectors

{}_{c}^{w}{\vec{t}}^{i}

and the extended LoS vector

λ {\vec{L o S}}_{t, k, i}^{w}, λ \in ℝ, λ > 1

.

Figure 4. LoS vector

{\vec{L o S}}_{1, k, 1}^{w}

and respective vector constraint

{\vec{r}}_{1, k, 1}^{w} (λ)

from Camera 1 to the detected target with

I d = 1

.

C_{1} = {}_{c}^{w}{\vec{t}}^{1}

: location of the optical center of Camera 1. Target 1: position of Target 1 in the image plane giving rise to LoS vector

{\vec{L o S}}_{t, k, i}^{w}

.

Figure 4. LoS vector

{\vec{L o S}}_{1, k, 1}^{w}

and respective vector constraint

{\vec{r}}_{1, k, 1}^{w} (λ)

from Camera 1 to the detected target with

I d = 1

.

C_{1} = {}_{c}^{w}{\vec{t}}^{1}

: location of the optical center of Camera 1. Target 1: position of Target 1 in the image plane giving rise to LoS vector

{\vec{L o S}}_{t, k, i}^{w}

.

Figure 5. Determining the distance between vector constraints

{\vec{r}}_{t, k, i}^{w} (λ)

and

{\vec{r}}_{t, k, j}^{w} (β)

. The tolerance

d_{m i n} \leq δ

indicates whether LoS vectors

{\vec{L o S}}_{t, k, i}^{w}

and

{\vec{L o S}}_{t, k, j}^{w}

point at the same target.

Figure 5. Determining the distance between vector constraints

{\vec{r}}_{t, k, i}^{w} (λ)

and

{\vec{r}}_{t, k, j}^{w} (β)

. The tolerance

d_{m i n} \leq δ

indicates whether LoS vectors

{\vec{L o S}}_{t, k, i}^{w}

and

{\vec{L o S}}_{t, k, j}^{w}

point at the same target.

Figure 6. Top and front views of a detected target centroid trajectory within a time window

K_{k}^{i}

.

Figure 6. Top and front views of a detected target centroid trajectory within a time window

K_{k}^{i}

.

Figure 7. LoS vectors from Camera 1 to Target 1

{\vec{L o S}}_{1, k, 1}^{w}, k = 1, \dots, 4

in the image plane in {W} coordinates during the time window

K_{k}^{1}

and the interpolated trajectory. The data will be stored in the information vectors

i_{1, k, 1}^{w} = {k; O_{c, 1}^{w}; {\vec{L o S}}_{1, k, 1}^{w}}

for each k-th instant within the time window

K_{k}^{1}

.

Figure 7. LoS vectors from Camera 1 to Target 1

{\vec{L o S}}_{1, k, 1}^{w}, k = 1, \dots, 4

in the image plane in {W} coordinates during the time window

K_{k}^{1}

and the interpolated trajectory. The data will be stored in the information vectors

i_{1, k, 1}^{w} = {k; O_{c, 1}^{w}; {\vec{L o S}}_{1, k, 1}^{w}}

for each k-th instant within the time window

K_{k}^{1}

.

Figure 8. Information vector storage buffer

i_{t, k, i}^{w}

of the t-th target in the i-th camera,

k = k_{1}, \dots, k_{n 1}

.

Figure 8. Information vector storage buffer

i_{t, k, i}^{w}

of the t-th target in the i-th camera,

k = k_{1}, \dots, k_{n 1}

.

Figure 9. Buffer 1: locally obtained information vectors stored in the i-th camera. Buffer 2: information vectors received from neighboring cameras at different time instants within the time window period. The information vectors are temporally aligned for later 3D proximity metric computation. Reception of two or more information vectors may occur at the same time instant within the time window

K_{k}^{i}

, as shown at time

k_{3}

. Colored arrows indicate temporal alignment of local information vectors with received information vectors.

Figure 9. Buffer 1: locally obtained information vectors stored in the i-th camera. Buffer 2: information vectors received from neighboring cameras at different time instants within the time window period. The information vectors are temporally aligned for later 3D proximity metric computation. Reception of two or more information vectors may occur at the same time instant within the time window

K_{k}^{i}

, as shown at time

k_{3}

. Colored arrows indicate temporal alignment of local information vectors with received information vectors.

Figure 10. Front view of the i-th camera with a detected target centroid trajectory.

τ_{k_{c}}^{i}

denotes the image processing delay incurred measuring

z_{t, k_{c}, 3 D}^{i}

and the local estimation phase.

T = α_{1} + α_{2}

is the inter-frame interval.

K_{k}^{i} = [k_{c} - α_{1}, k_{c} + α_{2}]

: time window. Transmission of information vectors

I_{t, k_{c} + τ_{k_{c}}^{i}, i}^{w}

occurs at

k_{c} + τ_{k_{c}}^{i}

whereas the fusion phase occurs at the end of the time window, i.e., at

k_{c} + α_{2}

. Instants k are integer multiples of the frame capture period T.

Figure 10. Front view of the i-th camera with a detected target centroid trajectory.

τ_{k_{c}}^{i}

denotes the image processing delay incurred measuring

z_{t, k_{c}, 3 D}^{i}

and the local estimation phase.

T = α_{1} + α_{2}

is the inter-frame interval.

K_{k}^{i} = [k_{c} - α_{1}, k_{c} + α_{2}]

: time window. Transmission of information vectors

I_{t, k_{c} + τ_{k_{c}}^{i}, i}^{w}

occurs at

k_{c} + τ_{k_{c}}^{i}

whereas the fusion phase occurs at the end of the time window, i.e., at

k_{c} + α_{2}

. Instants k are integer multiples of the frame capture period T.

Figure 11. Application of the UT transformation to approximate the 3D position measurement error statistics arising from the uncertainty in LoS vector

{\vec{L o S}}_{t, k, i}^{w}

in the image plane. This uncertainty in the LoS vector affects the output error statistics of the proximity metric algorithm that determines the distance between constraint vector pairs

{\vec{r}}_{t, k, i}^{w}

and

{\vec{r}}_{t, k, j}^{w}

(the latter not shown for the sake of clarity) and consequently impacts the 3D position measurement. The resulting seven points in 3D space are approximated with a normal distribution

N (μ_{f k,} C_{f k})

. Subscript f here means “final” in the UT 3D position mean and covariance determined from averaging with respect to the neighboring cameras tracking the same target centroid at the same time instant for each instant k within a temporal window. 3D motion of the imaged targets is not constrained to a ground reference plane. 3D position is in

{\vec{r}}_{t, k, i}^{w} (λ^{*})

when the distance tolerance between

{\vec{r}}_{t, k, i}^{w} (λ^{*})

and

{\vec{r}}_{t, k, j}^{w} (β^{*})

is

d_{m i n} \leq δ

.

Figure 11. Application of the UT transformation to approximate the 3D position measurement error statistics arising from the uncertainty in LoS vector

{\vec{L o S}}_{t, k, i}^{w}

in the image plane. This uncertainty in the LoS vector affects the output error statistics of the proximity metric algorithm that determines the distance between constraint vector pairs

{\vec{r}}_{t, k, i}^{w}

and

{\vec{r}}_{t, k, j}^{w}

(the latter not shown for the sake of clarity) and consequently impacts the 3D position measurement. The resulting seven points in 3D space are approximated with a normal distribution

N (μ_{f k,} C_{f k})

. Subscript f here means “final” in the UT 3D position mean and covariance determined from averaging with respect to the neighboring cameras tracking the same target centroid at the same time instant for each instant k within a temporal window. 3D motion of the imaged targets is not constrained to a ground reference plane. 3D position is in

{\vec{r}}_{t, k, i}^{w} (λ^{*})

when the distance tolerance between

{\vec{r}}_{t, k, i}^{w} (λ^{*})

and

{\vec{r}}_{t, k, j}^{w} (β^{*})

is

d_{m i n} \leq δ

.

Figure 12. 7 Sigma points found after applying UT to noisy LoS vectors in the image plane (optical axis included) undergoing the proximity metric algorithm to determine the distance between vector constraint pairs and the corresponding set of points in 3D space with distribution

N (μ_{f k}, C_{f k})

.

Figure 12. 7 Sigma points found after applying UT to noisy LoS vectors in the image plane (optical axis included) undergoing the proximity metric algorithm to determine the distance between vector constraint pairs and the corresponding set of points in 3D space with distribution

N (μ_{f k}, C_{f k})

.

Figure 13. Set of 3D points resulting from proximity determination between constraints

{\vec{r}}_{1, k, i}^{w}

and

{\vec{r}}_{t, k, i}^{w}, j = 2, 3, 4, 5

for target Id = 1 in Camera 1. Information vectors received from neighboring cameras can include LoS vectors to different targets or to the same target but with a different Id from the one issued in camera 1.

Figure 13. Set of 3D points resulting from proximity determination between constraints

{\vec{r}}_{1, k, i}^{w}

and

{\vec{r}}_{t, k, i}^{w}, j = 2, 3, 4, 5

for target Id = 1 in Camera 1. Information vectors received from neighboring cameras can include LoS vectors to different targets or to the same target but with a different Id from the one issued in camera 1.

Figure 14. An instance of distinct UT means

μ_{j k}, j = 1, \dots, n

in 3D space at instant

k

for the sets of 3D points shown in Figure 13.

Figure 14. An instance of distinct UT means

μ_{j k}, j = 1, \dots, n

in 3D space at instant

k

for the sets of 3D points shown in Figure 13.

Figure 15. Final 3D means

μ_{f k}

at instants k = 13, k = 14, and k = 15 after averaging their respective simultaneous UT means

μ_{j k}

seen in Figure 14.

Figure 15. Final 3D means

μ_{f k}

at instants k = 13, k = 14, and k = 15 after averaging their respective simultaneous UT means

μ_{j k}

seen in Figure 14.

Figure 16. After each time window, the algorithm enters the fusion phase, where information is exchanged asynchronously within a fusion window with p size.

k_{c}^{″}

is the transmission instant of

C^{i}

, and

k_{c}^{‴}

represents the reception instants in

C^{i}

from neighboring

C^{j}

within the fusion window. Content exchanged within the fusion window: information pair (

{\tilde{y}}_{t, k_{c}^{″}, 3 D}^{i, i}, {\tilde{Y}}_{t, k_{c}^{″}, 3 D}^{i, i}

) in BAF, fused 3D position measurement, and covariance (

z_{t, k_{c}, 3 D}^{i}

,

C_{t, k_{c}, 3 D}^{i}

) in SA-PF.

Figure 16. After each time window, the algorithm enters the fusion phase, where information is exchanged asynchronously within a fusion window with p size.

k_{c}^{″}

is the transmission instant of

C^{i}

, and

k_{c}^{‴}

represents the reception instants in

C^{i}

from neighboring

C^{j}

within the fusion window. Content exchanged within the fusion window: information pair (

{\tilde{y}}_{t, k_{c}^{″}, 3 D}^{i, i}, {\tilde{Y}}_{t, k_{c}^{″}, 3 D}^{i, i}

) in BAF, fused 3D position measurement, and covariance (

z_{t, k_{c}, 3 D}^{i}

,

C_{t, k_{c}, 3 D}^{i}

) in SA-PF.

Figure 17. In strategy (a) only the fused 3D position measurement and local estimation to the next time window

K_{k + 1}^{i}

are involved. Three steps are shown for strategy (b) where the blue arrows indicate the actions the filter goes through. The fused 3D measurement

z_{t, k_{c}, 3 D}^{i}

and covariance

C_{t, k_{c}, 3 D}^{i}

correspond to the frame capture instant

k_{c}

.

Figure 17. In strategy (a) only the fused 3D position measurement and local estimation to the next time window

K_{k + 1}^{i}

are involved. Three steps are shown for strategy (b) where the blue arrows indicate the actions the filter goes through. The fused 3D measurement

z_{t, k_{c}, 3 D}^{i}

and covariance

C_{t, k_{c}, 3 D}^{i}

correspond to the frame capture instant

k_{c}

.

Figure 18. Top: network communications topology. Bottom: projection of the FoVs from each of the

N_{s} = 8

cameras on the ground reference plane.

Figure 18. Top: network communications topology. Bottom: projection of the FoVs from each of the

N_{s} = 8

cameras on the ground reference plane.

Figure 19. Performance with variation of

α^{m a x}

and

τ^{m a x}

with exchange of just LoS vectors for fused 3D position measurement followed by local estimation (strategy a) and exchange of additional content other than LoS vectors for further fusion (strategy b) in fusion window with duration

p

= {1,2,4} frames.

Figure 19. Performance with variation of

α^{m a x}

and

τ^{m a x}

with exchange of just LoS vectors for fused 3D position measurement followed by local estimation (strategy a) and exchange of additional content other than LoS vectors for further fusion (strategy b) in fusion window with duration

p

= {1,2,4} frames.

Figure 20. Mean square error performance

ϵ

in pixels with added uncertainties in cameras’ poses.

Figure 20. Mean square error performance

ϵ

in pixels with added uncertainties in cameras’ poses.

Figure 21. Elapsed time to process tasks (a)–(f) as a function of the total number of targets

n_{i} \cdot n_{j}

.

Figure 21. Elapsed time to process tasks (a)–(f) as a function of the total number of targets

n_{i} \cdot n_{j}

.

Table 1. Analytical comparison.

Ref.	Acronym	Characteristics				Fusion Method
		Processing Delays	Aligns Received Data to Local Capture Instant	Data Exchanged among Camera Nodes	Position Representation
[16]	GHF	no	no	State estimates	Polar coordinates	No fusion
[25]	Decentralized EKF	no	no	State estimates	Cartesian coordinates	Sequential
[26]	DAT2TF	no	no	State estimates	Polar coordinates	Sequential
[15]	BAF-LLM	yes	yes	State estimates	Cartesian coordinates	Batch
[17]	SA-PF	no	no	Measurements	Cartesian coordinates	Sequential
This paper [15]	BAF-LLM—local estimation ¹	yes	yes ³	(1) LoS vectors to measure 3D positions	Cartesian coordinates	Batch
This paper [15]	BAF-LLM—with further fusion	yes	yes	(2) LoS vectors to measure 3D positions and information pairs’ estimates	Cartesian coordinates	Batch
This paper [17]	PF—local estimation ²	no	yes ³	(1) LoS vectors to measure 3D positions	Cartesian coordinates	Batch
This paper [17]	SA-PF—with further fusion	no	no	(2) LoS vectors to measure 3D positions and fused 3D position measurements	Cartesian coordinates	Sequential

¹ Kalman Filter in information form. ² Particle Filter. ³ 3D position measurements from asynchronously received LoS vectors undergo interpolation to local capture instant for the purpose of time alignment. See Section 2.1.6.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Di Gennaro, T.M.; Waldmann, J. Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network. Sensors 2023, 23, 1194. https://doi.org/10.3390/s23031194

AMA Style

Di Gennaro TM, Waldmann J. Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network. Sensors. 2023; 23(3):1194. https://doi.org/10.3390/s23031194

Chicago/Turabian Style

Di Gennaro, Thiago Marchi, and Jacques Waldmann. 2023. "Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network" Sensors 23, no. 3: 1194. https://doi.org/10.3390/s23031194

APA Style

Di Gennaro, T. M., & Waldmann, J. (2023). Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network. Sensors, 23(3), 1194. https://doi.org/10.3390/s23031194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network

Abstract

1. Introduction

1.1. Related Work

1.2. Our Proposal

1.3. Contributions

1.4. Paper Structure

2. Problem Formulation

2.1. Obtaining z t , k c , 3 D i

2.1.1. Target Centroid Detection in the Image Plane in Pixels and Identification Number (Id) Assignment

2.1.2. Camera Calibration

2.1.3. The Optical Center O c , i and Point P u Represented in the World Coordinate System {W}: A Preamble for Obtaining the LoS Vector

2.1.4. LoS Vector L o S → t , k , i w and Parametric Equation of the Line

2.1.5. Proximity Metric between Vector Constraints Originating from LoS to a Target

2.1.6. Asynchronous Exchange of Information Vectors

2.1.7. Obtaining the 3D Position Measurement at Capture Instant k c

2.2. Dynamic Target Model: Linear Gaussian Model

2.3. Target Information Fusion within the Fusion Window [15]

3. Asynchronous Bayesian State Estimation

3.1. Batch Asynchronous Filter (BAF)-Lower Load Mode (LLM) [15]

3.2. Sequential Asynchronous Particle Filter [17]

4. Simulation Results

4.1. Evaluation Setup

4.2. Discussions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.1. Obtaining $z_{t, k_{c}, 3 D}^{i}$

2.1.3. The Optical Center $O_{c, i}$ and Point $P_{u}$ Represented in the World Coordinate System {W}: A Preamble for Obtaining the LoS Vector

2.1.4. LoS Vector ${\vec{L o S}}_{t, k, i}^{w}$ and Parametric Equation of the Line

2.1.7. Obtaining the 3D Position Measurement at Capture Instant $k_{c}$