Next Article in Journal
Long-Term Coronary Artery Disease Risk Prediction with Machine Learning Models
Previous Article in Journal
On the Design of a Network Digital Twin for the Radio Access Network in 5G and Beyond
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network

by
Thiago Marchi Di Gennaro
1,*,† and
Jacques Waldmann
2
1
Systems and Control Department, Weapons Systems Directorate, Brazilian Navy, Rio de Janeiro 20010-100, RJ, Brazil
2
Systems and Control Department, Eletronics Engineering Division, Instituto Tecnológico de Aeronáutica, São José dos Campos 12228-900, SP, Brazil
*
Author to whom correspondence should be addressed.
Current Address: Rua Primeiro de Março 118—Edifício Barão de Ladário—Andares 17° a 21°—Centro, Rio de Janeiro 20010-100, RJ, Brazil.
Sensors 2023, 23(3), 1194; https://doi.org/10.3390/s23031194
Submission received: 15 November 2022 / Revised: 13 January 2023 / Accepted: 16 January 2023 / Published: 20 January 2023
(This article belongs to the Section Sensor Networks)

Abstract

:
We present a method to acquire 3D position measurements for decentralized target tracking with an asynchronous camera network. Cameras with known poses have fields of view with overlapping projections on the ground and 3D volumes above a reference ground plane. The purpose is to track targets in 3D space without constraining motion to a reference ground plane. Cameras exchange line-of-sight vectors and respective time tags asynchronously. From stereoscopy, we obtain the fused 3D measurement at the local frame capture instant. We use local decentralized Kalman information filtering and particle filtering for target state estimation to test our approach with only local estimation. Monte Carlo simulation includes communication losses due to frame processing delays. We measure performance with the average root mean square error of 3D position estimates projected on the image planes of the cameras. We then compare only local estimation to exchanging additional asynchronous communications using the Batch Asynchronous Filter and the Sequential Asynchronous Particle Filter for further fusion of information pairs’ estimates and fused 3D position measurements, respectively. Similar performance occurs in spite of the additional communication load relative to our local estimation approach, which exchanges just line-of-sight vectors.

1. Introduction

Tracking multiple targets in a distributed-processing wireless-communication network (WCN) of cameras encompasses several areas such as surveillance, computer vision, threat detection [1], inertial navigation aided by additional sensors, and autonomous vehicles. In the tracking process investigated here, we assume fixed cameras whose poses are known and with computational processing resources (network nodes). The cameras estimate the location of a moving object of interest, here called a target, over time [2] in three-dimensional (3D) space. Network topology affects the distributed processing that aims at greater flexibility and robustness [3], and target tracking is expected to become less likely subject to failure. Each network node implements detection algorithms and state estimators, where state estimate refers to the position and velocity of the estimated target. Detected targets are represented by the location of their centroids on the image plane, and the corresponding measurements are the line-of-sight (LoS) vectors. Target 3D position arises from intersecting LoS vectors measured by neighboring nodes covering the 3D space the target traverses. Hence, the target 3D motion is not restricted to a reference ground plane, and targets can move freely in the 3D space covered by the field of view of each camera in the network. The LoS measurements are usually noisy or erroneous owing to the limitations of computer vision algorithms [4]. With a network of cameras, it is possible to improve target state estimation via fusion of information from different cameras observing the same target.
Two approaches for information exchange among cameras are found in the literature: measurement information exchange [3,5,6] and state estimate information exchange [7,8,9,10,11,12,13,14,15,16,17,18]. The former computes the fusion of exchanged measurements and subsequently proceeds to target state estimation. Concerning the target state estimate exchange, the i-th camera fuses the local state estimate with the same target’s state estimate received from neighboring cameras.
Each camera node has a local clock representing the timeline for each of its pre-programmed actions. The network nodes share the same inter-frame interval, though frame capture instants are asynchronous [8]. Moreover, for each captured frame, there is a processing delay when detecting the targets and computing the measurements. Here, this measurement is the 3D target position in world Cartesian coordinates from intersecting local and received LoS vectors of cameras with known poses followed by the unscented transform (UT) [19]. Processing delay and asynchronous capture are both integer multiples of the inter-frame interval and dealt with by means of temporal alignment of the 3D position measurements with the local frame capture instant and measurement fusion by averaging the aligned 3D position measurements. Then, we proceed with the local estimation of either the target information pair with Kalman filtering in information form or the target state estimate with particle filtering.
We compare the above with the Batch Asynchronous Filter (BAF) [15] and the Sequential Asynchronous Particle Filter (SA-PF) [17], both Bayesian decentralized filters, in combination with our fused 3D position measurement approach. BAF and SA-PF originally exchange information other than just asynchronous line-of-sight vectors to targets, respectively, asynchronous information pairs’ estimates, and fused 3D position measurements.
BAF exchanges asynchronous information pair estimates among neighboring cameras from local Kalman information filters, aligns them temporally at the frame capture instant for Kullback–Leibler Average (KLA) fusion [20], and proceeds to target information pair update followed by time propagation to the next local capture instant.
SA-PF [17] asynchronously exchanges fused 3D position measurements among neighboring cameras, then propagates each local particle state estimate to the reception instant of each measurement in the local reception sequence, iteratively updates each local particle weight with the likelihood of each received measurement, updates the local state estimate, and propagates to the next local capture instant.

1.1. Related Work

Target tracking in a distributed asynchronous camera network can either use a sequential method [13] or a batch method [6]. In sequential methods, a network node computes the target state not only when local measurements are obtained but also when the same node receives information from neighboring camera nodes. The nodes do not propagate the estimates received from neighboring nodes to alignment with the local frame capture instant prior to update; the reception instants are used as the sending camera frame capture instants and the probability density for target state estimation; namely, the particle weights are updated sequentially [15,17]. The SA-PF filter disregards the processing delays.
In batch processing methods, the update with data information received asynchronously from other nodes occurs after the received data are propagated to the local frame capture instant for the purpose of time alignment [6,15]. BAF [15] considers the asynchronous frame capturing among cameras and the processing delay in the estimation and fusion phases at each camera. In the estimation phase, a camera obtains the 3D position measurement assuming the target motion develops on a ground reference plane and estimates the information pair at the frame capture instant. Cameras exchange local target state information, and all information received during a time window period is stored in a buffer. At the end of the time window, a camera enters the fusion phase and propagates the received target information pairs of other cameras either backward or forward to temporal alignment with the local frame capture instant [21].
Previous work mostly related to target tracking with a camera network used information exchange of two-dimensional target states restricted to move on a reference ground plane [2,3,7,8,10,13,14,15,17,21]. However, much less attention has been given to the often-occurring scenario in which targets move in 3D space while taking into account network asynchrony and delays in image processing. For example, in [22], multi-view histograms were employed to characterize targets in 3D space using color and texture as the visual features; the implementation was based on particle filtering. In [23], a deep network model was trained with an RGB-D dataset, and then the 3D structure of the target was reconstructed from the corresponding RGB image to predict depth information. In [24], images with a detected hand and head were used with each camera transmitting to a central processor using a low-power communication network. The central processor performed 3D pose reconstruction combining the data in a probabilistic manner. In [16], state estimation with complete maximum likelihood information transfer produced 3D cartesian estimates of target position and velocity from azimuth and elevation provided by multiple cameras. Ref. [25] addressed the competing objectives of monitoring 3D indoor structured areas against tracking multiple 3D points with a heterogeneous visual sensor network composed of both fixed and Pan-Tilt-Zoom (PTZ) controllable cameras. The measurement was the projection of a distinct 3D point representing the target onto an image plane wherein all cameras followed the pinhole model with unit focal length. The authors applied decentralized EKF information filtering, neither handling camera network asynchrony nor the issue that in real image sequences a target occupies a volume; thus, cameras aiming at a target from different directions see quite distinct portions of the target surface and different 3D points.
None of the above methods considered frame processing delays and network asynchrony among camera nodes.
Reference [26] applied DAT2TF to an asynchronous network of 2D radars that provided local Kalman filter state estimates to a central fusion node performing global estimation of the 3D target state. While focusing on the fusion of state estimates, a linearized approach is highlighted to determine the Cramer–Rao estimation error uncertainty bound when converting radar measurements from polar to Cartesian coordinates, and the authors claim the result is equivalent to decentralized filtering.

1.2. Our Proposal

To track multiple actual targets undergoing unrestricted motion in a real image sequence with a network of calibrated cameras whose poses are known, we propose the decentralized approach wherein our focus is on the fusion of 3D position measurements combined with only local estimation, without involvement of a central node. For that, we need to invert the projection function from 3D space to the image plane with stereoscopy and the known camera poses and calibration data:
Step 1: Within a time window about the local frame capture instant and among neighboring cameras, transmit and receive asynchronously LoS vectors represented in world coordinate system {W} to a target centroid in the image plane and the corresponding time stamps among neighboring cameras within a time window about the frame capture instant [15];
Step 2: Use the unscented transform (UT) and stereovision principles to acquire 3D position measurements and covariances from the received LoS vectors when they intersect to within a 3D spatial distance threshold;
Step 3: Propagate forward or backward in time the 3D position measurements and respective covariances via interpolation until temporal alignment with the local frame capture instant;
Step 4: Fuse locally the 3D position measurements and covariances at the frame capture instant by means of averaging, use them to update the local estimate, and propagate to the next frame capture instant.
We compare our approach to SA-PF and BAF wherein additional communication is exchanged other than just LoS vectors to measure 3D positions. After the aforementioned time window in Step 1, each camera starts a fusion window after the time window to receive and process the additional communications. In SA-PF, a camera transmits and receives fused 3D position measurements and respective covariances at distinct instants in the fusion window. Differently, BAF receives asynchronous information pairs’ estimates at distinct instants in the fusion window.
Both BAF and SA-PF are compared to our proposal with the average root mean square error in the image planes of the cameras tracking the various targets in the scene. The comparison also aims at investigating whether there is a relevant change in performance when a camera fails to receive communications due to ongoing frame processing.

1.3. Contributions

Our innovative approach directly measures the 3D centroid position of an actual target with stereoscopy as different, calibrated cameras with known poses view distinct portions of the surface of the same target. Moreover, we explicitly dispense with the fusion node to add robustness to target tracking via decentralized processing in a camera network with partial connectivity. Our focus is on the fusion of 3D position measurements and respective covariances computed with LoS vectors received asynchronously from neighboring cameras. We also consider the image processing delay of each node when obtaining the local measurements.
We derive LoS vectors from the optical center of each camera to the various target centroids in the respective image plane with the corresponding error covariances. Then, we use stereoscopy combined with an unscented transform (UT) that employs the minimum distance search between pairs of LoS vectors to compute the corresponding 3D positions and error covariance in the world reference frame. One LoS vector is from the local camera and the other LoS vector is received asynchronously from any of the neighboring cameras tracking the same target. We handle asynchronous LoS vectors and the corresponding error covariances, all represented in the world coordinate frame, with time propagation from the received instant into alignment with the camera local capture instant. A set of 3D position measurements and covariances arise with representation in the world coordinate frame at the capture instant of the camera. Then, we fuse the measurements and covariances by means of averaging and proceed to local estimation.
We emphasize that our contribution is not in filtering, though we modified both BAF and SA-PF to add a fusion stage to test our fused 3D position measurement approach and local estimation against a further fusion step among the neighboring cameras, and thus investigate the tradeoff between performance and required communications. We find that local estimation provides similar performance with a smaller communication workload.
Moreover, in the real scenario evaluated here, namely the PETS2009 sequence [27], subjects enter and leave the fields of view of cameras the camera network, resulting in targets that merge and separate as they suddenly change motion direction in 3D space. Our approach handles such occurrences gracefully. According to our research, no previous investigation has focused on decentralized target tracking using LoS vectors exchanged asynchronously among neighboring camera nodes with frame processing delay to obtain position measurements in 3D space from stereoscopy principles. In Table 1, we address an analytical comparison of methods found in the literature.

1.4. Paper Structure

The paper is organized as follows. Section 2 formulates how to measure 3D position considering the processing delay, network asynchrony, and tracking without constraining target motion to a reference ground plane. Section 3 describes the algorithms to obtain 3D measurements and the modifications to the Bayesian BAF and SA-PF filter algorithms to include an additional fusion window for comparison with just local estimation. Section 4 presents simulation results and discusses performance, and Section 5 concludes the paper.

2. Problem Formulation

Some background is required initially:
-
Variable k instants or time steps within a time window: they are integer multiples of the camera inter-frame interval and represent the capture instants of target centroids, and the instants of LoS vectors are received from neighboring cameras during the time window about the frame capture instant k c of a local camera. They are stored in the i-th camera node;
-
Frame capture instants k c : the moment when a frame is captured and defines a time window;
-
Here we assume that the communication is ideal among cameras, i.e., there is no communication delays.
Unlike the fully connected network in [15,17], we investigate a partially connected network. Consider a WCN of cameras C i , i = { 1 , 2 , , N s } , where N s denotes the maximum number of sensors. This network is based on decentralized processing, meaning that each C i camera node fuses its information with the information received from the set N j , i of camera nodes C j , j i considered neighbors; i.e., N j , i is the set of cameras with which the i-th camera communicates to track targets in 3D space. The network topology is fixed and independent of the tracked target. The term decentralized indicates that each camera in the network has a communication module, with detection and tracking algorithms independent from the rest of the cameras. There is no fusion center involved. Given that we aim at 3D tracking, the state of the t-th target at the i-th camera is given by x t , k c , 3 D i = [ x t , k c i , x ˙ t , k c i , y t , k c i , y ˙ t , k c i , z t , k c i , z ˙ t , k c i ] T , where [ x t , k c i , y t , k c i , z t , k c i ]   T represents the position, and [ x ˙ t , k c i , y ˙ t , k c i , z ˙ t , k c i ]   T is the velocity at the frame capture instant k c .
The first phase to model this problem involves obtaining the t-th target position measurements z t , k c , 3 D i in 3D space for the targets at the i-th camera node from intersecting LoSs measured asynchronously in neighboring camera nodes. Subsequently, at each camera node, position and velocity estimation are initialized for each target. The local estimation algorithms are the Bayesian filters BAF and particle filtering (PF) without engaging in the fusion step though [15,17]. They compute the probability density function of the target pdf  p ( x t , k c , 3 D i , i | z t , 1 : k c , 3 D i ) based on the known predicted posterior pdf p ( x ^ t , k c , 3 D i | z t , 1 : k c 1 , 3 D i ) corresponding to the previous capture instant k c 1 and propagated to capture instant k c , the state transition pdf p ( x t , k c , 3 D i , i | x ^ t , k c 1 , 3 D i ) , and the measurement likelihood pdf p ( z t , k c , 3 D i | x t , k c , 3 D i , i ) . The second superscript in i,i refers to the local state estimate processed at the i-th camera not using information received from a neighboring camera.
We consider the frame processing delay incurred to obtain z t , k c , 3 D i and the asynchrony of neighboring camera capture instants. To this end, we borrow from [15] the concept of a time window K k i about the local camera capture instant k c , wherein LoS vectors are received from neighboring cameras. We define τ k c i as the frame processing delay to obtain the measurement z t , k c , 3 D i for the i-th camera, and α i j is the relative offset between the capture instants of neighboring cameras C i and C j .
The problems arising with false positives and false negatives in target detection were considered previously [28] and are addressed here as well. Scene complexity (quantity and speed of the targets) [29], construction parameters of the cameras, and accuracy in their calibration [30,31,32] are factors that also hinder target detection. Thus, if the detection algorithm fails to confirm the presence of the same target within the limits of a minimum capture count up to a maximum count, then the target is discarded. A new identifier number (Id) assignment is created, and a new initial target centroid measurement in the image plane, expressed in pixels, is generated when detecting a novel target. We assume the detection algorithm does not incur in a frame processing delay [33] to obtain a target centroid measurement in the image plane, and the delay is solely from computing the position measurement z t , k c , 3 D i in 3D space with the received LoS vectors transmitted by neighboring camera nodes.
As stated earlier, the cameras will fuse information from targets detected in their FoVs with information received from the targets of neighboring cameras belonging to N j , i . Recall the LoS vector from a single pinhole camera model to a target does not convey depth information for tracking in 3D space [32]. Then, subject to network asynchrony, j-th cameras in the i-th camera neighborhood set N j , i transmit LoS vectors to distinct targets, each target being detected by at least two cameras with overlapping FoVs [30]. We use stereoscopy to measure depth along a LoS vector and determine 3D position as targets move freely. Stereoscopy must have views to the same target from at least two different directions. The LoS vector from the camera’s optical center to the target centroid in the image plane is represented in the global Cartesian coordinate system by means of camera calibration and pose [31] data. In a real scenario, the same target may have very distinct portions of its surface viewed from very different directions by cameras located far apart. In such a case, stereoscopy can produce a biased 3D position measurement located somewhere inside the target surface. A distance tolerance d m i n δ is established to detect LoS proximity and ascertain that the measured 3D position is consistent with the pair of LoSs involved.
We evaluate three approaches to information exchange:
  • First: We propose to exchange among neighboring cameras LoS vector measurements L o S t , k c + τ k c i , i w to the t-th target from the i-th camera at instant k c +   τ k c i within the time window K k i period, and the data are represented in the world coordinate system {W}. The measured z t , k , 2 D i (referring to the target centroid in the image plane in pixel coordinates) is transformed into L o S t , k , i w with the camera construction parameters as described in Section 2.1. The fused 3D position measurement z t , k c , 3 D i results from averaging intersecting pairs of vector constraints r t , k , i w ,   r t , k , j w derived from LoS vectors of neighboring j-th cameras pointing at the same t-th target according to a LoS-vector 3D proximity determination algorithm. Local target state estimation uses the fused 3D position measurement z t , k c , 3 D i to update and predict the state estimate in the next frame capture instant k c + 1 .
  • Second: For comparison, exchange among cameras C i and C j , where j   N j , i , the 3D fused measurement z t , k c , 3 D i for each target, transmitted and instantaneously received at instants within the fusion window with configurable p size after the K k i time window. This is the used in a sequential approach in SA-PF.
  • Third: Again, for comparison, exchange among cameras C i and C j , where j   N j , i ,   j i , information pairs the estimates for each target, transmitted and instantaneously received at instants within the fusion window with configurable p size after the K k i time window. This is used in the batch approach in BAF.
The first approach extends the concept of stereoscopy to neighboring calibrated cameras with known poses and requires a 3D proximity metric among locally detected LoS vectors in the i-th camera and LoS information received from cameras belonging to N j , i . Note that internal temporary storage of LoS vectors L o S t , k , i w at each instant within the time window K k i period is required because the LoS vectors L o S t , k , i w at C i must be aligned temporally, through time stamps, with the vectors L o S t , k , j w received from C j . In [34], the concept of “virtual synchronization” was used along a “dynamic” memory to store the captured data. Based on this concept, sending and receiving information from cameras are concurrent activities at instants k within the time window period.
The information vector I t , k c + τ k c i , i w exchanged between cameras conveys the LoS vector to the t-th target at the time instant k c + τ k c i , the time stamp referring to the k c + τ k c i instant and the optical center O c , i w of camera C i expressed in world coordinates {W}: I t , k c + τ k c i , i w = { k c + τ k c i ; O c , i w ; L o S t , k c + τ k c , i w } ; the information vector stored for each of the k instants within the time window K k i period is defined as i t , k , i w = { k ; O c , i w ; L o S t , k , i w } . We adopt capital I for the exchanged information vector and small i for the stored information vector.

2.1. Obtaining z t , k c , 3 D i

To obtain measurements in 3D space, the following steps are required, from capturing the target centroid in the image plane z t , k , 2 D i in pixels to measuring the centroid z t , k c , 3 D i in 3D space in meters.

2.1.1. Target Centroid Detection in the Image Plane in Pixels and Identification Number (Id) Assignment

For each target detected by the i-th camera, the target centroid in the image plane is represented in the image by a red dot, as shown in Figure 1. All targets are assigned an identification number (Id).
Concerning the detection algorithm, accurate data association for maintaining the assignment of an identifier to a target is a challenging problem. The “global nearest neighbor” algorithm [35], which in turn implements the Munkres algorithm [36], was used at each camera for this purpose. However, as shown in Figure 1, an Id assigned to a target by one of the cameras will not necessarily match the Id assigned to the same target by neighboring cameras. For example, in Figure 1, for the same target observed by Cameras 1 and 2, Camera 1 assigned Id = 1 while Camera 2 assigned Id = 3. This issue is addressed in Section 2.3.
Centroid pixels in Figure 1 should be represented in the corresponding camera {C} coordinate system, and the cameras’ calibration will correct the distortions caused by camera construction parameters.

2.1.2. Camera Calibration

All cameras follow the pinhole CCD model [32]. Using intrinsic or construction parameters of a camera, it is possible to correct its distortions and calibrate it so that the distorted pixel of the detected centroid of the t-th target, represented by point P d in Figure 2, corresponds to the corrected position in the image plane represented by point P u . The transformation ( x d , y d ) ( x u , y u ) follows [37].
After correcting point P d to point   P u in the camera coordinate system {C}, we represent two main points in the world coordinate system {W} for each camera: the optical center point O c , i and point   P u that give rise to the LoS vector from the i-th camera to the t-th target.

2.1.3. The Optical Center O c , i and Point P u Represented in the World Coordinate System {W}: A Preamble for Obtaining the LoS Vector

Given that the extrinsic parameters of camera C i represent its rotation R i and translation t i , the representation of the optical center O c , i defined as vector C ˜ i = O c , i [32], represented in camera coordinate system {C}, in world coordinate system {W} is expressed as follows:
{ C } { W } O c , i w = t c w i = R c w i C ˜ i
Point P u is located in image plane, as shown in Figure 2. We must first represent the pixels of P u in coordinates of the camera system {C}:
{ p i x e l s } { C } ( x u = f x c z c , y u = f y c z c , z c = f ) ( x u = x c ,   y u = y c , f )
Denoting the LoS vector X t , c i = [ x c = x u , y c = y u , f ] as the coordinates obtained from point P u for the t-th target, now represented in coordinates of the camera system {C}, we can represent it in a vector base with origin at the camera optical center O c , i and parallel to the world coordinate system {W} as follows:
{ C } { W } L o S t , k , i w = R c w i X t , c i
The rotation matrix R c w i is expressed as follows:
R c w i ( θ z , θ y , θ x ) = [ cos θ y cos θ z cos θ z sin θ x sin θ y cos θ x sin θ y sin θ x sin θ z + cos θ x cos θ z sin θ y cos θ y sin θ z sin θ x sin θ y sin θ z + cos θ x cos θ z cos θ x sin θ y sin θ z cos θ z sin θ x sin θ y cos θ y sin θ x cos θ x cos θ y ]
This is the matrix that performs the rotation from system {W} to system {C} for each camera; its rotation order in camera axes is z y x and angles θ z , θ y , θ x rotate from {C} to {W} [31], then it is necessary to apply the transpose operation ( ) in Equations (1) and (3).

2.1.4. LoS Vector L o S t , k , i w and Parametric Equation of the Line

Note in Figure 3 that the LoS vector L o S t , k , i w is simply the point X t , c i multiplied by the transposed rotation matrix of the i-th camera according to Equation (3).
It should be noted that t c w i is a constant vector because it refers to the fixed position of the optical center of the i-th camera. The LoS vector, in turn, depends on the position of the target, referring to point X t , c i , in the image plane expressed in camera coordinates system {C}. Writing the constraint vector r t , k , i w in the form of a parametric equation [38] yields the following expression:
r t , k , i w ( λ ) = t c w i + λ L o S t , k , i w ,     λ ,   λ > 1
Thus, the vector constraint r t , k , i w ( λ ) changes when extending the LoS vector L o S t , k , i w with parameter λ . Figure 4 shows the LoS vector L o S t , k , i w and points related to vectors constraint r t , k , i w ( λ ) and optical center position t c w i to Target 1 on Camera 1. The last two vectors have their origins coinciding with the origin { 0 ; 0 ; 0 } of the world coordinate system {W}.

2.1.5. Proximity Metric between Vector Constraints Originating from LoS to a Target

Assume camera C i receives the information vector I t , k c + τ k c j , j w from one of the transmitting cameras C j . Suppose also that the observed scene contains only one target, and two synchronized cameras— C i and C j with α i j = 0 —observe it with overlapping FoVs at the same time of capture instant in both cameras; moreover, there is no image processing delay, i.e., τ k c i = τ k c j = 0 . In these circumstances, we adopt for simplicity here subscript k to mean the capture instant k c . The vector constraints r t , k , i w and r t , k , j w , respectively, are computed from the information vectors of cameras C i and C j , respectively, and can be written as follows:
r t , k , i w ( λ ) = t c w i + λ L o S t , k , i w ,   λ   ,   λ > 1 r t , k , j w ( β ) = t c w j + β L o S t , k , j w ,   β ,   β > 1
Then, target triangulation must find λ * , β * = min λ , β d ( r t , k , i w ( λ ) , r t , k , j w ( β ) ) = min λ , β   d ( t c w i + λ L o S t , k , i w   , t c w j + β L o S t , k , j w ) that minimize the distance between vector constraints r t , k , i w ( λ ) and r t , k , j w ( β ) . The concept of distance between reversed straight lines is applied [38,39], and the solution is obtained by calculating the partial derivative of the distance with respect to λ   and   β .
{ d ( r t , k , i w ( λ ) , r t , k , j w ( β ) ) λ = 0 d ( r t , k , i w ( λ ) , r t , k , j w ( β ) ) β = 0 .
Solving (7) produces the optimized nonlinear parameters expressed as follows:
{ λ * =   ( L o S t , k , i w · L o S t , k , j w ) [ ( t c w i t c w j ) · L o S t , k , j w ] || L o S t , k , j w || 2 [ ( t c w i t c w j ) · L o S t , k , i w ] || L o S t , k , i w || 2 || L o S t , k , j w || 2 ( L o S t , k , i w · L o S t , k , j w ) 2 β * = ( t c w i t c w j ) · L o S t , k , j w + λ * ( L o S t , k , i w · L o S t , k , j w ) || L o S t , k , j w || 2
The operator · represents the inner product between the vectors, and || . || represents the vector 2-norm operator. Appendix A shows how to find the optimized parameters   λ *   and   β * .
Once obtained, these parameters are substituted in (6), and the minimum distance occurs where the 3D position is in r t , k , i w ( λ * ) . A tolerance d m i n δ = 0.3 m was chosen to indicate whether the pair of LoS vectors point at the same target. Figure 5 shows LoS vectors L o S t , k , i w and L o S t , k , j w pointing at the same target.
The above description concerns information vectors exchanged synchronously. Next, we analyze asynchronous information exchange considering the processing time required for obtaining position measurements in 3D space.

2.1.6. Asynchronous Exchange of Information Vectors

Figure 6 shows the top and front views of the t-th target centroid trajectory during time window K k i in one of the cameras, assuming k c = k 3   to be the frame capture instant. Both the optical center O c , i w and the LoS vector L o S t , k , i w that define the vector constraint r t , k , i w ( λ ) are observed in Figure 6. The frame processing delay of the detection algorithm is assumed negligible relative to acquiring the 3D position measurements of targets’ centroids instants k within the time window K k i .
An example of a target centroid trajectory in the image plane expressed in world coordinates {W} is depicted in Figure 7.
Following an idea reported in [34], the information vectors i t , k , i w are stored for each k-th instant within the time window K k i , as can be observed in Buffer 1 of the i-th camera in Figure 8. They are subsequently replaced by information vectors of the next time window K k + 1 i . Note that to each detected t-th target a buffer is allocated to store its information vectors.
Information vectors are sent and received by the i-th camera within the time window about the capture instant k c , expressed as K k i = [ k c α 1 , k c + α 2 ] , ( α 1 > 0   and   α 2 > τ k c i ) . The information vectors i t , k , i w of the t-th target in the i-th camera (Buffer 1) are aligned according to their time stamps with respect to time stamps in the information vectors of various targets received from the j-th neighboring cameras (Buffer 2). After the temporal alignment, C i will contain local information vectors (Buffer 1) and received information vectors from the j-th neighboring cameras (Buffer 2), as depicted in Figure 9.
After temporal synchronization of the information vectors, the proximity metric algorithm is initialized to determine the distance between constraint pairs r t , k , i w and r t , k , j w . The 3D position measurement in 3D space z t , k , 3 D i results from local constraints r t , k , i w and constraints r t , k , j w received from neighboring cameras within each time window K k i that are found to point to the same t-th target at the same instant k when distance between the constraint pair is less than tolerance δ . A polynomial interpolates the 3D position measurements of the same target z t , k , 3 D i , propagating these 3D measurements backward or forward in time within the time window K k i until alignment with the i-th camera captures instant k c , and fuses these 3D measurements by averaging them, thus producing the fused 3D position measurement z t , k c , 3 D i . An internal variable N c defined for each t-th target counts the number of cameras tracking the same target for the purpose of averaging. Then follows local estimation via update with measurement z t , k c , 3 D i and time propagation to the next capture instant, either with BAF or SA-PF, without engaging in any further fusion phase and hence reducing the communication load.
For comparison, the fusion window takes place at the end of each time window K k i at time ( k c + α 2 ) , either using information pairs’ estimates in BAF or fused 3D position measurements in SA-PF received asynchronously from neighboring cameras. As described in [15], each camera node knows the maximum relative offset α m a x = max i , j { α i j } and the maximum and minimum processing delays τ m a x = max i , k c { τ k c i } e τ m i n = min i , k c { τ k c i } . To avoid the next capture instant k c coinciding with the fusion phase of the previous time window, a separation T = α 1 + α 2 between consecutive capture instants must exist where α 1 = α m a x τ m i n and α 2 = α m a x + τ m a x [15]. Figure 10 shows the timeline that describes how the actions elapse within time window K k i . The algorithm that obtains the 3D position measurements takes into account the information vectors I t , k c + τ k c i , j w exchanged at the time window within the period [ k c α 1 , k c + τ k c i ] . The estimation phase takes place at k c + τ k c i using the fused 3D position measurement z t , k c , 3 D i .
An important point addressed next is how to model the uncertainties inherent to the whole process described above and thus find the fused 3D position measurement z t , k c , 3 D i .

2.1.7. Obtaining the 3D Position Measurement at Capture Instant k c

In the initialization of the proximity metric algorithm to determine the distance between constraint pairs r t , k , i w and r t , k , j w , we model the LoS vector L o S t , k , i w with the “Unscented Transform” (UT) [19,40] to approximate the statistics of the random LoS variables transformed by the nonlinear function that determines depth from stereo. The uncertainty in the image plane is a three-dimensional Gaussian distribution N ( μ x , C x ) that also accounts for camera calibration error in the focal length f , where μ x is the mean, and C x is the covariance. This Gaussian distribution defines seven sigma points [40]. Figure 11 shows the 3D sigma points along the three axes of the i-th camera due to uncertainty in the detected centroid location in the image plane that gives rise to the LoS vector L o S t , k , i w = R c w i X t , c i represented in the world coordinate system {W}.
Given that the dimension of X t , c i is L = 3 , then 2 L + 1 = 7 sigma points in three-dimensions are generated in the image plane and along the optical axis, as shown in Figure 12. Consequently, seven constraints r t , k , i w are found for the same t-th target. The sigma points in the image plane are generated for each target from information that camera C i either detects locally or receives information about from a neighboring camera C j . The mean μ x of the Gaussian distribution is the LoS vector L o S t , k , i w = R c w i X t , c i shown in Figure 11, represented by a red arrow.
The proximity metric determines the distance (a nonlinear function) between each candidate constraint pair, i.e., r t , k , i w and r t , k , j w (amounting to a total of 49 candidate pairs, but only the 7 points with minimum distance are considered for each constraint pair). Any point in 3D space along constraint r t a , k , i w whose distance to another candidate constraint r t b , k , j w falls within the proximity tolerance d m i n δ indicates that targets t a seen by C i and t b seen by C j are the same target, and a set of points in 3D space are thus obtained. Figure 13 shows the result of proximity determination between a constraint to a target detected locally by Camera 1 and each of the constraints to targets received in information vectors from neighboring cameras. In this case, Cameras 2, 3, 4, and 5 send constraints r t , k , j w to Camera 1 within time window K k 1 and some of them are relative to the same Target 1.
Each set of 3D points with corresponding UT mean and covariance ( μ f k , C f k ) after averaging refers to a time instant within the time window K k i . Notice that more than one such set may relate to the same instant k when the UT processes simultaneous constraints that originate in information vectors transmitted by n neighboring cameras tracking the same target at instant k. The representative point in 3D space and respective covariance at instant k, μ f k , and C f k are the average over the n UT means and covariances μ j k , C j k ,   j N j , i of simultaneous sets of points in 3D space at instant k.
The final 3D mean, μ f k , averages over the n simultaneous UT means μ 1 k , , μ n k in 3D space at instant k as follows:
μ f k = μ 1 k + μ 2 k + + μ n k n .
Likewise, C f k averages over simultaneous UT covariances C 1 k , , C n k . Figure 14 depicts this situation, where five UT means are obtained at instant k = 13 ,   one   at   instant   k = 14 ,   and   one   at   instant   k = 15 . The result is the average of the UT means at each k time instant, as depicted in Figure 15. The corresponding UT covariance average is not shown in Figure 15.
Thus, 3D points are measured at each instant k within period [ k c α 1 , k c + τ k c i ] of the time window K k i . Two conditions must be met for these measurements to be generated: (1) at least one information vector I t , k c + τ k c j , j w is received from a neighboring camera at any instant k within the time window period [ k c α 1 , k c + τ k c i ] , and (2) this information vector relates to the same target observed at the i-th camera. These two conditions are not always met; in that case, the 3D position of the target centroid is not measured at all instants within the aforementioned period. At least two 3D position measurements at different instants within the time window period [ k c α 1 , k c + τ k c i ] are necessary to derive a linear polynomial and interpolate from the target 3D position measurements forward or backward in time until the capture instant k c to compute the fused 3D position measurement z t , k c , 3 D i . This interpolation process occurs at the instant   k c + τ k c i , and z t , k c , 3 D i   is the average of the 3D UT means measured within the period [ k c α 1 , k c + τ k c i ] and interpolated in time until the capture instant k c . The corresponding fused covariance C t , k c , 3 D i is the average of UT covariances interpolated in time until the capture instant.
In essence, our approach provides the fused 3D position measurement z t , k c , 3 D i and respective covariance C t , k c , 3 D i from the following steps:
(1) The 3D position measurements z t , k , 3 D i and covariances C t , k , 3 D i are output by the i-th camera local proximity algorithm at each k instant within period [ k c α 1 , k c + τ k c i ] if the two conditions above are met;
(2) Then, a linear polynomial interpolation propagates z t , k , 3 D i and C t , k , 3 D i backward or forward in time until alignment with the i-th camera local capture instant k c ;
(3) Finally, the fused 3D position measurements z t , k c , 3 D i and covariances C t , k c , 3 D i result from averaging the interpolated measurements and covariances at capture instant k c .
A remark is that when just one 3D position measurement is generated at an instant that differs from k c , the measurement is discarded because there is no other point to derive a linear interpolating polynomial, and the next time window is awaited. By contrast, when just a single 3D position measurement occurs at frame capture instant k c , this measurement is accepted as z t , k c , 3 D i with the corresponding covariance C t , k c , 3 D i .
The fused 3D position measurement z t , k c , 3 D i and covariance C t , k c , 3 D i at instant k c are the required information for our decentralized approach with either the local estimation phase at instant k c + τ k c i or further fusion-based estimation with BAF or SA-PF at instant k c + α 2 .

2.2. Dynamic Target Model: Linear Gaussian Model

The discrete-time dynamic model for each target [41] at each camera, is given by
x t , k c + Δ k c , 3 D i = F ( k c , k c + Δ k c ) x t , k c , 3 D i + w ( k c , k c + Δ k c )
where F ( k c , k c + Δ k c ) is the state transition matrix from instant k c to k c + Δ k c , defined as
F ( k c , k c + Δ k c ) = [ 1 Δ k c 0 0 0 0 0 1 0 0 0 0 0 0 1 Δ k c 0 0 0 0 0 1 0 0 0 0 0 0 1 Δ k c 0 0 0 0 0 1 ]
and w ( k c , k c + Δ k c ) is the cumulative noise between time instants k c and k c + Δ k c , which is considered to be Gaussian with zero mean and covariance Q ( k c ,   k c + Δ k c ) expressed as follows:
Q ( k c ,   k c + Δ k c ) = σ u 2 [ Δ k c 3 3 Δ k c 2 2 0 0 0 0 Δ k c 2 2 Δ k c 0 0 0 0 0 0 Δ k c 3 3 Δ k c 2 2 0 0 0 0 Δ k c 2 2 Δ k c 0 0 0 0 0 0 Δ k c 3 3 Δ k c 2 2 0 0 0 0 Δ k c 2 2 Δ k c ] ,
where σ u 2 is the variance of process noise in the nearly-constant-velocity model [41]. The measurement model for the i-th camera is given by z t , k c , 3 D i = H x t , k c , 3 D i + ν i , where ν i is the measurement noise vector with dimensions 3 × 1, which is assumed to be Gaussian with zero mean and covariance matrix R i . The latter is the fused covariance matrix C t , k c , 3 D i resulting from interpolation into alignment with frame capture instant k c and averaging as described in Section 3.1, and H is given by
H = [ 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 ]

2.3. Target Information Fusion within the Fusion Window [15]

Our proposal is the use of fused 3D position measurements derived from stereoscopy in combination with local state estimation to track actual moving subjects undergoing free 3D motion within the fields of view of asynchronous, calibrated cameras with known poses and located far apart. In certain circumstances, very distinct parts of a target surface can be viewed from very different viewing directions; thus, the 3D centroid position of such a target lies at some point inside and not on the viewed surfaces. Local estimation uses either the Kalman filter in information form or particle filtering. Then we test additional fusion with either BAF or SA-PF for comparison with just local estimation and investigate the tradeoff with the additional communication load.
For that purpose, SA-PF and BAF have their algorithms modified as follows. After the local estimation at the i-th camera used the fused 3D position measurement from the asynchronous LoS vectors received within the time window K k i , further fusion occurs with the additional content received asynchronously from the neighboring cameras within a fusion window with a configurable p size. This additional content being exchanged is either the information pair ( y ˜ t , k c , 3 D i , i , Y ˜ t , k c , 3 D i , i ) in BAF or the fused 3D position measurement and covariance ( z t , k c , 3 D i , C t , k c , 3 D i ) in SA-PF. Figure 16 represents the actions occurring in two time windows in sequence. The motivation to use SA-PF is that it is not limited to the linear Gaussian assumption of the information filter.
We then investigate whether an improvement occurs with the additional communication exchange in the fusion window. Define k c and k c as, respectively, transmit and reception instants when camera C i exchanges with neighboring cameras C j either the information pair in BAF or the fused 3D position measurement and covariance in SA-PF. Fusion with additional communication exchange in BAF and SA-PF is as follows:
(1)
BAF fusion: The t-th target information pair is propagated from frame capture instant k c to k c , thus yielding ( y ˜ t , k c , 3 D i , i , Y ˜ t , k c , 3 D i , i ) for C i transmission. Define k c to be the reception instant at which camera C i receives an information pair estimate ( y ˜ t , k c , 3 D i , j , Y ˜ t , k c , 3 D i , j ) in the fusion window. We use the Euclidean distance applied to the corresponding 3D centroid position to indicate whether the target is the same. Then, all the received information pairs’ estimates are propagated backward in time to k c for Kullback–Leibler Average (KLA) fusion with the local i-th camera information pair estimate ( y ˜ t , k c , 3 D i , i , Y ˜ t , k c , 3 D i , i ), and the result propagated to the next frame capture instant k c + 1 .
(2)
SA-PF fusion: As stated before, C i sends the fused 3D measurement z t , k c , 3 D i and covariance C t , k c , 3 D i at instant k c to the neighboring cameras C j without time propagation from k c to k c . At various instants k c , C i receives ( z t , k c , 3 D j , C t , k c , 3 D j ) from neighboring cameras, only once from each C j . Here as well we use Euclidean distance applied to 3D position measurement to indicate whether the target is the same. Fusion occurs in this filter by propagating particles’ state estimates from the capture instant k c to the distinct receiving instants k c and sequentially updates the respective weights with ( z t , k c , 3 D j , C t , k c , 3 D i ). At the end of the fusion window, C i updates the state estimate at k c and propagates to the next frame capture instant k c + 1 .

3. Asynchronous Bayesian State Estimation

Figure 16 shows how we changed the timeline of the Batch Asynchronous Filter (BAF)-Lower Load Mode (LLM) [15] and the Sequential Asynchronous Particle Filter (SA-PF) [17] to accommodate further fusion. Our purpose here is to investigate whether improved target tracking across the asynchronous camera network results from further communication other than just exchanging LoS vectors among neighboring cameras to compute fused 3D position measurements and then use local estimation.
Neighboring cameras transmit either the information pairs’ estimates (BAF) or the fused 3D position measurement and covariance (SA-PF) at instant k c in the end of the time window K k i , and introduce a fusion window with configurable p duration after K k i . At instant k c within the fusion window, neighboring cameras receive asynchronously either information pairs’ estimates (BAF) or fused 3D position measurements and covariances (SA-PF).
Relative to local estimation with fused 3D position measurements, we investigate (a) the KLA-based fusion of received information pairs’ estimates temporally aligned in batch with the local frame capture instant k c (BAF), or (b) the sequential update of particles’ weights by sampling from a likelihood density function with the received fused 3D position measurements and covariances (SA-PF).
The following Algorithm 1 describes the steps to investigate the effect of additional communications and fusion other than exchanging LoS vectors among neighboring cameras for computing fused 3D position measurements:
Algorithm1. Algorithm to investigate the effect on target tracking of further fusion other than exchanged LoS vectors and local estimation
Input: I–images taken by camera C i
Output: estimated state x t , k c , 3 D i and the root mean square error computation ϵ t , k c , 2 D i
for r Monte-Carlo simulation runs
for each camera
1-If t-th target detection is confirmed after three occurrences in images, its position estimate   z ^ t , k c , 3 D i is computed as follows:
for each detected target
2-Detect target centroids in image plane z t , k , 2 D i within the time window period K k i ;
3-Transform measurements z t , k , 2 D i into stored information vectors i t , k , i w = { k ; O c , i w ; L o S t , k , i w } ;
4-Asynchronous transmission/reception of information vectors I t , k c + τ k c i , i w = { k c + τ k c i ; O c , i w ; L o S t , k c + τ k c , i w } among cameras C i and C j , j N j , i ,   i j ;
5-Compute 3D position measurements z t , k , 3 D i and covariances C t , k , 3 D i within the time window K k i from received LoS vectors, and interpolate all of them into temporal alignment with the frame capture instant k c to account for frame processing delay and camera asynchrony;
6-Local estimation of the target position and velocity in 3D space x t , k c , 3 D i ;
7-Investigate the added fusion phase with either the BAF or the SA-PF Asynchronously transmit/receive information pairs’s estimates (BAF) or fused 3D position measurements (SA-PF) among cameras C i and C j , j N j , i ,   j i ;
8- z ^ t , k c , 3 D i is the resulting 3D position component of the target state estimate;
9-Perspective projection of the 3D position estimate of each target being tracked onto the 2D image planes of each neighboring camera, expressed in pixel coordinates: z ^ t , k c , 3 D i z ^ t , k c , 2 D i ;
end for;
end for;
10-Average root mean square error ϵ t , k c , 2 D i = z ^ t , k c , 2 D , r i z t , k c , 2 D i 2 2 to compare performance in steps 6 and 7.
end for.

3.1. Batch Asynchronous Filter (BAF)-Lower Load Mode (LLM) [15]

The initialization of each target involves the fused 3D position measurement z t , k c , 3 D i and respective covariance C t , k c , 3 D i . Each camera C i uses Kalman filtering in information form [7]. The fused 3D position measurement pair ( z t , k c , 3 D i , C t , k c , 3 D i ) is transformed in the measurement information pair ( H T C t , k c , 3 D i 1 z t , k c , 3 D i , H T C t , k c , 3 D i 1 H   ) .
Local estimation uses the measurement information pair and the information pair ( y ^ t , k c , 3 D i , Y ^ t , k c , 3 D i ) predicted from the previous frame capture instant to update locally the information pair to ( y t , k c , , 3 D i , i , Y t , k c , 3 D i , i ) of the t-th target at camera C i where y t , k c , , 3 D i , i = y ^ t , k c , 3 D i + H T C t , k c , 3 D i 1 z t , k c , 3 D i and Y t , k c , 3 D i , i = Y ^ t , k c , 3 D i + H T C t , k c , 3 D i 1 H . Then, z ^ t , k c , 3 D i = H ( Y t , k c , 3 D i , i ) 1 y t , k c , , 3 D i , i is the 3D position estimate of the target for later use to evaluate performance in step 6 of the Algorithm 1 with the average root mean square error in the image plane.
To investigate the added fusion phase in step 7 of the Algorithm 1, for the purpose of transmission C i predicts the information pair at instant k c within the fusion window illustrated in Figure 16 as follows:
Y ˜ t , k c , 3 D i , i = ( F ( k c , k c ) Y t , k c , 3 D i , i 1 F ( k c , k c ) T + Q ( k c , k c ) ) 1 y ˜ t , k c , 3 D i , i = Y ˜ t , k c , 3 D i , i   F ( k c , k c ) ( Y t , k c , 3 D i , i 1 y t , k c , 3 D i , i ) .
C i transmits the predicted information pair ( y ˜ t , k c , 3 D i , i , Y ˜ t , k c , 3 D i , i ) to all cameras j N j , i . Furthermore, C i receives at the local instant k c within the fusion window the information pair ( y t , k c , 3 D j , j , Y t , k c , 3 D j , j ) from a neighboring camera C j , as depicted in Figure 16. Camera C i has no knowledge of the frame capture instant of neighboring camera C j . Therefore, C i reads the information pair received from C j as ( y t , k c , 3 D i , j , Y t , k c , 3 D i , j ) , which undergoes propagation backward in time to the local frame capture instant k c :
Y ˜ t , k c   , 3 D i , j = ( F ( k c , k c   ) Y t , k c , 3 D i , j 1 F ( k c , k c   ) T + Q ( k c , k c   ) ) 1 y ˜ t , k c   , 3 D i , j = Y ˜ t , k c   , 3 D i , j   F ( k c , k c   ) ( Y t , k c , 3 D i , j 1 y t , k c , 3 D i , j ) .
Fusion of local and received information pairs at camera C i for each target at instant k c is obtained with the Kullback–Leibler Average (KLA) [20] as follows:
Y ^ t , k c , 3 D i = 1 n ( N j , i ) + 1 ( Y t , k c , 3 D i , i + j , j N j , i n ( N j , i ) Y ˜ t , k c , 3 D i , j ) y ^ t , k c , 3 D i = 1 n ( N j , i ) + 1 ( y t , k c , 3 D i , i + j , j N j , i n ( N j , i ) y ˜ t , k c , 3 D i , j ) .
where n ( N j , i ) indicates the number of neighboring cameras in the set N j , i whose information pairs of the t-th target are received within the fusion window of the i-th camera. Then, z ˘ t , k c , 3 D i = H ( Y ^ t , k c , 3 D i ) 1 y ^ t , k c , 3 D i is the fused 3D position estimate of the target in step 7 of the Algorithm 1 for later use to evaluate performance with the average root mean square error in the image plane.
We define as strategy (a) the local estimation in step 6, with exchange of just the information vectors I t , k c + τ k c i , i w conveying LoS vectors. This strategy predicts from the updated ( y t , k c , , 3 D i , i , Y t , k c , 3 D i , i ) to ( y ^ t , k c + 1 , 3 D i ,   Y ^ t , k c + 1 , 3 D i ) in the next capture instant k c + 1 , and computes the next measurement information pair ( H T C t , k c + 1 , 3 D i 1 z t , k c + 1 , 3 D i , H T C t , k c + 1 , 3 D i 1 H ) based on the fused 3D position measurement and covariance, which we compute with LoS vectors received in the next time window K k + 1 i .
For comparison, we define strategy (b) in step 7 as adding KLA-based fusion with information pairs’ estimates received from neighboring cameras within the fusion window other than the LoS vectors received within K k i . This strategy predicts from the KLA-fused information pair ( y ^ t , k c , 3 D i , Y ^ t , k c , 3 D i ) to ( y ^ t , k c + 1 , 3 D i ,   Y ^ t , k c + 1 , 3 D i ) in the next capture instant k c + 1 and, as in strategy (a), computes the next measurement information pair (   H T C t , k c + 1 , 3 D i 1 z t , k c + 1 , 3 D i , H T C t , k c + 1 , 3 D i 1 H   ) from the fused 3D position measurement and covariance from LoS vectors received in the next time window K k + 1 i . For clarification, see Figure 17.
The position estimate of the t-th target resulting from strategies (a) and (b) at camera C i at frame capture instant k c are, respectively, z ^ t , k c , 3 D   i = H Y t , k c , 3 D i , i 1 y t , k c , , 3 D i , i and z ˇ t , k c , 3 D   i = H   Y ^ t , k c , 3 D i 1 y ^ t , k c , 3 D i .

3.2. Sequential Asynchronous Particle Filter [17]

The initial step for obtaining a fused 3D position measurement pair ( z t , k c , 3 D i , C t , k c , 3 D i ) at each camera C i is the same as previously described. Each camera has the same inter-frame period   T = α 1 + α 2 between consecutive frame capture instants as depicted in Figure 10. The model of the t-th target is the same from Section 3.2.
Define k c at the previous frame capture instant ( k c = k c 1 ) and m = 1 , , n ( N j , i ) , where we recall that n ( N j , i ) indicates the number of neighboring cameras C j ,   j N j , i , whose fused 3D position measurements z t , k c , 3 D j m of the t-th target are received within the fusion window of the i-th camera. Each camera C i receives measurements z t , k c , 3 D j m at distinct instants k c within the fusion window and computes the corresponding state estimates x ˜ t , k c , 3 D i , j m . The updated state estimate in C i is then computed from x ˜ t , k c , 3 D i , j m , j m = { j 1 , j 2 , , j n ( N j , i ) } , and j 0 is the C i camera. As an example, let camera 2 ( i = 2 ) receive measurements from n ( N j , i ) = 3 neighboring cameras, say, cameras 5, 7, and 9; then m = 1 , 2 , 3 , N j , 2 = { j 1 , j 2 , j 3 } = { 5 , 7 , 9 } , and j 0 = 2 .
In the prediction process of SA-PF, new particles x t , k c , 3 D i , ( l ) sample the state space from a proposal pdf q ( x t , k c , 3 D i , ( l ) | x t , k c , 3 D i , ( l ) , z t , k c , 3 D i ) with the set of previous particles x t , k c , 3 D i , ( l ) and weights ω t , k c i , ( l ) , and a normalization yields weights ω ^ t , k c i , ( l ) associated with particles x t , k c , 3 D i , ( l ) . Then follows the update process; when C i asynchronously receives a measurement z t , k c , 3 D j m from camera C j m in the fusion window, the predicted particles x ˜ t , k c , 3 D i , j m , ( l ) are computed, and the corresponding weights ω ^ t , k c i , j m , ( l ) are updated sequentially using measurements z t , k c , 3 D j m and particles x ˜ t , k c , 3 D i , j m , ( l ) in the likelihood function. After receiving the measurement from the last of the neighboring cameras within the fusion window, the updated state estimate x t , k c , 3 D i is then computed with the set of weights ω t , k c i , ( l ) and particles x t , k c , 3 D i , ( l ) .
Prediction. At k c = 0 , L particles initialize with x t , 0 , 3 D i , ( l ) , l = 1 , , L , sampled from the known pdf p ( x t , 0 , 3 D i ) . The weights are initialized equally as ω t , 0 i , ( l ) = 1 / L for all l .
At instants k c 1 with z t , k c , 3 D i previously obtained, new particles x t , k c , 3 D i , ( l ) are propagated from the previous particles x t , k c , 3 D i , ( l ) with the deterministic part of the motion model in Equation (10). The corresponding weights ω ˜ t , k c i , ( l ) use the fused 3D position measurement z t , k c , 3 D i :
ω ˜ t , k c i , ( l ) = ω t , k c i , ( l ) p ( z t , k c , 3 D i | x t , k c , 3 D i , ( l ) )   p ( x t , k c , 3 D i , ( l ) | x t , k c , 3 D i , ( l ) ) q ( x t , k c , 3 D i , ( l ) | x t , k c , 3 D i , ( l ) , z t , k c , 3 D i ) = ω t , k c i , ( l ) p ( z t , k c , 3 D i | x t , k c , 3 D i , ( l ) )
The above weights are normalized according to ω ^ t , k c i , ( l ) = ω ˜ t , k c i , ( l ) / l = 1 L ω ˜ t , k c i , ( l ) . The set of particles and respective normalized weights is { x t , k c , 3 D i , ( l ) , ω ^ t , k c i , ( l ) } l = 1 L . This is the local estimation defined as strategy (a), wherein SA-PF becomes just PF, which does not involve exchange other than LoS vectors, x ^ t , k c , 3 D i = l = 1 L ω ^ t , k c i , ( l ) x t , k c , 3 D i , ( l ) , and follows the propagation to the next frame capture instant k c + 1 (step 6 of the pseudoalgorithm).
Update. Strategy (b) or step 7 of the Algorithm 1 involves the fusion window, when C i receives asynchronously measurements z t , k c , 3 D j m from the neighboring cameras C j at distinct reception instants k c , then SA-PF predicts particles x ˜ t , k c , 3 D i , j m , ( l ) and updates the weights ω ^ t , k c i , ( l ) as in Equations (18) and (19) respectively.
Initially, ω ^ t , k c i , j 0 , ( l ) = ω ^ t , k c i , ( l ) described above and the predicted particles x ˜ t , k c , 3 D i , j m , ( l ) are:
x ˜ t , k c , 3 D i , j m , ( l ) = E { x t , k c , 3 D j m , ( l ) | x t , k c , 3 D i , ( l ) } = F ( k c , k c ) x t , k c , 3 D i , ( l )
The corresponding weights ω ^ t , k c i , j m , ( l ) , l = 1 , , L are updated sequentially according to:
ω ˜ t , k c i , j m , ( l ) = ω ^ t , k c i , j m 1 , ( l ) p ( z t , k c , 3 D j m | x ˜ t , k c , 3 D i , j m , ( l ) ) ,     m = 1 , , n ( N j , i )    
where p ( z t , k c , 3 D j m | x ˜ t , k c , 3 D i , j m , ( l ) ) is the likelihood function in Equation (21). The weights in Equation (19) are normalized according to ω ^ t , k c i , j m , ( l ) = ω ˜ t , k c i , j m , ( l ) / l = 1 L ω ˜ t , k c i , j m , ( l ) . The set of particles and normalized weights { x ˜ t , k c , 3 D i , j m , ( l ) , ω ^ t , k c i , j m , ( l ) } l = 1 L represent the “partial posterior” p ( x ˜ t , k c , 3 D i ( j 0 : j m ) | z t , k c , 3 D i ( j 0 : j m ) , z t , 1 : k c , 3 D ) . This conditioning is on all previous measurements z t , 1 : k c , 3 D , and all current measurements z t , k c , 3 D i ( j 0 : j m ) received at camera C i during the fusion window at distinct instants k c , including the local z t , k c , 3 D i and up to the neighboring camera C j m as indicated by superscript ( j 0 : j m ) [17].
This update in SA-PF is sequential until the last remaining measurement from the neighboring cameras is received within the fusion window, and then the “global posterior” weights [17] become ω t , k c i , ( l ) = ω ^ t , k c i , j n ( N j , i ) , ( l ) . The set of particles and respective normalized weights is { x t , k c , 3 D i , ( l ) , ω t , k c i , ( l ) } l = 1 L , and the state estimate is x t , k c , 3 D i = l = 1 L ω t , k c i , ( l ) x t , k c , 3 D i , ( l ) . Finally, we propagate the particles x t , k c , 3 D i , ( l ) to the next frame capture instant k c + 1 with the deterministic part of the motion model in Equation (10).

4. Simulation Results

4.1. Evaluation Setup

We use MATLAB [42] and the PETS2009 database [27] with a camera network composed of N s = 8 cameras. The targets move within the surveillance volume of a network of calibrated cameras, whose construction (intrinsic) parameters and poses are given in the header of the PETS2009 image sequence dataset file. This dataset is challenging for tracking tasks because targets change direction, and clusters and scattering occur frequently. Scene illumination and obstacles also hinder tracking. In this experiment, the frame sampling rate was 7 frames/s 143   ms , with a total of 794 frames per camera. Each k time step takes approximately 143   ms . Figure 18 shows all FoVs projected on the ground reference plane for each of the eight cameras. Portions of some FoVs do not project on the ground and instead cover the scenario above ground. The surveillance area of the cameras is approximately 60 m × 90 m as depicted in Figure 18, and their FoVs overlap. Moreover, the neighborhoods for each camera are defined by the network communication topology also shown. The chosen topology reflects two camera neighborhoods (clusters), one with cameras having large FoVs areas projected on the ground and the other one consisting of cameras with smaller projections. Communications between the clusters occur by means of cameras 4 and 8 and cameras 1 and 5.
BAF assumes the information pdfs are Gaussian whereas PF makes no such assumption. They are evaluated under the two strategies (a) and (b) previously explained.
The random process modeling the processing delay in frame units is τ k i U { 0 , τ m a x } , where U is the uniform distribution. A uniform distribution is also used for the offset α i j between cameras: U { 0 , α m a x } . Denote Λ i , r [ 1 , 749 ] as the set of frame capture instants during the r-th run of a Monte Carlo simulation, r = 1, …, M of camera C i tracking a target and P k i { 1 , , N t } as the set of targets in the FoV of camera C i at capture instant k c . The evaluation is based on the average root mean square error:
ε = 1 M N c r = 1 M i = 1 N c 1 | Λ i , r | k Λ i , r 1 | Ρ k i | t P k i || z ^ t , k c , 2 D , r i z t , k c , 2 D i || 2 2
where z ^ t , k c , 2 D , r i is the estimated 3D position of the t-th target projected according to the calibrated pinhole model onto the image plane of camera C i at the instant k c during the r-th round, and z t , k c , 2 D i corresponds to the actual measurement in the image plane as provided by the target detection algorithm. The latter does not change with r , since the image sequence dataset is given and processed previously by the target detection algorithm. σ u 2 = 4 in Equation (12), and R i = C t , k c , 3 D i in the information filter with the measurement information pair ( H T C t , k c , 3 D i 1 z t , k c , 3 D i , H T C t , k c , 3 D i 1 H   ) and in the likelihood function in Equation (21). F and Q are given by the dynamic model of the target. The error ε statistics converged with M = 12 runs.
We skip dataset frames to simulate asynchronous frame capture and frame processing delay. ε is analyzed by increasing the relative offset 0 α m a x 40 (see Figure 19) with a fixed offset α m a x and processing delay τ k i = τ i at each i-th camera and r-th run. Each camera has τ i sampled once from U { 0 , τ m a x } ,   0   τ m a x   3 and remains the same throughout all runs.
The same process is applied to analyze the effect on ε of increasing 0 τ m a x 40 (see Figure 19) with a fixed processing delay τ m a x and relative offset α i at each i-th camera, and r-th run. Each camera has α i sampled once from U { 0 , α m a x } , (synchronous case) 0 α m a x 4   (asynchronous case) and remains the same throughout all runs.
Concerning the particle filter, the likelihood function in Equation (17) is the multivariate Gaussian pdf [43]:
p ( z t , k c , 3 D i | x t , k c , 3 D i , ( l ) ) = 1 ( 2 π ) N 2 d e t 1 2 ( C t , k c , 3 D i ) e x p [ 1 2 ( H x t , k c , 3 D i , ( l ) z t , k c , 3 D i ) T ( C t , k c , 3 D i ) 1 ( H x t , k c , 3 D i , ( l ) z t , k c , 3 D i ) ] ,
N = 3 since z t , k c , 3 D i and C t , k c , 3 D i refer to the fused 3D position measurement. Equation (19) p ( z t , k c , 3 D j m | x ˜ t , k c , 3 D i , j m , ( l ) )   shares the same structure of Equation (21) but changes the corresponding variables. L = 1000 particles are used.
Figure 19 shows the simulation results for strategies (a) and (b). (a) exchanges LoS vectors among neighboring cameras followed by local estimation using fused 3D position measurements (BAF and PF), and (b) employs a fusion window after the time window K k i with duration p = { 1 , 2 , 4 } in frames to process either exchanged information pairs’ estimates (BAF) or fused 3D position measurements (SA-PF) other than LoS vectors.

4.2. Discussions

Figure 19 shows that BAF and PF filters present similar performance when varying α m a x and τ m a x in all tested situations for strategy (a) where LoS vectors measurements exchanged among cameras in the time window K k i generate the fused 3D position measurement z t , k c , 3 D i and both filters use it to estimate the tracked targets’ positions and velocities. The BAF-LLM filter presents a better performance than the SA-PF filter in strategy (b) when τ m a x = α m a x 16 . The initial weights ω ^ t , k c i , j 0 , ( l ) = ω ^ t , k c i , ( l ) in the update phase consider the weights ω ^ t , k c i , ( l ) at instant k c in the receiving instant k c , and the instant of the received measurements z t , k c , 3 D j m is considered as the frame capture instant of the neighboring camera, since SA-PF is not informed about the frame capture time at the neighboring camera. Moreover, just one significantly delayed measurement from a neighboring camera yields a small likelihood value, thus compromising the remaining particle weight updates in the sequential fusion step of SA-PF. The small likelihood in such condition is due to the particle state estimate undergoing time propagation in Equation (19) from capture instant to reception instant, whereas the neighboring camera measurement received asynchronously in the fusion window presents a large delay relative to the local camera capture instant. Then, if k c     k c , the updated estimate is negatively affected by the delay. Therefore, it is justified that for τ m a x and α m a x 16 (small values of processing delay and asynchrony), the SA-PF filter produces an average root mean square error response similar to that of the BAF-LLM filter; i.e., small α m a x and   τ m a x do not degrade SA-PF filter performance in strategy (b).
As depicted in Figure 19 for all tested p = {1,2,4}, the average root mean square error of SA-PF undergoes sudden degradation relative to BAF-LLM when either asynchrony or frame processing delay increase significantly (blue and magenta curves—strategy (b)). BAF-LLM is not susceptible due to its prediction step from the frame capture instant to the transmission instant.
One problem is to validate the target information received by camera C i because, as stated previously, different Ids can be assigned to the same target in different cameras. This problem is solved in BAF by evaluating the Euclidean distance between the target 3D position estimate in camera C i with the position estimate received from the neighborhood. In the SA-PF filter, we use the Euclidean distance between measurement z t , k c , 3 D i and measurement z t , k c , 3 D j m received from the neighborhood; i.e., for each information exchange between cameras in the fusion window of a potential target, a distance tolerance ensures the correct fusion of data arising from the same target.
Target tracking by a camera is lost if the camera capture interval T = α 1 + α 2 exceeds the target duration within the camera FoV. Different from [15] where a fixed number of targets remain within sight of the camera network during the experiment, our distributed target tracking algorithm can track a varying number of targets as they move within the FoV of at least 2 cameras.
Cameras in close spatial proximity degrade the 3D estimation accuracy of targets at long range, since the 3D measurement from imaging as proposed here borrow the concept of depth estimation from baseline length and the epipolar constraint in stereoscopy.
Additionally, we conducted further evaluation of (1) the effect of camera pose uncertainty, with focus on camera attitude error, and (2) computational complexity of our proposed 3D position measurement approach.
The PETS2009 image sequence dataset is a real benchmark scenario for target tracking with fixed cameras whose calibration parameters and pose data are accurate. In spite of that, we conduct a Monte Carlo simulation with M = 20 runs wherein uncertainties are added in the extrinsic calibration parameters of each camera (rotation and translation parameters). To the calibrated rotation angles θ z , θ y , θ x of each camera in radians, we add a Gaussian uncertainty N ( μ = 0.05   rad , σ = 0.001   rad ) . Additive Gaussian uncertainty N ( μ = 10   cm , σ = 1   cm ) in camera location is also applied. We evaluate the mean square error in Equation (20). The simulation investigated processing delays τ   = (0, 2, 4, 6, 8, 10, 12, 14, 16, 18), maximum asynchrony among cameras α m a x = 4 , and the particular asynchrony value for each camera sampled from the uniform distribution U { 0 , α m a x } , i.e., each camera received 0   α α m a x .
Figure 20 indicates uncertainties in cameras’ poses degrade the fused 3D position measurement of targets when compared with exact poses, as expected.
We evaluate numerically the computational complexity by increasing the number n i of targets observed in the field of view of the i-th camera and the number n j of targets received by the i-th camera of the j-th cameras, thus resulting a total of n i · n j targets. The processing tasks considered for each of the targets are:
(a)
Variable n i —each target has 7 sigma points assigned to the local LoS vector pointing to the t-th target in the image plane;
(b)
Variable n j —each target with its 7 sigma points assigned to the received LoS vector pointing to the received t-th target in the image plane;
(c)
Intersection search of pairs of LoS from the set of 49 LoS vectors providing points in 3D space (Equations (6) and (8));
(d)
Checks which 3D points pass the distance criterion d 0.3 m;
(e)
Selects the 7 LoS vectors from the set of 49 LoS vectors that pass the above test with smallest distance d ; and
(f)
UT computation: 3D position measurement mean and covariance obtained at a k c capture instant of the i-th camera.
We noticed three regions in Figure 21:
(a)
For the number of targets varying from 1 up to 90 targets (i.e., from 1 × 1 up to 90 × 90 targets), the elapsed processing time shows an exponential tendency;
(b)
From 91 up to 262 targets (i.e., from 91 × 91 up to 262 × 262 targets) we observe a linear tendency, and over 262 × 262 targets, the elapsed processing time presents an exponential tendency.
In the PETS2009 image sequence dataset, the number of observed targets at any k c capture instant does not reach above 10 targets simultaneously. Therefore, our approach yields a very low processing time to carry out the 3D position measurements.

5. Conclusions

We propose position measurement in 3D space with intersecting vector constraints derived from exchanged LoS vectors in a camera network with known relative poses. Solving for the 3D intersection of vector constraints is based on computing the proximity of reverse straight lines. Each camera does not need the construction parameters of the other ones to obtain the 3D measurements. We then applied our approach to tracking simultaneous targets with a network of asynchronously communicating cameras with frame processing delays. The targets enter and leave the fields of view of the cameras in the image sequence dataset.
We use the average root mean square error across all image planes of the asynchronous camera network to measure tracking performance under two conditions: fused 3D position measurement from exchanged LoS vectors and local state estimation (strategy (a)), and fusion of additional information other than the exchanged LoS vectors (strategy (b)).
BAF and PF present similar performance under strategy (a), which uses local estimation. With regard to strategy (b) involving further fusion, BAF-LLM is superior to SA-PF under large asynchrony and processing delay. Comparing between strategies (a) and (b), BAF, PF, and BAF-LLM yield approximately the same performance in a wide range of asynchrony and processing delay. We remark that local estimation in BAF and PF demands less communications than further fusion used in BAF-LLM and SA-PF. Further investigation should focus on the use of consensus to improve target tracking subject to asynchronous exchange in the camera network.

Author Contributions

Software, T.M.D.G.; Validation, T.M.D.G.; Investigation, T.M.D.G.; Writing—review & editing, J.W.; Supervision, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
WCNWireless Communication Networks
BAFBatch Asynchronous Filter
BAF-LLMBatch Asynchronous Filter – Lower Load Mode
SA-PFSequential Asynchronous Particle Filter
LoSLine Of Sight
FoVField Of View

Appendix A

Minimum point between reversed straight lines:
The expressions describing r e s are as follows:
{ r = P 1 + λ v 1   s = P 2 + β v 2
where λ , β     + and v 1 , v 2 are the direction vectors. The distances between any two points on lines r e s can be calculated as follows:
d ( r , s ) = ( r x s x ) 2 + ( r y s y ) 2 + ( r z s z ) 2 .
To find the minimum point between the lines, it is necessary to find the parameters λ and β that minimize the distance function d ( r , s ) . Therefore, λ * , β * = min λ , β d ( λ , β ) is the partial derivative of the function d ( r , s ) :
{ d λ = 0 d β = 0
( I ) d λ = 0 = 1 2 ( d ( λ , β ) ) 1 2 [ 2 ( r x s x ) d r x d λ + 2 ( r y s y ) d r y d λ + 2 ( r z s z ) d r z d λ ] ,
where
d r x d λ = d ( p 1 x + λ v 1 x ) d λ = v 1 x d r y d λ = d ( p 1 y + λ v 1 y ) d λ = v 1 y d r z d λ = d ( p 1 z + λ v 1 z ) d λ = v 1 z
Then,
( r x s x ) v 1 x + ( r y s y ) v 1 y + ( r z s z ) v 1 z = 0
[ p 1 x + λ v 1 x ( p 2 x + β v 2 x ) ] v 1 x + [ p 1 y + λ v 1 y ( p 2 y + β v 2 y ) ] v 1 y + [ p 1 z + λ v 1 z ( p 2 z + β v 2 z ) ] v 1 z = 0
Arranging the terms of these equations:
( p 1 x p 2 x ) v 1 x + ( p 1 y p 2 y ) v 1 y + ( p 1 z p 2 z ) v 1 z + λ v 1 x 2 + λ v 1 y 2 + λ v 1 z 2 β ( v 1 x v 2 x + v 1 y v 2 y + v 1 z v 2 z ) = 0
( A )   ( P 1 P 2 ) . v 1 + λ || v 1 || 2 β v 1 . v 2 = 0
( I I ) d β = 0 = 1 2 ( d ( λ , β ) ) 1 2 [ 2 ( r x s x ) d s x d β 2 ( r y s y ) d s y d β 2 ( r z s z ) d s z d β ]
( r x s x ) v 2 x + ( r y s y ) v 2 y + ( r z s z ) v 2 z = 0 ( p 1 x p 2 x ) v 2 x + ( p 1 y p 2 y ) v 2 y + ( p 1 z p 2 z ) v 2 z β ( v 2 x 2 + v 2 y 2 + v 2 z 2 ) + λ ( v 1 x v 2 x + v 1 y v 2 y + v 1 z v 2 z ) = 0
( B )   ( P 1 P 2 ) . v 2 β || v 2 || 2 λ v 1 . v 2 = 0
From (A) and (B):
λ = β v 1 . v 2 ( P 1 P 2 ) . v 1 || v 1 || 2 β = ( P 1 P 2 ) . v 2 + λ v 1 . v 2 || v 2 || 2
Substituting β in λ , we finally obtain:
{ λ * = ( v 1 . v 2 ) [ ( P 1 P 2 ) . v 2 ] || v 2 || 2 [ ( P 1 P 2 ) . v 1 ] || v 1 || 2 || v 2 || 2 ( v 1 . v 2 ) 2 β * = ( P 1 P 2 ) . v 2 + λ * ( v 1 . v 2 ) || v 2 || 2

References

  1. Sutton, Z.; Willet, P.; Bar-Shalom, Y. Target Tracking Applied to Extraction of Multiple Evolving Threats from a Stream of Surveillance Data. IEEE Transac. Comput. Soc. Syst. 2021, 8, 434–450. [Google Scholar] [CrossRef]
  2. Song, B.; Kamal, A.T.; Soto, C.; Ding, C.; Farrel, J.A.; Roy-Chowdhury, A.K. Tracking and Activity Recognition Through Consensus in Distributed Camera Networks. IEEE Transac. Imag. Proc. 2010, 19, 2564–2579. [Google Scholar] [CrossRef] [PubMed]
  3. Olfati-Saber, R. Distributed Kalman filtering for sensor networks. In Proceedings of the 2007 46th IEEE Conference on Decision and Control, New Orleans, LA, USA, 12–14 December 2007; pp. 5492–5498. [Google Scholar] [CrossRef]
  4. Solomon, C.; Breckon, T. Fundamentos de Processamento de Imagens uma Abordagem práTica com Exemplos em Matlab; LTC: Rio de Janeiro, Brazil, 2013. (In Portuguese) [Google Scholar]
  5. Chagas, R.; Waldmann, J. Difusão de medidas para estimação distribuída em uma rede de sensores. In Proceedings of the XI Symposium on Electronic Warfare, São José dos Campos, Brazil, 29 September–2 October 2009; pp. 71–75. (In Portuguese). [Google Scholar]
  6. Beaudeau, J.; Bugallo, M.F.; Djurić, P.M. Target tracking with asynchronous measurements by a network of distributed mobile agents. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 3857–3860. [Google Scholar] [CrossRef]
  7. Casbeer, D.W.; Beard, R. Distributed information filtering using consensus filters. In Proceedings of the American Control Conference, St. Luis, MO, USA, 10–12 June 2009; pp. 1882–1887. [Google Scholar] [CrossRef]
  8. Garcia-Fernandez, A.; Grajal, J. Asynchronous Particle Filter for tracking using non-synchronous sensor networks. Sig. Proc. 2011, 91, 2304–2313. [Google Scholar] [CrossRef]
  9. Wymeersch, H.; Zazo, V.S. Belief consensus algorithms for fast distributed target tracking in wireless sensor networks. Sig. Process. 2014, 95, 149–160. [Google Scholar] [CrossRef] [Green Version]
  10. Ji, H.; Lewis, F.L.; Hou, Z.; Mikulski, D. Distributed information-weigthed Kalman consensus filter sensor networks. Automatica 2017, 77, 18–30. [Google Scholar] [CrossRef] [Green Version]
  11. Khaleghi, B.; Khamis, A.; Karray, F.O.; Razavi, N.S. Multisensor data fusion: A review of the state-of-the-art. Inf. Fusion 2013, 14, 28–44. [Google Scholar] [CrossRef]
  12. Taj, M.; Cavallaro, A. Distributed and Decentralized Multicamera Tracking. IEEE Sign. Process. Mag. 2011, 28, 46–58. [Google Scholar] [CrossRef]
  13. Hlinka, O.; Hlawatsch, F.; Djurić, P.M. Distributed Sequential Estimation in Asynchronous Wireless Sensor Networks. IEEE Sig. Proc. Lett. 2015, 22, 1965–1969. [Google Scholar] [CrossRef]
  14. Olfati-Saber, R. Kalman-Consensus Filter: Optimality, stability, and performance. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) Held Jointly with 2009 28th Chinese Control Conference, Shanghai, China, 15–18 December 2009; pp. 7036–7042. [Google Scholar] [CrossRef]
  15. Katragadda, S.; Cavallaro, A. A batch asynchronous tracker for wireless smart-camera networks. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar] [CrossRef]
  16. Yang, R.; Bar-Shalom, Y. Full State Information Transfer Across Adjacent Cameras in a Network Using Gauss Helmert Filters. J. Advan. Inform. Fus. 2022, 17, 14–28. [Google Scholar]
  17. Zhu, G.; Zhou, F.; Xie, L.; Jiang, R.; Chen, Y. Sequential Asynchronous Filters for Target tracking in wireless sensor networks. IEEE Sens. J. 2014, 14, 3174–3182. [Google Scholar]
  18. Olfati-Saber, R.; Sandell, N.F. Distributed tracking in sensor networks with limited sensing range. In Proceedings of the 2008 American Control Conference, Seattle, WA, USA, 11–13 June 2008; pp. 3157–3162. [Google Scholar] [CrossRef]
  19. Julier, S.; Uhlmann, J.; Durrant-Whyte, H.F. A new method for the nonlinear transformation of means and covariances in filters and estimators. IEEE Trans. Autom. Cont. 2000, 45, 477–482. [Google Scholar] [CrossRef] [Green Version]
  20. Battistelli, G.; Chisci, L. Kullback-Leibler average, consensus on probability densities, and distributed state estimation with guaranteed stability. Automática 2014, 50, 707–718. [Google Scholar] [CrossRef]
  21. Katragadda, S.; Regazzoni, C.S.; Cavallaro, A. Average consensus-based asynchronous tracking. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4401–4405. [Google Scholar] [CrossRef]
  22. Kushwaha, M.; Koutsoukos, X. Collaborative 3D Target Tracking in Distributed Smart Camera Networks for Wide-Area Surveillance. J. Sens. Actuator Netw. 2013, 2, 316–353. [Google Scholar] [CrossRef] [Green Version]
  23. Matteo, Z.; Matteo, L.; Marangoni, T.; Mariolino, C.; Marco, T.; Manuela, G. 3D Tracking of Human Motion Using Visual Skeletonization and Stereoscopic Vision. Front. Bioeng. Biotech. 2020, 8. [Google Scholar] [CrossRef]
  24. Zivkovic, Z. Wireless smart camera network for real-time human 3D pose reconstruction. Comp. Vis. Im. Underst. 2010, 114, 1215–1222. [Google Scholar] [CrossRef]
  25. Giordano, J.; Lazzaretto, M.; Michieletto, G.; Cenedese, A. Visual Sensor Networks for Indoor Real-Time Surveillance and Tracking of Multiple Targets. Sensors 2022, 22, 2661. [Google Scholar] [CrossRef] [PubMed]
  26. Yan, J.; Liu, H.; Pu, W.; Bao, Z. Decentralized 3-D Target Tracking in Asynchronous 2-D Radar Network: Algorithm and Performance Evaluation. IEEE Sens. J. 2017, 17, 823–833. [Google Scholar] [CrossRef]
  27. PETS 2009 Benchmark Data. Available online: http://cs.binghamton.edu/~mrldata/pets2009 (accessed on 21 August 2022).
  28. Rottmann, M.; Maag, K.; Chan, R.; Hüger, F.; Schlicht, P.; Gottschalk, H. Detection of False Positive and False Negative Samples in Semantic Segmentation. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 1351–1356. [Google Scholar] [CrossRef]
  29. Fang, K.; Xiang, Y.; Li, X.; Savarese, S. Recurrent Autoregressive Networks for Online Multi-Object Tracking. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 12–15 March 2018; pp. 466–475. [Google Scholar] [CrossRef]
  30. Cavallaro, A.; Aghajan, H. Multi-Camera Networks: Principles and Applications; Academic Press: Cambridge, MA, USA, 2009. [Google Scholar]
  31. Corke, P. Robotics, Vision, and Control; International Publishing AG: Cham, Switzerland, 2017. [Google Scholar]
  32. Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  33. Lao, D.; Sundaramoorthi, G. Minimum Delay Object Detection from Video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5096–5105. [Google Scholar] [CrossRef] [Green Version]
  34. Matsuyama, T.; Ukita, N. Real-time multitarget tracking by a cooperative distributed vision system. Proc. IEEE 2002, 90, 1136–1150. [Google Scholar] [CrossRef]
  35. Konstantinova, P.; Udvarev, A.; Semerdjiev, T. A study of a Target Tracking Algorithm Using Global Nearest Neighbor Approach. In Proceeding of the 4th International Conference on Computer Systems and Technology-CompSysTech, Rousse, Bulgaria, 19–20 June 2003; pp. 290–295. [Google Scholar]
  36. Munkres, J. Algorithms for the assignment and transportation problems. J. Soc. Indus. Appl. Mathem. 1957, 5, 32–38. [Google Scholar] [CrossRef] [Green Version]
  37. Tsai, R.Y. A Versatile Camera Calibration Techniques for High-Accuracy 3D Machine Vision Metrology Using Off-the-Shelf TV Cameras and Lenses. IEEE J. Robot. Autom. 1987, 3, 323–344. [Google Scholar] [CrossRef] [Green Version]
  38. Camargo, I.; Boulos, P. Geometria Analítica um Tratamento Vetorial, 3rd ed.; Pearson: São Paulo, Brazil, 2005. (In Portuguese) [Google Scholar]
  39. Ferreira, P.P.C. Cálculo e Análise Vetoriais com Aplicações Práticas; Moderna: Rio de Janeiro, Brazil, 2012; Volume 1. (In Portuguese) [Google Scholar]
  40. Wan, E.A.; Van Der Merwe, R. The unscented Kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373), Lake Luise, AB, Canada, 4 October 2000; pp. 153–158. [Google Scholar] [CrossRef]
  41. Bar-Shalom, Y.; Li, X.R.; Kirubarajan, T. Estimation with Applications to Tracking and Navigation: Theory, Algorithms and Software; Wiley Online Library: Hoboken, NJ, USA, 2002. [Google Scholar]
  42. Katragadda, S. Available online: http://www.eecs.qmul.ac.uk/~andrea/software.htm (accessed on 10 March 2022).
  43. Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics; MIT Press: Massachusetts, UK, 2005. [Google Scholar]
Figure 1. Targets detected within the FoVs of Cameras 1, 2, 3, and 7 and the assigned corresponding assigned identification numbers [27]. Red dots indicate centroids of each target.
Figure 1. Targets detected within the FoVs of Cameras 1, 2, 3, and 7 and the assigned corresponding assigned identification numbers [27]. Red dots indicate centroids of each target.
Sensors 23 01194 g001
Figure 2. Pinhole camera model C i : {C} represents the camera coordinate system; {W} represents the world coordinate system; O c , i represents the optical center of the i-th camera expressed in the camera coordinate system; O w represents the origin of the world coordinate system; f is the focal length; P d is the distorted point given in pixels;   P u represents the corrected point given in pixels; P w is a point in the 3D space represented in world coordinates {W}; Z c = f is the camera image plane.
Figure 2. Pinhole camera model C i : {C} represents the camera coordinate system; {W} represents the world coordinate system; O c , i represents the optical center of the i-th camera expressed in the camera coordinate system; O w represents the origin of the world coordinate system; f is the focal length; P d is the distorted point given in pixels;   P u represents the corrected point given in pixels; P w is a point in the 3D space represented in world coordinates {W}; Z c = f is the camera image plane.
Sensors 23 01194 g002
Figure 3. Camera optical center and LoS vector representation in the world coordinate system {W}. The constraint vector r t , k , i w results from summing the translation vectors t c w i and the extended LoS vector λ L o S t , k , i w ,   λ ,   λ > 1 .
Figure 3. Camera optical center and LoS vector representation in the world coordinate system {W}. The constraint vector r t , k , i w results from summing the translation vectors t c w i and the extended LoS vector λ L o S t , k , i w ,   λ ,   λ > 1 .
Sensors 23 01194 g003
Figure 4. LoS vector L o S 1 , k , 1 w and respective vector constraint r 1 , k , 1 w ( λ ) from Camera 1 to the detected target with I d = 1 . C 1 = t c w 1 : location of the optical center of Camera 1. Target 1: position of Target 1 in the image plane giving rise to LoS vector L o S t , k , i w .
Figure 4. LoS vector L o S 1 , k , 1 w and respective vector constraint r 1 , k , 1 w ( λ ) from Camera 1 to the detected target with I d = 1 . C 1 = t c w 1 : location of the optical center of Camera 1. Target 1: position of Target 1 in the image plane giving rise to LoS vector L o S t , k , i w .
Sensors 23 01194 g004
Figure 5. Determining the distance between vector constraints r t , k , i w ( λ ) and r t , k , j w ( β ) . The tolerance d m i n δ indicates whether LoS vectors L o S t , k , i w and L o S t , k , j w point at the same target.
Figure 5. Determining the distance between vector constraints r t , k , i w ( λ ) and r t , k , j w ( β ) . The tolerance d m i n δ indicates whether LoS vectors L o S t , k , i w and L o S t , k , j w point at the same target.
Sensors 23 01194 g005
Figure 6. Top and front views of a detected target centroid trajectory within a time window K k i .
Figure 6. Top and front views of a detected target centroid trajectory within a time window K k i .
Sensors 23 01194 g006
Figure 7. LoS vectors from Camera 1 to Target 1 L o S 1 , k , 1 w ,   k = 1 , , 4 in the image plane in {W} coordinates during the time window K k 1 and the interpolated trajectory. The data will be stored in the information vectors i 1 , k , 1 w = { k ; O c , 1 w ; L o S 1 , k , 1 w } for each k-th instant within the time window K k 1 .
Figure 7. LoS vectors from Camera 1 to Target 1 L o S 1 , k , 1 w ,   k = 1 , , 4 in the image plane in {W} coordinates during the time window K k 1 and the interpolated trajectory. The data will be stored in the information vectors i 1 , k , 1 w = { k ; O c , 1 w ; L o S 1 , k , 1 w } for each k-th instant within the time window K k 1 .
Sensors 23 01194 g007
Figure 8. Information vector storage buffer i t , k , i w of the t-th target in the i-th camera, k = k 1 , , k n 1 .
Figure 8. Information vector storage buffer i t , k , i w of the t-th target in the i-th camera, k = k 1 , , k n 1 .
Sensors 23 01194 g008
Figure 9. Buffer 1: locally obtained information vectors stored in the i-th camera. Buffer 2: information vectors received from neighboring cameras at different time instants within the time window period. The information vectors are temporally aligned for later 3D proximity metric computation. Reception of two or more information vectors may occur at the same time instant within the time window K k i , as shown at time k 3 . Colored arrows indicate temporal alignment of local information vectors with received information vectors.
Figure 9. Buffer 1: locally obtained information vectors stored in the i-th camera. Buffer 2: information vectors received from neighboring cameras at different time instants within the time window period. The information vectors are temporally aligned for later 3D proximity metric computation. Reception of two or more information vectors may occur at the same time instant within the time window K k i , as shown at time k 3 . Colored arrows indicate temporal alignment of local information vectors with received information vectors.
Sensors 23 01194 g009
Figure 10. Front view of the i-th camera with a detected target centroid trajectory. τ k c i denotes the image processing delay incurred measuring z t , k c , 3 D i and the local estimation phase. T = α 1 + α 2 is the inter-frame interval. K k i = [ k c α 1 , k c + α 2 ] : time window. Transmission of information vectors I t , k c + τ k c i , i w occurs at k c + τ k c i whereas the fusion phase occurs at the end of the time window, i.e., at k c + α 2 . Instants k are integer multiples of the frame capture period T.
Figure 10. Front view of the i-th camera with a detected target centroid trajectory. τ k c i denotes the image processing delay incurred measuring z t , k c , 3 D i and the local estimation phase. T = α 1 + α 2 is the inter-frame interval. K k i = [ k c α 1 , k c + α 2 ] : time window. Transmission of information vectors I t , k c + τ k c i , i w occurs at k c + τ k c i whereas the fusion phase occurs at the end of the time window, i.e., at k c + α 2 . Instants k are integer multiples of the frame capture period T.
Sensors 23 01194 g010
Figure 11. Application of the UT transformation to approximate the 3D position measurement error statistics arising from the uncertainty in LoS vector L o S t , k , i w in the image plane. This uncertainty in the LoS vector affects the output error statistics of the proximity metric algorithm that determines the distance between constraint vector pairs r t , k , i w and r t , k , j w (the latter not shown for the sake of clarity) and consequently impacts the 3D position measurement. The resulting seven points in 3D space are approximated with a normal distribution N ( μ f k , C f k ) . Subscript f here means “final” in the UT 3D position mean and covariance determined from averaging with respect to the neighboring cameras tracking the same target centroid at the same time instant for each instant k within a temporal window. 3D motion of the imaged targets is not constrained to a ground reference plane. 3D position is in r t , k , i w ( λ * ) when the distance tolerance between r t , k , i w ( λ * ) and r t , k , j w ( β * ) is d m i n δ .
Figure 11. Application of the UT transformation to approximate the 3D position measurement error statistics arising from the uncertainty in LoS vector L o S t , k , i w in the image plane. This uncertainty in the LoS vector affects the output error statistics of the proximity metric algorithm that determines the distance between constraint vector pairs r t , k , i w and r t , k , j w (the latter not shown for the sake of clarity) and consequently impacts the 3D position measurement. The resulting seven points in 3D space are approximated with a normal distribution N ( μ f k , C f k ) . Subscript f here means “final” in the UT 3D position mean and covariance determined from averaging with respect to the neighboring cameras tracking the same target centroid at the same time instant for each instant k within a temporal window. 3D motion of the imaged targets is not constrained to a ground reference plane. 3D position is in r t , k , i w ( λ * ) when the distance tolerance between r t , k , i w ( λ * ) and r t , k , j w ( β * ) is d m i n δ .
Sensors 23 01194 g011
Figure 12. 7 Sigma points found after applying UT to noisy LoS vectors in the image plane (optical axis included) undergoing the proximity metric algorithm to determine the distance between vector constraint pairs and the corresponding set of points in 3D space with distribution N ( μ f k , C f k ) .
Figure 12. 7 Sigma points found after applying UT to noisy LoS vectors in the image plane (optical axis included) undergoing the proximity metric algorithm to determine the distance between vector constraint pairs and the corresponding set of points in 3D space with distribution N ( μ f k , C f k ) .
Sensors 23 01194 g012
Figure 13. Set of 3D points resulting from proximity determination between constraints r 1 , k , i w and r t , k , i w ,   j   =   2 , 3 , 4 , 5 for target Id = 1 in Camera 1. Information vectors received from neighboring cameras can include LoS vectors to different targets or to the same target but with a different Id from the one issued in camera 1.
Figure 13. Set of 3D points resulting from proximity determination between constraints r 1 , k , i w and r t , k , i w ,   j   =   2 , 3 , 4 , 5 for target Id = 1 in Camera 1. Information vectors received from neighboring cameras can include LoS vectors to different targets or to the same target but with a different Id from the one issued in camera 1.
Sensors 23 01194 g013
Figure 14. An instance of distinct UT means μ j k ,   j = 1 , , n in 3D space at instant k for the sets of 3D points shown in Figure 13.
Figure 14. An instance of distinct UT means μ j k ,   j = 1 , , n in 3D space at instant k for the sets of 3D points shown in Figure 13.
Sensors 23 01194 g014
Figure 15. Final 3D means μ f k at instants k = 13, k = 14, and k = 15 after averaging their respective simultaneous UT means μ j k seen in Figure 14.
Figure 15. Final 3D means μ f k at instants k = 13, k = 14, and k = 15 after averaging their respective simultaneous UT means μ j k seen in Figure 14.
Sensors 23 01194 g015
Figure 16. After each time window, the algorithm enters the fusion phase, where information is exchanged asynchronously within a fusion window with p size. k c is the transmission instant of C i , and k c represents the reception instants in C i from neighboring C j within the fusion window. Content exchanged within the fusion window: information pair ( y ˜ t , k c , 3 D i , i , Y ˜ t , k c , 3 D i , i ) in BAF, fused 3D position measurement, and covariance ( z t , k c , 3 D i ,   C t , k c , 3 D i ) in SA-PF.
Figure 16. After each time window, the algorithm enters the fusion phase, where information is exchanged asynchronously within a fusion window with p size. k c is the transmission instant of C i , and k c represents the reception instants in C i from neighboring C j within the fusion window. Content exchanged within the fusion window: information pair ( y ˜ t , k c , 3 D i , i , Y ˜ t , k c , 3 D i , i ) in BAF, fused 3D position measurement, and covariance ( z t , k c , 3 D i ,   C t , k c , 3 D i ) in SA-PF.
Sensors 23 01194 g016
Figure 17. In strategy (a) only the fused 3D position measurement and local estimation to the next time window K k + 1 i are involved. Three steps are shown for strategy (b) where the blue arrows indicate the actions the filter goes through. The fused 3D measurement z t , k c , 3 D i and covariance C t , k c , 3 D i correspond to the frame capture instant k c .
Figure 17. In strategy (a) only the fused 3D position measurement and local estimation to the next time window K k + 1 i are involved. Three steps are shown for strategy (b) where the blue arrows indicate the actions the filter goes through. The fused 3D measurement z t , k c , 3 D i and covariance C t , k c , 3 D i correspond to the frame capture instant k c .
Sensors 23 01194 g017
Figure 18. Top: network communications topology. Bottom: projection of the FoVs from each of the N s = 8 cameras on the ground reference plane.
Figure 18. Top: network communications topology. Bottom: projection of the FoVs from each of the N s = 8 cameras on the ground reference plane.
Sensors 23 01194 g018aSensors 23 01194 g018b
Figure 19. Performance with variation of α m a x and   τ m a x with exchange of just LoS vectors for fused 3D position measurement followed by local estimation (strategy a) and exchange of additional content other than LoS vectors for further fusion (strategy b) in fusion window with duration p = {1,2,4} frames.
Figure 19. Performance with variation of α m a x and   τ m a x with exchange of just LoS vectors for fused 3D position measurement followed by local estimation (strategy a) and exchange of additional content other than LoS vectors for further fusion (strategy b) in fusion window with duration p = {1,2,4} frames.
Sensors 23 01194 g019aSensors 23 01194 g019b
Figure 20. Mean square error performance ϵ in pixels with added uncertainties in cameras’ poses.
Figure 20. Mean square error performance ϵ in pixels with added uncertainties in cameras’ poses.
Sensors 23 01194 g020
Figure 21. Elapsed time to process tasks (a)–(f) as a function of the total number of targets n i · n j .
Figure 21. Elapsed time to process tasks (a)–(f) as a function of the total number of targets n i · n j .
Sensors 23 01194 g021
Table 1. Analytical comparison.
Table 1. Analytical comparison.
Ref.AcronymCharacteristicsFusion Method
Processing DelaysAligns Received Data to Local Capture InstantData Exchanged among Camera NodesPosition Representation
[16]GHFnonoState estimatesPolar coordinatesNo fusion
[25]Decentralized EKFnonoState estimatesCartesian
coordinates
Sequential
[26]DAT2TFnonoState estimatesPolar coordinatesSequential
[15]BAF-LLMyesyesState estimatesCartesian
coordinates
Batch
[17]SA-PFnonoMeasurementsCartesian
coordinates
Sequential
This paper [15]BAF-LLM—local
estimation 1
yesyes 3(1) LoS vectors to measure 3D positionsCartesian
coordinates
Batch
BAF-LLM—with further fusionyesyes(2) LoS vectors to measure 3D positions and information pairs’ estimatesCartesian
coordinates
Batch
This paper [17]PF—local estimation 2noyes 3(1) LoS vectors to measure 3D positionsCartesian
coordinates
Batch
SA-PF—with further fusionnono(2) LoS vectors to measure 3D positions and fused 3D position measurementsCartesian
coordinates
Sequential
1 Kalman Filter in information form. 2 Particle Filter. 3 3D position measurements from asynchronously received LoS vectors undergo interpolation to local capture instant for the purpose of time alignment. See Section 2.1.6.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Di Gennaro, T.M.; Waldmann, J. Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network. Sensors 2023, 23, 1194. https://doi.org/10.3390/s23031194

AMA Style

Di Gennaro TM, Waldmann J. Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network. Sensors. 2023; 23(3):1194. https://doi.org/10.3390/s23031194

Chicago/Turabian Style

Di Gennaro, Thiago Marchi, and Jacques Waldmann. 2023. "Sensor Fusion with Asynchronous Decentralized Processing for 3D Target Tracking with a Wireless Camera Network" Sensors 23, no. 3: 1194. https://doi.org/10.3390/s23031194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop