Visual Sensor Networks for Indoor Real-Time Surveillance and Tracking of Multiple Targets

The recent trend toward the development of IoT architectures has entailed the transformation of the standard camera networks into smart multi-device systems capable of acquiring, elaborating, and exchanging data and, often, dynamically adapting to the environment. Along this line, this work proposes a novel distributed solution that guarantees the real-time monitoring of 3D indoor structured areas and also the tracking of multiple targets, by employing a heterogeneous visual sensor network composed of both fixed and Pan-Tilt-Zoom (PTZ) cameras. The fulfillment of the twofold mentioned goal was ensured through the implementation of a distributed game-theory-based algorithm, aiming at optimizing the controllable parameters of the PTZ devices. The proposed solution is able to deal with the possible conflicting requirements of high tracking precision and maximum coverage of the surveilled area. Extensive numerical simulations in realistic scenarios validated the effectiveness of the outlined strategy.


Introduction
A Visual Sensor Network (VSN) is a multi-agent system constituted of a collection of spatially distributed smart cameras. In recent years, due to the ever-improving sensing and computational capabilities and the reduced cost of visual sensors, such architectures have gained popularity and are currently employed in many tasks, ranging from the more traditional surveillance and security scenarios to the cutting-edge IoT applications, e.g., in environmental monitoring, sports, and education contexts [1][2][3].
The ongoing IoT-driven incentive towards the development of cooperative distributed solutions leads to the progressive substitution of the traditional centralized VSNs made up of static devices, namely cameras having a fixed pose (i.e., position and orientation), in favor of more flexible, decentralized, and intelligent systems [4][5][6]. The modern VSNs do not generally envisage a central computing unit, but accomplish the assigned task by relying on a distributed approach. Moreover, they often include PTZ cameras, namely visual sensors having controllable pan and tilt angles and zoom parameters and, thus, characterized by variable orientations and fields of view. The introduction of these new visual sensors enhances the VSNs' scalability and robustness and, at the same time, allows reducing both the number of cameras needed to cover a given area and the communication and computational burden imposed by data exchange. On the other hand, to take full advantage of the PTZ cameras, it is necessary to dynamically optimize their parameters depending on some external factors, including their (fixed) position, the potential occlusions, the occurrence of failures, and/or targets to follow.
In this work, the attention is focused on a VSN made up of both fixed and PTZ cameras, required to monitor a given area with the purpose of detecting and tracking one or more of the targets need to be estimated in correspondence with the blind spots [1]. Then, the presence of multiple targets entails a tradeoff between the surveilled area coverage and the target view image resolution.
To conclude this non-exhaustive survey of the related works, many approaches to address target tracking are described in the literature. Most of them envisage the exploitation of filtering and prediction tools, as, for example particle filters, Bayesian estimation techniques, and (extended) Kalman Filters (KFs) [1]. In detail, the approaches based on the KF are extensively used when dealing with real-time applications and often require distributed computations [20,21]. For instance, in [22], a distributed KF was studied for sensor networks with a limited sensing range, and an extended version of the same approach was investigated in [16] to perform 2D tracking employing a PTZ camera network. In [23], instead, a solution based on particle filters was considered for decentralized tracking of groups of people or individuals. Finally, in [24], a decentralized framework was presented for cooperative self-localization and multi-target tracking via Gaussian filters.

Contributions
Accounting for a heterogeneous VSN made up of both fixed and PTZ cameras, this work presents a strategy aiming at ensuring the real-time surveillance of a structured indoor environment, as well as multi-target tracking. The outlined procedure envisages the (optimal) selection of the adjustable parameters of the dynamic devices composing the network, resting on the game theory approach proposed in [16].
With respect to [16], the novel aspects of this work derive from the focus on real-world scenarios, wherein it is desirable to limit both the costs (in terms of employed devices) and the task execution time. In [16], the study case consisted of a unique unstructured 2D environment monitored by a broad set of PTZ cameras capable of performing both the distributed multi-target tracking and the iterative optimization of their parameters. In particular, the proposed target tracking method is based on an Extended Kalman Filter (EKF), while the parameter selection rests on the iterative solution of a computationally demanding optimization problem. In this work, instead, great attention is devoted to the real-world environment. The twofold mentioned goal is, indeed, faced by accounting for a 3D scenario consisting of a structured area composed of multiple connected rooms, and the available a priori information on the physical space partition is exploited in the solution process. In addition, the considered VSN involves a limited number of (highly expensive) PTZ cameras, while including also (low-cost) static devices. Specifically, these permit still efficiently managing the transitions between different areas, thus streamlining the PTZ parameter selection. In this direction, their presence favors also network scalability since the dynamic devices' reconfiguration can be accomplished in a parallel manner in the different rooms.
More in detail, in this work, the double monitoring and tracking problem is addressed by modeling the targets as 3D point particles whose position is characterized by a non-null uncertainty and overcoming the concept of a planar occupancy grid. The selection of the PTZ parameters implies the identification of the optimal values for both the pan and tilt angle, jointly with the zoom (when accounting for the 2D context only, the pan angle is generally considered in the PTZ parameter selection) Inspired by the game theory approach proposed in [16], we propose an update procedure for the orientation of the PTZ cameras based on the optimization of a certain utility function. In particular, this latter is defined accounting for some criteria that are new and original with respect to [16] since the intent is to reduce the computational complexity of the parameter selection process because of the real-time constraints on the task's execution. The utility function is maximized in a distributed manner via an iterative negotiation mechanism among some PTZ devices. In particular, different from [16], the set of PTZ cameras involved in the mentioned negotiation procedure is determined based on the a priori knowledge of the environment in terms of structure and devices' placement. Such information is further exploited to allow for the parallelization of the parameter selection by multiple independent groups of PTZ cameras.
The principal advantage of the solution proposed in this work consists of its versatility and flexibility. Indeed, by conveniently choosing the weights that regulate the contributions in the utility function, it is possible to prioritize the monitoring task with respect to the tracking task, or vice versa. Along this line, the results of the conducted simulative campaign demonstrated its effectiveness in handling the tradeoff between the tracking precision and the image resolution, especially in the critical scenarios. Moreover, the heterogeneous nature of the network and the outlined distributed approach allow the parallelization of the parameter selection and tracking tasks, resulting in a framework that can be easily scaled up to larger and more complex environments.
As a final remark, we emphasize that, although the problems related to targets' detection and partial/complete loss are not directly taken into account in this work, some possible actions to face these issues are discussed.

Paper Structure
This paper is organized as follows. In Section 2, after the problem statement, the application scenario is described and modeled. In Section 3, the elements necessary to perform real-time multi-target tracking are outlined. In particular, Section 3.1 illustrates the distributed tracking algorithm based on the EKF solution, while Section 3.2 discusses the PTZ parameter selection. After that, Section 4 presents the application environment employed to validate the strategy outlined in the previous sections, and Section 5 reports the results of the multiple test scenarios considered. In Section 5.6, a discussion is provided about interesting aspects revealed from the simulation results, together with possible future improvements. Finally, Section 7 reports a summary of the study and contains some final considerations.

Problem Statement, Models, and Assumptions
This section aims at illustrating the application scenario taken into account in this work: a structured and cluttered indoor environment monitored by a heterogeneous VSN. We highlight the considered twofold goal, stating the problem and discussing the models and the assumptions adopted in the design of the proposed solution.

Problem Statement
In this work, the attention is focused on a structured indoor 3D environment E ∈ R 3 composed of n R ≥ 1 rooms and characterized by n A ≥ 1 access points. This is supposed to be monitored by a VSN made up of n C ≥ 2 cameras, divided into n S ≥ 1 static visual sensors, i.e., fixed cameras, and n D ≥ 1 dynamic visual sensors, namely PTZ cameras. In turn, the fixed cameras are split into n SHR ≥ 0 high-resolution visual sensors and n SWA ≥ 0 wide-angle visual sensors.
In this context, we aimed at proposing an effective strategy to fulfill a twofold goal: to ensure the real-time surveillance of the described environment and to guarantee the efficient tracking of n T ≥ 1 targets, free to move in the supervised area.

Environment Modeling
To cope with the aforementioned goal, we define the following sets: The dynamic and static visual sensors set C D = {C D 1 . . . C D n D } and C S = {C S 1 . . . C S n S }, respectively, of which this last one is the direct sum of C SHR and C SWA ; • The visual sensors set C = {C 1 . . . C n C } resulting from the direct sum of C D and C S ; • The targets set T = {T 1 . . . T n T }.
In addition, we also introduce the virtual environment partitions set P = {P 1 . . . P n P } and the handout zones set H = {H 1 . . . H n H }. The former set consists of n P ≥ n R virtual partitions of the supervised environment; formally, we have that P k ⊂ R 3 , n P k=1 P k = n R h=1 R h = E , and P k ∩ P κ = ∅ (disjoint sets) with k, κ ∈ {1 . . . n P }, k = κ. We emphasize that each element of P can either correspond to a physical room (n P = n R ) or a part of a physical room (n P > n R ). On the other side, the handout zones set is composed of n H ≥ n P environment portions corresponding to the transition areas between two adjacent virtual partitions. From a mathematical perspective, it thus holds that H p ⊂ R 3 , H p ∩ H ρ = ∅ with p, ρ ∈ {1 . . . n H }, p = ρ and H p ⊆ (P k ∪ P κ ) being P k , P κ adjacent virtual partitions. In particular, hereafter, we use the notation H kκ (= H p ) to indicate the handout zone between the k-th and the κ-th virtual partition; more specifically, we have that H kκ = H k kκ ⊕ H κ kκ , being H k kκ = H kκ ∩ P k and H κ kκ = H κ kκ ∩ P κ with k, κ ∈ {1 . . . n P }, k = κ. All the introduced sets are listed in Table 1, where we also report the assumptions about their cardinality, many of which derive from the following statements regarding the considered scenario. a.
The high-resolution fixed cameras were used only to monitor the access points in order to guarantee the quick and effective target detection when they enter in the environment, thanks to their increased performance, but also because of their cost. Thus, it holds that n SHR = n A ; b.
Characterized by ample FOVs at the cost of low resolution, the wide-angle fixed cameras were instead exploited to ensure the best coverage. Hence, we assumed that at least a visual sensor in the set C SWA is placed in each virtual partition, and in particular, this is located in order to monitor the related handout zones. Consequently, it follows that n SWA ≥ n P ; c.
Finally, the PTZ cameras were employed to enhance the VSN target tracking capabilities. Therefore, it is reasonable to assume that n D ≥ n P . At the same time, we highlight that any assumption is stated as regards the specific camera's placement within the environment: several existing algorithms allow determining the best sensor location [9,[25][26][27]. Table 1. Principal sets and main assumptions considered in this work.

Sets
Assumptions

Targets' Modeling
Adopting a control system approach, we modeled any target as a point particle acting in 3D space, thus characterized by a (time-varying) position in the global inertial frame F W , hereafter termed the world frame. Formally, the position of any j-th target, j ∈ {1 . . . n T }, at time t is identified by the vector p j (t) ∈ R 3 .
Assuming then a first-order dynamics for all the targets, we have that the j-th target state at time t is described by the vector x j (t) = p j (t)ṗ j (t) ∈ R 6 , stacking its position and velocity in F W . Moreover, inspired by [16], we assumed that the introduced state evolves according to the following discrete-time dynamics with sampling time T ∈ R + : where 0 3×3 ∈ R 3×3 and I 3×3 ∈ R 3×3 denote the square null and identity matrix, respectively. The vector w j (t) ∈ R 6 in (1) represents an additive Gaussian noise; in particular, we assumed that w j (t) ∼ N (0 6 , W j ) with 0 6 ∈ R 6 identifying the zero mean vector and W j ∈ R 6×6 representing the j-th target (known) covariance matrix. The point particle modeling assumption, even if it appears simplistic, allows capturing the basic behavior of a moving subject and focusing on other aspects of interest such as the coordination and cooperation among the VSN devices. We further emphasize that, although depending on the specific case, it is generally possible to relate the output of any object detection algorithm to the assumed representation. For example, if a detected object is modeled by a bounding box, then the center of such a box can be exploited in the point particle model. The uncertainty characterizing this operation can be included in a comprehensive error term affecting the cameras' observations.

Cameras Modeling
In this work, every camera composing the given VSN was modeled as a rigid body having (possibly time-varying) position and orientation, i.e., a pose, in the world frame. Note that these quantities are often referred to as camera extrinsic parameters in the literature. In detail, we denote by F B the local frame in-built with the device so that the x-axis points upward, the y-axis points to the right, the z-axis points forward, and it is aligned with the device optical axis. Then, the i-th camera (time-invariant) position in F W is identified by the vector t W,i ∈ R 3 , while its orientation with respect to the world frame is represented by the rotation matrix R W2B,i (t) ∈ SO(3), (potentially) depending on the time t. In particular, we assumed that R W2B,i (t) results from the composition of three subsequent rotations around the axes of F B , namely R W2B, For all the static visual sensors, the orientation is fixed and constant over time, namely R W2B,i (t) = R W2B,i , ∀i ∈ {1 . . . n S }. On the other hand, the dynamic visual sensors are characterized by a partially time-varying orientation. Indeed, a PTZ camera can modify its orientation through a pan and/or a tilt movement, namely through a rotation around the x-axis and/or the y-axis of its F B of a certain controllable pan and/or tilt angle, respectively. From a mathematical perspective, we have that R W,i (t) = R W,i (α i (t), β i (t)), ∀i ∈ {1 . . . n D }, with α i (t) and β i (t) hereafter referred to as the i-th camera pan and tilt angle, respectively.
Without loss of generality, some standard assumptions were made also on the intrinsic parameters of all the considered visual sensors: for any camera, the focal length f was assumed to be unitary; no distortion was taken into account; the FOV is defined by a pair of angles affecting its height and width. We remark that the PTZ cameras can also dynamically vary their zoom settings; hence, these are characterized by three controllable degrees of freedom. Hereafter, the zoom parameter of the i-th dynamic visual sensor, i ∈ {1 . . . n D }, is referred to as ζ i ≥ 0. In addition, we took into account the maximum distance at which a target can be detected with a satisfying quality level. Hereafter, this is associated with a minimum pixel density value. More in detail, for any camera, we assumed computing the pixel density characterizing the FOV at the distance (along the optical axis) of a certain target: when such a density is lower than a minimum threshold, the considered target is considered as not visible.
Then, when the j-th target, j ∈ {1 . . . n T }, is observed by the i-th visual sensor, i ∈ {1 . . . n C }, at time t, its position is projected on the camera image plane. Formally, introducing the (nonlinear) function h(·) : R 3 → R 2 mapping any vector x = [x 1 x 2 x 3 ] ∈ R 3 into its projection onto the 2D plane h(x) = [x 1 /x 3 x 2 /x 3 ] ∈ R 2 , we have that the position z ij (t) ∈ R 2 of the j-th target into the i-th camera image plane evolves as follows: where the vector v ij (t) ∈ R 2 represents the addictive noise deriving from the projection and measurement errors for the i-th camera. We assumed that v ij (t) ∼ N (0 2 , V i (t)) with 0 2 ∈ R 2 and V i (t) ∈ R 2×2 ; in particular, the covariance matrix was modeled as a diagonal matrix whose trace decreases proportionally to the zoom magnitude when considering PTZ cameras. Note that the camera orientation R W2B,i in (2) is reported as a time-varying quantity since the provided observation model is valid both for the static and dynamic visual sensors.

VSN Modeling
Motivated by the intent of proposing a distributed solution, we assumed that any i-th visual sensor, i ∈ {1 . . . n C }, composing the given VSN can communicate with the set of cameras placed in the same partition and in the adjacent ones. Formally, defining C S P k and C D P k as the sets of static and dynamic visual sensors located in the k-th partition, respectively, we have that all the devices placed in P k constitute the set . . n P }. Then, assuming that C i ∈ C P k , we have that the cameras interacting with the i-th one at time t correspond to the set C i (t) ⊆ C P k ∪ C P κ , P k and P κ being adjacent partitions, k, κ ∈ {1 . . . n P }, k = κ. Note that we implicitly made the assumption that all cameras in the network are aware of the partition wherein they are located and also of the related handout zones. Figure 1 aims at clarifying the introduced communication setup through a toy example fulfilling all the assumptions of the scenario taken into account in this work. In the following, the 3D simulation environment is shown using its projection on the 2D floor. One can, indeed, observe that the reported example envisages a structured environment having a single access point, composed of n R = 4 rooms (corresponding to the blue, red, green, and orange areas) and virtually divided into n P = 5 partitions connected by n H = 5 handout zones (dashed portions). We emphasize that the partitions P 2 and P 3 jointly cover the area associated with the orange room. The VSN is made up of n SHR = 1 high-resolution fixed camera (represented by the magenta square) monitoring the unique access point, n SWA = 5 wide-angle static visual sensors placed in the environment in order to guarantee the maximum area coverage (identified by the cyan squares), and n D = 8 PTZ cameras located with the purpose of entailing the network tracking capabilities (denoted by the gray circles). We point out that the FOV of the visual sensor C SWA 2 can cover a portion of both the partitions P 2 and P 3 ; in addition, we remark that all the handout zones are potentially monitored by at least a dynamic visual sensor. On the right panel of Figure 1, we highlight the devices' interaction in terms of information exchange: as illustrated, the cameras physically placed in the same partition can communicate among themselves, and these can also share data with the visual sensors located in the adjacent partitions. To conclude, we emphasize that the communication graph is imposed by the considered VSN structure. Similar graph-based descriptions, but with different connection roles among nodes were employed in [28,29] to characterize the environment structure.

Real-Time Surveillance and Multi-Target Tracking
Accounting for the scenario described in the previous section, we present here a distributed strategy aiming at ensuring the efficient real-time surveillance of the considered environment and the tracking of the n T targets, by means of the given VSN.
The designed procedure involves three principal actions: • The targets' detection, executed by both the static and dynamic visual sensors with the intent of extracting information regarding the presence of one or more targets; • The targets' position estimation and prediction performed by all the fixed and PTZ cameras having detected one or more targets, mainly to identify the devices involved in the tracking task in the near future; • The PTZ parameter selection, carried out by all the dynamic visual sensors that are already or soon engaged in the tracking task, with the purpose of optimizing the real-time performance.
We specify that the first and second actions were performed at a frequency of 1/T, while the PTZ parameters were optimized every ≥ 1 steps of duration T, namely at a slower frequency of 1/( T) with respect to the previous ones. The parameter was selected in order to respect the computational limits of the system while guaranteeing its promptness in reacting to tracking requirements.
In the rest of the section, the attention is focused on the outlined methods for the estimation and prediction of the targets' position and for the determination of the more suited parameters for the PTZ cameras. Conversely, we do not explicitly account for the targets' detection, assuming that this action is accurately performed by resting upon one of the existing and well-proven techniques. On the other hand, we remark that the designed solution permits the computations' parallelization. In detail, observing that both the targets' detection and the PTZ parameter selection require a high computational burden, these two operations can be concurrently executed by distributing the workload between two computing cores, if possible. Indeed, the optimization process depends only on the information gathered at every -th step about the predicted target state.

Targets' Position Estimation and Prediction
To efficiently fulfill the tracking task, it is well known that a fundamental step consists of the accurate estimation of the current position of the targets and also of their future trajectory. In the proposed strategy, we address this issue by suitably extending the distributed consensus-based EKF approach presented in [16] to the case of targets moving in 3D space (rather than 2D).
To better clarify the adopted approach, summarized in Algorithm 1, we focus on a generic j-th target, j ∈ {1 . . . n T }, assuming that this is detected by a set C T j (t) ⊂ C of cameras in the network. Observe that, according to (2), any i-th device in the aforementioned set can retrieve the projection z ij (t) of the target position onto its image plane jointly with the corresponding covariance V i (Line 2). This allows then computing the quantities r ij and U ij introduced in [16] (Lines 4-5). These are subsequently communicated to the devices set C i (t) ⊆ C distinguishing between the following situations (lines [6][7][8]. If the target is in P k \ H k,κ , ∀H k,κ ∈ H, then the i-th device communicates with all the other cameras in the same partition, namelyC i (t) = C P k ; • If the target is in H k k,κ for any κ ∈ {1 . . . n H }, then the i-th device communicates with all the cameras in the same and in the adjacent partition P κ , i.e.,C i (t) = C P k ⊕ C P κ .

Algorithm 1 DISTRIBUTED CONSENSUS-BASED EKF
1: for any detected target T j do 2: compute z ij (t) as in (2) and the corresponding V i 3: information fusion 10: EKFa posteriori estimation 13: compute M ij (t) = (P −1 ij (t) + S ij (t)) −1 (error covarince matrix) 14: . (target state estimate) 15: EKFa priori estimation 16: update The exchanged data are required by all the cameras in C T j (t) to initialize and/or update an EKF needed to retrieve a suitable estimationx ij (t) ∈ R 6 of the j-th target state at time t (Lines 12-17). Note that the filter initialization can be performed exploiting either the received information or the environment knowledge, as, for instance, the size and position of the rooms' access points. It is straightforward that the accuracy of such an estimation is affected by the number of cameras in C T j (t). Moreover, it is possible to prove that, relying on the consensus approach, it holds thatx ij (t) =x j (t) for any C i ∈ C T j (t), namely the target state estimation converges to the same value for all the devices detecting T j . For this reason, hereafter, we drop out the dependence on the i-th camera when referring to the EKF target state estimation.
Our solution entails the exploitation of the computed target state estimation to determine (even roughly) a prediction of its evolution after > 1 time steps. Indeed, exploiting the target dynamics (1), we obtain: wherex j (t) = p j (t)ṗ j (t) ∈ R 6 is the -steps ahead j-th target state prediction. Note that if the predicted target positionp j (t) ∈ R 3 exits from the surveilled environment or if it changes partition without being in the corresponding handout zone, then the prediction is considered not valid and is substituted by the actual estimated position.
To conclude, we point out that the tracking performance is affected by the selected sampling time. Indeed, small values of T might imply extremely high computational burden, whereas large values of T might compromise the system promptness in the case of fast-moving targets. A good choice is to select the sampling time taking into account the average speed of the targets.

PTZ Parameter Selection
One of the most original aspects of the proposed surveillance and tracking solution rests upon the use of a heterogeneous VSN, through a smart exploitation of the adjustable parameters of the PTZ cameras. Hereafter, we illustrate the PTZ parameter selection procedure designed to determine both the orientation and the zoom value of the dynamic visual sensors in the network, with the purpose of improving the tracking capability of the whole camera group. In detail, inspired by [16], we tackled the selection of the PTZ cameras' parameters through the iterative solution of a suitable maximization problem. Clearly, it is convenient to consider only a finite discrete number of PTZ parameters values since small changes do not yield relevant differences in the cameras' FOV.

PTZ Parameter Selection Procedure
To provide a clearer explanation, we first focus on the generic single j-th target, j ∈ {1 . . . n T }. Based on the computed prediction (3) and exploiting the information on the network topology, it is possible to identify the set of both fixed and PTZ cameras that could potentially detect the considered target at the following -th time step. We indicate such a set as C T j (t) = C ,S T j (t) ⊕ C ,D T j (t), distinguishing between the static and dynamic visual sensors subsets. In particular, we specify that if the target predicted positionp j (t) is in P k \ H k,κ , ∀H k,κ ∈ H, k, κ ∈ {1 . . . n P }, k = κ, then the set C T j (t) includes only cameras placed in the partition P k . Instead, if after time steps, the target is estimated to be in H k k,κ , then the set C T j (t) contains all visual sensors located in P k and only the static ones of P κ . Formally, in the former scenario, we have that C T j (t) ⊆ C P k , while in the latter one, C T j (t) ⊆ C P k ∪ C S P κ . The PTZ parameter selection process initially requires the communication among all the cameras in C T j (t). The involved devices share information about their position, orientation, as well as zoom value in the case of PTZ cameras. Subsequently, all the PTZ cameras in C ,D T j (t) compute a certain utility function depending on the received information. Then, for a fixed number m ≥ 1 of consecutive iterations, a single dynamic visual sensor at a time, randomly selected from a uniform distribution over the set C ,D T j (t), computes the optimal values for its PTZ parameters (maximizing the utility function) and broadcasts this information to the other cameras in C ,D T j (t), which correspondingly update their utility function. The whole iterative procedure can be interpreted as a negotiation phase.
For sufficiently large values of m, such a negotiation phase allows the selection process to converge at least towards a local maximum. In particular, as proven in [16], the convergence is ensured by the game theory results on the Nash equilibrium. When both the number of cameras in a partition and the number of PTZ parameters to select are high, it could be advantageous to rely on a stochastic method for the PTZ parameter selection (for example, in [16], at each negotiation step, a softmax function and a temperature variable were used to generate a probability distribution over the utilities of the selected camera available configurations). This allows avoiding the local maxima, but turns out to be computational demanding. On the contrary, when both the number of visual sensors and the number of PTZ parameters to select is low (and therefore, the risk of incurring a local maximum is low) or when a sub-optimal solution is accepted to the benefit of a faster convergence, then a greedy choice method could be preferred.
Note also that, for the PTZ parameter selection, the dynamic visual sensors do not exchange data with the PTZ cameras placed in other partitions. This implies that the negotiation phase can be simultaneously performed in more than one partition, coping with the presence of multiple targets.
Finally, we emphasize that the selection process's performance is conditioned by the value assigned to : the number of prediction time steps needs to be compatible with the value of m and the cameras' computational and actuation time.

Utility Function Definition
The determination of the PTZ parameters from any dynamic camera in C ,D T j (t) relies on the evaluation of the aforementioned utility function. Such a function is computed tacking into account all the targets that the device is supposed to detect at the following -th time step. Denoting this targets set as i T (t) ⊆ T , we define the i-th camera utility function as: where the triplet (α i (t), β i (t), ζ i (t)) summarizes the PTZ camera parameters, the scalar q j ≥ 0 constitutes the weight assigned to the j-th target in order to prioritize (or less) its tracking and the function f ) depends only on the j-th target and on the position, orientation, and eventually, zoom value of the visual sensors belonging to the is defined as the weighted sum of the terms deriving from the adoption of l ≥ 0 different criteria, namely: with r l ≥ 1 and g l (α i (t), β i (t), ζ i (t)) inferred as explained in the following.
In addition to those proposed in [16], in this work, we account for these criteria: 1. Distance from the center: This criterion implies the evaluation of the distance of the predicted position of the target from the center of the camera image plane. As a consequence, in this case, we have that: In (6), the scalar a ∈ [0, 1] permits weighting the importance given to the distance d(·), and it is set to one when the i-th camera can detect the target without modifying its PTZ parameters (Case a). On the other hand, the condition a < 1 is in place when the i-th camera needs to modify its current orientation and/or zoom parameter in order to detect the target (Case b). Note that in this last scenario, a penalty p > 0 is also assigned to the device thanks to the introduction of the indicator function χ ι , which takes a value of one in correspondence to Case b and zero to Case a; 2. View quality: This criterion intends to favor zoomed frames up to a minimum FOV height h min > 0, which is measured on the plane orthogonal to the optical axis and intersecting the j-th target position. Hence, we account for: 3. Number of cameras per target: This criterion aims at assigning a penalty p > 0 when the j-th target is estimated to be detected by less than n min cameras or more than n max cameras. Hence, we have that: 4. Minimum parameter adjustments: According to this criterion, the minimum parameter adjustments with respect to the previous update step are advisable. Introducing the vector [α i β i ζ i ] stacking the PTZ parameters of the i-th camera obtained at the previous selection step, it follows that: To conclude, we remark that, when accounting for different rooms, only static visual sensors are allowed to communicate. For this reason, it is possible to perform the selection of the PTZ cameras in separate partitions simultaneously.

Application Scenario
We note that the framework proposed in this work allows coping with a wide range of different (and potentially conflicting) objectives. This is possible through a convenient choice of the utility function terms and of the corresponding weights. Nonetheless, in this section, the attention is focused on the description of a specific application scenario. This is motivated by the intent of investigating the performance of the designed solution. In detail, we considered an application case wherein the necessary tradeoff between the high-resolution and high-precision tracking requirements emerges.
We considered the indoor environment depicted in Figure 1. This is physically divided into n R = 4 portions, namely a corridor (where the main entrance to the surveilled area is located) and three rooms accessible from the corridor. However, n P = 5 virtual partitions were taken into account. Indeed, two virtual partitions are associated with the environment portion physically corresponding to the corridor; this choice was motivated by the space geometry and by the intent of preventing camera view occlusions. Figure 1 reports also the assumed cameras' placement in the environment. We highlight that the outlined framework allows verifying the simultaneous PTZ parameter selection for dynamic devices associated with different partitions.

VSN Insights
We emphasize that partitions P 1 , P 2 , and P 3 are populated by the minimum number of visual sensors to guarantee multi-target tracking; partition P 4 is surveilled only by a (wide-angle) static and a dynamic device; four cameras are located in correspondence to partition P 5 . Note that P 4 and P 5 represent the most critical and the most favorable situations in the considered framework, respectively.
More in detail, one can observe that, as highlighted in Figure 2a, the static visual sensors were placed in order to guarantee the monitoring of the whole environment. Nonetheless, different features were assumed for these cameras, as reported in Table 2.
Observe that the high-resolution device aimed at monitoring the environment access point is characterized by a limited FOV. Concerning the PTZ cameras, instead, these are supposed to be located so as to ensure that the volume of each partition can be approximately entirely covered by at least a couple of these devices, except for partition P 4 , as depicted in Figure 2b. The considered PTZ cameras were not all identical and differ, as reported in Table 3, where the pan and tilt ranges identify the extreme achievable angles when moving with respect to the initial configuration. Note that the sensors placed in the corridor are characterized by a smaller pan range as compared to the ones in the other rooms, which can span a larger area.
In the simulation, the FOV of each camera is also characterized by a maximum distance at which a target can be seen. This value can be computed starting from each camera resolution, horizontal and vertical FOV angles and the minimum pixel density at which a target is considered to be detectable. Clearly, for dynamic cameras, this maximum distance changes depending on the zoom magnitude. The minimum density considered by us was 3 pixel per cm 2 (ppcm).  Taking into account a maximum velocity of 4 m/s for the targets, we assume that all the cameras composing the VSN acquire new data every T = 50 ms. In addition, for any i-th static sensor, i ∈ {1 . . . n S }, we select the covariance matrix of its observation error in (2) as a diagonal matrix V i (t) = 10 −3 I 3×3 . In doing this, when projecting the target position on the camera image plane, we have that the maximum error is approximately 9.5 cm for a target at a distance of 1 m. Conversely, for any ι-th PTZ camera, ι ∈ {1 . . . n D }, the covariance matrix depends on the value of its zoom parameter as V ι (t) = (10 −3 /ζ ι (t)I 3×3 ). Note that we suppose that the zoom parameter can vary in the range [1,3] with unitary step for all the dynamic devices ( Table 3). The pan and tilt angles, instead, can be updated in different ranges for the various cameras, although the update step is fixed to 7.5 degrees for all the devices. We also assume that such angular movements are achieved in 0.5 s per step: this constitutes an arbitrary choice, even though, in the following, it is shown that reasonably longer movement time do not affect considerably the tracking performance.

PTZ Parameter Selection Insights
In the following, the PTZ parameter selection is performed by relying only on two of the criteria presented in Section 3.2, namely the distance from the center and the quality view criterion. This choice was motivated by the results of a preliminary comparison of all the proposed criteria, jointly with those described in [16]: the two selected ones constitute the best tradeoff in terms of both effectiveness and computational burden. We remark that the tracking criterion introduced in [16] and the outlined distance from the center criterion serve the same purpose. Nonetheless, the former requires complex computations to minimize the covariance matrix of the target state estimate, while the latter ensures good tracking performance just by trying to keep the targets as centered as possible in the image plane, only involving the computation of the distance from the optical axis through a norm. As regards the distance from the center criterion, the penalty term and the weight factor introduced in (6) were set to p = 10 3 and a = 0.5, respectively. As far as the quality view criterion is concerned, instead, the parameter h min in (7) was fixed to 1 m. We highlight that the purpose of the studied scenario was to obtain high-resolution shots of the targets while maintaining a good tracking performance for all of them. Since the quality view criterion favors zoomed framings of the targets and the distance from the center criterion improves the tracking precision, just by combining these two simple rules, it is possible to obtain the desired network behavior.
In light of the given premises, the utility function computed by any i-th PTZ camera, i ∈ {1 . . . n D }, in correspondence to the generic j-th target, j ∈ {1 . . . n T }, results in being: with g 1 (α i (t), β i (t), ζ i (t)) and g 2 (α i (t), β i (t), ζ i (t)) defined as in (6) and (7), respectively, and r 1 = r 2 = 1, namely without prioritizing any of the two selected criteria. The utility function that the i-th PTZ camera is required to maximize in order to determine its next PTZ parameter is thus f i (α i (t), β i (t), ζ i (t)) defined in (4). In particular, hereafter, we assume q j = 1 for any j ∈ {1 . . . n T }. We remark that the PTZ parameter selection can be simultaneously performed by devices corresponding to different partitions. Moreover, since in our simulation framework, at most three dynamic visual sensors are placed in each partition, for the reasons explained in Section 3.2.1, a greedy selection policy over the PTZ parameters was employed. Therefore, it is possible to use a relatively small number of negotiation steps, i.e., m = 6. Observe that partitions P 1 , P 2 , and P 4 are all monitored by a single dynamic sensor: in these cases, the optimal PTZ parameters are directly determined and the negotiating process is not required. To conclude, we highlight that the computational time for parameter selection is proportional to the number of targets inside a partition. Observing also that the network tracking capabilities are strongly influenced by the PTZ parameter selection time, we made the following design choices as regards the heterogeneous network. First, the dynamic devices are supposed to be able to perform a discrete pan and/or tilt movement in 0.5 s. In addition to this time, we need to consider also a small time interval during which the cameras stand still in order to acquire images without motion blur. We chose this interval to be of at least 0.5 s, during which we also computed the new PTZ parameters for the dynamic cameras. It follows that, if the computational time exceeds this value, the mentioned time interval needs to be longer. As a consequence, it turns out to be convenient to predict the targets state at least 1 s in the future. Having assumed T = 0.05 s, this leads to selecting ≥ 20 prediction time steps, 10 steps of which are introduced to cover the camera movement time.

Targets' Insights
In the designed simulation framework, we accounted for multiple possible targets moving in the environment according to (1). In detail, we selected the j-th target state noise covariance matrix, j ∈ {1 . . . n T }, as: where we indicate with p j ≥ 0 the average target velocity expressed in m/s. In addition, we took into account three main possible trajectories for all the targets: these are reported in Figure 3, neglecting the hypothesis of additional noise. The path in Figure 3a, i.e., Trajectory T1, refers to a non-elusive target going through partitions P 1 , P 2 , and P 3 , i.e., the environment portion characterized by a minimum number of cameras to properly ensure the target tracking. The trajectories in Figure 3b,c, namely T2 and T3, respectively, account for the behavior of a non-elusive and an elusive target, respectively, crossing partitions P 4 and P 5 . We recall that these two partitions represent the most critical and favorable scenario in terms of cameras to guarantee the target tracking. In all three cases, targets were generally assumed to move at a constant speed of 1 m/s, although variations up to 4 m/s were also studied for the trajectory in Figure 3a. In addition, in the following, we considered scenarios wherein 4, 8, and 12 targets were simultaneously present in the environment. In these cases, the distance among them was reduced in order to have all of subjects concurrently present in P 1 at some point during the simulation.

Simulation Results
Accounting for the simulation framework outlined in the previous section, hereafter, we investigate the performance of the solution designed, by studying different scenarios.

Performance Evaluation Criteria
To do this, the following performance indexes were taken into account: • The target state estimation error (precision) and the tracking confidence (accuracy). For any j-th target, j ∈ {1 . . . n T }, at time t, the former is simply the difference between the true and the estimated state. The latter is computed from the elements on the main diagonal of the covariance matrix P j (t). Formally, we distinguish between the position estimation accuracy δ p (t) = 3(p 1,1 (t) + p 2,2 (t) + p 3,3 ) ∈ R and the velocity estimation accuracy δ v (t) = 3(p 4,4 + p 5,5 + p 6,6 ) ∈ R. Note that the accuracy indexes δ p (t) and δ v (t) correspond to the 3σ C.I., where σ is the square root of the average variances of the position and velocity components of the target state, respectively; • The (maximum, minimum, mean, and 75th percentile) resolution at which any target is observed. This is expressed in pixel per cm 2 (ppcm) and computed evaluating the pixel density per 1 m 2 on the plane orthogonal to the optical axis and intersecting the estimated target position; • The number of cameras detecting any target at each time step T; • The time required for the PTZ parameter selection depending on the number of targets in the partition.
Note that only the last mentioned performance index involves a temporal quantity and, specifically, refers to the computational time employed in the PTZ parameter selection. This choice was motivated by the fact that the workload associated with all the other operations, as, e.g., the EKF target state estimation, is negligible with respect to the PTZ parameter selection process.
Furthermore, to better highlight the advantages of heterogeneous VSNs including also PTZ cameras, we introduce the so-called static simulation framework (SSF). This differs from the simulation framework described in Section 4, hereafter termed the dynamic simulation framework (DSF), since we assumed substituting all the dynamic visual sensors with static devices. In particular, in the SSF, we fixed the orientation of the (substituted) cameras in order to have at least a couple of sensors monitoring the whole volume of each partition, except for partition P 4 , as depicted in Figure 4. Since we did not deal with occlusions, one can realize that in the SSF, the network tracking performance does not scale with the number of targets. For this reason, in the following, we investigate the performance of the SSF by accounting only for a single target moving in the environment. Nonetheless, the achieved results were then used as a benchmark for comparing the performance of the designed solution in the given DSF, specifically addressing also the multi-target case. All the simulations were run on a Windows laptop equipped with an Intel core i7-6700HQ.
To conclude, we emphasize that, to provide a fair evaluation of the designed solution performance, 10 independent trials were run for each testing scenario and the average of all the aforementioned metrics was computed.

Single Target
The first intent is to remark about the advantages deriving from the use of a heterogeneous VSN. In doing this, we focus on the network tracking capabilities both in the SSF and in the DSF, by considering a single target that follows the trajectories described in Section 4.3 with a constant speed equal to 1 m/s. First, note that in this case, the choice of = 20 prediction steps (Section 4.2) is sufficient. Indeed, as shown in Table 4, the computational time required for the PTZ parameter selection in the case of a single target following the trajectory T3 (in Figure 3c) never exceeded 0.5 s. We remark that the PTZ parameter selection procedure took into account not only the visible targets, but also the ones potentially visible in the near future. For this reason, the computation time turned out to be strongly affected also by the environment structure: a PTZ camera located in a room having a complex geometry in terms of walls and obstacles is penalized. Focusing on the target tracking of trajectory T3 (the most challenging one for the camera network), in Figure 5, we report the performance indexes in correspondence to both the SSF and the DSF. However, we also summarize the performance in correspondence to all the target trajectories that are depicted in Figure 3 and Table 5.
It is possible to notice that the estimation error and tracking confidence, namely the position and velocity precision and accuracy, are comparable for all the trajectories, with a slight improvement when considering the DSF. This improvement can be explained by observing the mean number of cameras on the target. Indeed, accounting for Figure 5b, related to the case of a target following trajectory T3, we note that in several steps of the path, the number of cameras framing the target was larger for the DSF, as compared to the SSF. Moreover, the use of PTZ cameras allows considerably improving the resolution at which the target is seen (see Table 5 and Figure 5b). Observing also Figure 6, one can realize that the frame distribution in terms of ppcm is higher in DSF.
We observed that some spikes affected the trend of the estimation error reported in Figure 5a: this fact can be motivated by the changes of direction in the considered trajectory, approximated with instantaneous variations. However, the overall performance was not compromised by this behavior: the maximum value of the tracking precision and accuracy was, indeed, bigger than its 75th percentile. We emphasize that in real-world scenarios, this issue is less remarkable, since usually, changes of direction happen more smoothly; nonetheless, the proposed approach is useful to test the robustness of the network tracking capability. In the elusive target scenario corresponding to trajectory T3 ( Figure 5), we highlight that the target tries to exploit the blind areas of the VSN, and this turns out to be particularly problematic in partition P 4 , where the visual sensors are not capable of ensuring the tracking over the entire environment portion. This fact can be noticed in Figure 5a: in correspondence to the part of the trajectory associated with partition P 4 , namely between the 350th and the 650th time step, the accuracy and precision are compromised, especially when considering the SSF, since the target is framed by only a single camera for a long time (Figure 5b). In the DSF, the presence of dynamic devices allows partially counteracting this problem, with the limiting factor given by the movement capabilities of the PTZ cameras.
To conclude, to test the robustness inherited from the distributed approach, two scenarios were analyzed where cameras C D 7 and C D 8 were respectively considered as not working and a target was following trajectory T2. In both cases, thanks to the negotiation among the remaining cameras, the VSN was able to autonomously adapt to the new situation and to limit the loss of performance to a slight decrease in the quality of view performance (mean resolution values), as can be observed by comparing the data in Table 5 (DSF, T2) and Table 6.

Single Target vs. Four Targets
Accounting for the DSF, in Figure 7, we compare the single-target case discussed in the previous subsection with the case in which four targets move in the environment following trajectory T1 and thus crossing partitions P 1 , P 2 , and P 3 . The intent, here, is to evaluate the heterogeneous VSN performance in a scenario wherein the number of cameras is the minimum to guarantee a successful target tracking over the entire area.
First of all, from Table 7, it is possible to notice that also with four targets, the computational time necessary to perform the PTZ parameter selection did not exceed 0.5 s, thus the assumption on the target state prediction = 20 is again valuable.   Based on the accuracy and precision values reported in Table 8, we point out that the increase in the number of targets did not affect the network tracking performance. However, when multiple targets were simultaneously present in the same partition, the PTZ cameras leaned toward the reduction of their zoom value in order to frame as many targets as possible, consequently compromising the (mean) resolution at which the targets were observed. This was due to the fact that the utility function (4) was designed so that all the dynamic devices tend to keep all targets at the maximum possible resolution, but also centered in their camera image plane. The described situation is more probable in larger partitions, as can be noticed from the ppcm trend in Figure 7: between the 400th and the 700th time step, we detected the highest discrepancy between the single-and multi-target case, and this corresponds to the part of the target trajectory in P 3 . We also observed that the utility function maximization sometimes led the PTZ cameras to focus on a single target (thus, increasing their zoom value), rather than framing more targets, especially when these were already monitored by other visual sensors. This fact explains why the mean number of cameras on the targets was slightly lower in the DSF when accounting for four targets as compared both to the one-target case and to the SSF (see Table 8). In general, it is possible to balance the amount of zoomed PTZ configurations with a good tracking performance by tuning the weights of the target utility function.
To conclude, we highlight that Figure 6 and Tables 5 and 8 show that, even when four targets are assumed to be present in the environment, the use of PTZ cameras brings a benefit in terms of target resolution. In fact, the average pixel density on the target was higher with respect to the one obtained in SSF.

Multiple Targets: Limit Behavior
Hereafter, we analyze how the average target resolution changes when the number of targets increases, and in particular when the use of PTZ cameras in the VSN does not provide any improvement with respect to the SSF. In doing this, we considered two scenarios involving 8 and 12 targets, respectively, ensuring their simultaneous passage through the partition P 1 at some time instant. It is straightforward that the different partitions have different tracking limits, depending on their topology and on the number of corresponding visual sensors. To provide a fair comparison, we considered trajectory T1 as in Section 5.3.
It turned out that the tracking limit in the DSF was 12 targets. Indeed, as can be observed comparing Tables 5 and 9, the advantage given by the exploitation of a heterogeneous network in this case was minimal. As a consequence, we can conjecture that for a higher number of targets, the network performance might degenerate until the use of PTZ devices results in not being beneficial. This happened, for instance, in partition P 2 and P 3 in correspondence to a lower number of targets because of the presence of a single dynamic device. On the other hand, in partition P 1 , the adoption of a heterogeneous VSN turned out to be advantageous due to the higher number of available PTZ cameras. For the 12-target case, thus, the DSF performance in terms of mean ppcm was the same as can be obtained by considering a network of static cameras, with higher resolution in critical points of the surveilled environment. Moreover, Figure 8a,b highlights that the computational time grew approximately linearly with the number of targets in a partition and the trend slope changed according to the number of devices placed in the partition. One can also observe that in these cases, the computational time to select the PTZ parameters of all the dynamic cameras was higher than 0.5 s (see Tables 10 and 11). This implies that a longer time period needs to be taken into account for computations before the cameras' movements. In particular, we considered intervals of 1 s for the 8-target case and of 2 s for the 12-target case, which correspond to = 30 and = 50 steps ahead prediction, respectively.  Finally, we noticed that, under the target velocity assumption of 1 m/s, the target state estimation precision was not compromised (see Table 9), suggesting that the designed solution was capable of coping with situations involving a lower number of targets and a longer time for the camera movements.

Multiple Targets with Different Velocities
We investigate now the system performance in the presence of targets having different (increasing) velocities. Specifically, we assumed dealing with four targets moving at the constant speeds of 1 m/s, 2 m/s, and 4 m/s. The last case is extreme for walking targets, but this was studied with the purpose of pushing the system performance and analyzing the provided solution robustness. Furthermore, in this scenario, the targets were supposed to follow trajectory T1.
In Figure 9, we report the results of the comparison of the tracking precision and accuracy for the second of the four targets, while moving at the different velocities. We observed that, in all the cases, the distributed EKF generally allowed estimating the target position with a small error, except in correspondence to direction changes wherein the tracking error grew proportionally to the velocity. In particular, we note that for higher velocities, a longer time was required to correct the estimates. These considerations are confirmed by the results in Table 12. Nonetheless, we also remark that abrupt changes in direction are unlikely in a real case scenario, especially in the case of high velocities.  As concerns the resolution at which the targets were observed, from Table 12, one can observe that generally, the mean ppcm value was lower for higher target velocities. In these cases, in fact, less zoomed PTZ configurations are preferable since the movement that any camera can accomplish in one time step is not sufficient to follow a relatively close target that is moving fast.
Finally, we emphasize that, as the target velocity increases, the mean number of cameras on the target decreases. Indeed, in correspondence to faster targets, it is more likely for the PTZ cameras to temporarily lose the target. In these cases, the target state estimate only relies on the measurements of the static devices. In the worst case, for a brief time interval, the target is viewed by a single fixed camera; therefore, its state estimates can become less accurate until a PTZ camera frames it again.

Multiple-Target Real-World Scenario
To conclude the assessment of the proposed solution, we accounted for a real-world scenario wherein several targets access the surveilled environment with small time intervals between each other. They are all supposed to move at 1 m/s, following different trajectories in order to cover the whole environment and to explore also the blind areas for the VSN. In this real-world scenario, the targets follow more natural paths with respect to the ones considered in the previous cases; the result is a more random distribution of the tracked subjects over the simulation area. We observed that the PTZ parameter selection computational time here was less than 1s for the partitions of the corridor and less than 0.5 s for the other partitions; hence, the prediction time steps considered were = 30 in correspondence to partitions P 2 and P 3 and = 20 for the other ones.
In this scenario also, particularly challenging from a surveillance point of view, the designed solution based on the exploitation of a heterogeneous VSN resulted in being more effective with respect to the SSF, especially in terms of the resolution at which targets were seen. This fact is confirmed when comparing the SSF and DSF performances reported in Table 13: the ppcm index almost doubled in correspondence to any partition, with the exception of P 2 and P 3 , where the improvement was slightly less due to the constrained topology of the corridor.
Furthermore, we remark that when multiple targets are present in a single partition, the number of PTZ cameras on a specific target tends to diminish. This behavior can be explained by considering that, based on the maximization of its utility function, for some dynamic device, it may to be more convenient to set its PTZ parameters in order to focus on a specific target, obtain some high-resolution shots instead of framing a larger area. The most challenging situation occurs when the maximum number of targets occupies a single partition and, in particular, these are spread over the entire partition area, e.g., when eight targets are present in partitions P 2 and P 3 and/or when five targets are present in partitions P 1 , P 4 , and P 5 , as shown in Figure 10. In this case, it turns out to be preferable to not focus on a single target for a long period, since the risk of loosing the others exists. The results reported in Table 14 highlight how, besides the difficult situation, the solution proposed in this work was able to improve considerably the resolution at which targets were seen as compared to the traditional VSN characterizing the SSF.

Discussion
Accounting for the critical aspects of the proposed solution highlighted in the previous sections, we discuss here some possible ruses to improve the system performance.
First of all, we observed that the parameter selection computational time linearly depends on the number of targets, while the mean resolution at which the target is seen results in being inversely proportional. To address the first issue, it could be convenient to consider an adaptive rate for the PTZ parameter selection, namely to change the interval between consecutive selection procedures. To do so, it is sufficient to estimate the duration of the selection procedures' process depending on the number of targets in a partition and therefore to select an adequate number of steps for the prediction of the target state. Indeed, when the partition is populated by a small number of targets, then a high PTZ parameter update rate can be adopted, thus improving the network performance. Note that the value of can be different for each partition, since the selection procedures can be carried out independently. Moreover, when the number of targets in the surveilled environment is such that the mean resolution obtained on them with a heterogeneous VSN is comparable to the one obtained with a static network, it could be useful to temporarily switch to a predefined configuration for the PTZ cameras that allows covering the entire area well, without trying to maximize the utility function.
Furthermore, we remark that, as the targets' velocity increases, less zoomed configurations are preferred to reduce the possibility of losing the target. An improvement rests upon the introduction of an adaptive zoom based on the estimated velocity of the target, namely a procedure entailing the reduction of the maximum zoom magnitude as the velocity increases, allowing reducing the number of parameter value possibilities considered in the selection process.
Another crucial aspect is the management of the target direction changes, constituting the major source of error in the tracking process. To face this issue, it is possible to identify specific zones in which these situations are more likely to occur, such as the corners of a room or of the corridor, as well as the intersections between different rooms. When the targets are in these zones, it could be useful to consider higher values for the variance of the noise w j (t). This would allow dealing better with the uncertainty related to the changes of direction, which in these areas are bound to happen with a very high probability. Another more heuristic approach to this problem could be to consider only configurations with a wide FOV for cameras framing targets crossing these critical zones that would therefore be able to cover all the potential targets movements.
Finally, we emphasize that target occlusions represent still an open challenge in camera network design and in the proposed solution. It could be possible to select the PTZ cameras' parameters according to the potential detectable obstacles present in the environment. Along this line, it would be sufficient to weight the contribution of each camera to a given target utility function according to the occlusion information. In other words, once a visual sensor realizes that a target is occluded by using its visual information jointly with the target state estimation, it could lower its contribution to the utility function with respect to the occluded target. In this way, the PTZ parameters of such a camera will be set by the algorithm to focus only on the targets that are visible from its point of view.
One further possible refinement could be to optimize the number, position, and type of cameras located inside the surveilled environment, for example by using the solutions proposed in [7][8][9]27]. This would allow starting from more advantageous PTZ parameter configurations as concerns the utility function maximization.

Conclusions
In this work, we proposed an original real-time surveillance and multi-target solution for an indoor environment, based on the exploitation of a heterogeneous VSN composed of both fixed and PTZ cameras. The environment topology was exploited to support the implementation of a distributed approach and thus obtain a robust, flexible, and scalable network. Indeed, the outlined structure allows separating the surveilled area into multiple independent partitions that can be handled in parallel.
The described surveillance solution consists of two main parts: a distributed EKF and a PTZ parameter selection algorithm that is based on a game theoretic approach. The former aims at estimating and predicting the state of the targets moving in the surveilled area. The latter, instead, given the predicted targets' states, tries to maximize a utility function with the aim of finding the best configuration for the dynamic cameras in the VSN. Such a framework allows realizing a wide range of different and potentially conflicting objectives by simply choosing the proper utility function terms and weights.
In the simulation part, we evaluated the performance of the designed solution considering a specific case where the objective was to obtain a tradeoff between high-resolution views and good tracking ability. Multiple scenarios were investigated, by considering different numbers of targets following multiple trajectories at different velocities. The results confirmed the effectiveness of the solution in obtaining the desired tradeoff. In addition, it is possible to establish a linear relationships between the number of targets in a partition and the computational times. Finally, the threshold (in terms of target number) at which it is more convenient to temporarily switch to a static camera network configuration was also individuated.