To estimate the human position in a factory cell, Ragaglia et al. [

8] propose a stochastic methodology, merging external condition and human gait model with human recognition and tracking algorithms. The robotic factory cell includes several cameras and it is divided into co-existence, co-operation and interference areas. The developed algorithm uses a predictive model based on an offline unsupervised human path classification to estimate the area the human operator will move to, associating a likelihood for each of the areas. The human being is modeled with a rectangular box. On the other hand, Morato et al. [

9] uses a skeleton-like model to track the human and estimating the position of the skeleton joints using a Kalman filter approach that merges the information of multiple Microsoft Kinects cameras (Microsoft Corp., Redmond, Washington, USA). Humans are modeled using virtual spheres centered in the midpoint between two consecutive human joints: the diameter of the sphere is equal to the distance between the two joints. Modeling human bodies with spheres has also been proposed by Kulic et al. in [

10], where the information is merged with physiological sensors to estimate the user’s affective state. Safeea et al. proposed to reconstruct humans and manipulators using cylindrical primitives, presenting a closed-form solution to compute distances between them [

11]. In this study, a case study where a human is represented by one cylindrical primitive and the manipulator is composed of two cylindrical primitives, was presented. Using more cylindrical primitives increases the fidelity of human and manipulator reconstruction, as much as the model complexity leading to a computationally expansive representation.

All the techniques mentioned so far aim at reconstructing humans by exploiting convex approximations of their shape. Phan and Ferrie [

12], instead, propose a model based on geodesic distance graph model to compute a high-resolution reconstruction of humans. However, the algorithm requires a computational time of 0.6 s, that is not always suitable for applications demanding shorter reaction time, such as in HRC. To reduce the computational time, Cefalo et al. [

13] propose a real-time collision algorithm based on the parallel computing of Graphics Processing Units (GPU). The algorithm subtracts the depth image acquired from a camera to a virtual depth image built using CAD models. A collision is detected whenever there is a difference between the two images, i.e., whenever the robot touches or is covered by an obstacle. In Cefalo et al. [

13] a single camera is used; thus, results may be affected by shadows problems. Whenever an object is observed by a camera, a blind volume, where the camera is essentially senseless, is generated. Such a phenomenon is known as the shadow effect. For this reason, the entire shadow cone generated by an obstacle is often considered as part of the obstacle itself. This assumption leads to a substantial reduction of the manipulator workspace, affecting its performances. Multiple depth cameras can be employed to overcome this problem, adding multiple points of view and, therefore, mitigating the problem of shadows. Although, a multi camera approach can solve the issue of shadow effects, it will require a high computational cost due to the fusion of the information coming from different sensors. Fisher et al. in [

14] propose a master-slave architecture of distributed cameras. The architecture has five processing layers that corresponds to the data type transferred between slave and master cameras. These processing layers vary from 0 to 5 (from raw image acquisition up to minimum distance calculation). The proposed approach, first, remaps the depth sensors information into the Cartesian space, and then clusters the information and computes the object convex hull, which is used to compute the distance. The quality of approximation of the detected objects depends on distributed processing level: the lower the level, the higher the quality will be. The average computational time of the algorithm with the lowest quality of approximation (i.e., level 5) is about 80 ms (i.e., 12.5 Hz). Moreover, the computational time increases proportionally with the number of depth sensors when distributed processing level is between 0 and 4, hampering the use of additional sensor units. In this regard, Flacco et al. faced the integration of multiple depth sensors in [

15] extending his previous work [

16], where the distances between control points on the manipulator surfaces and moving obstacles (including humans) were evaluated based on the concept of depth space. The presented approach assumes that a main camera covers the entire collaborative workspace. Secondary cameras, monitoring sub-regions of the workspace, are queried only when the main camera information is not enough. However, it is worth mentioning that this assumption may not be feasible in a real environment.