3D Human Pose Estimation with a Catadioptric Sensor in Unconstrained Environments Using an Annealed Particle Filter

The purpose of this paper is to investigate the problem of 3D human tracking in complex environments using a particle filter with images captured by a catadioptric vision system. This issue has been widely studied in the literature on RGB images acquired from conventional perspective cameras, while omnidirectional images have seldom been used and published research works in this field remains limited. In this study, the Riemannian varieties was considered in order to compute the gradient on spherical images and generate a robust descriptor used along with an SVM classifier for human detection. Original likelihood functions associated with the particle filter are proposed, using both geodesic distances and overlapping regions between the silhouette detected in the images and the projected 3D human model. Our approach was experimentally evaluated on real data and showed favorable results compared to machine learning based techniques about the 3D pose accuracy. Thus, the Root Mean Square Error (RMSE) was measured by comparing estimated 3D poses and truth data, resulting in a mean error of 0.065 m when walking action was applied.


Introduction
Catadioptric sensors are widely used in robotics and computer vision. Their popularity is mainly due to their ability to acquire 360 • images with a single shot. They have been used for 3D reconstruction of large environments, robotics, and video surveillance. In addition, 3D human tracking in complex and cluttered environments remains a difficult and challenging problem despite the extensive research work carried out in the literature. In order to get a panoramic view of the environment, several solutions have been proposed using synchronized cameras [1]. However, this kind of system is difficult to implement, especially when the workspace is uncontrolled and cluttered. In this research work, we propose to estimate, through a particle filter, the 3D human pose from images provided by a catadioptric camera. Our main contribution consists in developing robust likelihood functions, which take into account the intrinsic properties of the spherical images. As a result, the particle filter becomes able to propagate the particles in a better manner, which make it more stable and accurate. We provide in detail the architecture of the proposed approach and give more in-depth the experimental results to demonstrate its effectiveness.
The rest of the paper is organized as follows. Section 2 provides the related work. Section 3 describes the proposed particle filter-based 3D tracking approach. Section 4 details the experimental framework undertaken to validate the performance of the proposed algorithm. Finally, some conclusions and future works are drawn in Section 5. We used HoG (histogram of oriented gradients) descriptors to extract human feature because they effectively describe the local distribution of the human body and they are invariant to illumination changes and small movement in the images. Moreover, linear Support Vector Machines (SVM) trained on HOG features demonstrated an excellent performance for human detection [25]. Thus, the HOG descriptors have been adapted to omnidirectional images before being combining with an SVM classifier. For that, the image gradient is computed in the Riemannian space [2]. The obtained results clearly demonstrate the effectiveness of the catadioptric-adapted gradient comparing to the conventional methods directly computed in the pixel space. Once the tracking initialized, the particle filter allows the generation of several hypotheses of 3D human posture thanks to its particle propagation process around the current pose. Each generated particle corresponds to a probable posture of the 3D human body model in the current image; it takes into account the mechanical and kinematic constraints of the movement due to the articulated aspect of the human body. In order to take into account the distortion caused by the catadioptric sensor, the weight assigned to each particle is computed according to several likelihood functions. The calculation of these functions is given in the following subsections. We used HoG (histogram of oriented gradients) descriptors to extract human feature because they effectively describe the local distribution of the human body and they are invariant to illumination changes and small movement in the images. Moreover, linear Support Vector Machines (SVM) trained on HOG features demonstrated an excellent performance for human detection [25]. Thus, the HOG descriptors have been adapted to omnidirectional images before being combining with an SVM classifier. For that, the image gradient is computed in the Riemannian space [2]. The obtained results clearly demonstrate the effectiveness of the catadioptric-adapted gradient comparing to the conventional methods directly computed in the pixel space. Once the tracking initialized, the particle filter allows the generation of several hypotheses of 3D human posture thanks to its particle propagation process around the current pose. Each generated particle corresponds to a probable posture of the 3D human body model in the current image; it takes into account the mechanical and kinematic constraints of the movement due to the articulated aspect of the human body. In order to take into account the distortion caused by the catadioptric sensor, the weight assigned to each particle is computed according to several likelihood functions. The calculation of these functions is given in the following subsections.

The 3D Human Model
In state-of-the-art research, the human body is often represented by an articulated 3D model whose number of degrees of freedom (DOF) differs according to the application, for example it is equal to 82 in [26] 14 in [27] and 32 in [28]. The number of DOF model directly impacts the behavior of the 3D Sensors 2020, 20, 6985 4 of 17 tracking algorithm, since it corresponds to the vector size of parameters to be estimated. A high number of DOF would increase the estimation time but would allow us to model complex human postures. Recently, more flexible and parameterizable 3D human models have been developed, such as SMPL [29], which allows the representation of different body shapes that deform naturally with the pose, like a real human body. However, this kind of model needs to be trained on thousands of aligned scans of different people in different poses. Their use in our case is not appropriate, as we want to develop a low-cost real-time tracking solution. Thus, we opted for cylinders to model the head and trunk of the human body, and truncated cones for the upper and lower limbs ( Figure 2). This representation has the advantage of being simple to handle (few parameters to define a cylinder/cone) [30,31] and easy to project into images. Our model has 34 degrees of freedom, composed of 11 parts: pelvis, torso, head, head, arms, forearms, legs and thighs. The model shape is represented by the length and width of the upper/lower limbs and trunk, while the 3D posture is defined through 30 parameters that give the position and orientation of the pelvis as well as the angles at the joints between the different body parts. In the end, all these parameters were grouped into a single vector x = [x(1), x(2), . . . , x(29), x (30)] that defines a complete 3D model of the human body.

The 3D human Model
In state-of-the-art research, the human body is often represented by an articulated 3D model whose number of degrees of freedom (DOF) differs according to the application, for example it is equal to 82 in [26] 14 in [27] and 32 in [28]. The number of DOF model directly impacts the behavior of the 3D tracking algorithm, since it corresponds to the vector size of parameters to be estimated. A high number of DOF would increase the estimation time but would allow us to model complex human postures. Recently, more flexible and parameterizable 3D human models have been developed, such as SMPL [29], which allows the representation of different body shapes that deform naturally with the pose, like a real human body. However, this kind of model needs to be trained on thousands of aligned scans of different people in different poses. Their use in our case is not appropriate, as we want to develop a low-cost real-time tracking solution. Thus, we opted for cylinders to model the head and trunk of the human body, and truncated cones for the upper and lower limbs ( Figure 2). This representation has the advantage of being simple to handle (few parameters to define a cylinder/cone) [30,31] and easy to project into images. Our model has 34 degrees of freedom, composed of 11 parts: pelvis, torso, head, head, arms, forearms, legs and thighs. The model shape is represented by the length and width of the upper/lower limbs and trunk, while the 3D posture is defined through 30 parameters that give the position and orientation of the pelvis as well as the angles at the joints between the different body parts. In the end, all these parameters were grouped into a single vector = (1), (2), … , (29), (30) that defines a complete 3D model of the human body. In addition, used the unified model to take into account the geometry of the catadioptic sensor when projecting the 3D model into the current image. Thus, the projection of a straight-line segment gives conics on the image plane ( Figure 3). In addition, used the unified model to take into account the geometry of the catadioptic sensor when projecting the 3D model into the current image. Thus, the projection of a straight-line segment gives conics on the image plane ( Figure 3).

The Filtering
Filtering consists in estimating the current state taking into account all past measurements : ≡ , … , [32]. From a mathematical point of view, this results in estimating the posterior distribution of the current state ( | ). In our case, the state vector includes all the parameters

The Filtering
Filtering consists in estimating the current state x t taking into account all past measurements y 1:t ≡ y 1 , . . . , y t [32]. From a mathematical point of view, this results in estimating the posterior distribution of the current state p(x t y 1:t ). In our case, the state vector includes all the parameters describing the 3D posture of the human body as explained in the previous section, and the measurements that feed the filter at each iteration correspond to visual primitives extracted from the current image. The posterior distribution of the current state p(x t y 1:t ) can be recursively computed from the distribution of the previous state p(x t−1 y 1:t−1 ) in two steps:

•
Prediction step: In Equation (1) the temporal diffusion model p(x t |x 1:t−1 ) is used to compute the predicted state. In this study, we use the random walk model that gives the best results when setting the standard deviations at 0.1 m for translation and 1.5 • for rotation. The filtered solution (posterior distribution) corresponds to the predicted pose weighted by the likelihood function p(y t x t ), which corresponds to the observation probability conditioned by the estimated pose. It is known that the filtering equations can generally not be solved in closed form, except for linear Gaussian systems where the Kalman Filter (KF) provides the exact solution [24]. A large amount of research has been carried out to generalize the KF solution to non-linear systems. Different numerical methods have been developed such as the EKF (Extended Kalman Filter). In this work, we used the Particle filter framework for its simple implementation and its effectiveness in managing complex and random motion. So, we implemented an annealed particle filter (APF) which is based on Sequential Importance Resampling (SIR) algorithms [33,34] or CONDENSATION algorithm [35]. The APF filter was developed by Deutscher et al. [36] to solve the problem of articulated body motion tracking with a large number of degrees of freedom. The basic principle of the APF is the use of the annealing in an iterative way in order to better estimate the peaks of the probability density. Therefore, at each time, the APF algorithm proceeds in a set of "layers", from layer M down to layer 1, that update the probability density over the state parameter. A series of weighting functions are employed in which each w m differs only slightly from w m+1 , where w m is designed to be very broad representing the direction of the search space. The posterior distribution after each layer m + 1 of an annealing run is represented by a set of N weighted particles: . For the prediction step at layer m, a Gaussian diffusion model is implemented. Specifically, a "Monte Carlo sampling with replacement" method is used to generate the new hypotheses at layer m from the posterior density at the previous layer m + 1 using: The sampling covariance matrix C controls the extent of the research space at each layer, where a large covariance matrix allows for a more widespread distribution of the sampled particles. The Parameter α is used to gradually reduce the covariance matrix C in the lower layers in order to guide the particles to the modes of the posterior distribution. In our case, α is set at 0:4. Sampled poses that do not respect the geometric constraints of the articulated model of the human body (limits of the articular angle of the model exceeded or interpenetration of the limbs) are rejected and are not Sensors 2020, 20, 6985 6 of 17 resampled in a layer. New normalized weights are assigned to the remaining particles based on an "annealed" version of the likelihood function: The value of β m will determine the annealing rate at each layer. Generally, the parameter β m is set so that about half of the particles are propagated to the next layer by Monte-Carlo sampling.

Likelihood Functions
The likelihood of each particle in the posterior distribution measures how well the projection of a given body pose fits the observed image. Therefore, it is important to correctly choose, which image features are to be used to construct the weighting function. Many image features could be used, including appearance models and optical flow constraints. In our case, we use edge and silhouette features for their simplicity (easy and efficient to extract) and their degree of invariance to imaging conditions, namely with omnidirectional images.

Edge-Based Likelihood Function
The image gradient is first used to detect the edges in the omnidirectional images. Then, we propose to use geodesic metrics to process spherical images and measure the distance between a pixel and the edge. For that, a gradient-mapping on the Riemannian manifold [30,31] is considered. Let S be a parametric surface on R 3 with an induced Riemannian metric g ij that encodes the geometrical properties of the manifold. A point on the unit sphere can be represented according to Cartesian or polar coordinates by (x, y, z) = (sinθsinφ, sinθcosφ, cosθ). The Riemannian inverse metric is then given by: and ξ is a projection parameter which takes into account the shape of the mirror. When ξ = 0 we are back to the pinhole model. This Riemannian metric is then used as a weighting function applied to the classical gradient computed on the omnidirectional image: and on the spherical images: where (θ, φ) represent, respectively, the longitude and colatitude angles; and e θ , e φ are the unit vectors.
For each pose hypothesis (defined by a particle of the APF filter), the 3D human model is projected into the generated gradient image. Then the distance between the projected model and the contour is determined. In omnidirectional images, the distance between two neighboring pixels differs according to the image region under consideration and therefore using the Cartesian distance is not suitable. We have therefore opted for the geodesic distance in order to build the distance map. The geodesic distance between two points in a spherical image, x 1 = (θ 1 , φ 1 ) and x 2 = (θ 2 , φ 2 ), is given by: An edge distance map M e t is then constructed for each image. The likelihood is estimated by projecting the complete model into the edge map and computing the mean squared error: where ξ e x t ( j) represents the coordinates of the image points corresponding to the projected 3D model points in the image along all the body parts, using the estimated 3D pose x t . In order to improve the computing speed, we calculate the geodesic distance according to given direction. Thus, the large circle that passes through the ends of each 3D model cylinder is determined. Then, several circles belonging to the perpendicular planes on this large circle are generated in order to sample the projected 3D model. The points of intersection between these circles and the cylinder contour correspond to the sample points of the projected 3D model ( Figure 4). This reduces the number of pixels whose distance from the edge must be calculated. Indeed, unlike the case of perspective images, the complexity of the Distance Map calculation is very high when spherical images are considered.
For each pose hypothesis (defined by a particle of the APF filter), the 3D human model is projected into the generated gradient image. Then the distance between the projected model and the contour is determined. In omnidirectional images, the distance between two neighboring pixels differs according to the image region under consideration and therefore using the Cartesian distance is not suitable. We have therefore opted for the geodesic distance in order to build the distance map. The geodesic distance between two points in a spherical image, = ( , ) and = ( , ), is given by: An edge distance map t is then constructed for each image. The likelihood is estimated by projecting the complete model into the edge map and computing the mean squared error: where ( ) represents the coordinates of the image points corresponding to the projected 3D model points in the image along all the body parts, using the estimated 3D pose . In order to improve the computing speed, we calculate the geodesic distance according to given direction. Thus, the large circle that passes through the ends of each 3D model cylinder is determined. Then, several circles belonging to the perpendicular planes on this large circle are generated in order to sample the projected 3D model. The points of intersection between these circles and the cylinder contour correspond to the sample points of the projected 3D model (Figure 4). This reduces the number of pixels whose distance from the edge must be calculated. Indeed, unlike the case of perspective images, the complexity of the Distance Map calculation is very high when spherical images are considered.

Silhouette-Based Likelihood Function
The scene background is estimated using a Gaussian mixture model, then subtracted at each time to generate the binary foreground silhouette map M s t . The silhouette likelihood function is then estimated by the equation: However, this function constrains the body to lie inside the image silhouette. In order to correct this defect, we define a new silhouette likelihood that penalizes non-overlapping regions. Let and R 1 t on the other side. The size of each region can be computed by summing all the image pixels as follows: Thus, the dual likelihood function is defined as a linear combination of these regions: Finally, we use the multiple probability formulation to combine the different likelihood functions: where y t is the image observations obtained at time t and L ∈ {e, s, sd} is the set of the proposed likelihood functions.

Experimental Results
In this section, we detail the experiments we have carried out under real conditions to study the behavior of our 3D tracking algorithm and to evaluate its performance. We used the SmartTrack "capture motion" system [37] to generate the ground-truth of the 3D body poses. We first detail the experimental protocol put in place, as well as the construction of our test database, and then we present the used evaluation criteria and discuss the obtained results.

Acquisition System Setup
The used acquisition system is composed of the SmartTrack device and an omnidirectional camera realized by combining a hyperbolic mirror with a perspective camera, as shown in Figure 5. The SmartTrack is an integrated tracking system. This means, inside the small housing are integrated not only two tracking cameras but also the Controller, which performs all calculations and generates the data output stream. It is composed of two infrared (IR) cameras with a field of view of approximately 100 degrees in horizontal and 84 degrees in vertical. The IR cameras allow the tracking  The SmartTrack is an integrated tracking system. This means, inside the small housing are integrated not only two tracking cameras but also the Controller, which performs all calculations and generates the data output stream. It is composed of two infrared (IR) cameras with a field of view of approximately 100 degrees in horizontal and 84 degrees in vertical. The IR cameras allow the tracking of targets within reflective surface. Indeed, these markers reflect the incoming IR radiation into the direction of the incoming light. More precise: the IR radiation is reflected into a narrow range of angles around the (opposite) direction of the incoming light. Passive markers are mostly spheres covered with retro reflecting foils. However, they can also be stickers made from retro reflecting material. In our experiment, we placed the passive markers on the person's pelvis and head to record their 3D position and orientation in real time. We used a WIA (Windows Image Acquisition) server to synchronize the data provided by the SmartTrack device with the images acquired from the omnidirectional camera.

Database Construction
Thanks to the acquisition system, we built a database composed of four sequences. The first one represents a person moving slowly around the sensor (Figure 6a). In sequence 2, the person follows the same trajectory as in sequence 1 with an oscillating movement of his arms. In the third sequence, a movement around the sensor with a forward/backward motion has been performed ( Figure 6b). Figure 5. Data acquisition setup. The SmartTrack device and the Omnidirectional camera are mounted on a tripod. A calibration process was carried out to determine the rigid transformation between the two systems.
The SmartTrack is an integrated tracking system. This means, inside the small housing are integrated not only two tracking cameras but also the Controller, which performs all calculations and generates the data output stream. It is composed of two infrared (IR) cameras with a field of view of approximately 100 degrees in horizontal and 84 degrees in vertical. The IR cameras allow the tracking of targets within reflective surface. Indeed, these markers reflect the incoming IR radiation into the direction of the incoming light. More precise: the IR radiation is reflected into a narrow range of angles around the (opposite) direction of the incoming light. Passive markers are mostly spheres covered with retro reflecting foils. However, they can also be stickers made from retro reflecting material. In our experiment, we placed the passive markers on the person's pelvis and head to record their 3D position and orientation in real time. We used a WIA (Windows Image Acquisition) server to synchronize the data provided by the SmartTrack device with the images acquired from the omnidirectional camera.

Database Construction
Thanks to the acquisition system, we built a database composed of four sequences. The first one represents a person moving slowly around the sensor (Figure 6a). In sequence 2, the person follows the same trajectory as in sequence 1 with an oscillating movement of his arms. In the third sequence, a movement around the sensor with a forward/backward motion has been performed (Figure 6b).  The fourth sequence presents a more complex scenario where the person rotates around himself and climbs stairs. This sequence allows us to evaluate the robustness of the algorithm against the self-occlusion problem. The video sequences were captured at frame rate of 25 images per second. The characteristics of the collected dataset are summarized in Table 1.

Performance Criteria
We use two evaluation metrics based on the mean square error (MSE) [38,39] to compare the estimated body poses given by our algorithm and the truth data. The first one computes the average Euclidean distance between the markers placed on the joints and extremities of the limbs and the estimated poses. This distance is given by: where m i (x) ∈ R 3 are the locations of the markers corresponding to the 3D ground truth, and m i ( x) represent the 3D joint positions induced by the estimated pose x.
The second criterion is a pixellic distance measured in the images. To do this, we manually annotated the videos in the dataset with extra information representing the ground-truth of the body posture in the image sequence. Thus, for each frame of each video, we annotated the positions of 11 ends of the human silhouette. For the evaluation, we project the human body model into the images and then compute the 2D distance between the projected ends and the annotated dataset, as follows: where p i (x) is the 2D points annotated in the reference image of the database, d i ( x) ∈ R 2 is the projection in the image of the 3D coordinates of the target i knowing the predicted pose x.

Evaluation of the APF Parameters
Given the stochastic nature of our 3D tracking approach, the results obtained when performing the same experiment with the same APF configuration parameters often support different results. Thus, to obtain consistent measurements and repeatability of the performance, each experiment is run 10 times for each sequence. We calculate the average of the errors (3D or 2D) obtained at each moment on all the estimated positions. First, we evaluated the effect of the resampling parameter α used in the APF to limit the spread of particles from layer M to layer M − 1. It can be seen, as shown in Figure 7, that this parameter has an important influence on the obtained results, especially when the number of particles is low.
Sensors 2020, 20, x FOR PEER REVIEW 11 of 17 Figure 7. Influence of the parameter -sequence n° 2. The results suggest that a good choice for the alpha parameter can improve the performance of the annealed particle filter (APF) and consequently increase the accuracy of 3D tracking.
Thus, we varied the value of the parameter from 0.2 to 0.7 and compute the average data of the 2D error for all sequences. The obtained results are summarized in Table 2. We can see that that the value = 0.4 allows us to obtain the best performances for all sequences. Indeed, this value allows the constraint of the propagation space from one layer to the next when the human movements are significant. This is the case with the arms in sequence 2 where the system no longer allows us to track the joints that have undergone a great movement. For example, the obtained 2D error (in pixels) when = 0.4 is about 4.15 0.73 pixels for sequence 2. = 0.6 gives the poorest results with an error of 5.06 1.16 pixels. Therefore, appropriate choice of the parameter can improve the tracking performance by 22%. Figure 7. Influence of the parameter α-sequence n • 2. The results suggest that a good choice for the alpha parameter can improve the performance of the annealed particle filter (APF) and consequently increase the accuracy of 3D tracking.
Thus, we varied the value of the parameter α from 0.2 to 0.7 and compute the average data of the 2D error for all sequences. The obtained results are summarized in Table 2. We can see that that the value α = 0.4 allows us to obtain the best performances for all sequences. Indeed, this value allows the constraint of the propagation space from one layer to the next when the human movements are significant. This is the case with the arms in sequence 2 where the system no longer allows us to track the joints that have undergone a great movement. For example, the obtained 2D error (in pixels) when α = 0.4 is about 4.15 ± 0.73 pixels for sequence 2. α = 0.6 gives the poorest results with an error of 5.06 ± 1.16 pixels. Therefore, appropriate choice of the parameter α can improve the tracking performance by 22%.

Comparing of Likelihood Functions
In this section, the effect of the likelihood function on the performance of the proposed 3D tracking algorithm is studied. Thus, four likelihood functions are considered: Spherical Gradient with Geodetic Distance (GG) (defined by Equation (10)), Omnidirectional Gradient (OG), Dual Silhouette (DS) (defined by Equation (11)), and a combination of DS and GG likelihood functions (given by Equation (15)). As a reminder, the likelihood function (OG) uses the classical gradient function (Equation (7)) weighted by the Riemannian metric and calculated on the omnidirectional image. The results obtained when we apply our approach to sequence 1 and 2 demonstrate that GG likelihood function performs better than OG function. It improves the accuracy by 11% compared to the OG function. This demonstrates that handling omnidirectional images in spherical space and using the geodesic distance increases the pose estimation quality. The second result that is clearly seen is that the combination of the likelihood functions DS and GG always gives the best results. Figure 8 shows the obtained results for sequence 4 using DS + GG likelihood function; we found an average error of 15 pixels per image. This is because of the complexity of sequence 4, which presents many self-occlusions of the upper and lower limbs.
Sensors 2020, 20, x FOR PEER REVIEW 12 of 17 Figure 8. Obtained results on sequence 4 using the combined likelihood function (DS+GG). Average 2D distance between the projected 3D model and the annotated data. This error increases significantly when the tracking of the upper limbs is lost due to self-occlusion, this is the case between frame 40 and 50. Table 3 summarizes, for each sequence, the average pixel error obtained for the proposed likelihood functions (computed using Equation (17)). It can be seen that this error is in the range of 4.15 to 7.95 pixels for sequences 1, 2 and 3, whereas it reaches 22 pixels for sequence 4. This can be explained by the fact that sequence 4 has self-occlusion of the upper limbs. Thus, when the person rotates on itself, and the arms remain stuck along the body, then neither the contour nor the silhouette can provide enough information to detect the person's rotation. Table 3. The Mean localization error in the image (pixels) of different sequences in the database. Obtained results on sequence 4 using the combined likelihood function (DS+GG). Average 2D distance between the projected 3D model and the annotated data. This error increases significantly when the tracking of the upper limbs is lost due to self-occlusion, this is the case between frame 40 and 50. Table 3 summarizes, for each sequence, the average pixel error obtained for the proposed likelihood functions (computed using Equation (17)). It can be seen that this error is in the range of 4.15 to 7.95 pixels for sequences 1, 2 and 3, whereas it reaches 22 pixels for sequence 4. This can be explained by the fact that sequence 4 has self-occlusion of the upper limbs. Thus, when the person rotates on itself, and the arms remain stuck along the body, then neither the contour nor the silhouette can provide enough information to detect the person's rotation.  Figure 9 shows the tracking results of the body extremities: head, hands and feet. We note that the head is the part of the body that is best tracked, while the feet are less well tracked. Indeed, the position of the feet in the omnidirectional images are close to the center, which reduces their size and makes their detection more difficult. Figure 8. Obtained results on sequence 4 using the combined likelihood function (DS+GG). Average 2D distance between the projected 3D model and the annotated data. This error increases significantly when the tracking of the upper limbs is lost due to self-occlusion, this is the case between frame 40 and 50. Table 3 summarizes, for each sequence, the average pixel error obtained for the proposed likelihood functions (computed using Equation (17)). It can be seen that this error is in the range of 4.15 to 7.95 pixels for sequences 1, 2 and 3, whereas it reaches 22 pixels for sequence 4. This can be explained by the fact that sequence 4 has self-occlusion of the upper limbs. Thus, when the person rotates on itself, and the arms remain stuck along the body, then neither the contour nor the silhouette can provide enough information to detect the person's rotation.  Figure 9 shows the tracking results of the body extremities: head, hands and feet. We note that the head is the part of the body that is best tracked, while the feet are less well tracked. Indeed, the position of the feet in the omnidirectional images are close to the center, which reduces their size and makes their detection more difficult.   Figure 10 illustrates an example of head tracking compared to ground truth. The blue and red trajectories on the image correspond to the history of the estimated and real head positions, projected into the current image. We can see that the head displacement estimated by our tracking algorithm corresponds to the real trajectory recorded by the SmartTarck system. This demonstrates the accuracy of our approach and its effectiveness when processing real data. Figure 10 illustrates an example of head tracking compared to ground truth. The blue and red trajectories on the image correspond to the history of the estimated and real head positions, projected into the current image. We can see that the head displacement estimated by our tracking algorithm corresponds to the real trajectory recorded by the SmartTarck system. This demonstrates the accuracy of our approach and its effectiveness when processing real data.

Evaluation of the Computation Time
The computation time is directly proportional to the number of particles as well as to the number of layers of the APF filter. It also depends on likelihood functions. Table 4 summarizes the computation times obtained for the slowest case when a combination of two likelihood functions (gradient with geodesic distance and dual silhouette) is used with 100 particles for a single layer (the computation time of propagation likelihood function will be multiplied by the number of layers). The computation time to perform the 3D tracking on one frame of 800 × 600 pixels is about 0.79 s when using a 3 GHz Intel Core-i7 with Matlab implementation. We note that the time required for image pre-processing (calculation of the gradient and geodetic distance) represents about 57% of the total computing time. This high time is mainly due to the multiple omnidirectional projections towards the spherical space. In our case, we limit the calculations to a restricted image space thanks to the HOG detection window. In addition, the time required to estimate the likelihood functions represents 37% of the overall calculation time, while the time required to propagate the particles of the APF filter and subtract the background is relatively small; it represents only 1% of the total time.

Comparison with Other Works
For completeness, we present a qualitative analysis that compare our results against other 3D human pose estimation methods. This is just meant to be an indicative result, as the considered methods are evaluated differently. Indeed, public omnidirectional image datasets are unfortunately

Evaluation of the Computation Time
The computation time is directly proportional to the number of particles as well as to the number of layers of the APF filter. It also depends on likelihood functions. Table 4 summarizes the computation times obtained for the slowest case when a combination of two likelihood functions (gradient with geodesic distance and dual silhouette) is used with 100 particles for a single layer (the computation time of propagation likelihood function will be multiplied by the number of layers). The computation time to perform the 3D tracking on one frame of 800 × 600 pixels is about 0.79 s when using a 3 GHz Intel Core-i7 with Matlab implementation. We note that the time required for image pre-processing (calculation of the gradient and geodetic distance) represents about 57% of the total computing time. This high time is mainly due to the multiple omnidirectional projections towards the spherical space. In our case, we limit the calculations to a restricted image space thanks to the HOG detection window. In addition, the time required to estimate the likelihood functions represents 37% of the overall calculation time, while the time required to propagate the particles of the APF filter and subtract the background is relatively small; it represents only 1% of the total time.

Comparison with Other Works
For completeness, we present a qualitative analysis that compare our results against other 3D human pose estimation methods. This is just meant to be an indicative result, as the considered methods are evaluated differently. Indeed, public omnidirectional image datasets are unfortunately not available, which did not allow us to carry out a quantitative comparison with state-of-the-art techniques. We evaluate the accuracy of 3D human pose estimation in terms of average Euclidean distance between the predicted and ground-truth 3D joint positions and Head. We compare the results obtained from the "walking" action in our investigation with recent state-of-the-art approaches which are tested in the walking action of popular public datasets like Human3.6M and HumanEva-I. The walking action in our database corresponds to one person's movement towards the camera, with a coherent swing of the left (right) arm and the right (left) leg with each other in space, which is quite similar with the walking action of Human3.6M and HumanEva-I databases. The reported results are presented in Table 5. We can see that the performance of our approach is similar to state-of-the-art methods, validating the effectiveness of our tracking scheme. Nevertheless, it would be interesting to generalize this result by testing the robustness of our approach under more challenging conditions with complex human actions. Table 5. 3D errors (mm) of 3D human pose estimation methods in the walking actions.

Conclusions
This paper presents a new approach for human pose estimation by using a catadioptric vision system within the context of Bayesian filtering. We developed original likelihood functions in Riemannian/spherical space to take into account the geometrical properties of the omnidirectional images. The spherical image derivatives were then used to adapt the gradient computation to this space, and the geodesic distance was considered when generating the distance map. Numerous experiments were carried out with real image sequences to evaluate the performance of the proposed approach. We used the MSE criteria to measure the quality of the estimated 3D pose in comparison to the ground truth data. The results show that the performance is further improved when using the combined Silhouette/Edge likelihood function. Indeed, our algorithm converges in less than 1 s in most cases, while the 3D pose estimation error generally remains below 10 cm. However, we have observed that the AFP filter sometimes has some limitations, in particular, when the body extremities are partially occluded or when the person is more than 5 m away from the sensor. As future work, we plan first to explore the use of additional information provided by other sensors, like Kinect and IMU (Inertial Measurement Unit), to improve the estimation accuracy, and second to use deep learning approaches such as those that have been demonstrated to produce remarkable results for classical 3D object localization.