A 3 D Human Skeletonization Algorithm for a Single Monocular Camera Based on Spatial – Temporal Discrete Shadow Integration

Three-dimensional (3D) human skeleton extraction is a powerful tool for activity acquirement and analyses, spawning a variety of applications on somatosensory control, virtual reality and many prospering fields. However, the 3D human skeletonization relies heavily on RGB-Depth (RGB-D) cameras, expensive wearable sensors and specific lightening conditions, resulting in great limitation of its outdoor applications. This paper presents a novel 3D human skeleton extraction method designed for the monocular camera large scale outdoor scenarios. The proposed algorithm aggregates spatial–temporal discrete joint positions extracted from human shadow on the ground. Firstly, the projected silhouette information is recovered from human shadow on the ground for each frame, followed by the extraction of two-dimensional (2D) joint projected positions. Then extracted 2D joint positions are categorized into different sets according to activity silhouette categories. Finally, spatial–temporal integration of same-category 2D joint positions is carried out to generate 3D human skeletons. The proposed method proves accurate and efficient in outdoor human skeletonization application based on several comparisons with the traditional RGB-D method. Finally, the application of the proposed method to RGB-D skeletonization enhancement is discussed.


Introduction
The development of three-dimensional (3D) human skeleton extraction contributes enormously to prospering fields like virtual reality and somatosensory human-computer interaction.However, current 3D human skeletonization algorithms require specified acquisition equipments including RGB-Depth (RGB-D) cameras and wearable sensors, or a specific experimental setup like ring illuminator array.RGB-D cameras like Microsoft Kinect are designed to perform human skeletonization in a short range [1,2].Wearable sensors only perform effective skeletonization on human subjects wearing experimental tags.Ring illuminator array requires precise subject position and illuminator array setup during the 3D modelling and skeletonization procedures.These setup restrictions of traditional human skeletonization methods bring great limitation on the outdoor applications.
Instead of deploying algorithms on traditional specified platforms, this work pays attention to the commonest projection of the human body on the ground.Shadow is the projection of a opaque object on a certain surface, containing single-view silhouette information of the object.Multiple methods have been developed to extract information from shadow.Current methods mainly focus on the recovery of mesh model [3][4][5] or point clouds [6,7] of static objects based on partial shadow information [8].In this paper, a silhouetted shadow-based skeleton extraction (SSSE) method is proposed.The proposed SSSE method deploys shadow information extraction algorithm to the field of human skeletonization [9][10][11].
Based on the proposed SSSE method, six 3D joint positions in the human skeleton can be precisely extracted in outdoor scenarios with a normal monocular camera.Compared with current indoor 3D human skeleton extraction methods based on RGB-D cameras like Kinect, the proposed SSSE method reduces constraints on input device choice and application environment setup.
This work is motivated by the procedure of taking a silhouette photo.During this procedure, the human body blocks a part of light from reaching film or sensor, leaving a body sketch on the silhouette photo.Human shadow on the ground, from the aspect of silhouette imaging, can be regarded as a silhouette photo of the human body on a special giant film.The ground surface plays the role of film.For captured frames containing human shadows, each shadow on the ground can provide extra human contour information from a unique observation angle view other than the camera view.
This paper mainly focuses on the extraction and aggregation of the extra silhouette information from spatial-temporal discrete human shadows on the ground, aiming to perform 3D human skeletonization with a monocular camera in outdoor scenarios.Based on the aggregation of multiple shadows from discrete spatial-temporal coordinates, SSSE is capable of launching 3D human skeletonization even in outdoor scenes where the scale is too large for traditional methods [12][13][14] to handle [6,15,16].The main contributions of this paper are related to three aspects: (1) The 3D human skeletonization is realized with a normal monocular camera based on the proposed SSSE method.(2) The proposed SSSE method achieves 3D human skeletonization in a large-scale outdoor scene.
(3) The proposed SSSE method deploys the aggregation of temporal-spatial discrete two-dimensional (2D) shadow information in a 3D human skeletonization procedure The remaining sections of this paper are organized as follows: In Section 2, the basic theory for shadow-based single frame human skeletonization is introduced first, followed by the advanced SSSE method aggregating temporal-spatial discrete shadow information to recover complete skeleton sequences.In Section 3, a five-step method is introduced to deploy the proposed SSSE method in large-scale outdoor scenarios with a monocular camera.In Section 4, the effective range and precision of the SSSE skeletonization results are evaluated in comparison with the skeletonization result of traditional RGB-D method.Additionally, a fusion application of the SSSE and RGB-D skeletonization method is achieved in Section 5, providing much wider effective range in outdoor scenarios.Eventually, the advantages and potential applications of the proposed SSSE method are illustrated in Section 6.

Basic Theory
This section presents the basic theory of the silhouetted shadow-based 3D human skeletonization method.To illustrate our method clearly, the basic theory under multiple light source scenarios is introduced first.Then the advanced theory designed to aggregate temporal-spatial discrete shadow information is introduced to achieve skeleton recovery under single light source scenarios.

Skeleton Simulation in Multi-Light-Source Scenarios
In a multiple light source scenario, contour of each human shadow on the ground is decided by two factors: (1) human contour shape.(2) positional relationship between the light source and the human.
Since each single shadow on the ground is restricted in a 2D plate, it is impossible to reproduce 3D information from any single shadow image.However, multiple shadows generated by different light sources can carry contour information from multiple 3D view angles, allowing the reproduction of 3D information.
A 3D voxel model of an object can be simulated from shadows generated by a annular set of light sources [6].However, out of the laboratory environment, accurate manual arrangement of light source positions is elusive.Thus SSSE is designed to be adaptive to the posterior combination of random light source positions.With two or more shadows generated by different light sources, our method is capable of simulating 3D human skeleton information.
In a multiple light source scenario shown in Figure 1a

3D Scenario Reproduction
The silhouettes of human shadow are projected on the ground.To locate 2D shadow areas corresponding to different human joints, 3D scenario reproduction is launched to extract original 2D silhouette information Sh from the corresponding images Sh c captured by camera C. The extraction is launched through the a two step perspective transformation between the ground surface plane S(u, v) and the camera coordinate plane C(x , y ).Due to the progressive road engineering and partial patching, the height levels between different road parts are normally discontinuous.Instead of deploying global perspective transformation between the ground surface plane S(u, v) and image coordinate plane Im(x, y), this work proposes a block matrix-based projection transformation optimized for uneven road surfaces.
Before the 3D scenario reproduction procedure, the projection transformation parameter matrices are calculated once.Then frame-by-frame block matrix-based projection transformations are launched to extract original silhouette information.
In order to illustrate the block matrix-based projection transformation clearly, traditional plane-to-plane projection transformation is presented first.Then the block matrix-based projection transformation is introduced along with the optimized extraction solution for parameter matrices.Based on the extracted parameter matrices, the simplified Equation ( 15) for frame-by-frame projection transformation is presented.

Plane-to-Plane Projection Transformation
During the imaging process of a monocular camera, the projection transformation from ground surface plane to image coordinate plane is carried out in two steps.Firstly, each point (u, v) on ground surface plane is projected to corresponding coordinates (x , y ) on the camera coordinate plane.
Secondly, a linear transformation happens inside the camera, transforming coordinates (x , y ) to pixel coordinates (x, y) on the image coordinate plane.
The projection transformation between the point coordinates (u, v) on the ground surface plane and the corresponding coordinates (x , y ) on the camera coordinate plane is presented as below: w is a fixed camera internal parameter that affects the linear transformation from the camera coordinate plane to the image coordinate plane.Noticeably, A is the projection transformation calibration matrix that defines the relationship between ground surface plane and camera coordinate plane.Multiple transformations are taken into consideration in architecting the projection transformation calibration matrix A.
• Rotation transformation.Most surveillance cameras are not precisely set up at the horizontal angle which is parallel with the ground surface.The non-horizontal installation attitude brings a rotated field of view.The rotation transformation is introduced to calibrate the rotated field of view, ensuring the calibrated field of view parallel with the ground surface.

•
Scale transformation.The coordinate system of the ground surface plane is measured in centimeters.However, pixel is the basic unit of measurement in the image coordinate plane.Thus the scale transformation is introduced to bridge two different units of measurement, extracting ground surface plane coordinates from the pixel coordinates.
Both rotation transformation and scale transformation are linear transformations.The coordinates of both transformations are combined into the linear parameter matrix L.

•
Translation transformation.For the image coordinate plane, the origin of the coordinate system is fixed at the bottom left corner.For each captured frame, the origin of the coordinate system on the ground surface plane does not necessarily coincide with the origin of image coordinate plane.The translation transformation is introduced to calibrate the translation between two coordinate systems.The detailed parameters for translation transformation are given in parameter matrix T.

•
Perspective transformation.Instead of the flat view, a perspective view is captured by each monocular surveillance camera in each frame.Thus, the perspective transformation is introduced to recover the flat ground surface plane from the captured perspective view.The detailed perspective transformation parameters are given in parameter matrix P.
For linear transformation parameter matrix L, the scale transformation parameters c x and c y and rotation angle θ are included.
The translation transformation parameter matrix T is made up of translate values t x and t y in different axis directions.
The perspective transformation parameter matrix P is made up of perspective values p x and p y in different axis directions.
Based on the detailed transformation parameter matrices T, L and P, the projection transformation matrix A can be presented as: Based on the Equations ( 1) and ( 5), coordinates x ,y and w on the camera coordinate plane can be presented by coordinates u, v on the ground surface plane and sub-parameters of matrix A.
x = a 11 u + a 21 v + a 31 (6a) Then, a linear transformation carried out to calculate pixel coordinates x and y in image coordinate plane.The transformation is controlled by the camera internal parameter w .
(x, y) T = x w , Eventually, the pixel coordinates x and y can be presented by ground surface plane coordinates u, v and sub-parameters of projection transformation calibration matrix A.
Additionally, if the human shadow pixel coordinates x and y and projection calibration matrix A is acknowledged, the real-world coordinates u and v of the human shadow can be extracted based on solving the Equations (8a) and (8b).The procedure of solving real-world coordinates u and v is simplified in Equation (9).
(u, v) T = f ((x, y) T , A) Block Matrix-Based Projection Transformation Parameter Calculation The traditional plane-to-plane projection transformation is designed for ideal scenarios with continuous flat ground surface.Nevertheless, the realistic scenarios contain uneven ground surfaces with discontinuous pavement levels.Thus, the single projection transformation calibration matrix A is not capable of ensuring precise projection transformation for all sub-blocks of the uneven ground surface.
In order to deploy the projection transformation on realistic scenarios with high precision, a block matrix-based projection transformation is proposed in this part.Instead of deploying imprecise plane-to-plane global transformation, the proposed method launches a set of precise sub-transformations.Each single sub-transformation covers only one partially flat sub-block on the ground surface, ensuring the precise projection transformation between a surface sub-block and the corresponding image subset.For each sub-block, the unique projection transformation calibration matrix A sub is non identical with the parameter matrices belonging to other sub-blocks.
The parameter matrices A sub of different sub-blocks are calculated separately based on Equations (8a) and (8b).To solve the unique calibration matrix of each sub-block, four pairs of marked point coordinates on ground surface plane and their corresponding pixel coordinates on image coordinate plane are required.However, manipulating massive markers to calculate parameter matrices of all sub-blocks will bring a heavy workload.
In order to simplify the setup, the optimized block matrix based parameter calculation procedure is designed to be marker coordinates multiplexable and parallel computing friendly.From the top-view angle, the ground surface is divided into a matrix consisting of multiple intensive square sub-blocks as shown in Figure 1b.
Each sub-block is a unit square area Sq sub defined by the four corner markers, occupying one meter square area on the ground surface as shown in Figure 1b.The coordinate set of four markers on ground surface is defined as For each sub-block Sq sub , a set of auxiliaries is introduced to simplify the calculation of parameter matrices A sub based on Equation (10).The scale auxiliary parameters set includes ∆x 1 , ∆x 2 , ∆y 1 , and ∆y 2 .
Additionally, the parallel auxiliary parameters ∆x 3 and ∆y 3 are introduced as Equation (10) as well.If both auxiliary parameters ∆x 3 and ∆y 3 approach zero, the field of camera view is regarded as parallel with the sub-block.
The translation parameter T sub , perspective parameter P sub and linear parameter L sub in each calibration matrix A sub can be solved as: The extraction procedure of the block matrix based projection transformation calibration matrix can be simplified as:

Block-Matrix Based Projection Transformation Deployment
Based on Equation ( 9) and the calculated parameters in matrix A sub , real-world coordinates (u, v) of each point in one sub-block area can be calculated from the corresponding pixel coordinates (x, y).
The presentation of extraction procedure can be simplified as: Noticeably, different from the original global calibration matrix A, each sub-block calibration matrix A sub is only deployed on the restricted regional transformation between the sub-block area on the ground and the corresponding pixel range in the image.
Once all parameter matrices A sub for different sub-blocks are extracted through the block matrix-based parameter calculation procedure, coordinates (x, y) of pixels belonging to different sub-blocks can be transformed to corresponding real-world coordinates (u, v) inside the sub-block Sq sub .Block matrix-based projection transformation is deployed based on the parallel computation of sub-transformations illustrated in Equation (13).The deployment algorithm of a sub-transformation is illustrated in Algorithm 1.

Algorithm 1: Block matrix based projection transformation deployment Algorithm
Input: M S sub = {(u i , v i ) , i = 1, 2, 3, 4}: coordinates set of marker positions for sub-block Sq sub ; M Im sub = {(x i , y i ) , i = 1, 2, 3, 4} :corresponding pixel coordinates set of M S sub on image coordinate plane Im(x, y); (x, y): image coordinates of captured pixel in human shadow silhouette Output: (u, v): corresponding real-world coordinates of (x, y) For each sub-block area Sq sub , a distinctive sub-transformation thread is launched based on the specific calibration matrix A sub .The parallel computation of block matrix-based projection transformation contains multiple sub-transformation threads.For the simplicity of the parallel computation presentation, A mat is introduced as the collection of all calibration sub-matrices {A sub } for different sub-blocks.The overall transformation is simplified as Equation ( 14).
(u, v) T = F((x, y) T , A mat ) Based on Equation ( 14), the real-world coordinates (u, v) of human shadow silhouette Sh can be extracted from corresponding pixel coordinates (x , y ) ∈ Sh c captured by a monocular camera.The block matrix-based projection transformation between the captured human shadow silhouette Sh c and the corresponding real-world shadow silhouette Sh is illustrated in Equation (15).
The benefits of the block matrix based projection transformation are obvious: • The positions of markers can be reused to simplify the scenario set up.For a scenario containing a m × n square meter area, the number of markers is reduced from (4 Parallel sub-transformations on different sub-blocks can be processed synchronously to accelerate the overall projection transformation procedure.

•
Only when the position of camera is moved or the ground surfaced is repaved, will partial recalibration work be necessary for the affected sub-block Sq sub .
Overall, all parameter matrices A sub for different sub-blocks only need to be calculated once.Then all pixel coordinates in video frames can be transformed into the real-world coordinates on the ground surface plane.The block matrix-based structure also simplifies the parameter maintenance procedure when changes occur in the scenario.

Silhouette Information Extraction
For the extracted human shadow contour Sh on the ground surface, joint positions are extracted through an optimized method based on the silhouette contour extreme point seeking method.Comparing with traditional human segmentation methods, only silhouette information is available for shadow contour segmentation in our work.In order to perform an efficient joint position extraction based on precise silhouette contour segmentation [17], a two-step algorithm is presented in this section.

Human Shadow Silhouette Contour Preprocess
Firstly, a survey for global peak points on the shadow contour is launched to locate most obvious joint positions on the human shadow contour.In this step, the gravity center coordinate (u, v) of human shadow contour Sh is calculated first.For human shadow contour Sh containing N contour points (u m , v m ), the gravity center (u, v) can be extracted based on the Equation (16).
Then, the the distance curve D between contour points (u m , v m ) ∈ Sh and the gravity center (u, v) is calculated for the localization of global peak points.The value of each point on the distance curve D is calculated based on Equation (17).The Cartesian distance is applied in the Equation ( 17) as a linearized approximation for the value of each point on the distance curve.
In order to reduce the interference of grainy ground surface in the joint position extraction procedure, the distance curve D is denoised based on Equation (18).The smooth length unit η is set as 10 in our experiment.In the next step, the localization procedure of major joint positions is based on the denoised distance curve D.
The global peak points including head and two feet appear at the maximum point on the distance curve.Based on the denoised distance curve D, the major joint positions can be located through seeking peak points.The normalized distance curve extraction procedure is illustrated from Equation (16) to Equation (18) and simplified in the stage Equation (19).In order to simplify the subsequent presentations, function Pre is introduced to cover the extraction procedure for the normalized distance curve D based on the human shadow contour Sh.

Localization of Major Joint Positions on Human Shadow Silhouette Contour
In the second step, a quick localization of global maximum peaks is launched first to locate the positions of head and both feet, then elaborate local search for major joints including hands, shoulders and knees is carried out.
(1) Localization of Global Convex Areas Three global maximum peaks of denoised curve D is marked in corresponding positions on Figure 2a with square symbols.The marked positions indicate precise global convex area on the human shadow silhouette contour, including head Sp head , left foot Sp f oot le f t and right foot Sp f oot right .As shown in Figure 2b, the area containing the head are marked in red, and areas containing the feet are marked in green.
(2) Localization of Auxiliary Anchor Points Based on the acknowledged major joint positions including head and feet, the positions of rest joints are calculated through locating the local peak and nadir points.
Based on the three major joint positions, the shadow contour is divided into three sub-curves.Each sub-curve contains one auxiliary anchor point at the corresponding local nadir position on curve D. The auxiliary anchor points are markered with star symbols in Figure 2a.

•
The sub-curve between two feet joints contains the position of hip center Sp hip at the local nadir position.

•
The sub-curves between the head position and two feet positions contain positions of two oxters at local nadir positions, respectively.The major joint position localization procedure is illustrated in the three steps above and simplified in stage Equation (20).In order to simplify the subsequent presentations, function Loc is introduced to cover the localization procedure for major joint position set Sp joint based on the human shadow contour Sh and the corresponding distance curve D.
Noticeably, the joint position localization procedure can also be adopted in the joint position extraction from a normal human pose contour.The human pose classification illustrated in Section 2.2.2 is based on the joint position extraction procedure illustrated in Equation (20).

3D Joint Position Estimation and Skeleton Synthesis
In a multiple light source scenario, more than one human shadow is projected on the ground surface at the same time.In order to identify shadow areas generated by different light sources, 2D human shadow contour Sh and joint position region Sp joint are footnoted with corresponding light source identifier i as shown in Equation (21).Additionally, the point coordinates (u, v) inside the each region Sp joint i are footnoted as (u ).
In order to estimate the 3D joint position Mp joint of each major joint, the light beams L joint i from different light sources S i blocked by Mp joint are reconstructed first.Then, the 3D position of Mp joint is calculated based on allocating the shared voxel area between multiple reconstructed light beams L joint i .Finally, the human skeleton is synthesized based on the calculated 3D joint position set {Mp joint }.
For the first step, each light beam is generated as a 3D cone with its vertex on the light source position S i = (u i , v i , h i ).The underside of each cone is the joint area Sp The 3D light beam shape extraction procedure presented by Section 2.1.3is simplified in the stage Equation (23).For the simplicity of the subsequent presentations, function Occ is introduced to cover the 3D light beam shape extraction procedure for the occupied 3D cone shape L Figure 3b presents the skeleton synthesis procedure of a human being based on human shadow information under a multi light source scenario.The illustrated human skeleton synthesis procedure is presented in Algorithm 2 and simplified in Equation (25).
However, there are two restrictions for the deployment of the basic theory:  {S i } :3D positions of multiple light sources {(u i , v i , h i )}.Output: Sk: 3D human skeleton synthesis based on seven major joint positions.

Skeleton Simulation in Single-Light-Source Scenario
The basic theory introduced in Section 2.1 is only effective in scenes containing two or more shadows generated by multiple light sources.For single light source scenarios, only one shadow is generated in each captured frame.In order to extend the proposed basic theory in single light source scenarios, a video sequence instead of a single frame is taken into consideration.Human shadow contours are footnoted with time coordinate t in this part.The extension solution is introduced below.

Theoretic Proof of the Extension Solution in a Single Light Source Scenario
For every video sequence, the extension solution is based on two facts:

Temporal Distinguished Relative Position between Light Source and Human Body
In a sequence, the relative position between a moving human and a fixed light source keeps changing.In other words, temporal discrete shadows Sh t are generated by the light sources from different relative positions θ t towards the human.The temporal neighboring human shadows Sh t and Sh t+1 are distinguished from each other because of different relative positions between the light source S and the human body.For neighboring frames at time coordinates t and t + 1, it is clear that θ t = θ t+1 and Sh t = Sh t+1 .

Temporal Discrete Shadows for Same Human Pose
In order to categorize different frames based on the human poses, the 2D contour of human body captured by a monocular camera is regarded as the human pose P t at the time coordinate t.As shown in Figure 4b, same human pose P 0 appears repeatedly during an activity sequence.Since each relative position between the light source and human body is different frame by frame, multiple frames sharing the same human pose P 0 can be found.Each frame owns different shadows Sh t and projection angles sθ t .In an activity sequence, all the human shadows Sh t sharing the same human pose P 0 are categorized into the set {Sh t }.For different human shadows Sh t ∈ {Sh t }, their corresponding relative position angles θ t are distinguished from each other.If multiple human shadows in {Sh t } with different projection angles θ t are integrated in one single frame as shown in Figure 4b, condition (2) of launching the basic theory proposed in Section 2.1 is satisfied.Through applying translation transformations on each integrated frames to make the all human poses P t spatially coincide with the central pose P j c , an artificial multiple light source scenario satisfying conditions (1) and ( 2) is established as shown in Figure 4c.The simulated scenario makes it feasible to recover the skeleton of the shared pose P o in a single light source scenario based on the basic theory proposed in Section 2.1.

Temporal-Spatial Aggregation Method
Before the deployment of human skeletonization, it is necessary to find shadows that share the same human pose P 0 , yet have distinctive projection angles θ t .Human pose classification and temporal-spatial shadow aggregation are deployed to fit spatial coordinates of shadows in {Sh t } with the chosen central pose position P t c .

Human Pose Classification
In order to analyses the human pose P t i at each time coordinate t i , the denoised distance curve D t i between the human pose contour and human pose gravity center is extracted based on the same method illustrated in Equation (19).Similarly, the stage Equation (26) covers the normalized distance curve extraction procedure illustrated from Equation ( 16) to Equation (18).The function Pre presents the extraction procedure for the normalized distance curve D t i based on the contour curve P t i .
Based on the distance curve D t i , major peak point set {J p k t i |k = 1, 2, 3} including head and two feet are extracted from the captured human contour based on the same procedure presented in Equation (20).Similar to the stage Equation ( 20 The positions of three peak points of the human pose contour are combined into a star feature to describe the human pose in each frame [10,18].Then unsupervised classification is adopted to assort each frame with corresponding pose category label based on the star feature [19].

Temporal-Spatial Shadow Aggregation
During the temporal-spatial shadow aggregation procedure shown in Figure 4c, temporal discrete light sources are aggregated in a single frame.Normally, for multiple human poses, the human pose P t i with median time coordinate t i is chosen as the central pose P j c .
For each human pose P t i ∈ {P j }, the translation transformation parameter T t i →j c is defined by the vector between corresponding joint points in P t i and P j c , satisfying the spatial transformation from P t i to P j c .
Noticeably, the joint positions J p k t i and J p k j c are captured in the image coordinate plane.Before calculating the translation transformation T t i →j in real-world coordinates, it is necessary to transform the joint coordinates into the real-world coordinates Sp k t i and Sp k j c based on the stage projection transformation F((x, y), A mat ) presented in Equation (14).
As shown in Equation (29a), Sp k t i is the extracted real-world human joint coordinates at the original captured position.In Equation (29b), Sp k j c is the human joint coordinates at the destination position.Then the translation vector − −− → T t i →j c is calculated based on the averaged horizontal translation vectors from the original position to the destination position.As shown in Equation (29c), α and β are two vertical unit vectors of the real-world coordinate system on the ground surface.The translation vector − −− → T t i →j c is presented as a combination of translation components ∆u (t i ,j c ) and ∆v (t i ,j c ) in two vertical directions.Based on the translation components ∆u (t i ,j c ) and ∆v (t i ,j c ) , a translation transformation matrix A t i →j c can be established for the translation calculation as shown in Equation (29d).
For the convenience of further illustration, the extraction procedure of matrix A t i →j c is simplified in Equation (30).The function Par is introduced to present translation matrix extraction procedure illustrated from Equation (29a) to Equation (29d).
In order to maintain a consistent expression system, the translation transformation is presented in the same format with Equations ( 14) and (15).When the translation transformation in Equations ( 31) and ( 32) is deployed synchronously on the light source S t i and human shadow contour Sh t i for each frame, all the transformed human shadows Sh t i fit the spatial coordinates of the central human pose P t c in each simulated scenario.
The position of light source S t i applies the same transformation T t i →j c along with the related shadow Sh t i , simulating multiple light sources S t i in the single frame.
The temporal-spatial aggregation procedure illustrated above is presented in Algorithm 3.With more than two positional distinctive light sources simulated in the same frame, the skeleton synthesis procedure presented in Equation ( 24) can be applied on the simulated human shadow set {Sh t i } and the corresponding light source set {S t i }.Based on Equation (33), the skeleton Sk j c of pose P j c can be synthesized under a single light source scenario as shown in Figure 4d.The detailed human skeleton synthesis procedure under a single light source scenario is illustrated in Section 3.

Proposed Method
Based on the basic theory and its extension introduced in Section 2, a normal single light source scenario can support 3D human skeletonization.In this section, a five-step algorithm is proposed according to the illustrated theory as shown in Figure 5.The procedure of the proposed method is shown in Algorithm 4.

Pose Classification
Temporal-spatial Aggregation

Pose Classification
In a human activity sequence captured under a single light source scenario, frames at different time coordinates are classified based on human poses [8] on the captured frames.For each captured human pose P t i at time coordinate t i , the denoised distance curve between contour points and the gravity center of P t i can be extracted based on the method presented in Equation (19).The deployment of the extraction method on the human pose P t i is presented in Equation (34).
Based on the method presented in Equation ( 20), a major peak joint position set {J p k t i } is extracted from the human contour P t i , including the head position J p 1 t i , the left foot position J p 2 t i and right foot position J p 3 Based on the normalized peak joint positions, raw frames containing same class human poses P t i is aggregated to the human pose category P j based on the automatic unsupervised clustering illustrated in Equation (28).C t i is the category label of human pose P t i as shown in Equation (36).

Preprocess
The preprocess procedure transforms the captured shadow contour pixel coordinates Shc t i into the real-world coordinates Sh t i .
Before the preprocess of the first shadow contour Shc t i , all the A sub ∈ A mat are calculated and saved for further preprocess procedures.For each square unit area Sq sub , the related projection parameter matrices A sub are calculated based on four real-world coordinates {M S sub } and their corresponding imaging coordinates {M Im sub } based on Equation (12).Based on the calibration matrix set A mat = {A sub }, the global projection transformation F(Shc t i , A mat ) can be figured out.Based on the projection transformation presented in Equation (37), captured human shadow contour pixel coordinates Shc t i can be extracted from each of the raw frames and transformed into the real-world coordinates Sh t i .
Temporal-Spatial Aggregation Preprocessed shadow contours Sh t i are aggregated according to category P j of corresponding human pose P t i .Nevertheless, the real-world coordinates of P t i ∈ P j are spatially dispersed due to the human movement as shown in Figure 4a.Thus it is necessary to aggregate shadow contours Sh t i of the same central human pose P j c to deploy precise joint position estimation.
For each pose category, one central human pose P j c is set up as the aggregating destination for other human shadow Sh t i related with P t i ∈ P j .
The translation of each human shadow Sh t i is based on the translation transformation calibration matrix A t i →j c .The translation transformation matrix is calculated based on the Equation (30).Since the major peak joint position sets {J p k t i } and {J p k j c } are obtained in the pose classification step, the translation transformation calibration matrix A t i →j c can be extracted as shown in Equation (38).
Along with the 2D translation of each Sh t i , the corresponding 3D light source position S i is moved with the identical translation as shown in Equation (39b).The aggregated human shadow Sh t i and light source S t i offer the ideal multiple light source situation for 3D joint position estimation.
As shown in Figure 4b, when P t 2 is setup as the aggregating destination, other Sh t i are aggregated to the aggregating destination through the 2D translation as shown in Equation (39a).

Joint Position Estimation and Skeleton Synthesis
For each aggregated human shadow contour Sh t i , joint area estimation is launched based on the algorithm introduced in the basic theory section.First of all, the gravity center G t i of curve Sh t i is calculated.Then, the denoised distance curve D t i between each point (x t i , y t i ) ∈ Sh t i and G t i is available based on the preprocess procedure illustrated in Section 2.1.2.
The 2D positions of major joint areas including head Sp In each simulated scenario, silhouette information extraction is applied to each joint area Sh i_k .In order to estimate the 3D joint position based on Sp t i , the ray set L k t i connecting light source S t i and joint shadow area Sp t i is simulated.
Since more than two simulated light sources S t i exist in the scenario, silhouette information of single joint area is extracted separately for each light sources.Based on all ray sets L t i _k targeting at the same joint, 3D joint position M k p j can be calculated based on Equation (43).
Repeating steps above for each major joints, 3D joint position set {Mp k j } containing all joint positions can be figured out.Then joint positions can be synthesized based on the combination Equation (44).
In order to simplify the presentation in Algorithm 4, the illustrated joint position estimation and skeleton synthesis procedure is simplified into Equation (45).
Frame Integration Repeating the above steps, synthesized 3D human skeletons Sk j c can be generated for all human poses category by category.The kinematic model of skeleton Sk p j contains seven major joints, including head, neck, hip, both keens and both feet.The bones connecting particular joints are regarded as rigid objects.Based on the pose classification result in the step (1), the time coordinate t i of each P t i ∈ P j can be tracked.Then, reassign synthesized human skeleton Sk j c to frame t i as Sk t i based on the reverse translation transformation. Sk

Experimental Validation
In this section, the experimental data source and settings are illustrated first.Then the effective range and precision of the proposed method are validated in comparison with the RGB-D based method.

Data Source Description and Experimental Settings
The experiments are launched based on data captured by a Kinect RGB-D camera, containing daily human activities.Captured sequences include both RGB frames and normal depth frames captured by Kinect.Kinect extracts human skeleton automatically based on the combined information of RGB frames and depth frames [15].However, SSSE is deployed only on RGB frames captured by the monocular RGB camera on Kinect.
In each sequence captured for effective range validation, Kinect is set up at a static distance from the human subject.The photographic distance increases from 1 m to 20 m with a fixed step of 1 m.Sampled skeletonization results based on both methods at different distances are presented in Figure 6a.

Effective Range and Precision Analyses
In order to validate the effectiveness of the proposed SSSE method and traditional RGB-D method, two aspects including effective range and precision are evaluated.In the following, the effective distance range is marked first.Then, the precision of six major 3D joint positions extracted by SSSE is evaluated.

Effective Range
Effective range is defined as the distance between the sensor and human, which allows effective human skeleton extraction.Effective human skeleton extraction in the effective range generates valid human joint positions.For the RGB-D based method, each extracted joint position comes with a confidence index.Valid joints are joints with confidence above 0.7.For the SSSE method, valid joints are extracted joints not affected by sheltering.In the following experiments, frames with all valid simulated human joints are defined as effective frames.In order to obtain the effectiveness-distance relationship of both methods, the shares of effective frames at different distance levels are measured.In addition, 1000 to 1200 frames containing 3D human skeletons sampled at each photographic distance from 1 to 20 m are evaluated for each method.For the RGB-D-based skeletonization procedure, effective frames are automatically labeled based on the corresponding joint confidence.For an SSSE-based procedure, effective frames are chosen based on the number of valid joints in each skeleton.The effectiveness of both methods at the same distance can be represented by the shares of effective frames among all frames.For each method, effective range covers photographic distances whose effectiveness exceed a specified threshold.
The official parameter of Kinect [1,15] indicates the effective range of state-of-art Kinect result is from 0.8 m to 3.5 m.Thus, the range of distance where effectiveness is above 0.8 is regarded as the effective range.As shown in Figure 6b, the effective range of SSSE is 7-10 m.Note that the effectiveness of SSSE decreases when photographic distance exceeds 10 m because of the limitation of camera resolution.The experimental result in Figure 6a shows that SSSE can provide reliable 3D human skeletonization at an effective range of 7-10 m, while Kinect is unable to extract human skeleton information when the photographic distance exceeds 5 m.

Precision Evaluation
As with the effectiveness evaluation result mentioned above, the RGB-D-based method and SSSE provide effective skeleton extraction results at different distance ranges.Precisions of all extracted joints by SSSE are determined by the deviation values relative to corresponding ground truth joint positions.In the precision evaluation procedure, two Kinects are setup for different purposes.Kinect No.1 is set up 9 m way from human object, capturing RGB frames for human skeletonization based on SSSE.Kinect No.2 is setup 3 m away from human object, capturing RGB-D frames along with 3D human skeletons simultaneously.Since 3 m is inside the effective range of the RGB-D-based 3D skleletonization, the 3D joint positions captured by Kinect No.2 are valid joints, providing ground truth for the deviation calculation.Based on the experimental scenario setup, 1546 frames are captured simultaneously for both methods, of which 1345 effective frames are evaluated.
For each skeleton extracted from a effective frame, joint positions are normalized relative to the hip center, avoiding deviation introduced by different shot distances.
Six major joints are considered in evaluation, including head, spine, both keens and both feet.Figure 7 depicts the averaged precision evaluation result.
As presented in Figure 7, due to the larger scale of upper body shadow on the ground, relative high deviations appear at joints of the head and spine, where averaged deviations reach 14.5 cm and 12.1 cm, respectively.For the remaining joints, the averaged deviations are around 4 cm and the highest deviation remains below 8 cm.In summary, SSSE extracts joint positions in a reasonable precision at 9 m away from the target human, compared with the ground truth Kinect skeletonization result obtained at a position 6 m closer to the subject human.

Discussion
Based on the experimental results in Section 4.2, an interesting phenomenon can be observed in that the effective ranges of the proposed SSSE and traditional RGB-D method are highly complementary.Thus, the fusion application of SSSE and traditional RGB-D method can provide wide range human skeletonization for indoor and outdoor scenarios.In the fusion method, the traditional RGB-D method and SSSE are deployed under different scenarios.For humans inside the effective range of RGB-D cameras, the traditional RGB-D based skeletonization method can provide solid human skeleton extraction method.For humans outside the effective range of RGB-D cameras, SSSE method can redress the unreliable 3D joint positions appears in RGB-D skeletonization result.In order to evaluate the fusion application effectiveness, a comparison between the reliable joint percentage of original skeletons extracted by Kinect and redressed skeletons processed by SSSE is carried out in this section.Reliable joints are defined as joints generated by SSSE not affected by sheltering, and joints generated by Kinect with a confidence index above 0.7.On the contrary, unreliable joints are unavailable joints affected by sheltering in SSSE methods, or joints generated by Kinect with confidence index under 0.7.For better evaluation of the fusion application, Kinect is set up to skeletonize a human subject outside its effective range.
The unreliable 3D joint positions in Kinect skeletonization result is redressed by SSSE simultaneously.In total, 20 sets of experiments have been launched to evaluate the reliable joint percentage enhancement.

Reliable Joint Percentage Enhancement
The enhancement of the reliable joint percentage is evaluated by determining the precisely recovered joint rate J R and precisely recovered frame rate F R .As shown in Equation (6a)-(6c), N Ej and N E f are the unreliable joint number and relevant affected frame number, respectively.N rj is the number of total recovered unreliable joint positions after deploying the SSSE procedure, while E rj is the number of inaccurately recovered joints.From the aspect of frame statistics, N r f is the total number of recovered frames and E r f is the number of frames containing inaccurately recovered joints.
The 20 test sets presented in Table 1 indicate that more than four-fifths of all unreliable joints are successfully redressed based on the proposed SSSE method, and more than three-quarters of all frames containing unreliable joint skeletonization results are accurately fixed.Based on the experimental results above, the fusion application of SSSE and traditional RGB-D method proved effective in reliable joint percentage enhancement.

Computational Cost Evaluation
The simultaneous collaboration between the RGB-D skeletonization method and proposed SSSE method is crucial for the real-time deployment of the fusion application.Thus, limiting the computational cost is essential for the effectiveness of the fusion method.The test platform is a mainstream personal laptop connected with the first generation Kinect, equipped with one Intel Core i7 central processing unit (CPU) and 16 Gigabyte of random access memory (RAM).Two indicators, i.e., maximum process capability per second and single frame delay are concerned in order to evaluate the computational cost.This evaluation test aims to process as many frames as the computational capability allows based on the proposed method.The computation cost efficiency of the fusion application is determined by the number of frames processed per second.As shown in Figure 8, the stable maximum process capability remains around 25 frames per second after the initial stage where less than 10 frames are processed per second.The experimental result indicates that the fusion application is feasible for real-time deployment based on its stable maximum process capability.

Conclusions
In this paper, we proposed a shadow silhouette-based skeleton extraction (SSSE) method.SSSE extracts three-dimensional human skeleton based on the human shadow information on the ground.Specifically, the proposed SSSE method comprises the following: (1) A block matrix-based projection transformation is proposed, allowing the reconstruction of precise shadow silhouette information from human shadow captured by monocular camera.(2) A silhouette shadow-based human skeleton extraction method is proposed.The proposed SSSE method extracts 3D positions of seven major joints in the human skeleton based on the reconstructed human shadow silhouette information and light source position.(3) A temporal-spatial integration algorithm for discrete shadow silhouette information is proposed, empowering the SSSE-based human skeletonization in single light source scenario.
As shown in Table 2, compared with the traditional RGB-D human skeletonization method and other mono-RGB method, the proposed SSSE method has the following advantages: (1) The SSSE method can be deployed in large-scale outdoor scenarios where traditional 3D human skeletonization algorithms are not effective.(2) the SSSE method is capable of extracting human skeleton from frames shot by any normal monocular camera.(3) The SSSE method can be deployed in stretching the effective range of traditional RGB-D skeletonization method in the fusion application.Human Skeleton 7 to 20 Jafari's RGB-D method [16] RGB-D Camera Not Available (N/A) Human Voxel 0 Yang's mono-RGB method [6] Multiple RGB Cameras N/A Partial Voxels 0 For traditional outdoor surveillance systems, the limited 8-Bit color depth in the analogy transmission system restricts the precision of depth information.Based on the proposed SSSE method, precise 3D human skeleton activities can be extracted at any monitoring terminal.The extracted 3D human skeleton activities will enrich the information for surveillance video analyses, empowering convenient 3D scenario reproduction.Because of the simplicity in device requirement and the compatibility with the traditional surveillance network, the proposed SSSE is an ideal upgrade solution for a traditional surveillance system without extra hardware expenditure.
In conclusion, SSSE offers an extra choice for 3D human skeletonization other than depth camera, wearable sensors, or illuminator array, laying down a milestone to deploy in-lab human skeleton-related methods [6,16,20] in outdoor scenarios with normal photographic devices.Based on the unique outdoor merits provided by SSSE, we will focus our future research on applications of SSSE on outdoor surveillance and unmanned aerial vehicle navigation.

,Figure 1 .
Figure 1.Demo of silhouetted shadow-based skeleton extraction (SSSE) in a multi light source scenario: (a) A simulated dual light source scenario; (b) Scenario reconstruction.

Figure 2 .
Figure 2. Silhouette information analyses and joint position extraction.(a) Sub-curve segmentation; (b) two-dimensional (2D) joint position extraction on the shadow area.

⊂
light source S i is blocked by the certain joint part Mp joint of human body M, the joint shadow area Sp joint i produced on the ground.Thus, the direction of blocked light beam L joint i leads to shadow area Sp joint i Sh i , going through human body part Mp joint .If w ∈ [0, h i ] is introduced as the height component in the cone expression of L joint i , the 3D space caused by L joint i can be presented as Equation (22).
Figure 3a demonstrates the recovery of neck joint area Mp neck based on two related shadow areas Sp neck a and Sp neck b generated by light sources S a and S b .Finally, 3D human skeleton Sk with multiple joint positions is synthesized by calculating Mp joint joint by joint.Figure3bpresents the skeleton synthesis procedure of a human being based on human shadow information under a multi light source scenario.The illustrated human skeleton synthesis procedure is presented in Algorithm 2 and simplified in Equation (25).

1 ) 2 )
Two or more light sources are required in the scene.• Condition (Relative angular positions between human body and different light sources should be different.

Figure 3 .Algorithm 2 :
Figure 3. Demo of SSSE in a multiple light source scenario: (a) A simulated dual light source scenario.Sp neck a and Sp neck a are the joint areas of the neck position in the shadows projected by light source a and b, respectively.Similarly, light beams L neck a and L neck b are generated by light sources a and b, respectively.The enclosure Mp neck is the intersection area of L neck a and L neck b .(b) Scenario reconstruction.

Figure 4 .
Figure 4. Demo of SSSE procedure in a single light source scenario.(a) Pose classification based on major joint positions; (b) Spatial-temporal discrete human poses belonging to same class; (c) Temporal-spatial aggregation; (d) Three-dimensional (3D) human skeletonization.

Figure 5 .
Figure 5.The flow chart of skeleton synthesis procedure based on SSSE.

Figure 6 .
Figure 6.Experimental results.(a) A comparison of tracking results; (b) Effective ranges of RGB-D-based results and SSSE-based results.
7.0 m <R e < 10 m Human Skeleton 7 Traditional RGB-D Method [20] RGB-D Camera 0.8 m < R e < 3.5 m Human Skeleton 20 SSSE and RGB-D Fusion RGB-D Camera 0.8 m <R e < 10 m ), stage Equation (27) covers the major joint position localization procedure illustrated in the Section 2.1.2.The function Loc presents the human joint position extraction procedure for major joint position set {J p k t i } based on the human contour Sh t i and the corresponding distance curve D t i .
i } = Loc(Sh t i , D t i )

Algorithm 3 :
Temporal-spatial aggregation procedure Input: t i : time coordinate for each frame; Sh t i :human shadow on the ground surface in frame t i ; P t i : human pose in frame t i ; S: light source position; {J p k j c }: joint position set of the central pose on the aggregation destination ; Output: Sh t i : integrated human shadow Sh t i in the simulated scenario.S t i : integrated light source position in correspondence with Sk t i .
1 foreach time coordinate t i do 2

Algorithm 4 :
Skeleton synthesis procedure Input: t i : time coordinate for each frame; Shc t i :captured human shadow in frame t i ; P t i : human pose in frame t i ; S: light source position; {Sq sub }: the set of sub-blocks on the ground surface plane S {M S sub }: the marker position coordinate sets for sub-block Sq sub on S; {M Im sub }: the pixel coordinates set of {M S sub } on the image coordinate plane.Output: Sk t i :3D human skeleton corresponding to Sh t i at time coordinate t i 1 foreach Sq sub do 1 i , neck Sp 2 i , hip center Sp 3 i , left keen Sp 4 i , right keen Sp 5 i , left foot Sp 6 i and right foot Sp 7 i can be obtained through locating the peak and nadir points in D t i .

Table 1 .
Result of unreliable joint position redress.J R is the precisely recovered joint rate.F R is the precisely recovered frame rate.N Ej is the number of unreliable joints.N E f is the number of frames affected by unreliable joints.N rj is the number of total recovered unreliable joint.E rj is the number of inaccurately recovered joints.N r f is the total number of recovered frames.E r f is the number of frames containing inaccurately recovered joints.

Table 2 .
Comparison between external sensor information-based quadcopter monitoring methods.