You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

11 January 2019

DeepMoCap: Deep Optical Motion Capture Using Multiple Depth Sensors and Retro-Reflectors

,
,
and
1
Centre for Research and Technology Hellas, Information Technologies Institute, 6th km Charilaou-Thermi, 57001 Thermi, Thessaloniki, Greece
2
National Technical University of Athens, School of Electrical and Computer Engineering, Zografou Campus, Iroon Polytechniou 9, 15780 Zografou, Athens, Greece
3
School of Computer Science, University of Lincoln, Brayford LN67TS, UK
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Depth Sensors and 3D Vision

Abstract

In this paper, a marker-based, single-person optical motion capture method (DeepMoCap) is proposed using multiple spatio-temporally aligned infrared-depth sensors and retro-reflective straps and patches (reflectors). DeepMoCap explores motion capture by automatically localizing and labeling reflectors on depth images and, subsequently, on 3D space. Introducing a non-parametric representation to encode the temporal correlation among pairs of colorized depthmaps and 3D optical flow frames, a multi-stage Fully Convolutional Network (FCN) architecture is proposed to jointly learn reflector locations and their temporal dependency among sequential frames. The extracted reflector 2D locations are spatially mapped in 3D space, resulting in robust 3D optical data extraction. The subject’s motion is efficiently captured by applying a template-based fitting technique on the extracted optical data. Two datasets have been created and made publicly available for evaluation purposes; one comprising multi-view depth and 3D optical flow annotated images (DMC2.5D), and a second, consisting of spatio-temporally aligned multi-view depth images along with skeleton, inertial and ground truth MoCap data (DMC3D). The FCN model outperforms its competitors on the DMC2.5D dataset using 2D Percentage of Correct Keypoints (PCK) metric, while the motion capture outcome is evaluated against RGB-D and inertial data fusion approaches on DMC3D, outperforming the next best method by 4.5 % in total 3D PCK accuracy.

1. Introduction

Human pose tracking, also known as motion capture (MoCap), has been studied for decades and is still a very active and challenging research topic. MoCap is widely used in industries such as gaming, virtual/augmented reality, film making and computer graphics animation, among others, as a means to provide body (and/or facial) motion data for virtual character animation, humanoid robot motion control, computer interaction, and more. To date, a specialized computer vision and marker-based MoCap technique, called Optical Motion Capture [], constitutes the gold-standard for accurate and robust motion capture []. Optical MoCap solutions [,,] employ multiple optical sensors and passive or active markers (passive markers are coated with retro-reflective material to reflect light, while active markers are powered to emit it; for passive marker-based MoCap systems, IR emitters are also used to cast IR light on the markers) placed on the body of the subject to be captured. The 3D positions of the markers are extracted by intersecting the projections of two or more spatio-temporally aligned optical sensors. These solutions precisely capture the body movements, i.e., the body joint 3D positions and orientations per frame in high frequency (ranging from 100 to 480 Hz).
In the last decade, professional optical MoCap technologies have seen rapid development due to the high demand of the industry and the strong presence of powerful game engines [,,], allowing for immediate and easy consumption of motion capture data. However, difficulties in using traditional optical MoCap solutions still exist. Purchasing a professional optical MoCap system is extremely expensive, while the equipment is cumbersome and sensitive. With respect to its setup, several steps should be carefully followed, ideally by a technical expert, to appropriately setup the required hardware/software and to rigidly install the optical MoCap cameras on walls or other static objects [,]. In addition, time-expensive and non-trivial post-processing is required for optical data cleaning and MoCap data production []. To this end, there still exists an imperative need for robust MoCap methods that overcome the aforementioned barriers.
In this paper, a low-cost, fast motion capture method is proposed, namely DeepMoCap, approaching marker-based optical MoCap by combining Infrared-Depth (IR-D) imaging, retro-reflective materials (similarly to passive markers usage) and fully convolutional neural networks (FCN). In particular, DeepMoCap deals with single-person, marker-based motion capture using a set of retro-reflective straps and patches (reflector-set: a set of retro-reflective straps and patches, called reflectors for the sake of simplicity) from off-the-shelf materials (retro-reflective tape) and relying on the feed of multiple spatio-temporally aligned IR-D sensors. Placing reflectors on and IR-D sensors around the subject, the body movements are fully captured, overcoming one-side view limitations such as partial occlusions or corrupted image data. The rationale behind using reflectors is the exploitation of the intense reflections they provoke to the IR streams of the IR-D sensors [,], enabling their detection on the depth images. FCN [], instead of using computationally expensive fully connected layers, are applied on the multi-view IR-D captured data, resulting in reflector 2D localization and labeling. Spatially mapping and aligning the detected 2D points to 3D Cartesian coordinates with the use of depth data and intrinsic and extrinsic IR-D camera parameters, enables frame-based 3D optical data extraction. Finally, the subject’s motion is captured by fitting an articulated template model to the sequentially extracted 3D optical data.
The main contributions of the proposed method are summarized as follows:
  • A low-cost, robust and fast optical motion capture framework is introduced, using multiple IR-D sensors and retro-reflectors. Contrary to the gold-standard marker-based solutions, the proposed setup is flexible and simple, the required equipment is low-cost, the 3D optical data are automatically labeled and the motion capture is immediate, without the need for post-processing.
  • To the best of our knowledge, DeepMoCap is the first approach that employs fully convolutional neural networks for automatic 3D optical data localization and identification based on IR-D imaging. This process, denoted as “2D reflector-set estimation”, replaces the manual marker labeling and tracking tasks required in traditional optical MoCap.
  • The convolutional pose machines (CPM) architecture proposed in [] has been extended, inserting the notion of time by adding a second 3D optical flow input stream and using 2D Vector Fields [] in a temporal manner.
  • A pair of datasets consisting of (i) multi-view colorized depth and 3D optical flow annotated images and (ii) spatio-temporally aligned multi-view depthmaps along with Kinect skeleton, inertial and ground truth MoCap [] data, have been created and made publicly available (https://vcl.iti.gr/deepmocap/dataset).
The remainder of this paper is organized as follows: Section 2 overviews related work; Section 3 explains in detail the proposed method for 2D reflector-set estimation and motion capture; Section 4 presents the published datasets; Section 5 gives and describes the experimental frameworks and results; finally, Section 6 concludes the paper and discusses future work.

3. Proposed Motion Capture Method

DeepMoCap constitutes an online, close to real-time marker-based approach that consumes multi-view IR-D data and results in single-person 3D motion capture. The pipeline of the proposed method, depicted in Figure 1, is summarized as follows:
Figure 1. DeepMoCap pipeline overview. After the placement of the reflectors on the subject’s body (1), spatio-temporally aligned IR-D streams are acquired and processed (2) to feed the FCN with colorized depth and 3D optical flow images (3). The FCN outcome, i.e., the multi-view 2D reflector-set estimates, is fused to extract the 3D optical data (4) and, finally, yield the subject’s 3D pose for motion capture (5).
  • A set of retro-reflective straps and patches is placed on the subject’s body.
  • Placing multiple calibrated and synchronized IR-D sensors around the subject to fully capture the body movements, IR-D raw data are acquired and processed, giving multi-view pairs of colorized depth and 3D optical flow.
  • Each pair is fed to a FCN model, resulting in 2D reflector-set estimation per view.
  • The reflector-set estimates are spatially mapped in 3D space using the depth data and the intrinsic and extrinsic calibration camera parameters. The resulting 3D point sets are fused, resulting in 3D optical data outcome.
  • A template-based articulated structure is registered and fitted to the subject’s body. 3D motion capture is achieved by applying forward kinematics to this structure based on the extracted 3D optical data.
The pipeline steps are accordingly described in the following sections.

3.1. Reflector-Set Placement

The reflector-set placement has been designed to provide robust and highly informative motion capture data, i.e., capturing large number of degrees of freedom (DoFs). The selected placement is shown in Figure 2, consisting of a set of 26 reflectors R i { R 1 , , R 26 } , 16 patches and 10 straps, enabling motion capture by fitting an articulated body structure of 40 DoFs. The use of both straps and patches has been chosen due to the fact that the straps are 360°-visible on cylindrical body parts (i.e., limbs), while patches have been used on the body parts where strap placement is not feasible, i.e., torso, head and hands. Aiming to highlight the distinction between the front and the back side of the body, the reflective patches are not symmetrically placed. On the front side, two reflector patches are placed on the head, two on the chest and one on the spine middle, while on the back side, one is placed on the head, one on the back and one on the spine middle. The retro-reflective material used to create the reflector-set is the off-the-shelf 2-inch reflective tape used in protective clothing []. Following carefully the matching between the reflectors and the body-parts of the subject as depicted in Figure 2, the reflector-set placement is a fast procedure (it lasts approximately 2 min), since the sticky straps and patches are effortlessly placed, not requiring high placement precision (it is enough to be approximately placed to the body part locations shown in Figure 2).
Figure 2. Proposed reflector-set placement. Reflective straps (orange) and patches (blue) placement on the subject’s body.

3.2. Raw Data Acquisition and Processing

After the reflector-set placement, the subject is ready to be captured. Let us consider the use of N IR-D devices, thus, N is also the number of views v { 1 , , N } . Using the multi-Kinect for Xbox One capturing setup proposed in [], spatio-temporally aligned multi-view IR-D data are acquired. All reflector regions have distinguishable pixel values on IR images I I R v (Figure 3a), thus, applying binary hard-thresholding, the binary mask I I R m v of the reflectors is extracted (Figure 3b). The corresponding regions on the raw depth images I D v (Figure 3c) have zero values due to the retro-reflections.
Figure 3. Raw depth and infrared data.
The multi-view IR-D raw data are processed before feeding the FCN. On the one hand, the IR-D frames are jointly processed in order to compute the 3D optical flow I F v of the body movements. In the present work, the primal-dual algorithm proposed in [] is considered due to its demonstrated efficiency on relevant computer vision tasks, such as interaction-based RGB-D object recognition [] and 3D action recognition []. In particular, the 3D motion vectors between two pairs of IR-D images and their magnitude are computed. The 3D flow and its magnitude are then colorized by normalizing each axis values and transforming the 3D motion vectors into a three-channel image. On the other hand, the depth images I D v are colorized applying J E T color map conversion. Finally, the reflector mask I I R m v is subtracted from the colorized depth images, resulting in colorized depth with reflector black regions, I C D v , facilitating the detection of the reflectors. The colorization step for both streams is required in order to allow the usage of the proposed FCN, initialized by the first 10 layers of VGG-19 []. An example of the processed multi-view outcome is shown in Figure 4.
Figure 4. Multiview two-stream input. (Up) Colorized depth, mask subtracted. (Down) Colorized 3D optical flow.

3.3. 2D Reflector-Set Estimation Using FCN

The major challenge of the proposed method is the efficient localization and identification task of the reflectors placed on the subject’s body. Studying the recent literature in 2D localization on RGB images, the efficiency of deep neural networks in complex tasks such as articulated 2D pose estimation is remarkable and, therefore, considered appropriate for the present challenge. To this end, a deep learning approach is introduced extending the multi-stage CPM architecture in order to localize and identify the reflectors on the body. In particular, a multi-stage fully convolutional network is trained to directly operate on intermediate confidence maps and optical flow 2D vector fields, instead of Part Affinity Fields (PAF) between 2D keypoints, implicitly learning image-dependent spatial models of the reflectors locations among sequential frames.
Despite the similarities between 2D pose and reflector-set estimation, both being solvers of 2D localization problems on color images, there exist noteworthy differences. Cao et al. [] efficiently address the problem of 2D pose estimation in large, “in-the-wild” RGB datasets [], resulting in accurate estimates in a variety of data showing multiple people in different environmental and lightning conditions. In contrast, the reflector-set estimation is applied in more “controlled” conditions; (i) the input depth data lie within a narrow range, (ii) the reflector regions have clearly distinguishable pixel values on IR images, (iii) there is only one subject to be captured and, in most cases, (iv) the subject is acting at the center of the scene. On the other hand, the reflector-set estimation task is more complicated with respect to (i) the estimation of a larger number of reflectors in comparison with the keypoints detected by CPM approaches and (ii) the fact that the reflector patches are one-side visible. For instance, the reflectors R 15 and R 14 are both placed on the right shoulder, but on the front and the back side of the body respectively, while the right shoulder keypoint in CPM 2D poses is unique for all views.
The overall FCN method is illustrated in Figure 5. A pair of images, the colorized depth I C D v and the corresponding 3D optical flow I F v , are given as input. A FCN simultaneously predicts a set of 2D confidence maps S of the reflector locations and a set of 2D vector fields L ; the latter corresponds to the optical flow fields (OFFs) from the previous frame to the next one, encoding the temporal correlation between sequential frames. Both sets contain R = 26 elements, one per reflector R i { R 1 , , R 26 } , the set S = ( S R 1 , S R 2 , . . . , S R 26 ) , S R i R w × h , where w and h are the width and height of the input images respectively, and the set L = ( L R 1 , L R 2 , , L R 26 ) , where L R i R 2 × w × h . Finally, a greedy inference step is applied on the extracted confidence maps and OFFs, resulting in 2D reflector-set estimation.
Figure 5. Overall 2D reflector-set estimation from confidence maps and optical flow 2D vector fields. The 2D location for R 13 reflector is estimated taking into account its predicted heat map and optical flow between the predictions r 13 , f 1 and r 13 , f .

3.3.1. FCN Architecture

The FCN architecture, shown in Figure 6, introduces a new two-stream, two-branch, multi-stage CPM-based approach which consumes colorized depth I C D v and 3D optical flow I F v images. Both input streams are separately processed by a convolutional network of 10 layers (first 10 layers of VGG-19 []), generating two sets of feature maps F D and F O F , correspondingly. Sequentially, an early stage fusion takes place, concatenating the feature maps of the two streams, F = F O F F D . Let us denote t the stage of the network. At the first stage ( t = 1 ), the fused feature set F is given to both branches producing confidence maps, S t = ρ t ( F ) , and 2D vector fields, L t = φ t ( F ) , where ρ t and φ t denote the inference of each FCN branch. For all the subsequent T 1 stages, where T denotes the total number of stages, the predictions from both branches in the previous stage, along with the features set F , are fused and used to produce refined predictions based on:
S t = ρ t ( F , S t 1 , L t 1 ) , t { 2 , , T } L t = φ t ( F , S t 1 , L t 1 ) , t { 2 , , T }
where the number of stages T is equal to 6, experimentally set by evaluating the results on the validation dataset for T = 3 and T = 6 stages, as proposed in [,], respectively. At the end of each stage, two L 2 loss functions, L S t and L L t , between the predictions and the ground truth are applied to guide the network branches to predict confidence maps and OFFs, respectively. At stage t, for a 2D location p = ( x , y ) , p R 2 , the loss functions are given by:
L S t = r = 1 R p | | S r t ( p ) S r * ( p ) | | 2 2 L L t = r = 1 R p | | L r t ( p ) L r * ( p ) | | 2 2
Figure 6. DeepMoCap two-stream, multi-stage FCN architecture. The outcome of each stage t { 2 , , T } and the feature set F are fused and given as input to the next stage.
In that way, the vanishing gradient problem is addressed by the intermediate supervision at each stage, replenishing the gradient periodically. The overall loss function L of the network is given by:
L = t = 1 T ( L S t + L L t )

3.3.2. Confidence Maps and Optical Flow Fields

L S t and L L t are evaluated during training by generating the ground truth confidence map S R i * (Equation (4)) and vector fields L R i * (Equation (5)), respectively.
Confidence maps: Each confidence map is a 2D representation of the belief that a reflector occurs at each pixel location. The proposed method performs single person motion capture, therefore, a single peak should exist in each confidence map. Let x R i , f R 2 be the ground truth 2D location of the reflector R i on the image, at frame f. For every 2D location p R 2 , the ground truth value of S R i , f * is given by:
S R i , f * ( p ) = exp ( | | p x R i , f | | 2 2 σ 2 )
where σ controls the spread of the peak. At test time, non-maximum suppression is applied on the predicted confidence maps to localize the reflectors, assigning the confidence map peak value to each reflector prediction confidence, E R i , f S .
Optical Flow Fields: In this work, the feature representation of 2D vector fields proposed in [], is used in a temporal manner. Preserving both 2D location and orientation information across a region, a 2D vector field for each reflector is defined by connecting the reflector 2D locations between f 1 and f frames. Let x R i , f 1 , x R i , f R 2 be the ground truth 2D locations of the reflector R i at frame f 1 and f, respectively. The ground truth value for every 2D location p R 2 of L R i , f * is given by:
L R i , f * ( p ) = v , i f p o n o p t i c a l f l o w f i e l d 0 , o t h e r w i s e v = x R i , f x R i , f 1 | | x R i , f x R i , f 1 | | 2
The set of points that belong to the optical flow field includes the points within a distance threshold from the line segment between the reflector 2D locations, given by:
0 v · ( p x R i , f 1 ) d R i | v · ( p x R i , f 1 ) | σ R i
where σ R i is the width of the field in pixels, d R i is the euclidean distance of the R i reflector 2D locations between sequential frames in pixels, i.e., d R i = | | x R i , t x R i , t 1 | | 2 , and v is a vector perpendicular to v . As an example, the 2D optical flow field for the reflector R 13 is illustrated in Figure 5.
During inference, the optical flow, encoding the temporal correlations, is measured by computing the line integral over the corresponding optical flow field along the line segment connecting the candidate reflector locations between two sequential frames. Let r R i , f and r R i , f 1 be the predicted locations for the reflector R i at the current frame f and the previous one f 1 , correspondingly. The predicted optical flow field L R i , f is sampled along the line segment to measure the temporal correlation confidence between the predicted reflector positions in time by:
E R i , f L = 0 1 L R i , f ( p ( u ) ) · r R i , f r R i , f 1 | | r R i , f r R i , f 1 | | 2 d u
where p ( u ) interpolates the reflector positions r R i , f and r R i , f 1 between sequential frames, as given by:
p ( u ) = ( 1 u ) · r R i , f 1 + u · r R i , f
In other words, the integral is approximated by sampling and summing uniformly spaced values of u.
Greedy Inference: Finally, a greedy inference step is introduced, taking into consideration the temporal correlations between temporally sequential 2D reflector estimates. The confidence values E R i , f S and E R i , f L given by the confidence maps and the optical flow fields correspondingly, are summarized in a weighted manner, in order to give the fused confidence E R i , f for each reflector estimate. In detail, the major component of E R i , f is E R i , f S , however, we weight the confidence E R i , f L based on a w R i , f L = ( 1 E R i , f S ) factor that increases when E R i , f S decreases as:
E R i , f = E R i , f S + w R i , f L · E R i , f L
In that way, when a confidence map prediction results in low confidence E R i , f S estimates, the total confidence E R i , f is strongly affected by the optical flow confidence, if high. The final outcome of the 2D reflector-set estimation is given by applying non-maximum suppression on the reflector predictions based on the total confidence E R i , f .

3.4. 3D Optical Data

3.4.1. 2D-to-3D Spatial Mapping

Given the reflector-set 2D locations on the depth image I D v , a 3D spatial mapping technique is applied to precisely extract the corresponding 3D positions. Considering that the reflector locations are given exclusively when the reflectors are clearly visible, a reflector estimate is considered valid only if it belongs to a region of more than b m i n black pixels in I C D v , otherwise it is dropped. The minimum accepted size in pixels for a region was set b m i n = 5 , since the same region size was used for the annotation of the data.
In Figure 7, two of the potential cases with respect to the reflector spatial mapping are shown. In the first case (Figure 7a), the simple and most common one, E 0 I C D v is the detected region for a reflector R i { R 1 , , R 26 } . Retrieving a pixel set P R i v of the E 0 region contour in I C D v (red pixels in Figure 7a), and mapping its points to I D v , the corresponding raw depth values of P R i v are given. Using only the non-zero depth values of P R i v , the median value d R i is considered the distance of the reflector R i from the sensor view v.
Figure 7. Contour detection (red pixels) of the reflector regions for depth estimation and 3D mapping. (a) Contour of single reflector region, E 0 . (b) Contour of multi-reflector region, E 1 .
The second case is illustrated in Figure 7b, where two or more (although not usual) reflectors belong to the same region E 1 . Examining the overlapping between the reflector areas, i.e., when n > 1 reflectors share the same black region, the pixels of the contour are clustered in n clusters, based on the 2D pixel coordinates and the depth values, and then mapped to the corresponding reflectors. Subsequently, d R i is determined for each reflector R i based on the clustered pixel set.
After one-to-one mapping between reflectors and regions, the central 2D point p R i of each E R i region is spatially mapped to 3D coordinates using the depth distance value d R i and the intrinsic parameters of the corresponding IR-D sensor, giving the 3D position P R i v of the reflector R i from viewpoint v.

3.4.2. 3D Point Sets Fusion

Using the extrinsic calibration parameters of each sensor, the extracted 3D positions are spatially aligned to a global 3D-coordinate system, as shown in Figure 8. For the reflective patches, which are one-side visible to the sensors, the same retro-reflective region is captured by all IR-D sensors and, therefore, the 3D mapping yields slightly different estimates. To this end, the 3D points P R i v for a patch reflector R i for all views v are fused to one single 3D point P R i , taking into account the confidence value E R i v of the FCN reflector estimation. The 3D point P R i is estimated as the weighted central position of the 3D points from all views by:
w R i v = E R i v / v = 1 N E R i v P R i = 1 N r = 1 N w R i P R i v
Figure 8. Multiple 2D reflector-set estimates (left) are spatially mapped in a global 3D Cartesian coordinate system, resulting in 3D optical data extraction (right).
For the reflective straps, since different parts of the reflectors are visible to each sensor, the 3D points are estimated in different 3D locations around the part of the limb where the strap is placed on. In this case, the desired 3D point is approximately located to the center of the “circle” where these 3D points lie on, extracted by the method presented in Appendix A.
The full set of the global retro-reflector 3D positions per multi-view group frame f, i.e., the extracted 3D optical data, is denoted as P f , while a total confidence value C f R i for each R i reflector is considered as the average value of the reflector estimation confidence E R i v for each v - view, v { 1 , . N } , C f R i = 1 N v = 1 N E R i v . To refine the quality and stability of the 3D point estimates in time, Kalman filtering [] is applied to the extracted 3D optical data.

3.5. Motion Capture

The final stage of the proposed method targets at 3D motion capture based on 3D optical data. At this point, the extracted 3D optical data P f are mapped to a relative motion representation consisting of joint 3D positions and orientations. Similarly to [,], a body structure calibration technique is proposed, adapting an articulated humanoid template model to the real body structure of the subject. Subsequently, the calibrated articulated body is jointly moved by applying forward kinematics.
The proposed articulated body structure consists of 20 joints, j J , including D D o F = 40 DoFs. It contains L i L , i { 0 , , 6 } , hierarchical joint levels and the bones of the structure are registered to particular reflectors. To this end, a reflector subset R S j { R 1 , , R 26 } , j J , moves the body joint j J according to the body joint hierarchy. The correspondence between the joints and the reflectors is depicted in Figure 9a, while the reflector-to-body part mapping is illustrated in Figure 9b.
Figure 9. (a) Template model joints, hierarchical level, DoFs and correspondence to the reflector subsets. (b) Reflector mapping to body structure body parts.
Initially, the template is coarsely scaled based on the first batch of optical data frames P f . Then, given the 3D positions P j , j J per frame, a per-bone alignment process is performed to precisely fit the body parts of the template to the subject’s real body lengths. This step is sequentially performed to the bones following the joint hierarchy levels, from L 0 to L 6 . More specifically, after fitting the template to the optical data, the body root 3D position P H I P S with hierarchical level L 0 is given, enabling the sequential estimation of the rest of the bone lengths. Based on the assumption that the bone lengths are constant (rigid bones) and using exclusively high-confident ( C f R i > 0.6 , experimentally set) optical data P f , a random particle generation step is applied per level in L 1 phases. More specifically, a set G j of G = 500 particles (experimentally set) is generated around the j-joint location P j given by the spatially aligned template placement using P f . After the particle generation, the G j particles relatively follow the optical data applying forward kinematics. The particle g j G j that moves more rigidly between P j l and P j l + 1 , is considered as the closest one to the real relative position of joint j l + 1 . The objective function that estimates this particle is given by:
D 0 = | | P j l , 0 P g j l + 1 , 0 | | 2 + | | P j l + 1 , 0 P g j l + 1 , 0 | | 2 arg min g j l + 1 D ( g j l + 1 ) = 1 F f = 1 F ( D 0 | | P j l , f P g j l + 1 , f | | 2 + | | P j l + 1 , f P g j l + 1 , f | | 2 )
where D 0 denotes the sum of the 3D euclidean distances at the initial frame of the ( l ) ( l + 1 ) level alignment between the 3D position of the particle g j l + 1 and the joints j l and j l + 1 , as given by the latest template fitting, P g j l + 1 , f denotes the 3D position of the particle g j l + 1 at frame f { 1 , , F } , where F is the total number for a window of frames used to align a body part of level l + 1 . In a similar fashion, the next level joints and the corresponding bones are calibrated. Angular body part movements, especially elbow and knee flexions, enable faster and more efficient convergence of the per-bone alignment process.

4. Evaluation Datasets

Two public datasets have been created (https://vcl.iti.gr/deepmocap/dataset) consisting of subjects wearing retro-reflectors on their bodies. These datasets are exploited for: (i) motion capture evaluation in comparison with recent MoCap methods and ground truth and (ii) 2D reflector-set estimation FCN training and testing.

4.1. DMC3D Dataset

The DMC3D dataset consists of multi-view depth and skeleton data as well as inertial and ground truth motion capture data. Specifically, 3 Kinect for Xbox One sensors were used to capture the depth and Kinect skeleton data along with 9 XSens MT [] inertial measurement units (IMU) to enable the comparison between the proposed method and inertial MoCap approaches based on []. Furthermore, a PhaseSpace Impulse X2 [] solution was used to capture ground truth MoCap data. PhaseSpace Impulse X2 is an optical marker-based MoCap system considered appropriate for the scope of this dataset due to the usage of active instead of passive retro-reflective markers that would have interfered with the retro-reflectors. The preparation of the DMC3D dataset required the spatio-temporal alignment of the modalities (Kinect, PhaseSpace, XSens MTs). The setup [] used for the Kinect recordings provides spatio-temporally aligned depth and skeleton frames. For the Kinect – IMU synchronization, a global clock was used to record depth and inertial data with common timestamps. Additionally, given the timestamps and using the robust recording frequency of PhaseSpace Impulse X2 as reference clock, the spatio-temporal alignment of the ground truth data was manually achieved.
With respect to the amount and the variety of data, 10 subjects, 2 females and 8 males, wearing retro-reflectors, inertial sensors and active markers (LEDs) on the body, were recorded performing 15 physical exercises, presented in Table 1. The full dataset contains more than 80 × 10 3 three-view depth and skeleton frames, the extrinsic calibration parameters and the corresponding inertial and MoCap data.
Table 1. Data captured per subject in the DMC3D dataset.

4.2. DMC2.5D Dataset

A second set comprising 2.5D data (DMC2.5D Dataset) was captured in order to train and test the proposed FCN. Using the recorded IR-D and MoCap data, colorized depth and 3D optical flow data pairs per view were created, as described in Section 3.2. The samples were randomly selected from 8 of the 10 subjects, excluding 2 of them to use them for the evaluation of the MoCap. More specifically, 25 × 10 3 single-view pair samples were annotated with over 300 × 10 3 total keypoints (i.e., reflector 2D locations of current and previous frames on the image), trying to cover a variety of poses and movements in the scene. 20 × 10 3 , 3 × 10 3 and 2 × 10 3 samples were used for training, validation and testing the FCN model, respectively. The annotation was realized by applying image processing and 3D vision techniques on the IR-D and MoCap data. In particular, applying blob detection on the IR binary image I I R m v yielded the 2D locations of the reflectors, while then, the corresponding 3D positions were estimated by applying 3D spatially mapping (Section 3.4.1). Finally, the reflectors were labeled by comparing the euclidean 3D distance per frame between the extracted 3D positions and the joint 3D positions of the ground truth data. However, the automatic labeling was erroneous in the cases that the reflector regions were merged (Figure 10) or the poses where complex. The complex pose issues occurred due to the positional offset between the reflector and the joint 3D positions, confusing the labeling process. Thus, further processing was required in order to manually refine the dataset.
Figure 10. Reflector annotation using blob tracking on IR binary mask, visualized though on I I R v IR data for the sake of better understanding. Erroneous estimates occur when reflector regions are merged.

5. Evaluation

For the evaluation of the proposed method, two types of experiments were conducted, presented and discussed in this section. The first experiment concerns the evaluation of the 2D reflector-set estimation on the DMC2.5D dataset, while, in the second one, the motion capture results are compared against robust MoCap solutions and ground truth on the DMC3D dataset. At first, FCN architectures are evaluated, highlighting the outperformance of the proposed FCN model. Accurate 2D reflector-set estimation eliminates the errors in 3D optical data extraction and, consequently, to the final motion capture outcome. Thus, afterwards, applying the proposed FCN model to the DMC3D testing dataset, 3D optical data from multi-view sequences are extracted and accordingly used for motion capture.

5.1. 2D Reflector-Set Estimation on DMC2.5D

5.1.1. FCN Implementation

With respect to the training of the proposed FCN, data augmentation takes place randomly to increase variation of input by scaling and rotating the data. For each batch of frames being fed into the network per iteration, the transformation is consistent, thus, lower batch size results in higher variation between the iterations. Both input images are randomly scaled by f s [ 0.6 , 1.1 ] , randomly rotated by f θ [ 40 ° , 40 ° ] and flipped with binary randomness. Finally, all images are cropped to 368 × 368 resolution size, also setting the subject bodies at the image center. Regarding the method parameterization, the stages are equal to T = 6 , using stochastic gradient descent with momentum α m = 0.9 and weight decay w d = 5 × 10 4 to optimize the learning process. The batch size is equal to 10, while the initial learning rate is l r = 2 × 10 5 and drops by f g = 3.33 × 10 1 every 30 × 10 3 iterations.

5.1.2. Experimental Framework

The introduced FCN architecture approaches 2D-reflector-set estimation, a similar, yet different task in comparison with 2D pose estimation. Aiming to evaluate the present FCN approach and the introduced extension with respect to the temporal correlations between the reflector 2D positions from frame-to-frame, existing methods for keypoint are adapted and trained to address the present challenge. In detail, the methods by Wei et al. [] and Cao et al. [], included in OpenPose (https://github.com/CMU-Perceptual-Computing-Lab/openpose) library, have been adapted and trained with the DMC2.5D dataset.
For the adaptation, the body parts have been replaced by the reflectors, while it is worth noting that the Part Affinity Fields (PAFs) have been altered due to the difference of the reflector sub-set placement between the front and the back side of the body. The adapted association between the reflectors is illustrated in Figure 11. Moreover, since the proposed method has been developed for single-person motion capture, even though PAFs branch contributes to the final feature space for the confidence map prediction, its output is not taken into account for the final reflector-set estimation. Finally, a two-branch colorized depth and 3D optical flow two-stream approach similar to [] is evaluated ([] + 3D OF), showing remarkable results.
Figure 11. The red arrows illustrate the directional associations between the reflectors to adapt the Part Affinity Fields, as proposed in []. The orange and blue colored labels indicate the reflective straps and patches, respectively.
With respect to the evaluation metrics, the proposed FCN is evaluated on the DMC2.5D dataset measuring Average Precision (AP) per reflector and mean Average Precision (mAP) for the entire set, based on Percentage of Correct Keypoints (PCK) [] thresholds, i.e., a prediction is considered true if it falls within a pixel region around the ground-truth keypoint. This region is defined by multiplying the width and height of the bounding box that contains the subject by a factor α that controls the relative threshold for correctness consideration.
Setting α = 0.05 , the validation set of the DMC2.5D dataset was used to indicate the optimum minimum confidence threshold c m i n for the highest mAP per method, aiming at fair comparison between them. c m i n corresponds to the minimum threshold of confidence for a reflector prediction to be considered as valid, i.e., E R i , f > c m i n . The results are presented in Figure 12, showing the method mAP against confidence threshold. Maximum mAP on the validation set was achieved for c m i n = 0.4 for all methods, thus, considered optimum for the experiments on the DMC2.5D testing set.
Figure 12. Mean Average Precision based on Percentage of Correct Keypoints thresholds ( a = 0.05 ) against confidence threshold, m A P ( c m i n ) .

5.1.3. Results and Discussion

Evaluating the AP results per reflector, shown in Table 2, the efficiency of the methods in reflector-set estimation is perceived. The proposed FCN method outperforms the rest of the methods for the 80.7% of the reflectors (i.e., 21 out of 26). In particular, the AP is improved for the end-reflectors, i.e., the reflectors placed on the hands and the feet ( R 13 , R 18 , R 22 , R 26 ), and their linked reflectors, i.e., the wrists and the ankles ( R 12 , R 17 , R 21 , R 25 ), which are placed on the body parts with the highest moving capability and, therefore, the most rapid movements. From these results, we conclude that the temporal information implicitly encoded in the proposed FCN model improves the distinction among these reflectors, while for the reflectors that the AP is slightly lower ( R 8 , R 10 , R 13 , R 14 , R 20 ), we can assume that the predicted optical flow was not accurate or informative enough to boost the prediction confidence and, therefore, the accuracy of the estimates.
Table 2. AP for PCK with α = 0.05 , for each of the 26 reflectors. The proposed FCN method outperforms the rest of the methods for the 80.7% of the reflectors (i.e., 21 out of 26).
However, for all the respective methods, the prediction of the end-reflectors is weak in comparison with the rest of the reflectors. That is probably due to the fact that these reflectors are not often visible and are closely placed to their linked reflectors. To highlight this difference, mAP results are presented in Table 3 with and without the end-reflectors.
Table 3. mAP for PCK with α = 0.05 , with and without end-reflectors.
The proposed approach outperforms the competitive methods, presenting an absolute increase of mAP equal to 1.47 % , 0.94 % and 0.89 % with end-reflectors, and 1.5 % , 1.16 % and 0.9 % without them, in comparison with [,] and [] + 3D OF, respectively. It is worth noting that the two-stream approach which takes into account the 3D optical flow ([] + 3D OF) achieves higher mAP than [,], meaning that the temporal information given by the 3D optical flow stream is encoded in the feature space of the model, resulting in higher localization accuracy.
Finally, before feeding the motion capture method with the 2D reflector-set estimates, a filtering process is applied based on two fundamental considerations of the task. At first, the reflectors are detected only when visible; (i) if the region where a detector is located does not belong to a specific color (black) region of size greater than or equal to b m i n = 5 pixels, this estimate is discarded, and (ii) when more than one reflectors are detected on the same location (absolute distance less than 3 pixels, experimentally set), the reflectors with lower confidence are dropped. Secondly, the reflectors are unique on an image since we approach single-person MoCap; if more than one reflector estimates of the same reflector are detected, only the one with the highest confidence is considered valid.
The AP results per reflector after filtering are shown in Table 4. As shown, the results for all reflectors and for all methods are equal or greater than the corresponding results before filtering. At this experiment, the proposed FCN outperforms the rest of the methods at 14 of 26 ( 53.84 % ) of the reflectors.
Table 4. AP for PCK with α = 0.05 , for each of the 26 reflectors, after filtering.
The mAP results for the total set of reflectors after filtering, with and without the end-reflectors, are presented in Table 5, all showing higher accuracy than the corresponding values before filtering. The proposed method outperforms [,] and [] + 3D OF by presenting an absolute increment equal to 1.25 % , 0.49 % and 0.37 % with end-reflectors, and 1.06 % , 0.47 % and 0.61 % without them, respectively.
Table 5. mAP for PCK with α = 0.05 , with and without end-reflectors, after filtering.
In that way, the outcome of the 2D reflector-set estimation allows us to detect the 2D locations of the reflectors and, subsequently, the corresponding 3D optical data in a global coordinate system. Qualitative results of the proposed FCN outcome on sequential input data are illustrated in Figure 13, while more qualitative results have been made publicly available (https://vcl.iti.gr/deepmocap).
Figure 13. Visualization of the proposed FCN outcome overlayed on sequential multi-view depth frames. Five multi-view sequential frames, from frame f 10 to f + 10 with frame step equal to 5, are horizontally presented.

5.2. Motion Capture on DMC3D

5.2.1. Experimental Framework

After the introduction of an efficient FCN model for 2D reflector-set estimation, the final MoCap outcome is evaluated. Applying the proposed qualified FCN model to multi-view sequences, the 3D optical data are extracted and fed to the MoCap proposed method. For these experiments, several sequences of approximately 6 × 10 3 frames in total were selected from 2 subjects that were excluded from the dataset used to train the FCN model. Considering the ground truth data of the DCM3D dataset, i.e., the motion data captured with PhaseSpace Impulse X2 [], the motion capture outcome is compared against the Kinect for Xbox One skeleton tracking with the highest quality, an inertial MoCap approach that fuses Kinect skeleton and inertial data from 9 IMU (Fusion) [], and a second robust inertial method in a similar fashion as [] (Fusion++) that fuses ground truth data for initialization and root positioning instead of Kinect skeleton tracking. It is worth noting that jerky skeleton estimates of Kinect skeleton tracking that cause highly erroneous root 3D position estimates have been excluded from the testing sequences, keeping only estimates meaningful for comparison.
Inertial MoCap methods were considered appropriate for a fair comparison due to their robust way of capturing, independent from self-occlusions. Multiple RGB-D- or 3D-based MoCap approaches were not taken into account due to the missing parts of depth and, therefore, missing 3D data on the body parts of the subject where the reflectors were placed on, resulting in unfair comparison. Finally, 3D MoCap methods from monocular RGB considered out of scope due to one-view and less-informative input, while motion capture methods from multiple RGB sources were not considered equal for comparison due to the blurry images of RGB streams on rapid movements.
Regarding the evaluation metrics, DeepMoCap is evaluated on the DMC3D dataset using Mean Average Error (MAE), Root Mean Squared Error (RMSE) and 3D PCK @ a 3 D = 20 cm metrics for the 3D euclidean distance between the outcome of the methods and the ground truth on 12 joints spanned by shoulders, elbows, wrists, hips, knees and ankles. In 3D PCK, an estimate is considered correct when the 3D euclidean distance is less than a 3 D .

5.2.2. Results and Discussion

The total results of the comparison between the MoCap methods are presented in Table 6, showing the outperformance of DeepMoCap in comparison with the rest of the methods. The total MAE, RMSE and 3D PCK for all sequences are 9.02 cm, 10.06 cm and 92.25 % , respectively, showing the best results of all experimental methods. Fusion++ [], Fusion [] and Best Kinect [] follow the proposed method presenting 88.75 % , 85.93 % and 83.37 % in 3D PCK accuracy, respectively. Additionally, it is worth mentioning that the proposed method presents lower than 10 cm total mean average error (MAE).
Table 6. Comparative evaluation of the motion capture results of the respective methods, presenting total MAE, RMSE and 3D PCK ( α 3 D = 20 cm) metrics.
In Table 7, the 3D PCK accuracy results are shown per exercise, giving evidence with respect to the strengths and the weaknesses of the methods based on the body movement type variations. DeepMoCap is qualified outperforming the rest of the methods at 12 of the 15 exercises in total ( 80 % ), efficiently capturing most of them. In detail, Walking on the spot, a simple and slow motion, is efficiently captured from all the methods. Elbow and Knee flexion exercises, which are simple rotational joint movements, are captured with high precision by all methods, especially from inertial MoCap approaches. It is worthnoting that DeepMoCap presents lower accuracy for Side-steps exercise than Fusion++ probably due to the hand placement on the reflectors of the hips resulting in merged reflector regions, which complicated the detection and identification of the involved reflectors. However, in more complex exercises as Butt kicks left-right and Forward lunge left-right where there are occlusions for Kinect and body stretching for inertial sensors placed on the body, DeepMoCap presents approximately 5 % higher absolute 3D PCK accuracy than Fusion++, which follows. For Jumping jacks, which is a fast and complex exercise where all body parts are fully involved, DeepMoCap achieves 96.05 % 3D PCK accuracy followed by Best Kinect [], while inertial MoCap approaches fail to properly capture the shoulders 3D positions due to rigid body movement of the torso, showing lower accuracy in such exercises. For Alternate side reaches and Kick-box kicking, which are the most challenging exercises, the 3D PCK accuracy of the proposed method is 4.84 % and 8.33 % higher in comparison with the second best method (Fusion++), respectively. Furthermore, it should be noted that all exercises are captured by DeepMoCap in 3D PCK accuracy higher than 82.53 % , showing low variation between different types of body movements.
Table 7. Comparative evaluation per exercise using 3D PCK, α 3 D = 20 cm metric.
In the plot presented in Figure 14, the total 3D PCK accuracy is given against a 3 D threshold values. DeepMoCap shows higher efficiency for all a 3 D , showing high 3D PCK accuracy from low threshold values (e.g., 32.25 % and 63.27 % for α 3 D = 5 cm and α 3 D = 10 cm, respectively), in comparison with the next best method that presents 16.38 % and 54.36 % , respectively. Given the fact that joint positioning varies between different motion capture approaches resulting in the existence of a positional offset between the estimated 3D joint positions, we conclude that DeepMoCap shows high efficiency by presenting 32.25 % of the estimates on average to be closer than 5 cm from ground truth.
Figure 14. Comparative evaluation of the motion capture methods using total 3D PCK results in different α 3 D threshold values in cm.
In Table 8, the euclidean MAE and RMSE are presented per joint. It can be observed that the proposed method has the lowest errors for 9 of 12 ( 75 % ) joints for MAE and RMSE. For Shoulders and Right Elbow, Fusion++ [] shows slightly better results than DeepMoCap probably due to better skeleton structure positioning. The lower body joints (hips, knees and feet) are captured presenting 6.05 cm and 7.08 cm mean average and root mean squared errors, respectively.
Table 8. Experimental results of the respective motion capture methods using (MAE) and (RMSE) metrics per joint (in cm).
Qualitative results depicting the 3D outcome of the proposed approach are presented in Figure 15. In particular, the multi-view input overlayed with the reflector-set estimates and the corresponding 3D motion capture along with optical data results are illustrated. As shown, DeepMoCap approaches motion capture similarly to the way traditional optical MoCap solutions work, in a more flexible and low-cost manner though. More qualitative results are publicly available (https://vcl.iti.gr/deepmocap).
Figure 15. Five samples of the method results are illustrated in rows. At the left side of the figure, the multi-view input along with the FCN reflector-set estimates are presented, while, at the right side, their corresponding 3D optical and motion capture outcomes are shown.

5.3. Performance Analysis

The runtime performance analysis was conducted by measuring the total time required to capture the motion data from the testing sequences. For approximately 6 × 10 3 three-view pairs of colorized depth and 3D optical flow frames, thus 18 × 10 3 single-view pairs, raw data pre-processing lasted 1796 s (∼100 ms per sample), while FCN model prediction required 3136 s (∼174 ms per sample). Thus, the proposed method achieves 2D reflector-set estimation at approximately 6 frames per second with ∼100 ms latency, while the motion tracking from optical data is real-time, requiring less than 10 ms. With respect to the input, the original frame size is 424 × 512 , re-sized to 368 × 444 during testing to fit in GPU memory. Thus, DeepMoCap performs motion capture at approximately 2 Hz for 3-view input of 368 × 444 , while the performance complexity against number of views, i.e., input image pairs, is O ( n ) . The runtime analysis was performed on a desktop machine equipped with one NVIDIA GeForce Titan X GPU. Code (https://github.com/tofis/deepmocap) and dataset tools of the proposed method are publicly available to encourage further research in the area.

6. Conclusions

In the present work, a deep marker-based optical motion capture method is introduced, using multiple IR-D sensors and retro-reflectors. DeepMoCap constitutes a robust, fast and flexible approach that automatically extracts labeled 3D optical data and performs immediate motion capture without the need for post-processing. For this purpose, a novel two-stream, multi-stage CPM-based FCN is proposed that introduces a non-parametric representation to encode the temporal correlation among pairs of colorized depthmaps and 3D optical flow frames, resulting in retro-reflector 2D localization and labeling. This step enables the 3D optical data extraction from multiple spatio-temporally aligned IR-D sensors and, subsequently, motion capture. For research and evaluation purposes, two new datasets were created and made publicly available. The proposed method was evaluated with respect to the 2D reflector-set estimation and the motion capture accuracy on these datasets, outperforming recent and robust methods in both tasks. Taking into consideration this comparative evaluation, we conclude that the joint usage of traditional marker-based optical MoCap rationale and recent deep learning advancements in conjunction with 2.5D and 3D vision techniques can significantly contribute to the MoCap field, introducing a new way of approaching the task.
With respect to the limitations, the side-view capturing and the highly complex body poses that occlude or merge reflectors on the image views constitute the main barriers. These limitations can be eliminated by increasing the number of IR-D sensors around the subject, however, increasing the cost and complexity of the method. Next steps of this research would include the study of recent deep learning approaches in 3D pose recovery and motion capture, investigating key features that will allow us to address main MoCap challenges such as real-time performance, efficient multi-person capturing, in outdoor environments, with more degrees of freedom of the body to be captured and higher accuracy.

Supplementary Materials

Supplementary File 1

Author Contributions

Conceptualization, ALL; methodology, A.C.; software, A.C.; validation, ALL; formal analysis, A.C.; investigation, A.C.; resources, A.C., D.Z., P.D.; data curation, A.C.; writing–original draft preparation, A.C.; writing–review and editing, D.Z., S.K., P.D.; visualization, A.C.; supervision, D.Z., S.K., P.D.; funding acquisition, D.Z., P.D. The research in this work was mainly conducted by A.C., PhD candidate in National Technical University of Athens and Research Associate in Centre for Research and Technology – Hellas. The research was supervised by D.Z., S.K. and P.D. They provided ideas, inputs and feedback to improve the quality of the present work.

Funding

This research and the APC were funded by VRTogether, Horizon 2020 Framework Programme, grant number 762111.

Acknowledgments

This work was supported by the EU funded project VRTogether H2020 under the grant agreement 762111. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPU used for this research.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix Strap Retro-Reflector Point-Set Fusion

Let n R i v i and n R i v j be the 3D normal vectors of the reflector regions for v i and v j views, v i , v j { 1 , . N } , respectively. These vectors are defined as the normal vectors of the 3D points given by mapping the corresponding pixel sets P R i v i and P R i v j to 3D Cartesian coordinates. The closest 3D points P R i v i , j and P R i v j , i between the normal vector lines of n R i v i and n R i v j , for a view pair v i v j are given by applying the equations:
b = n R i v i n R i v j c = n R i v i ( P R i v i P R i v j ) s = ( b · f c ) / d P R i v i , j = P R i v i + s · n R i v i d = 1 b 2 f = n R i v j ( P R i v i P R i v j ) t = ( f c · b ) / d P R i v j , i = P R i v j + t · n R i v j
Applying Equation (10) for all N R i = 2 · m R i extracted 3D points, where m is the total number of pairs, the 3D position of the reflective strap R i is estimated to the center of the body part. Even when only one sensor captures a strap reflector, if the template-based structure is calibrated, the 3D position can be estimated using the normal vector and the body part (limb) radius. References

References

  1. Guerra-Filho, G. Optical Motion Capture: Theory and Implementation. RITA 2005, 12, 61–90. [Google Scholar]
  2. Merriaux, P.; Dupuis, Y.; Boutteau, R.; Vasseur, P.; Savatier, X. A study of Vicon system positioning performance. Sensors 2017, 17, 1591. [Google Scholar] [CrossRef] [PubMed]
  3. VICON. Available online: https://www.vicon.com/ (accessed on 3 August 2018).
  4. OptiTrack. Available online: http://optitrack.com/ (accessed on 3 August 2018).
  5. PhaseSpace. Available online: http://www.phasespace.com/ (accessed on 3 August 2018).
  6. Unreal Engine. Available online: https://www.unrealengine.com/ (accessed on 5 August 2018).
  7. Unity3D. Available online: https://unity3d.com/ (accessed on 5 August 2018).
  8. CRYENGINE. Available online: https://www.cryengine.com/ (accessed on 5 August 2018).
  9. Chen, X.; Davis, J. Camera Placement Considering Occlusion for Robust Motion Capture; Technical Report; Computer Graphics Laboratory, Stanford University: Stanford, CA, USA, 2000; Volume 2, p. 2. [Google Scholar]
  10. Bodenheimer, B.; Rose, C.; Rosenthal, S.; Pella, J. The process of motion capture: Dealing with the data. In Proceedings of the Computer Animation and Simulation’97, 2–3 September 1997; Springer: Budapest, Hungary, 1997; pp. 3–18. [Google Scholar]
  11. Ye, M.; Yang, C.; Stankovic, V.; Stankovic, L.; Kerr, A. A depth camera motion analysis framework for tele-rehabilitation: Motion capture and person-centric kinematics analysis. IEEE J. Sel. Top. Signal Process. 2016, 10, 877–887. [Google Scholar] [CrossRef]
  12. Cippitelli, E.; Gasparrini, S.; Spinsante, S.; Gambi, E. Kinect as a tool for gait analysis: Validation of a real-time joint extraction algorithm working in side view. Sensors 2015, 15, 1417–1434. [Google Scholar] [CrossRef] [PubMed]
  13. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  14. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2D pose estimation using Part Affinity Fields. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 7. [Google Scholar]
  15. Ramakrishna, V.; Munoz, D.; Hebert, M.; Bagnell, J.A.; Sheikh, Y. Pose machines: Articulated pose estimation via inference machines. In Proceedings of the European Conference on Computer Vision, 6–12 September 2014; Springer: Zurich, Switzerland, 2014; pp. 33–47. [Google Scholar]
  16. Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June 2016–1 July 2016; pp. 4724–4732. [Google Scholar]
  17. Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the European Conference on Computer Vision, 11–14 October 2016; Springer: Amsterdam, The Netherlands, 2016. [Google Scholar]
  18. Song, J.; Wang, L.; Van Gool, L.; Hilliges, O. Thin-slicing network: A deep structured model for pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 2. [Google Scholar]
  19. Luo, Y.; Ren, J.; Wang, Z.; Sun, W.; Pan, J.; Liu, J.; Pang, J.; Lin, L. LSTM Pose Machines. arXiv, 2017; arXiv:1712.06316. [Google Scholar]
  20. Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3D human pose estimation in the wild using improved CNN supervision. In Proceedings of the International Conference on 3D Vision (3DV), 10–12 October 2017; IEEE: Qingdao, China, 2017; pp. 506–516. [Google Scholar]
  21. Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3D pose estimation from a single image. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 2500–2509. [Google Scholar]
  22. Liu, M.; Han, S.; Lee, S. Tracking-based 3D human skeleton extraction from stereo video camera toward an on-site safety and ergonomic analysis. Constr. Innov. 2016, 16, 348–367. [Google Scholar] [CrossRef]
  23. Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 1263–1272. [Google Scholar]
  24. Zhou, X.; Zhu, M.; Pavlakos, G.; Leonardos, S.; Derpanis, K.G.; Daniilidis, K. Monocap: Monocular human motion capture using a CNN coupled with a geometric prior. IEEE Trans. Pattern Anal. Mach. Intell. 2018. [Google Scholar] [CrossRef] [PubMed]
  25. Mehta, D.; Sridhar, S.; Sotnychenko, O.; Rhodin, H.; Shafiei, M.; Seidel, H.P.; Xu, W.; Casas, D.; Theobalt, C. Vnect: Real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. (TOG) 2017, 36, 44. [Google Scholar] [CrossRef]
  26. Rogez, G.; Schmid, C. Mocap-guided data augmentation for 3D pose estimation in the wild. In Advances in Neural Information Processing Systems; The MIT Press: Barcelona, Spain; pp. 3108–3116.
  27. Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. Real-time human pose recognition in parts from single depth images. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), 20–25 June 2011; IEEE: Colorado Springs, CO, USA, 2011; pp. 1297–1304. [Google Scholar]
  28. Khoshelham, K.; Elberink, S.O. Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors 2012, 12, 1437–1454. [Google Scholar] [CrossRef] [PubMed]
  29. Plantard, P.; Auvinet, E.; Pierres, A.S.L.; Multon, F. Pose estimation with a kinect for ergonomic studies: Evaluation of the accuracy using a virtual mannequin. Sensors 2015, 15, 1785–1803. [Google Scholar] [CrossRef] [PubMed]
  30. Asteriadis, S.; Chatzitofis, A.; Zarpalas, D.; Alexiadis, D.S.; Daras, P. Estimating human motion from multiple kinect sensors. In Proceedings of the 6th International Conference on Computer Vision/Computer Graphics Collaboration Techniques and Applications, 6–7 June 2013; ACM: Berlin, Germany, 2013; p. 3. [Google Scholar]
  31. Oikonomidis, I.; Kyriazis, N.; Argyros, A.A. Efficient model-based 3D tracking of hand articulations using Kinect. In Proceedings of the BMVC, Dundee, UK, 29 August–2 September 2011; Volume 1, p. 3. [Google Scholar]
  32. Zimmermann, C.; Welschehold, T.; Dornhege, C.; Burgard, W.; Brox, T. 3D Human Pose Estimation in RGBD Images for Robotic Task Learning. arXiv, 2018; arXiv:1803.02622. [Google Scholar]
  33. Haque, A.; Peng, B.; Luo, Z.; Alahi, A.; Yeung, S.; Li, F. Towards viewpoint invariant 3D human pose estimation. In Proceedings of the European Conference on Computer Vision, 11–14 October 2016; Springer: Amsterdam, The Netherlands, 2016; pp. 160–177. [Google Scholar]
  34. Rafi, U.; Gall, J.; Leibe, B. A semantic occlusion model for human pose estimation from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  35. Joo, H.; Simon, T.; Sheikh, Y. Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8320–8329. [Google Scholar]
  36. Liu, Z.; Huang, J.; Han, J.; Bu, S.; Lv, J. Human motion tracking by multiple RGBD cameras. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 2014–2027. [Google Scholar] [CrossRef]
  37. Carraro, M.; Munaro, M.; Burke, J.; Menegatti, E. Real-time marker-less multi-person 3D pose estimation in RGB-Depth camera networks. arXiv, 2017; arXiv:1710.06235. [Google Scholar]
  38. Shafaei, A.; Little, J.J. Real-Time Human Motion Capture with Multiple Depth Cameras. In Proceedings of the 13th Conference on Computer and Robot Vision, CIPPRS, Victoria, BC, Canada, 1–3 June 2016. [Google Scholar]
  39. Shuai, L.; Li, C.; Guo, X.; Prabhakaran, B.; Chai, J. Motion capture with ellipsoidal skeleton using multiple depth cameras. IEEE Trans. Vis. Comput. Graph. 2017, 23, 1085–1098. [Google Scholar] [CrossRef] [PubMed]
  40. Cahill-Rowley, K.; Rose, J. Temporal–spatial reach parameters derived from inertial sensors: Comparison to 3D marker-based motion capture. J. Biomech. 2017, 52, 11–16. [Google Scholar] [CrossRef] [PubMed]
  41. von Marcard, T.; Rosenhahn, B.; Black, M.J.; Pons-Moll, G. Sparse inertial poser: Automatic 3D human pose estimation from sparse imus. Comput. Graph. Forum 2017, 36, 349–360. [Google Scholar] [CrossRef]
  42. Zhang, X.; Xiao, W. A Fuzzy Tuned and Second Estimator of the Optimal Quaternion Complementary Filter for Human Motion Measurement with Inertial and Magnetic Sensors. Sensors 2018, 18, 3517. [Google Scholar] [CrossRef]
  43. Szczęsna, A.; Skurowski, P.; Lach, E.; Pruszowski, P.; Pęszor, D.; Paszkuta, M.; Słupik, J.; Lebek, K.; Janiak, M.; Polański, A.; et al. Inertial motion capture costume design study. Sensors 2017, 17, 612. [Google Scholar] [CrossRef]
  44. Destelle, F.; Ahmadi, A.; O’Connor, N.E.; Moran, K.; Chatzitofis, A.; Zarpalas, D.; Daras, P. Low-cost accurate skeleton tracking based on fusion of kinect and wearable inertial sensors. In Proceedings of the European Signal Processing Conference (EUSIPCO), 1–5 September 2014; IEEE: Lisbon, Portugal, 2014; pp. 371–375. [Google Scholar]
  45. Zheng, Z.; Yu, T.; Li, H.; Guo, K.; Dai, Q.; Fang, L.; Liu, Y. HybridFusion: Real-Time Performance Capture Using a Single Depth Sensor and Sparse IMUs. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  46. Riaz, Q.; Tao, G.; Krüger, B.; Weber, A. Motion reconstruction using very few accelerometers and ground contacts. Graph. Models 2015, 79, 23–38. [Google Scholar] [CrossRef]
  47. Barnes, F.P. Illuminated Protective Clothing. U.S. Patent No. 5,249,106, 28 September 1993. [Google Scholar]
  48. Alexiadis, D.S.; Chatzitofis, A.; Zioulis, N.; Zoidi, O.; Louizis, G.; Zarpalas, D.; Daras, P. An integrated platform for live 3D human reconstruction and motion capturing. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 798–813. [Google Scholar] [CrossRef]
  49. Jaimez, M.; Souiai, M.; Gonzalez-Jimenez, J.; Cremers, D. A primal-dual framework for real-time dense RGB-D scene flow. In Proceedings of the International Conference on Robotics and Automation (ICRA), 26–30 May 2015; IEEE: Seattle, WA, USA; pp. 98–104. [Google Scholar]
  50. Thermos, S.; Papadopoulos, G.T.; Daras, P.; Potamianos, G. Deep Affordance-grounded Sensorimotor Object Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  51. Wang, P.; Li, W.; Gao, Z.; Zhang, Y.; Tang, C.; Ogunbona, P. Scene flow to action map: A new representation for rgb-d based action recognition with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  52. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
  53. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, 6–12 September 2014; Springer: Zurich, Switzerland; pp. 740–755. [Google Scholar]
  54. Grewal, M.S. Kalman Filtering. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 705–708. [Google Scholar]
  55. Schubert, T.; Gkogkidis, A.; Ball, T.; Burgard, W. Automatic initialization for skeleton tracking in optical motion capture. In Proceedings of the International Conference on Robotics and Automation (ICRA), 26–30 May 2015; IEEE: Seattle, WA, USA; pp. 734–739. [Google Scholar]
  56. Zordan, V.B.; Van Der Horst, N.C. Mapping optical motion capture data to skeletal motion using a physical model. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 26–27 July 2003; Eurographics Association: San Diego, CA, USA; pp. 245–250. [Google Scholar]
  57. Paulich, M.; Schepers, M.; Rudigkeit, N.; Bellusci, G. Xsens MTw Awinda: Miniature Wireless Inertial-Magnetic Motion Tracker for Highly Accurate 3D Kinematic Applications; Xsens: Enschede, The Netherlands, 2018. [Google Scholar]
  58. Yang, Y.; Ramanan, D. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 20–25 June 2011; IEEE: Colorado Springs, CO, USA; pp. 1385–1392. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.