High-Speed Dynamic Projection Mapping onto Human Arm with Realistic Skin Deformation †

: Dynamic projection mapping for a moving object according to its position and shape is fundamental for augmented reality to resemble changes on a target surface. For instance, augmenting the human arm surface via dynamic projection mapping can enhance applications in fashion, user interfaces, prototyping, education, medical assistance, and other ﬁelds. For such applications, however, conventional methods neglect skin deformation and have a high latency between motion and projection, causing noticeable misalignment between the target arm surface and projected images. These problems degrade the user experience and limit the development of more applications. We propose a system for high-speed dynamic projection mapping onto a rapidly moving human arm with realistic skin deformation. With the developed system, the user does not perceive any misalignment between the arm surface and projected images. First, we combine a state-of-the-art parametric deformable surface model with efﬁcient regression-based accuracy compensation to represent skin deformation. Through compensation, we modify the texture coordinates to achieve fast and accurate image generation for projection mapping based on joint tracking. Second, we develop a high-speed system that provides a latency between motion and projection below 10 ms, which is generally imperceptible by human vision. Compared with conventional methods, the proposed system provides more realistic experiences and increases the applicability of dynamic projection mapping.


Introduction
Augmented reality is being rapidly developed to enhance the user experience in the real world and has attracted much attention in research and industry. A fundamental approach to realize augmented reality is called projection mapping or spatial augmented reality [1], which has been widely used for a variety of applications. Projection mapping aims to create the perception of changes in the materials and shape of a target surface by overlaying images according to the target position and shape. Compared with other augmented reality techniques relying on handheld devices or head-mounted displays, projection mapping omits using or wearing devices, and the information presented via projection mapping can be shared across multiple users. Various applications have demonstrated the applicability of projection mapping. For instance, theme parks and other entertainment environments have been enhanced and energized by adopting projection mapping [2].
Existing projection mapping techniques can be classified considering the dynamics and shapes of the target objects. With the proposal by Raskar et al. [3] of shader lamps, the targets of traditional projection mapping systems were limited to static and rigid objects.
To overcome these limitations, studies on dynamic projection mapping have been subsequently conducted considering moving and non-rigid targets to achieve highly immersive visual experiences. Siegl et al. [4] proposed a dynamic projection mapping system using a depth sensor. Narita et al. [5] proposed a non-rigid projection mapping system using a deformable dot cluster marker. Miyashita et al. [6] introduced the MIDAS Projection, a marker-less and model-less non-rigid dynamic projection system to represent the appearance of materials using real-time measurements in the infrared region. Nomoto et al. [7] addressed limitations of dynamic projection mapping using multiple projectors and a pixel-parallel algorithm.
Projection mapping can also be performed on the human body, which is a complex non-rigid surface. For instance, accurate face projection mapping systems have been proposed [8,9]. Bermano et al. [8] achieved projection mapping onto a human face based on non-rigid marker-less face tracking. Sigel et al. [9] proposed the FaceForge system to improve the applicability and quality of face projection mapping using multiple projectors. Similarly, other parts of the human body can be augmented by projection mapping. Augmenting the human arm surface via dynamic projection mapping can notably improve applications related to activities of daily living. In addition, projection mapping has found applications in a variety of fields such as fashion [10,11], user interfaces [12][13][14][15], prototyping [16,17], education [18], and medical assistance [19]. For instance, Gannon et al. [16] presented ExoSkin, a hybrid fabrication system for designing and projecting digital objects directly on the body through projection mapping and marker-based joint tracking. Xiao et al. [12] introduced LumiWatch, a skin-projected wristwatch, to enable coarse touch screen capabilities on the arm surface. However, conventional projection mapping systems for the human arm surface mostly neglect skin deformation and have a high latency between motion and projection, consequently causing noticeable misalignments between the arm surface and projected images. These problems degrade the user experience and limit the development of more applications.
We focus on high-speed dynamic projection mapping onto the human arm surface with realistic skin deformation for the user not to perceive any misalignment between the arm and projected images. We address two main challenges: reducing the overall latency of the system and reconstructing realistic skin deformation. Ng et al. [20] developed a projection-based touch input device with a response latency of 1 ms and determined that users cannot perceive an average projection latency below 6.04 ms. Therefore, a projection mapping system should achieve a latency of a few milliseconds to avoid perceivable misalignments.
To provide an accurate and realistic experience, we propose a dynamic projection mapping system on the human arm surface with the following two contributions [21]. First, we combine a state-of-the-art parametric deformable surface model with an efficient regression-based accuracy compensation method to describe skin deformation. The compensation method modifies the texture coordinates parallel to the 3D shape reconstruction to achieve fast and accurate image generation for projection mapping based on joint tracking. Second, we develop a high-speed system that provides a latency between motion and projection below 10 ms, which can hardly be perceived by human vision. Compared with conventional methods, the proposed system provides more realistic experiences and broadens the applicability of dynamic projection mapping.

Related Work
In this section, we discuss a variety of studies on human skin surface reconstruction.

Non-Rigid Registration
Non-rigid registration has been widely studied in recent years and is a suitable method for human skin surface reconstruction. Basically, it deforms the vertices of a model to fit the observations at each time step using a non-rigid deformation [22][23][24]. The asrigid-as-possible method [22] is commonly used to regularize deformable surfaces [25][26][27][28][29][30][31]. Zollhöfer et al. [26] proposed a novel GPU (graphics processing unit) pipeline for non-rigid registration of live RGB-D (color and depth) data to a smooth template using an extended nonlinear as-rigid-as-possible framework. Gao et al. [30] proposed SurfelWarp for real-time reconstruction using the as-rigid-as-possible method.
However, pure non-rigid deformation suffers from error accumulation and usually fails when tracking long motion sequences [31]. Guo et al. [28] introduced robust tracking of complex human bodies using non-rigid motion tracking techniques. Li et al. [29] developed a robust template-based non-rigid registration method for the human body using localcoordinate regularization, thus improving the registration speed. Nonetheless, these non-rigid registration methods for the human body cannot achieve real-time performance, which usually remains below 1 fps.

Learned Human Body Model
In computer graphics, the generation of realistic representations of human bodies has been pursued to describe different body shapes and natural deformations according to the pose. For instance, blend skinning [32,33] and blend shapes [34,35] are widely used in animation. In recent years, higher-quality models based on blend skinning and blend shapes have been proposed, such as SCAPE [36], BlendSCAPE [37], and SMPL [38]. SMPL, the skinned multi-person linear model, is a realistic 3D model of the human body learned from thousands of 3D body scans. The SMPL model is suitable to represent multiple persons and provides fast rendering and ease of use.

3D Human Surface Estimation
In computer vision, remarkable results have been achieved in the reconstruction of the human body surface based on parametric deformable surface models, such as SMPL [38], without using markers.
Early works have been aimed to fit parametric models to 2D image observations through iterative optimization, implying time-consuming computations and requiring careful initialization. Bogo et al. [39] proposed the SMPLify framework to automatically estimate the SMPL parameters using human landmarks. Similarly, Lassner et al. [40] introduced a method for estimating the SMPL parameters using silhouettes, and Joo et al. [41] proposed the Total Capture system to estimate the SMPL parameters using a multicamera system.
Convolutional neural networks have been widely studied with the development of deep learning. Kanazawa et al. [42] presented an end-to-end adversarial learning method, HMR, to estimate the SMPL parameters in approximately 40 ms using a single GPU (NVIDIA GeForce GTX 1080 Ti; NVIDIA, Santa Clara, CA, USA). Similarly, Pavlakos et al. [43] introduced an end-to-end framework to estimate the SMPL parameters in 50 ms using a single GPU (NVIDIA Titan X). In these methods, the process is decomposed into two stages-first, regression of some types of 2D representations such as joint heatmap, mask, or 2D segmentation; second, estimation of model parameters from the intermediate results.
As the performance of two-stage methods relies heavily on the accuracy of the intermediate results, input information is not fully utilized. Recent studies have reported the higher efficiency and effectiveness of estimating meshes representing human bodies in a single stage instead of two stages [44][45][46]. The concept of DenseBody proposed by Yao et al. [46] achieves state-of-the-art performance in speed by introducing new representations for 3D objects, being more suitable for convolutional neural networks. Its runtime is approximately 5 ms for images of 256 × 256 pixels using a single GPU (NVIDIA GeForce GTX 1080 Ti). Although a runtime of 5 ms may be appropriate, the accuracy of DenseBody is still a critical problem, especially for representing the limbs. The main reason for the limited accuracy of DenseBody is its low resolution, and the limbs only account for a small portion of input images of the whole body. Thus, the skin deformation accuracy on the limbs, including the arms, is low for dynamic projection mapping. Moreover, the reconstruction of realistic skin deformation without markers while achieving high projection accuracy and speed remains to be solved. Figure 1 shows the system configuration and method pipeline. The proposed system is composed of three main parts: tracking, rendering, and projection. During tracking, arm poses including wrist pose P wrist and elbow pose P elbow are obtained by a marker-based motion tracking system in real time and used for rendering. Gray dots on the arm in Figure 1 indicate markers. During rendering, shape reconstruction outputs a 3D mesh of the arm surface based on the acquired real-time joint poses and predefined shape parameters, while accuracy compensation outputs the 2D texture coordinates of the arm model based on the joint poses only. These two processes are performed in parallel to accelerate rendering. The outputs are then combined by texture mapping to generate an image for projection. During projection, a high-speed projector shows the image on the arm of a user.

High-Speed Motion Tracking
As mentioned in Section 2, it is difficult to reconstruct skin deformation with high accuracy and speed using marker-less methods. However, fast and accurate methods are required for applications such as prototyping and medical assistance. Although it may increase the burden in setup and cause some discomfort to the user, marker-based methods are more suitable than their marker-less counterparts to track joint poses and reconstruct the arm surface using a parametric deformable surface model, such as SMPL [38]. For instance, the system in [16] used marker-based tracking to obtain accurate arm poses.
The SMPL model represents a realistic human body using 10-dimensional body shape parameter β and 72-dimensional pose parameter θ [38]. As the user's body shape does not vary in a short time, β can be determined in advance using marker-less methods, such as those in [39][40][41][42][43]. In contrast, the user's pose changes rapidly and notably contributes to skin deformation. We consider the forearm (i.e., the region between the elbow and wrist) for the proposed system. According to the definition of blend skinning, each vertex of the mesh is transformed by the weight of its neighboring joints. Hence, only the poses of the wrist and elbow are required to represent the forearm.
Commercial marker-based motion tracking systems, such as OptiTrack (NaturalPoint, Corvallis, OR, USA) [47] and Vicon (Oxford Metrics, Yarnton, UK) [48], can offer robust pose information of rigid bodies with a very short latency. In the proposed system, wrist pose P wrist and elbow pose P elbow are obtained from a motion tracking system. Joint pose P is a six-dimensional vector that includes the joint position, J ∈ R 3 , and joint orientation, γ ∈ R 3 .

High-Speed Rendering
The arm surface, which is represented by 3D mesh T ∈ R 3N with N vertices, is reconstructed using the SMPL model [38]. Although the SMPL model provides highquality reconstruction, it suffers from various problems. For instance, despite the accurate shape representation [41], small details of skin deformation are missing in the SMPL model due to its linearity [49].
The first row of Figure 2 illustrates the accuracy problems of the SMPL model. The misalignment between the projected patterns and black crosses is clear. This is a critical problem in projection mapping because users are sensitive to misalignments between the deformed skin and projected images. This problem is more noticeable because the users observe the projected image directly on their arms instead of on a separate display.  Regarding the arm, wrist pronation, and supination tend to cause the most obvious skin deformations that cause the most severe misalignments, as shown in the first row of Figure 2. Pronation and supination are rotations around the axis of the forearm. Thus, we assume that the misalignments caused by these motions are mainly perpendicular to the axis of the forearm. To improve the accuracy of describing skin deformation in the SMPL model for dynamic projection mapping, we propose an efficient regression-based compensation method. The method modifies the texture coordinates of the model in real time as follows: where (u * , v * ) are the texture coordinates provided by the arm-surface model initially before compensation. Respectively, u-axis is perpendicular to the axis of the forearm and v-axis is parallel to the axis of the forearm. Regression model g(v * , γ wrist ) is prepared in advance using polynomial curve fitting of user-specific data. The compensation method provides u and v , the texture coordinates of the model that are consistent with the reconstruction of the arm surface, T, obtained by the SMPL model. In the proposed system, the SMPL model serves as shape reconstruction method, and the compensation method can run in parallel to accelerate processing, as shown in Figure 3. Then, the reconstruction and compensation methods are combined by texture mapping, which maps an image onto a 3D shape. The required processing time of the proposed compensation method is shorter than that of the arm-surface reconstruction due to the simple polynomial regression. Thus, the proposed compensation method does not increase the overall system latency and improves the accuracy of dynamic projection mapping.

Shape reconstruction
Accuracy compensation

High-Speed Projection
To perform dynamic projection mapping within milliseconds [20], it is necessary to reduce the latency between image transmission and projection. Commercial off-the-shelf projectors usually have a latency above 10 ms after the projected image is delivered by the computer, being unsuitable for the proposed system. Alternatively, projectors based on the Digital Light Processing (DLP) chipsets have been proposed to achieve extremely high projection speeds. For instance, Watanabe et al. [50] developed a single-chip DLP high-speed monochrome projector with a rate of 1000 fps and a latency of 3 ms. Such high performance is achieved by the synchronized operation of a digital micromirror device and an LED (light emitting diode) by using a specialized image transmission module. To project RGB images using DLP projectors, three-chip DLP architectures have been combined to project the red, green, and blue image channels [8,51]. For instance, a customized 24bit RGB projector at 480 fps has been devised using a three-chip DLP configuration [8]. However, this configuration requires complicated optical systems that are costly and bulky. Watanabe et al. [52] then developed a single-chip DLP 24-bit RGB projector with fast response and high brightness.
In the proposed system, considering the system configuration and latency, we use a state-of-the-art high-speed single-chip projector that displays 24-bit XGA (Extended Graphics Array) images at a maximum rate of 947 fps [52]. As a result, the latency in image transmission from computer generation until projection can be reduced to less than 3 ms. Figure 4 shows the system configuration used for experiments to test the proposed system. Eight OptiTrack Prime 17 W cameras (360 fps, 1664 × 1088 resolution, 70 • field of view) were used to determine the joint poses within 4 ms [47]. In addition, a 24-bit high-speed projector was used. The projector achieves a maximum rate of 947 fps and 3 ms latency [52]. For the experiments, we used a computer equipped with a dual-core Intel Xeon Gold 6136 processor (3.00 GHz, 24 cores; Intel, Santa Clara, CA, USA), an NVIDIA GeForce RTX 2080 Ti (VRAM 11.0 GB) GPU, and 80.0 GB (2666 MHz) memory.

High-speed cameras (360 fps)
High-speed projector (947 fps) Image generation The stereo calibration between the projector and tracking system was performed in advance by collecting several pairs of corresponding 2D and 3D data points from the projector and tracking system, respectively. The motion tracking system can use both passive and active markers to obtain the 3D position and orientation. For simplicity in the setup on the skin surface, we used passive markers to test the proposed system. We directly attached the markers to the skin surface around the wrist and elbow, using four markers per body part. The markers were registered as two rigid bodies in the motion tracking system to accurately determine wrist pose P wrist and elbow pose P elbow in real time. The origin points and local coordinate system of each rigid body are fixed to be the same as the corresponding joint by manual operation.

Model Preparation
The SMPL model is described by pose parameter θ and shape parameter β [38]. Compared with real-time tracking of pose parameter θ, shape parameter β can be obtained in advance for each user. We used the open-source frameworks SMPLify [39] and Open-Pose [53] to estimate shape parameter β. OpenPose can estimate a 2D human pose from a single image. Then, SMPLify can estimate the SMPL parameters from a single image with a 2D human pose.
The proposed accuracy compensation method was applied in the 2D space of the texture. However, the default UV map of the SMPL model hinders compensation [54]. The default UV map of the SMPL model for a whole human body including the arm is shown in the left graph of Figure 5. We used Blender [55], an open-source 3D creation software, to modify the whole-body UV map and extract the forearm region. The extracted UV map was modified such that u-axis and v-axis were respectively perpendicular and parallel to the axis of the forearm, and Euclidean distances in 2D texture coordinate space between all vertices were proportional to their Euclidean distances in 3D world space, obtaining the result shown in the right graph of Figure 5. The vertices (u * wrist , v * wrsit ) ∈ (u * , v * ) are around the wrist and also closest to the wrist among (u * , v * ). In the experiment, we set the v * wrsit to be the zero vector.

Data Preparation
We determined Equation (1) in advance using polynomial curve fitting of user-specific data. Six markers for data collection were placed along the forearm discretely, as shown in Figure 6. The 3D position of i-th marker M t i at time t was obtained by the motion tracking system. The data were processed as follows: where o t i is the cumulative offset that is perpendicular to the axis of the forearm, of i-th marker at time t, and d t i is the Euclidean distance between i-th marker and wrist at time t. Over 10,000 datasets of (o t i , d t i ) were collected for a user in the experiments, and the results of the following sixth-degree polynomial curve fitting were obtained: where the x-axis is parallel to the axis of the forearm in wrist coordinate; d ∈ R N is a vector that contains the Euclidean distances in 3D world space from the arm-surface model's vertices to the wrist; o * ∈ R N is a vector that contains the initial value before calculating cumulative offset of the arm-surface model's vertices; and o ∈ R N is a vector that contains the cumulative offset of the arm-surface model's vertices according to corresponding γ x wrist , d and o * .
Since o and o * are perpendicular to the axis of the forearm and d is approximately parallel to the axis of the forearm, we accordingly obtained the conversion relationship as follows: where k is the parameter that converts pixels to millimeters; and v * 0 ∈ R N is a vector that contains the v-axis value of the vertex (u and (u * , v * ) have the minimum Euclidean distance in 2D texture coordinate space.
Due to the modification of the UV map in Section 4.2, v * 0 is the zero vector and Equation (4) can be simplified as: Hence, after combing Equation (3) and Equation (5), the regression model g(v * , γ wrist ) was obtained:

Results
Figure 2 (first and second rows) presents the projection results obtained before and after accuracy compensation. Black patterns were projected on the forearm surface, which contained five black crosses drawn using an ink marker as the ground truth of skin deformation. The misalignment between the projected patterns and black crosses in the first row of the figure is clear. The second row of the figure shows the improvement in projection accuracy achieved by applying the proposed compensation method, which suppresses the misalignment. We evaluated the accuracy before and after accuracy compensation by counting the numbers of crosses between the projected black patterns in the time sequence. The evaluation was conducted for 5 s, with 150 frames. The user performed wrist supination and pronation once during the evaluation. The more the crosses between the projected black patterns, the higher the accuracy. Figure 2 (bottom) presents the evaluation results. All of the five black crosses stay between the projected black patterns before wrist supination and pronation. Before accuracy compensation, the number of crosses between the projected black patterns drops to 2 and 0 after wrist supination and pronation, respectively. After accuracy compensation, the number of crosses remains 5 in both movements of wrist supination and pronation.
The developed system achieved a rate of 360 fps, with a bottleneck being caused by the tracking cameras. The tracking part causes a latency of 4 ms, obtained by the OptiTrack software [47]. The rendering part causes a latency of 3 ms, measured directly in the program. The projection part causes a latency of 3 ms, mentioned in [52]. Therefore, by simply adding the latency in each part, we can deduce that the overall latency between motion and projection was approximately 10 ms. At the achieved latency, users could hardly notice the misalignment between the projected image and arm surface, as illustrated in Figure 7. Thus, the proposed system provides fast and accurate projection.  Figure 8 presents an example of tattoo image projection on the forearm with varying skin deformation using the developed system. The projected image can be selected by the user freely and is deformed realistically as the wrist and elbow joints rotate. The developed system can also enable flexible on-body user interfaces because both sides of the arm surface can be used. In the left photograph of Figure 9, the inner side of the forearm serves as a menu panel, and, in the right photograph, the outer side of the forearm serves as a music player interface. Figure 10 presents another example of wound effects using a physical-based rendering texture.

Discussion
Herein, we propose a high-speed dynamic projection mapping system for the human arm with realistic skin deformation. The developed system achieves a latency of 10 ms between the motion and projection, almost imperceptible by the human vision even under high-speed rotations and translations, as shown in Figure 7. This is the first projection mapping system that can realistically handle human arm surfaces with a very low latency, although the achieved latency is still a few milliseconds higher than that reported in [20]. We believe that our system can be beneficial for applications in many fields such as fashion, user interfaces, and education. For example, it is difficult to remove a tattoo after getting tattooed; however, by using the proposed system, people can enjoy the realistic tattoo digitally and intuitively find their favorite tattoo design, as shown in Figure 8. On-body interfaces can become more flexible and interactive owing to realistic skin deformation and low latency as shown in Figure 9. Special-effect makeup, typically considered complicated and costly, can be easily applied in the film industry, as shown in Figure 10. Physiotherapy students can gain a better understanding by directly projecting the anatomical structures on the body.
Nevertheless, the proposed system has various limitations that remain to be addressed. For instance, the system uses a marker-based method, which complicates the setup. The registration of the joints requires careful manual operation to ensure that the coordinate systems of the registered rigid bodies and the joints are consistent. The error between the joints and the registered rigid bodies may lead to severe misalignments in the projection and error of the skin deformation. We believe that the problem can be alleviated if more joints are registered in the tracking system. The relationship of the skeletal structure can provide information to improve the registration accuracy.
In addition, we only focused on the arm surface in this study, and the proposed system should be extended to other body parts to widen the applicability. Accordingly, the proposed accuracy compensation method should be carefully designed and modified because the neighboring joints of vertices of other body parts could be different from the forearm. For example, the torso of a body has more than two neighboring joints.
The system described herein is user-specific in that the regression model was prepared in advance for a specific user. This limitation narrows the applications, and the regression model should also be extended to multiple persons. It is known that the SMPL model represents different body shapes through shape parameters [38]. If we collect the data of multiple persons with different shape parameters and regress the collected data with the shape parameters, the regression model can possibly be extended to multiple persons. In future work, we will implement the system using a marker-less method and widen its application to multiple persons simultaneously using learning-based methods.