Real-Time Live Streaming Framework for Cultural Heritage Using Multi-Camera 3D Motion Capture and Virtual Avatars

Kim, Minjoon; Hwang, Taemin; So, Jaehyuk

doi:10.3390/app152212208

Open AccessArticle

Real-Time Live Streaming Framework for Cultural Heritage Using Multi-Camera 3D Motion Capture and Virtual Avatars^†

by

Minjoon Kim

^1,*

,

Taemin Hwang

² and

Jaehyuk So

²

¹

School of Electrical and Computer Engineering, University of Seoul, Seoul 02504, Republic of Korea

²

Korea Electronics Technology Institute, Seongnam-si 13509, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Kim, M.; Hwang, T. Real-time multi-view 3D pose estimation system with constant speed. In Proceedings of the HCI International 2023, Copenhagen, Denmark, 23–28 July 2023.

Appl. Sci. 2025, 15(22), 12208; https://doi.org/10.3390/app152212208

Submission received: 25 September 2025 / Revised: 31 October 2025 / Accepted: 10 November 2025 / Published: 18 November 2025

(This article belongs to the Special Issue Challenges and Current Applications of 3D Information Technologies for Cultural Heritage)

Download

Browse Figures

Versions Notes

Abstract

The preservation and digital transmission of cultural heritage have become increasingly vital in the era of immersive media. This study introduces a real-time framework for digitizing and animating traditional performing arts, with a focus on Korean traditional dance as a representative case study. The proposed approach combines three core components: (1) high-fidelity 3D avatar creation through volumetric scanning of performers, costumes, and props; (2) real-time motion capture using multi-camera edge processing; and (3) motion-to-avatar animation that integrates skeletal mapping with physics-based simulation. By transmitting only essential motion keypoints from lightweight edge devices to a central server, the system enables bandwidth-efficient streaming while reconstructing expressive, lifelike 3D avatars. Experiments with eight performers and eight cameras achieved low latency (~200 ms) and minimal network load (<1 Mbps), successfully reproducing the esthetic qualities and embodied gestures of Korean traditional performances in a virtual environment. Beyond its technical contributions, this framework provides a novel pathway for the preservation, dissemination, and immersive re-experiencing of intangible cultural heritage, ensuring that the artistry of traditional dance can be sustained and appreciated in digital form.

Keywords:

traditional performing arts; 3D virtual avatar; real-time motion capture; live streaming framework

1. Introduction

In recent years, video-based content has grown explosively, driving demand for live streaming services with ever-higher quality expectations. Supporting such real-time streaming requires both high-performance computing and efficient network management. At the same time, interest in virtual content has expanded across industries such as entertainment and gaming, where visual effects (VFX) and motion capture (MoCap) are widely used to create immersive experiences and digital characters [1]. More recently, commercially viable technologies have emerged that combine these elements, enabling the 3D reconstruction of real people and their integration into diverse virtual environments [2].

Two primary technologies support realistic 3D human modeling. The first, volumetric capture, uses arrays of cameras to generate high-resolution 3D meshes and textures. While this method produces highly detailed results, it requires extensive camera setups, significant processing power, and high bandwidth, making it impractical for real-time applications. The second, motion capture (MoCap), focuses on recording movement rather than texture. MoCap can be implemented with wearable sensors or camera-based systems, producing less data and enabling faster processing, though at the cost of realistic texture reconstruction and visual fidelity.

Volumetric methods are generally infrastructure-heavy and not optimized for real-time applications [3], whereas camera-based MoCap systems are more scalable and better suited for interactive use cases [4]. Such systems have been increasingly applied in gaming, online avatars, and virtual presenters [5]. They offer immersive experiences while protecting user privacy by abstracting personal identity.

The rapid progress of computer vision and artificial intelligence (AI) has significantly advanced MoCap technology [6]. Historically, marker-based systems dominated in gaming and entertainment, requiring performers to wear sensor-equipped suits or rely on specialized cameras. Recent developments in deep learning, particularly convolutional neural networks (CNNs), have enabled posture estimation using only RGB cameras. This approach has broadened accessibility, reduced hardware dependence, and expanded adoption across multiple fields.

Realistic 3D human modeling relies primarily on two approaches. Volumetric capture employs arrays of cameras to generate high-resolution 3D meshes and textures, producing highly detailed results but requiring large-scale camera setups, significant processing power, and high bandwidth, which limits real-time applications. Motion capture (MoCap), in contrast, focuses on capturing movement rather than texture and can be implemented using either wearable sensors or camera-only systems. MoCap generates less data and enables faster processing, though it lacks realistic texture reconstruction, which constrains visual fidelity.

Volumetric approaches are typically infrastructure-intensive and not optimized for real-time performance [3], whereas camera-based MoCap systems are more scalable and suitable for interactive applications [4]. Such systems have been increasingly adopted in gaming, virtual presenters, and online avatars [5], providing immersive experiences while protecting user privacy through abstraction of identity.

With rapid advancements in AI-powered computer vision, MoCap technology has evolved substantially, enabling high-accuracy human posture estimation using simple RGB cameras [6]. Marker-based systems historically dominated gaming and entertainment but required specialized suits and cameras. Deep learning methods, particularly convolutional neural networks (CNNs), have recently enabled high-performance pose estimation from single images, expanding applicability across multiple domains.

As the demand for precise motion tracking grows, 3D human pose estimation has become a key research area. Methods are commonly categorized into multi-view and single-view models, with multi-view approaches achieving higher accuracy by integrating inputs from multiple cameras [7]. These advancements have opened new opportunities for interactive, immersive environments.

Recent trends in digital cultural heritage have increasingly emphasized immersive, interactive, and data-driven methods for documenting and disseminating cultural assets. IoT–XR–based frameworks have been explored for large-scale reconstruction, restoration support, and intelligent guided-tour experiences, demonstrating how emerging technologies can broaden public access to cultural heritage [8,9]. In parallel, studies on virtual and metaverse-based museums have examined user trust, engagement, and social cognition in digitally mediated heritage experiences [10,11,12].

At the same time, research on camera-based posture analysis has expanded rapidly in both digital-ergonomics and cultural-heritage applications. OpenPose-style 2D keypoint pipelines are widely used in low-cost, multi-camera settings for quantifying human movement, while depth-enabled systems such as ZED provide geometric cues that improve pose disambiguation in heritage documentation and motion analysis tasks. These technological developments collectively highlight the growing need for real-time, high-fidelity, and culturally contextualized 3D representations of intangible heritage—particularly traditional performing arts—motivating the direction of the present study.

This study presents a real-time live streaming platform for digitizing and animating traditional performing arts, demonstrated through Korean traditional dance (Figure 1). The platform supports multi-performer scenarios and leverages a distributed edge computing infrastructure. It integrates multi-view 3D pose estimation with 3D coordinate synthesis, enabling accurate, coherent tracking of multiple dancers in dynamic environments. Reconstructed motion data are mapped onto pre-modeled, rigged 3D avatars, generating real-time animated virtual performances that can be enriched with virtual effects and cultural backdrops.

Despite these technological advances, applying real-time 3D human digitization to intangible cultural heritage—especially multi-performer traditional dance—remains technically challenging. Specifically, this study focuses on addressing the following key challenges: (1) Latency and bandwidth constraints required for real-time multi-camera 3D motion capture and streaming, (2) Multi-performer robustness, including occlusion handling, depth ambiguity resolution, and identity consistency during complex ensemble choreography, (3) Cultural authenticity, ensuring that reconstructed movements, costumes, and symbolic gestures faithfully preserve the characteristics of traditional performances.

These points define the technical and cultural challenges that our framework seeks to address. In summary, our main contributions are as follows:

We propose a lightweight, edge-based platform that integrates AI-driven pose estimation with avatar-based animation, validated through demonstrations of Korean traditional dance.
We achieve bandwidth-efficient real-time streaming, reducing raw HD video transmission (~1 Gbps) to skeleton-based motion data (~64 Kbps) while maintaining 60 fps rendering quality.
We ensure high realism and interactivity through multi-device synchronization across a distributed edge network, enabling smooth, immersive delivery of cultural performances in digital form.

2. Related Work

The rapid progress of computer vision, fueled by deep learning and increased computational capacity, has enabled real-time content generation for interactive and immersive applications. For cultural heritage, such developments offer new opportunities to preserve and share performing arts in digital spaces, where both technical accuracy and cultural fidelity are essential. This section reviews three foundational components of our platform: 3D motion capture, virtual avatar generation, and motion-to-avatar animation.

2.1. Three-Dimensional Motion Capture

Recent advances in deep learning and convolutional architectures have greatly accelerated progress in 3D human pose estimation. Early breakthroughs such as DeepPose [13] pioneered the use of CNNs for 2D pose estimation, laying the foundation for later 3D extensions [14,15,16]. Broadly, single-person 3D pose estimation methods fall into two categories: one-stage and two-stage frameworks. One-stage approaches infer 3D joint coordinates directly from images, while two-stage approaches first estimate 2D keypoints and then reconstruct their 3D counterparts through geometric lifting. The latter helps reduce overfitting by leveraging robust 2D datasets.

One-stage models, though efficient, often struggle to capture dependencies between joints. Li and Chan [15] showed that multitask CNN architectures can jointly learn 3D regression and body part detection, while Tekin et al. [16] introduced structured prediction frameworks to better model spatial constraints. Nonetheless, single-view estimation remains challenging in scenarios with occlusions or complex body configurations.

Two-stage approaches first estimate 2D keypoints and then reconstruct 3D poses, benefiting from large and robust 2D datasets. Multi-view methods further address occlusion issues by incorporating geometric information from multiple cameras. For example, Takahashi et al. [17] managed asynchronous, uncalibrated inputs through 2D detection, spline refinement, and triangulation. Lightweight alternatives also integrate differentiable triangulation into end-to-end learning pipelines.

As research expanded to multi-person pose estimation, challenges emerged in cross-view identity matching. Solutions include epipolar constraints, multi-way clustering [6], and bipartite graph matching [12]. These approaches are particularly relevant for group performances, where multiple individuals must be tracked simultaneously. More recently, edge-based inference has become practical for real-time applications. ConvNeXtPose [18] and RRMP [19] demonstrated efficient 3D pose estimation on mobile devices, while platforms such as Google Edge TPU and NVIDIA Jetson Xavier have shown how distributed edge processing can improve scalability. These developments are important for capturing live performances with low latency.

2.2. Three-Dimensional Avatar Creation

3D avatar creation, or 3D human reconstruction, refers to the digital recreation of human figures for applications such as virtual reality (VR), augmented reality (AR), gaming, and film. It involves generating realistic, animated 3D models that can be integrated into virtual environments for character animation, simulation, or immersive presentation. Advances in this area have expanded possibilities for interactive experiences and for preserving performative heritage in digital form [5].

Recent progress in deep learning, particularly CNNs, has substantially improved 3D reconstruction. Models such as PoseNet estimate human pose from a single image and extend that information into 3D space, enabling reconstruction from minimal inputs. The SMPL model provides a parametric representation of human body shape and pose, supporting realistic and adaptable reconstructions across different body types and actions. Work on dynamic appearance and shape continues to advance: for example, Ref. [20] introduced Gaussian maps to improve dynamic appearance modeling, and [21] used 3D Gaussians to enhance appearance synthesis from 2D images. Studies in [22,23] apply model-optimization techniques to achieve efficient performance. Nonetheless, many image-based reconstruction methods remain computationally intensive.

Real-time 3D model generation is critical for motion-capture applications, where tracked movements must be reconstructed and rendered with minimal latency. Motion-capture suits and deep learning–based pose estimation both contribute to efficient, real-time avatar creation from camera streams [2]. These capabilities are essential for gaming, live streaming, and VR, where accurate and responsive human models are required.

Beyond entertainment, 3D human reconstruction has broad applications in healthcare, education, and cultural heritage. In cultural contexts, realistic avatars can preserve and convey details of costume, gesture, and choreography, features that are crucial for documenting and presenting traditional performances such as Korean dance. As real-time modeling techniques improve, they will continue to deliver more accurate and interactive avatars for social interaction, education, and the immersive presentation of cultural heritage [1,9].

2.3. Motion-to-Avatar Animation

Motion-to-avatar animation uses motion capture and related tracking technologies to animate virtual avatars or characters in real time. By translating human movements into digital models, this technology produces lifelike and dynamic animations. Platforms such as Twitch and YouTube have increasingly adopted motion-driven avatars for live streaming, allowing creators to interact with audiences through digital representations that mirror their gestures and expressions. This approach enhances engagement by making the experience more interactive and immersive.

Recent advances in AI and machine learning have further improved motion-based avatar animation. AI algorithms refine raw motion capture data by filling in missing movements, smoothing transitions, and generating complex motions, all of which contribute to more natural animations. Moreover, AI techniques can reconstruct animations from sparse or incomplete data, increasing both the realism and fluidity of avatar motion.

A key challenge in this area is achieving high accuracy and low latency for real-time applications. Motion-tracking systems must capture fine-grained movements and reproduce them without perceptible delay, especially in interactive environments such as gaming, live streaming, or cultural performance broadcasting. Even small delays can disrupt immersion, making near-instantaneous data processing essential. Addressing these challenges ensures smooth, responsive avatar animation, supporting seamless user experiences and enabling accurate digital representation of traditional performances [14].

3. Proposed Framework

The proposed real-time streaming platform, extended and restructured from our prior work [24,25] and illustrated in Figure 2, comprises two primary components: a motion capture system that acquires real-time 3D motion data from performers using multiple synchronized cameras, and a 3D reconstruction system that integrates pre-modeled character avatars. The platform generates and streams 3D content by combining real-time motion data with pre-built 3D character models.

Unlike our previous work [25], which focused on technical performance evaluation of multi-view 3D pose estimation and synchronization accuracy, the present study extends that validated backbone toward a real-time and interactive platform for cultural-heritage applications. In particular, the system newly integrates edge-based lightweight inference, skeleton-only data transmission for bandwidth reduction, and real-time avatar retargeting with physics-based simulation, enabling expressive reproduction of traditional performances in virtual environments. The proposed platform generates and streams 3D content by combining real-time motion data with pre-built 3D character models.

The motion capture system consists of multiple cameras connected to edge AI devices and a central server. Edge devices detect 2D human poses and extract appearance features from RGB images, transmitting only this lightweight data to the central server. Because only pose information is transmitted, network bandwidth requirements are typically in the order of tens of Kbps.

At the central server, 2D pose data are organized by timestamp and reconstructed into 3D poses using geometric triangulation. Compared with conventional systems, this architecture is better suited for real-time applications because deep learning tasks are distributed across edge devices, substantially reducing network traffic. The server groups 2D poses based on geometric similarity and reconstructs them into 3D coordinates through multi-view triangulation.

The platform employs pre-modeled 3D avatars representing the target characters. Each avatar, composed of 3D meshes and texture data, is animated in real time by combining it with motion data from the capture system. The resulting avatars mimic the performer’s movements naturally within a virtual environment. Animated avatars are synchronized with background effects and rendered as part of the live streaming output.

Models for 3D motion capture [6,7,13,14] achieve high accuracy in monocular 3D pose estimation through deep temporal learning, while models for avatar generation [20,21,22,23] demonstrates advanced neural rendering for photorealistic avatar generation. However, these methods are generally optimized for offline reconstruction or single-person settings, limiting their applicability to real-time multi-performer cultural scenarios.

In contrast, our proposed Hybrid System integrates the efficiency of keypoint-based motion capture with the realism of pre-scanned avatars, achieving low latency and high fidelity suitable for live streaming of traditional performances. The comparative positioning of these approaches is summarized in Table 1, which situates the proposed method between motion-capture- and avatar-generation-based paradigms.

4. System Implementation

The proposed platform is a real-time live streaming system that integrates three key components: (1) edge AI-based 3D motion capture, (2) 3D avatar creation, and (3) real-time motion-to-avatar animation. It uses a distributed network of cameras and edge processors to extract human motion, which is combined with pre-modeled avatars to generate and stream realistic virtual performances.

To clarify the computational workflow, the edge devices do not perform model splitting or distributed inference in the strict sense. Instead, each edge node independently executes lightweight 2D pose estimation on its paired camera stream and transmits only the detected 2D keypoints to the central server. Because multi-view fusion is performed only once at the server side, this design naturally distributes the computational load: the edge devices handle dense per-frame inference, while the server performs triangulation, identity matching, and final 3D reconstruction. This architecture significantly reduces server-side processing requirements and removes the need for transmitting high-bandwidth video streams, enabling stable real-time operation even with multiple performers.

The following subsections describe the implementation details and specific roles of each subsystem.

4.1. Edge AI-Based 3D Motion Capture

To maintain focus on the cultural-heritage application, we provide here only the conceptual overview of the multi-view synchronization, PDJ evaluation, and depth-aided identity matching modules. The complete algorithmic specifications are fully detailed in our prior study [25], which this system directly builds upon.

Each edge device independently performs 2D human pose estimation on its respective image input and forwards the resulting data to a centralized server, as illustrated in Figure 3. The i-th edge device transmits the 2D keypoint coordinates

x

_i and timestamp t_i to the central server, where

x

_n represents the 2D keypoint coordinates {u_n, v_n} along with the confidence score {c_n} from the n-th camera. The central server collects {

x_{1}^{'}

, …,

x_{m}^{'}

}, a set of 2D keypoint coordinates synchronized by timestamp, forming a synchronized group of 2D poses. It then reconstructs a 3D pose

X

= {x_w, y_w, z_w} with confidence score {c_w} using geometric triangulation techniques applied to the set of 2D human poses via the DLT method [6].

Synchronization of 2D poses from multiple edge devices is necessary because perfectly aligned timestamps are generally impossible. To address this, data within an acceptable error range Δt are considered co-occurring. Green in Figure 4 (left) illustrates data sent within the acceptable time window for each device, while red indicates data outside the window. For DLT reconstruction, at least two data points are required; if fewer than two are received, the previous frame’s values are used. However, this scenario is unlikely in the proposed distributed system due to the number of edge devices. Performance of the time synchronization algorithm is evaluated using the percentage of detected joints (PDJ) [8]:

{P D J}^{α} (%) = \frac{\sum_{i = 1}^{k} B (d_{i} < α D)}{k} \times 100

(1)

where

d_{i}

is the Euclidean distance between the i-th predicted keypoint and the corresponding ground-truth keypoint, D denotes the Euclidean distance of the 3D bounding box of the human body,

α

is a distance threshold used to verify the estimation accuracy,

k

is the total number of keypoints, and B(·) is a Boolean function returning 1 if the condition is true and 0 otherwise.

An appropriate Δt can be determined from the computer simulation results shown on the right side of Figure 4. These results illustrate the PDJ across varying Δt values for eight edge devices, with detection performance adjusted using

α

. The simulations show that setting Δt to approximately 33 ms for a fixed 30 fps configuration achieves a PDJ of around 80%, while for a 60 fps configuration, a Δt of ~20 ms yields a PDJ of nearly 90%. In practical field applications, the output interval Δt should be configured based on factors such as the number of cameras and edge devices, edge processing speed, network conditions, and other relevant system parameters.

To distinguish multiple individuals, the central server matches 2D poses across views using depth-sensing results. Calibrated cameras are aligned to a world plane, and 2D poses

x_{m}^{p}

are grouped by minimizing Euclidean distances using a clustering algorithm, such as agglomerative clustering, based on depth-sensing results

x_{i}^{p}

and

d_{i}^{p}

. Each cluster corresponds to the 2D poses of the same individual across multiple views. Person matching is performed for every frame, as illustrated in Figure 5.

4.2. 3D Scan Based Avatar Creation

The 3D reconstruction system, shown in Figure 6, uses multiple 8K DSLR cameras along with handheld scanners (Artec Leo) to capture high-resolution models of the performer, including the rigid body, clothing, and props. This system produces high-quality human model data, ensuring both precision and realism. Each component, body, costume, and props, is scanned individually. The process includes cleanup, retopology, and rigging to maintain data quality comparable to real-life models. Rigging allows the 3D model to integrate seamlessly with motion capture data, enabling dynamic and natural animation.

For soft-body components such as clothing, reconstructed models are simulated over the rigid body base. These non-rigid elements respond to captured motion data and are animated in real time using a physics engine, which calculates realistic fabric behavior within the virtual environment. This approach enhances visual realism and allows lifelike interactions between the avatar and its surroundings. While physical simulation is not the primary focus of this study, our implementation leverages Unity’s built-in physics material system to achieve real-time animation effects.

4.3. Real-Time Motion-to-Avatar Animation

The real-time motion-to-avatar animation module maps 3D skeletal data from the motion capture system onto the pre-modeled and rigged 3D avatars. This process generates natural, real-time animations for multiple performers simultaneously. To enhance realism, an inverse kinematics (IK) library is used. By utilizing the estimated 3D joint positions of key body parts, such as hands, feet, and elbows, the system infers plausible positions and rotations for other joints, enabling smooth and physically accurate movement of multi-joint structures.

To satisfy real-time constraints, the IK solver operates within the rendering loop and is executed immediately after receiving updated 3D joint positions from the motion capture module. We adopt Unity’s lightweight IK library, which performs local joint optimization with a bounded number of iterations (typically 1–2 per frame). This design keeps the computation within a 3–5 ms per-frame budget, ensuring that avatar retargeting consistently runs at 60 fps without interrupting the overall streaming pipeline.

To maintain consistency between captured motions and assigned avatars, each 3D joint sequence is assigned a unique ID during the motion capture stage. This ID ensures that motion data is correctly mapped to the corresponding avatar, preventing model-switching artifacts during animation.

As illustrated in Figure 7, virtual backgrounds and additional effects can be integrated during rendering. The final output delivers an immersive and visually rich 3D performance experience from dynamic camera perspectives, allowing end users to engage with the content with enhanced realism and spatial interaction.

5. Experimental Results

5.1. Environment Setup

We developed a real-time demonstration environment to validate the feasibility of the proposed platform. Since the focus of this study is on practical functionality in a live demo, standard algorithm performance datasets were not used. Instead, results were obtained by directly recruiting performers in a real-world setting. As shown in Table 2 and Figure 8, the demonstration setup enabled real-time visualization of 3D pose estimation results on a display screen. Eight RGB cameras were connected to eight edge devices, each performing 2D pose estimation on single-view RGB images and outputting results with corresponding timestamps. All edge devices were linked to a central server via an IP switch over an Ethernet network. Because only pose data, not raw image data, was transmitted, network traffic was limited to a few hundred kbps. The central server aggregated the 2D pose data from all edge devices and reconstructed 3D poses at a constant rate of 60 fps. The process was monitored on-site in real time using GUI software (Unity 2023). Measured latency between the live scene and the GUI display was approximately 200 ms, providing sufficient responsiveness for an interactive experience.

5.2. Real-Time Demonstration

An experiment was conducted using a dancing performer to evaluate the feasibility of reproducing various human actions in real time within the implemented demonstration environment. The goal was to determine whether the pre-modeled virtual avatar could accurately replicate the performer’s movements while providing an interactive experience. The results are shown in Figure 9. The avatar demonstrated high responsiveness and accurately expressed actions such as standing, looking forward, looking back, and sitting, confirming its suitability as part of a real-time live streaming platform.

Additional experiments involved multiple performers, focusing on eight individuals. A key objective was to assess whether the system could effectively manage occlusions between performers. Real-time virtual content was also generated with integrated background elements, illustrating the platform’s potential for expansion into broader content services, as shown in Figure 10.

The developed technology leverages AI-based 3D performance generation by estimating performer poses and synthesizing 3D positions using only multi-camera RGB images. While this system may (1) be slightly more costly and exhibit marginally lower fidelity than conventional motion capture systems, it offers significant advantages in usability, and (2) although the visual restoration quality may be slightly lower than volumetric systems, it excels in cost efficiency and real-time processing. Considering factors such as mobility, responsiveness, and economic feasibility, this technology demonstrates strong potential for commercialization in real-time virtual performance content creation, including the preservation and digital presentation of traditional performances.

5.3. Performance Analysis

To contextualize the performance of the proposed framework, we compare it against two representative state-of-the-art approaches: ConvNeXtPose [18], a recent single-view 3D human pose estimation model, and GaussianAvatar [21], an advanced avatar-generation pipeline based on Gaussian Splatting. These models represent the dominant paradigms in (1) learning-based 3D pose estimation and (2) high-fidelity digital human reconstruction, respectively. Table 3 summarizes the key differences in terms of input modality, latency, accuracy, real-time feasibility, and streaming capability.

Our system differs significantly from ConvNeXtPose, which supports real-time inference on a single GPU but lacks a full real-time, multi-view, multi-performer, streaming-capable pipeline. ConvNeXtPose operates on single-view RGB images and provides only inference-level performance; it does not include multi-camera synchronization, triangulation, identity preservation, or avatar retargeting, which are essential for ensemble dance capture. In contrast, the proposed framework performs distributed inference across synchronized edge devices, reconstructs multi-view 3D skeletons, and streams animation-ready motion data to the server in real time.

When compared with GaussianAvatar, our system provides complementary strengths. GaussianAvatar excels at generating high-quality static digital humans but relies on offline reconstruction, with reconstruction times on the order of several minutes. Although its rendering performance is high, the overall pipeline cannot support real-time cultural performance streaming or multi-performer interaction. The proposed system, by contrast, focuses on real-time operation: it achieves approximately 200 ms end-to-end latency, supports up to eight performers simultaneously, and maintains a stable 60 fps streaming rate. The current ~200 ms latency is adequate for real-time streaming, and future work will explore further reduction through network and pipeline optimization.

In terms of accuracy, the proposed system achieves 94.6–96.1% PDJ, which is comparable to modern learning-based approaches. Although ConvNeXtPose provides competitive accuracy in single-view scenarios, it generally underperforms in multi-person or multi-view conditions because it lacks geometric triangulation. Meanwhile, GaussianAvatar does not provide joint-level accuracy metrics because its objective is appearance reconstruction rather than pose accuracy. The accuracy of our method benefits from the multi-view triangulation module developed in prior work, which improves tolerance to occlusion—particularly important in Korean traditional dance, where wide sleeves, props, and overlapping arm motions are common. While commercial MoCap systems may show slightly higher accuracy, our multi-view RGB pipeline demonstrated stable performance for real-time streaming, with minor gaps arising mainly from sleeve occlusion and overlapping silhouettes. The measured PDJ scores indicate that the accuracy is sufficient for ensemble dance reconstruction, and further refinement under heavy occlusion is planned for future work.

Finally, the streaming capability of the proposed pipeline distinguishes it from both baselines. By transmitting only lightweight 2D keypoints from each edge device, the system maintains network bandwidth below 1 Mbps per performer, enabling stable real-time streaming even in multi-performer conditions. Neither ConvNeXtPose nor GaussianAvatar provides an end-to-end streaming architecture, making our method more suitable for live cultural-heritage performances, remote exhibitions, or virtual museum applications.

5.4. Technical Implications for Cultural-Heritage Preservation

The proposed framework offers several technical advantages that contribute directly to the preservation and digital transmission of intangible cultural heritage, particularly in the context of traditional Korean performing arts. Unlike conventional 3D capture pipelines—which primarily focus on single-performer motion acquisition or offline reconstruction—our system integrates real-time multi-view synchronization, multi-performer 3D motion reconstruction, and volumetric avatar–based retargeting to preserve cultural expressions with higher fidelity and contextual richness.

First, the real-time multi-view pipeline enables the preservation of embodied motion semantics, including subtle rhythmic variations, expressive upper-body gestures, and choreographic structures that define traditional dance idioms. Such motion-level information is often lost or simplified in single-view or offline systems, especially in performances involving wide sleeves, long silhouettes, or overlapping arm trajectories. By resolving depth ambiguity through synchronized triangulation, the system captures motion dynamics with greater robustness to occlusion—a critical requirement for documenting dances that incorporate large costume elements or handheld props.

Second, the framework allows the preservation of ensemble interactions, which are central to many traditional dance forms. Multi-performer tracking supports the reconstruction of group spacing, formation changes, and relational body movements, all of which constitute essential components of intangible cultural heritage. Existing pipelines seldom address these multi-person spatiotemporal relationships due to their reliance on offline volumetric processing or monocular pose estimation.

Third, the integration of high-fidelity volumetric avatar scans ensures that culturally meaningful costume characteristics—such as sleeve length, garment flow, and the silhouette of traditional attire—are reflected in the final virtual performance. Through physics-based simulation and avatar retargeting, the system preserves both the motion and the esthetic symbolism conveyed by traditional costumes.

Finally, the lightweight streaming architecture enables low-bandwidth, real-time dissemination of reconstructed performances, making the captured heritage accessible in remote or distributed environments such as museums, education centers, and virtual exhibitions. This contributes not only to archival preservation but also to broader public engagement and re-experiencing of traditional performances in immersive digital formats.

In the demonstrations conducted for this study, eight trained performers from a Korean traditional dance program participated in the evaluation. Their height range (158–176 cm) and experience levels (5–12 years) provided a realistic representation of typical ensemble choreography. The recorded sequences included basic steps, sleeve-driven arm motions, and short excerpts from commonly practiced dance routines. During these sessions, the system maintained stable performance, achieving an ID-matching accuracy of 97.2%, a frame-drop rate below 1.8%, and PDJ scores consistent with those presented in Section 5.3. These results indicate that the proposed pipeline remains reliable even in occlusion-prone multi-performer settings characteristic of traditional dance.

5.5. Cultural Authenticity and Ethical Considerations

To ensure that the proposed framework not only captures motion data but also preserves the cultural authenticity of traditional Korean performances, several measures were incorporated throughout the digitization and animation process. First, costumes and props—key elements that carry symbolic meaning in many traditional dance forms—were digitized using high-resolution 3D scanning. This allowed the system to faithfully reproduce the geometry, texture, and visual characteristics of garments such as long sleeves, layered fabrics, and ornamental accessories, all of which play an essential role in shaping the expressive qualities of Korean dance.

Second, culturally meaningful gestures were reviewed in close collaboration with professional traditional dancers. Particular attention was given to subtle but symbolically significant motions, including sleeve trajectories, fingertip articulation, and ritualized arm poses. During the retargeting stage, these gestures were prioritized to prevent semantic distortion, and motion sequences with cultural importance were archived together with metadata annotations. This ensures that the captured material can serve not only as a visual reproduction but also as a contextually interpretable cultural record for future researchers, curators, and educators.

From an ethical standpoint, all performers provided written informed consent prior to participating in the data acquisition process, and an anonymized version of the consent form has been submitted according to journal requirements. The collected data are used strictly for research and educational purposes, and participants retain the right to request data deletion or restricted access. Data ownership and usage conditions were discussed and agreed upon in advance, ensuring that the digitization of intangible cultural heritage respects both individual rights and the broader cultural value of the archived material.

6. Conclusions

The demand for video live streaming continues to grow, driven by increasing data traffic, rising network costs, and the technical challenges of delivering real-time performance. At the same time, interest in virtual content, such as VFX and motion capture, is expanding, creating new opportunities for immersive entertainment and interactive live experiences. This study presented a real-time live streaming framework that integrates motion capture and 3D reconstruction technologies. The developed platform leverages AI-based 3D performance generation, estimating human poses and synthesizing 3D positions using only multi-camera image input. The system was demonstrated with both single- and multi-performer scenarios, including traditional Korean dance, highlighting its potential for preserving and digitally presenting intangible cultural heritage in an interactive format.

While the system may incur slightly higher implementation costs and marginally lower accuracy compared to conventional commercial motion capture systems, it provides substantial advantages in usability, flexibility, and real-time responsiveness. Similarly, although reconstruction fidelity may not fully match that of volumetric systems, the platform excels in cost efficiency and practical real-time performance. Considering its scalability, responsiveness, and economic feasibility, the proposed framework offers a promising solution for commercializing real-time virtual performance content, with applications ranging from entertainment and education to the preservation and dissemination of traditional performing arts.

Looking ahead, two technical extensions represent promising directions for future work. First, integrating a lightweight real-time feedback interface—allowing performers to monitor their reconstructed motions during capture—could further enhance interactivity and improve motion fidelity. Second, adopting higher-level predictive models such as temporal networks or reinforcement-learning-based motion smoothing may yield more stable trajectories for complex or dynamic choreography, provided such methods can be incorporated without compromising real-time performance. These extensions will be explored as part of the next phase of system refinement.

Author Contributions

Conceptualization, writing—original draft preparation, M.K. and T.H.; Writing—review and editing, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Ministry of Trade, Industry and Energy (MOTIE) and the Korea Planning & Evaluation Institute of Industrial Technology (KEIT) through the Industrial Technology Innovation Program (RS-2025-02307330, Development and Demonstration of On-Device AI Semiconductor for Manufacturing Automation Robots through sVLM-Based Situational Awareness) for Jaehyuk So. This work was also supported by the 2025 Research Fund of University of Seoul for Minjoon Kim.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Data in Figure 1, Figure 2, Figure 9 and Figure 10 were acquired with the cooperation of performers at Korea National University of Arts and The Center for Intangible Culture Studies.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ververas, E.; Zafeiriou, S. SliderGAN: Synthesizing expressive face images by sliding 3D blendshape parameters. Int. J. Comput. Vis. 2020, 128, 2629–2650. [Google Scholar] [CrossRef]
Kocabas, M.; Athanasiou, N.; Black, M.J. VIBE: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5253–5263. [Google Scholar]
Mahmood, N.; Zhang, Y.; Matusik, W.; Li, H.; Albrecht, A. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Elmi, A.; Mazzini, D.; Tortella, P. Light3DPose: Real-time multi-person 3D pose estimation from multiple views. In Proceedings of the IEEE International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2020. [Google Scholar]
Kumar, P.; Chauhan, S.; Awasthi, L.K. Human pose estimation using deep learning: Review, methodologies, progress and future research directions. Int. J. Multimedia Inf. Retr. 2022, 11, 489–521. [Google Scholar] [CrossRef]
Dong, J.; Zhang, S.; Lin, W.; Xu, L.; Xie, L. Fast and robust multi-person 3D pose estimation and tracking from multiple views. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6981–6992. [Google Scholar] [CrossRef] [PubMed]
Tanke, J.; Gall, J. Iterative greedy matching for 3D human pose tracking from multiple views. In Pattern Recognition; Springer: Cham, Switzerland, 2019. [Google Scholar]
Zhang, J.; Wang, G.; Chen, H.; Huang, H.; Shi, Y.; Wang, Q. Internet of Things and Extended Reality in Cultural Heritage: A Review on Reconstruction and Restoration, Intelligent Guided Tour, and Immersive Experiences. IEEE Internet Things J. 2025, 12, 19018–19042. [Google Scholar] [CrossRef]
Al-kfairy, M. Exploring trust and social cognition in the adoption of Metaverse-based museums. Kybernetes 2025. [Google Scholar] [CrossRef]
Lin, C.; Xia, G.; Nickpour, F.; Chen, Y. A review of emotional design in extended reality for the preservation of cultur al heritage. npj Heritage Sci. 2025, 13, 86. [Google Scholar] [CrossRef]
Anwar, M.S.; Yang, J.; Frnda, J.; Choi, A.; Baghaei, N.; Ali, M. Metaverse and XR for cultural heritage education: Applications, standards, architecture, and technological insights for enhanced immersive experience. Virtual Real. 2025, 29, 51. [Google Scholar] [CrossRef]
Muñoz, A.; Climent-Ferrer, J.J.; Martí-Testón, A.; Solanes, E.; Gracia, L. Enhancing Cultural Heritage Engagement with Novel XR Systems: A Design Methodology. Electronics 2025, 14, 2039. [Google Scholar] [CrossRef]
Toshev, A.; Szegedy, C. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
Gamra, M.B.; Akhloufi, M.A. A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis. Comput. 2021, 114, 104282. [Google Scholar] [CrossRef]
Li, S.; Chan, A. 3D human pose estimation from monocular images with deep convolutional neural network. In Proceedings of the Asian Conference on Computer Vision (ACCV), Taipei, Taiwan, 20–24 November 2016; pp. 332–347. [Google Scholar]
Tekin, B.; Katircioglu, I.; Salzmann, M.; Lepetit, V.; Fua, P. Structured prediction of 3D human pose with deep neural networks. In Proceedings of the British Machine Vision Conference (BMVC), York, UK, 19–22 September 2016. [Google Scholar]
Takahashi, K.; Mikami, D.; Isogawa, M.; Kimata, H. Human pose as calibration pattern: 3D pose estimation with multiple unsynchronized and uncalibrated cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1856–1867. [Google Scholar]
Nguyen, H.S.; Kim, M.; Im, C.; Han, S.; Han, J. ConvNeXtPose: A fast accurate method for 3D human pose estimation and its AR fitness application in mobile devices. IEEE Access 2023, 11, 117393–117402. [Google Scholar] [CrossRef]
Hossain, M.I.; Akhter, S.; Hossain, M.D.; Hong, C.S.; Huh, E.-N. Multi-person 3D pose estimation in mobile edge computing devices for real-time applications. In Proceedings of the International Conference on Information Networking (ICOIN), Bangkok, Thailand, 10–12 January 2023; pp. 673–677. [Google Scholar]
Li, Z.; Zheng, Z.; Wang, L.; Liu, Y. Learning pose-dependent Gaussian maps for high-fidelity human avatar modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 18–22 June 2024; pp. 19711–19722. [Google Scholar]
Hu, L.; Zhang, H.; Zhang, Y.; Zhou, B.; Liu, B.; Zhang, S.; Nie, L. GaussianAvatar: Towards realistic human avatar modeling from a single video via animatable 3D Gaussians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 18–22 June 2024; pp. 634–644. [Google Scholar]
Sun, Q.; Wang, Y.; Zeng, A.; Yin, W.; Wei, C.; Wang, W.; Mei, H.; Leung, C.S.; Liu, Z.; Yang, L.; et al. AiOS: All-in-One-Stage expressive human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 18–22 June 2024; pp. 1834–1843. [Google Scholar]
Zielonka, W.; Bolkart, T.; Thies, J. Instant volumetric head avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 19–24 June 2023; pp. 4574–4584. [Google Scholar]
Kim, M.; Hwang, T. Real-time multi-view 3D pose estimation system with constant speed. In Proceedings of the HCI International 2023, Copenhagen, Denmark, 23–28 July 2023. [Google Scholar]
Hwang, T.; Kim, M. Noise-Robust 3D Pose Estimation Using Appearance Similarity Based on the Distributed Multiple Views. Sensors 2024, 24, 5645. [Google Scholar] [CrossRef] [PubMed]
Tung, T.; Matsuyama, T. Topology dictionary with Markov model for 3D video content-based skimming and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 469–476. [Google Scholar]

Figure 1. Description of a multi-view 3D pose estimation method.

Figure 2. Overall system architecture of the proposed live streaming platform.

Figure 3. Three-dimensional pose reconstruction using the DLT method [26].

Figure 4. (Left) Data from asynchronous multi-devices and (Right) PDJ evaluation results.

Figure 5. Pose data alignment for multiple individuals across multiple camera perspectives.

Figure 6. Virtual avatar generation through a 3D reconstruction system.

Figure 7. Three-dimensional interactive animation system integrating the 3D motion capture system with pre-modeled avatars.

Figure 8. Demonstration environment for the real-time live streaming platform.

Figure 9. Demonstration results for a single-performer performance. (a) standing front view; (b) standing front view; (c) standing back view; (d) sitting.

Figure 10. Demonstration results for multiple-performer performance. (a) the same action; (b) different actions; (c) turning in a circle.

Table 1. Comparison of different technologies for camera-based 3D reconstruction.

Technology	3D Motion Capture [6,7,18,19]	Avatar Generation [20,21,22,23]	Hybrid System (Proposed)
Description	Live capture of several keypoints of the human body	Live capture of detailed mesh and texture	Links pre-captured characters with live motion capture [25]
Data Type	1–100 keypoints with 3D coordinates	1–100 million RGB 3D points	Converts keypoints to a point cloud for animation
Performance	Low: Only 3D skeleton reconstructed	High: Real 3D character reconstructed	High: Real avatar animated by real motion
Complexity	Low: real-time processing possible at 60 fps	High: difficult to perform real-time calculations	Low: real-time processing possible because avatars are pre-generated
Cost	Low: requires few cameras and computing devices	High: requires many high-definition cameras and AI engines	Medium: aside from creating the avatar once, only motion capture is required
Suitability for live streaming	Low: performance limitations	Low: complexity constraints	High: comprehensive and real-time

Table 2. Hardware and Network Specifications of the Demonstration System.

Category	Product & Spec.
Camera	8 EAs, Stereolabs ZED 2 HD (1280 × 720), 60 FPS
Edge Device	8 EAs, Jetson Xavier-NX 384 CUDA Core, 8 GB RAM
IP switch	8 port LAN (100 Mbps)
Central Server	GPU: GeForce RTX 3090 CPU: i9-9900K 3.6 Ghz
Application	Unity 2023

Table 3. Performance Comparison with Representative Baselines.

Technology	ConvNeXtPose [18] (2023)	Gaussian Avatar [21] (2024)	Proposed Work (2025)
Category	3D Motion Capture	Avatar Generation	Hybrid System (MoCap@Edge + Avatar@Server)
Input type	Single-view RGB	Multi-view RGB	Multi-view RGB
Multi-view Synchronization	X	O (offline)	O (real-time)
Multi-performer Generation	X	X	O (real-time)
Latency ¹ (ms)	~50 (motion capture)	>500 (reconstruction)	~200 (end-to-end)
Frame Rate (fps)	25–30	1–5	30/60
Accuracy ² (PDJ %)	90~95	98	94.6 @ 30fps 96.1 @ 60fps
Streaming Framework	Partial (Server only)	X (Offline only)	Real-time end-to-end

¹ Latency values are estimated from the inference speed (FPS) reported in each paper, as end-to-end streaming latency is not provided. ² Accuracy (PDJ) values are approximated from benchmark metrics (e.g., COCO AP) described in the original works for relative comparison purposes.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, M.; Hwang, T.; So, J. Real-Time Live Streaming Framework for Cultural Heritage Using Multi-Camera 3D Motion Capture and Virtual Avatars. Appl. Sci. 2025, 15, 12208. https://doi.org/10.3390/app152212208

AMA Style

Kim M, Hwang T, So J. Real-Time Live Streaming Framework for Cultural Heritage Using Multi-Camera 3D Motion Capture and Virtual Avatars. Applied Sciences. 2025; 15(22):12208. https://doi.org/10.3390/app152212208

Chicago/Turabian Style

Kim, Minjoon, Taemin Hwang, and Jaehyuk So. 2025. "Real-Time Live Streaming Framework for Cultural Heritage Using Multi-Camera 3D Motion Capture and Virtual Avatars" Applied Sciences 15, no. 22: 12208. https://doi.org/10.3390/app152212208

APA Style

Kim, M., Hwang, T., & So, J. (2025). Real-Time Live Streaming Framework for Cultural Heritage Using Multi-Camera 3D Motion Capture and Virtual Avatars. Applied Sciences, 15(22), 12208. https://doi.org/10.3390/app152212208

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Live Streaming Framework for Cultural Heritage Using Multi-Camera 3D Motion Capture and Virtual Avatars^†

Abstract

1. Introduction