Kinematic Skeleton Extraction from 3D Model Based on Hierarchical Segmentation

Mata, Nitinan; Tangwannawit, Sakchai

doi:10.3390/sym17060879

Open AccessArticle

Kinematic Skeleton Extraction from 3D Model Based on Hierarchical Segmentation

by

Nitinan Mata

¹ and

Sakchai Tangwannawit

^2,*

¹

Department of Information Technology, Faculty of Information Technology and Digital Innovation, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand

²

Department of Information Technology Management, Faculty of Information Technology and Digital Innovation, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 879; https://doi.org/10.3390/sym17060879

Submission received: 12 April 2025 / Revised: 24 May 2025 / Accepted: 27 May 2025 / Published: 4 June 2025

(This article belongs to the Special Issue Asymmetric and Symmetric in Deep Computer Vision and Generative Modeling)

Download

Browse Figures

Versions Notes

Abstract

:

A new approach for skeleton extraction has been designed to work directly with 3D point cloud data. It blends hierarchical segmentation with a multi-scale ensemble built on top of modified PointNet models. Outputs from three network variants trained at different spatial resolutions are aggregated using majority voting, unweighted averaging, and adaptive weighting, with the latter yielding the best performance. Each joint is set at the center of its part. A radius-based filter is used to remove any outliers, specifically, points that fall too far from where the joints are expected to be. When evaluated on benchmark datasets such as DFaust, CMU, Kids, and EHF, the model demonstrated strong segmentation accuracy (mIoU = 0.8938) and low joint localization error (MPJPE = 22.82 mm). The method generalizes well to an unseen dataset (DanceDB), maintaining strong performance across diverse body types and poses. Compared to benchmark methods such as L1-Medial, Pinocchio, and MediaPipe, our approach offers greater anatomical symmetry, joint completeness, and robustness in occluded or overlapping regions. Structural integrity is maintained by working directly with 3D data, without the need for 2D projections or medial-axis approximations. The visual assessment of DanceDB results indicates improved anatomical accuracy, even in the absence of quantitative comparison. The outcome supports practical applications in animation, motion tracking, and biomechanics.

Keywords:

3D model segmentation; modified PointNet; kinematic skeleton extraction; deep learning; point cloud processing; centroid-based joint estimation

1. Introduction

The process of generating high-detail 3D models has become far more efficient in recent years, thanks to significant advances in modeling and scanning technologies [1,2]. These tools are now widely applied across domains—from engineering and healthcare to digital media and scientific visualization—supporting tasks such as fluid dynamics analysis, remote sensing, and immersive virtual environments [3]. In computer graphics and games in particular, 3D models are essential for visual realism. However, static 3D geometry alone is often insufficient for tasks involving motion or interaction. To address this, skeletal representations are commonly employed to abstract structural motion, enabling applications such as character rigging, biomechanical modeling, and motion-driven animation.

Rigging refers to the process of building an internal skeleton beneath a 3D character or object to enable physically plausible and visually convincing motion. While essential to animation and motion capture, creating accurate kinematic skeletons remains labor-intensive and requires expert knowledge [4]. These challenges have motivated recent efforts toward automation, laying the foundation for learning-based approaches to skeleton extraction.

A skeleton—comprising joints and their interconnections—serves as the core framework in 3D modeling and plays a pivotal role in applications requiring motion abstraction, including medical animation, motion modeling, and 3D scanning [5]. However, conventional skeleton extraction methods often struggle with highly articulated poses, self-occlusion, and close limb proximity, scenarios common in complex animation or motion capture. These limitations reduce their effectiveness and restrict their applicability in real-world settings.

Skeleton extraction abstracts a 3D object into a simplified internal structure—referred to as a skeleton—rather than processing its entire surface geometry. This representation enables more efficient motion tracking and structural manipulation, making it especially valuable in animation, 3D editing, and other applications where both spatial coherence and articulation matter.

Skeletons can be broadly categorized into two types: curve skeletons and kinematic skeletons. Curve skeletons trace the medial axis of a 3D model, providing a simplified representation of its topology that supports tasks such as shape analysis, segmentation, and recognition [6]. Techniques such as local separator-based refinement have been proposed to improve robustness under noisy or incomplete data [7]. However, curve skeletons often lack explicit articulation information, limiting their use in applications that require joint-based motion. In contrast, kinematic skeletons represent articulated motion through a hierarchy of joints and bones, making them essential in animation, robotics, and biomechanics. These structures enable physically plausible movement and are critical for simulating joint behavior. Recent research has proposed more adaptive solutions—including automatic rigging and flexible joint path modeling—to improve realism in joint deformation and skin dynamics [8,9]. However, most existing approaches still rely on heuristics or template-based constraints, which can hinder generalization across diverse body types and poses.

Recent advances in deep learning have significantly transformed approaches to kinematic skeleton extraction and analysis. Leveraging architectures such as graph neural networks (GNNs) and convolutional neural networks (CNNs), researchers can now model joint behavior and estimate poses with greater precision than previously achievable [10,11]. To improve robustness under noisy or incomplete input, several studies have incorporated biomechanical constraints into optimization frameworks [12]. The availability of fine-grained geometric input, enabled by modern scanning and depth-sensing technologies, has further strengthened learning-based models. Large-scale datasets such as AMASS [13] offer diverse motion sequences across a range of body types and activities, enhancing generalizability. Despite these advances, unresolved challenges remain, including trade-offs between real-time performance, accuracy, and robustness in dynamic or occluded scenarios. Many existing methods continue to rely on surface simplification, template fitting, or fixed-topology assumptions, limiting their adaptability. For example, L1-Medial, Pinocchio, and MediaPipe depend on medial-axis heuristics, mesh templates, or 2D-to-3D projection. These techniques often fail under self-occlusion, noisy scans, or anatomically flexible structures. This underscores the need for learning-based alternatives that can operate directly on 3D geometry while preserving spatial fidelity.

Several state-of-the-art models have addressed 3D human pose estimation using either image-based or point-based approaches. SPIN [14], for example, estimates joint positions and mesh parameters from single RGB images using regression-based learning but relies heavily on SMPL templates. More recently, models such as PointFormer [15] and PointHPS [16] have advanced the field by directly processing raw point cloud data, improving robustness under occlusion and complex poses. However, many of these methods still depend on predefined joint templates or exhibit sensitivity to noisy input. In contrast, our method operates directly on point cloud geometry without requiring mesh fitting or template alignment, ensuring greater flexibility and anatomical consistency in skeleton extraction.

To overcome the limitations of prior skeleton extraction methods, we propose a novel learning-based framework that directly estimates joint structures from 3D point cloud data using multi-scale segmentation. The core of the system combines a modified PointNet architecture with hierarchical part segmentation, enabling spatially adaptive processing through three network variants trained at distinct resolutions. To leverage their complementary strengths, we introduce a dynamic ensemble weighting strategy that adjusts each model’s contribution based on segmentation performance. Joint positions are then refined using a Limit-Radius centroid method, which enhances robustness under noisy or incomplete input.

The framework was evaluated across multiple standard datasets and further tested on an unseen set (DanceDB) to assess generalization. The results show strong performance under occlusion and limb overlap, with consistent anatomical accuracy across various body types and motion styles. Its modular architecture facilitates integration into real-time pipelines for animation, biomechanics, and robotics. Moreover, the system supports symmetry-aware modeling by maintaining structural balance between left and right body segments, contributing to anatomically faithful skeleton construction.

2. Related Work

Research on skeleton extraction from 3D models has made considerable progress in recent years, and yet it remains technically demanding due to challenges such as noise, fragmentation, and complex topologies. Prior work has addressed these challenges from multiple perspectives, including point cloud preprocessing, deep segmentation networks, ensemble learning, and the extraction of kinematic skeletons via joint estimation. This section reviews relevant approaches in four key areas that inform the proposed method: (1) skeleton extraction techniques, (2) point cloud preprocessing, (3) segmentation networks, and (4) ensemble-based models for structural learning.

2.1. Skeleton Extraction Methods for 3D Models

Extracting a skeleton from a 3D model is a fundamental task in computer vision, shape analysis, and motion modeling. The goal is to generate a simplified representation that captures the model’s internal structure while reducing geometric complexity. However, this task is complicated by common issues such as missing regions, self-intersections, and irregular geometries. Existing techniques can be broadly classified into four main categories: medial axis methods, boundary propagation approaches, geometric heuristics, and learning-based models. Each method offers distinct advantages and trade-offs, particularly in handling flexible anatomy, noisy surfaces, or ambiguous topologies.

2.1.1. Thinning and Boundary Propagation Methods

Thinning-based approaches extract curve skeletons by progressively removing surface voxels while preserving the model’s core geometric and topological structure. Common variants include directional thinning, which proceeds along axis-aligned orientations; subfield sequential thinning, which partitions the model into localized regions to enhance clarity; and fully parallel thinning, which enables simultaneous voxel removal across multiple areas, offering high efficiency for large or dense models. These methods are generally simple and computationally efficient, but their performance degrades in the presence of noise, occlusion, or uneven sampling, often producing fragmented or anatomically inconsistent skeletons [17].

2.1.2. Distance-Field-Based Methods

Distance field methods determine the model’s medial axis by measuring how far each internal point is from the closest boundary, capturing the central structure in the process. These methods are particularly effective for capturing the central axis of elongated or tubular structures, and unlike thinning techniques, they do not rely on explicit surface boundaries to work. Most distance-field-based skeletonization methods follow a three-step process:

(1) Distance Field Computation—The distance field is built by measuring how far each voxel is from the nearest surface point. It lays the groundwork for the rest of the process.

(2) Ridge Point Detection and Pruning—Here, key points are picked out to shape the skeleton, and branches that do not belong are removed. That said, this process can become tricky when surfaces are uneven or noisy.

(3) Connectivity Restoration—Once pruning is complete, broken parts of the skeleton need to be stitched back together. It helps maintain continuity, though getting it right sometimes takes fine-tuning.

One well-known approach is L1-median extraction [18], which offers a reasonable trade-off between computational structure and editability. In comparison, polygon-based techniques and distance blending often require manual intervention and fine parameter adjustment to generate reliable results.

Despite these advances, distance-field-based methods still exhibit notable limitations in computational cost, parameter sensitivity, and robustness under surface occlusion or irregular sampling. These challenges highlight the need for alternative approaches that can extract skeletal structures more directly and efficiently from raw 3D input data.

2.1.3. Geometric-Based Methods

Geometric-based methods interpret skeletal structures by analyzing mesh or point-based features to infer structural connectivity. Voronoi-diagram-based approaches divide the model into regions centered around each point, enabling skeleton construction by linking centroids, which are well suited for nonlinear geometry [19]. Reeb graphs convert surface topology into hierarchical structures that guide skeleton formation.

Among widely adopted frameworks, Pinocchio [20] constructs skeletons by fitting non-overlapping spheres within a mesh and linking them using graph-based heuristics such as the Gabriel graph. Its template-free design and consistent performance over watertight models have made it a common choice in general-purpose skeleton pipelines

Other techniques—such as Random Center Shift (RCS) and Oriented Bounding Boxes (OBB)—locate endpoints and align branches with geometry, working reasonably well on imperfect input [1]. More recent studies incorporate machine learning to improve robustness under sparse or uneven data [21]. Despite these advances, many methods still rely on handcrafted rules or rigid assumptions, making them less effective for irregular, noisy, or occluded point clouds. These limitations highlight the need for learning-based frameworks that directly process raw geometric data while preserving structural integrity.

2.1.4. Deep Learning-Based Methods

Deep learning has significantly transformed skeleton extraction by enabling models to learn structural patterns directly from raw 3D data without relying on handcrafted geometric rules. A variety of neural architectures have been explored to improve how point clouds and voxels are processed effectively and robustly.

PointNet was among the earliest models designed to directly handle unordered point clouds, offering flexibility by avoiding reliance on grid structures [22]. VoxelNet, in contrast, converts point clouds into voxel grids to leverage 3D convolution and region proposal mechanisms, making it particularly effective in object detection contexts [23]. PointCNN introduces X-convolution to structurally reorganize point data, enhancing spatial feature learning and improving segmentation accuracy [24]. Similarly, PointConv integrates point density information, allowing the network to maintain consistent performance even when data are unevenly distributed [25].

Existing deep learning approaches for skeleton extraction fall into three main categories. (1) Image-based networks employ 2D or 3D CNNs with heatmap regression to infer joint locations from RGB or depth images [11]. (2) Classification tree methods such as Random Forests use point-wise voting for semantic segmentation and region-based joint inference [26]. (3) Direct 3D-model-based networks work with point clouds or meshes, capturing spatial structure more effectively. For instance, PointSkelCNN [27] segments body parts and infers joint positions, while SUPPLE [28] unwraps 3D mesh into 2D projections and applies heatmap-based detection.

Each category has its trade-offs: 2D methods are computationally efficient but limited in spatial depth awareness, classification tree models are interpretable but sensitive to handcrafted features, and 3D-based networks offer strong spatial reasoning at the cost of computational complexity.

More recently, transformer-based architectures such as Point Transformer [15] and Point-BERT [29] have emerged, utilizing self-attention to better model spatial dependencies. These approaches enhance segmentation and joint prediction but demand extensive training data and high computational resources. Meanwhile, practical pipelines like MediaPipe [30] achieve real-time performance by detecting 2D landmarks and lifting them to 3D. However, such methods are often sensitive to self-contact, occlusion, and depth ambiguity, especially in noisy or sparse inputs.

Collectively, these limitations underscore the need for lightweight learning-based frameworks that can infer anatomically consistent skeletal structures directly from raw geometric input while maintaining robustness across diverse conditions.

2.2. Preprocessing of Point Cloud Data for 3D Models

Point clouds offer detailed geometric representations and are widely used to capture complex 3D structures with high precision. However, their high density and large size often lead to substantial computational and storage demands, posing challenges for real-time processing and deep learning applications. Without optimization, using high-resolution point clouds in such contexts is often impractical. To address this, preprocessing steps are commonly applied to reduce point density while preserving the global shape and structural features. Effective downsampling ensures that the input to learning models remains efficient and structurally representative [31]

Several sampling strategies have been developed to address different trade-offs between fidelity and efficiency. Poisson Disk Sampling enforces a minimum distance between points, yielding uniformly distributed data that improves stability in downstream modeling tasks [32]. VoxelGrid Downsampling aggregates local point clusters into grid-based representatives, sacrificing fine detail for speed [33]. Random Sampling (RS) selects points randomly, offering speed but risking the loss of critical geometry [34]. Cluster-KNN Sampling reduces density via clustering while retaining key structural patterns through local neighborhood preservation [35]. Farthest Point Sampling (FPS) iteratively selects points with maximal separation, resulting in balanced coverage that has proven effective in skeleton-based learning tasks [36].

Each step plays a part, and together they make it a lot easier to work with point cloud data. They make the process faster, but without throwing away the shape, something that really matters when you are working with skeletons, complex forms, or anything that runs in real time.

While these preprocessing methods improve efficiency and reduce computational load, the choice of sampling strategy can significantly impact downstream performance. Over-simplification may lead to the loss of anatomical detail, while overly dense representations increase training cost. Hence, selecting a method that preserves structural integrity while maintaining scalability is critical for skeleton extraction tasks.

2.3. Deep Learning Models for Part Segmentation

Part segmentation plays a vital role in interpreting the internal structure of 3D point clouds, forming the basis for tasks such as kinematic skeleton extraction. However, the unstructured, sparse, and high-dimensional nature of point clouds present challenges for deep learning models. Early solutions like PointNet introduced a per-point feature learning pipeline using MLPs and global max pooling to achieve order invariance. PointNet++ [37] extended this by introducing hierarchical grouping and multi-scale learning, enabling the capture of both fine and global geometric structures, which is particularly beneficial for segmentation.

Building on these foundations, Dynamic Graph CNN (DGCNN) [38] introduced graph-based feature aggregation to capture local neighborhood relationships, dynamically updating edges based on learned features. More recently, Point Transformer applied self-attention to model long-range dependencies, achieving state-of-the-art performance in segmentation tasks by better encoding spatial relationships across complex geometries. These models now form the backbone of automated part segmentation pipelines, supporting robust joint detection, structural adaptation, and motion analysis in skeleton extraction systems [39]. Future directions include hybrid models that combine global transformer-based reasoning with local graph-based representations, as well as self-supervised learning approaches to reduce the dependence on labeled datasets, which is particularly valuable in anatomical or noisy scan scenarios.

2.4. Ensemble Methods for Deep Learning

Ensemble learning improves model consistency and robustness by aggregating predictions from multiple models, effectively mitigating the limitations of any single architecture [40]. In 3D point cloud segmentation, ensemble methods help address challenges from data sparsity, geometric variability, and occlusion [41].

Common ensemble techniques include bagging, which trains models on random subsets to reduce variance; boosting, which focuses on correcting prior errors; and stacking, where a meta-learner integrates predictions from diverse base models. In the segmentation of geometric data such as point clouds, ensemble methods help capture both global and local features across varying spatial resolutions. Techniques such as majority voting, unweighted averaging, and performance-based weighting are frequently used to combine outputs and reinforce consistent feature recognition [42].

Recent research explores more dynamic ensemble architectures, such as adaptive weighting schemes that adjust model contributions based on input complexity. Integrating ensemble learning with self-supervised techniques also offers promising directions, particularly for 3D segmentation scenarios where annotated data are limited. These hybrid approaches are expected to enhance generalization and enable scalable deployment in real-world geometry-centric applications [43]. In particular, frameworks combining self-supervised learning with ensemble pipelines show strong potential in domains such as robotics, medical imaging, and motion analysis.

3. Proposed Framework

This study proposes a two-stage pipeline for 3D kinematic skeleton extraction from point clouds, integrating modified deep learning architectures and ensemble learning strategies to ensure accuracy and robustness while maintaining computational efficiency (Figure 1).

Stage 1 involves preprocessing raw 3D models and segmenting them into 37 structural parts using an ensemble of Modified-PointNet models. Each variant is trained with different downsampling and augmentation settings to improve generalization across diverse input distributions.

Stage 2 focuses on joint estimation from the segmented parts. Spatial features are extracted to define 20 key joints, and a hierarchical structure is constructed to form a kinematic skeleton suitable for motion analysis and animation.

The segmentation-first strategy provides a rich anatomical structure that not only guides joint localization, but also improves interpretability and compatibility with downstream tasks such as pose simulation, biomechanics, and real-time animation. Additionally, centroid-based joint refinement enhances spatial accuracy, especially under noisy or incomplete conditions, offering a significant advantage over traditional curve skeleton and 2D-to-3D projection methods.

The following subsections detail each component of the pipeline, including data collection, ground truth segmentation, model architecture, ensemble strategy, joint estimation method, and evaluation metrics.

3.1. Data Collection

This study employs standard 3D model datasets to ensure diversity and robustness in point-cloud-based skeleton extraction. The selected datasets—CMU, Kids, D-FAUST, and EHF—offer variation in point distribution, model resolution, and structural features, enabling the segmentation model to generalize across a wide range of human body types. These datasets include differences in gender, height, and posture, which enhance the model’s ability to adapt to anatomical variability.

The primary selection criteria include high-resolution scans, complete anatomical structure, and coverage of diverse human poses. Furthermore, the datasets vary significantly in point density—from 6890 to 59,727 points per model—allowing the framework to perform effectively across both low- and high-resolution inputs.

In preparation for training, the raw 3D models are normalized to ensure consistent orientation and scale across datasets. From these curated sources, we construct a benchmark dataset used to train and evaluate the Modified-PointNet segmentation model. Each dataset and its role in this study are summarized in Table 1.

3.2. 3D Model Analysis and Ground Truth Generation for Part Segmentation

Following the collection of 3D models from diverse datasets with variations in point distribution, density, and structural characteristics, a segmentation map serving as ground truth was generated to support accurate kinematic skeleton extraction. Skeletal joint structures were outlined using motion tracking data from the Kinect SDK, with varying joint configurations depending on research context (e.g., 9, 12, 15, 20, 25, and 41 joints) [48]. The 20-joint framework was selected for its balance between motion mapping efficiency and smooth integration into animation systems, while also supporting automated skeletal generation for 3D models [49]. Joint localization and error analysis are anchored in the mapping between each segment and its corresponding joint.

Under the refinement steps, vertex groups were created from each model’s point cloud to align with the defined skeletal structure. Blender was used to generate these groups with high precision. The final segmentation structure comprises 20 segments aligned with skeletal joints and 17 additional structural regions, totaling 37 parts. This configuration is designed to conform to the anatomical hierarchy, facilitating accurate kinematic skeleton extraction. Figure 2a,b illustrate both the skeletal joint framework and the corresponding segmentation result.

In addition to segmentation, an automated labeling pipeline was developed to export vertex groups and assign joint-based labels. This process verifies group associations, resolves inconsistencies, and records outputs—including vertex coordinates and assigned labels—for use in later evaluation. Figure 3a outlines the labeling procedure, while Figure 3b shows the resulting labeled segmentation, comprising 20 joint-linked regions and 17 structural components, providing a solid foundation for joint estimation.

3.3. Point Cloud Preprocessing for 3D Model Segmentation

This section explains how point cloud data are prepared for the deep learning-based segmentation of 3D models. Because point clouds are high-dimensional and computationally intensive, preprocessing is essential to balance data fidelity, efficiency, and learning performance.

Reducing the number of points helps minimize computational overhead and accelerates training. However, aggressive downsampling can lead to the loss of critical geometric features, particularly near joints, extremities, or thin regions. To explore this trade-off, we evaluate how three different point cloud resolutions—512, 1024, and 2048 points—affect segmentation accuracy and geometric preservation. Alongside point count adjustments, this study investigates three distinct sampling strategies:

Random Sampling (RS) selects points uniformly at random without regard to spatial arrangement. While computationally simple [50], RS often discards important local features and fails to maintain structural continuity.
Farthest Point Sampling (FPS) iteratively selects the point farthest from all previously chosen ones, ensuring an even distribution across the model [51]. FPS is widely adopted in point cloud segmentation for its superior balance between global coverage and local feature retention.
Cluster-KNN Sampling clusters spatially adjacent points and samples within each cluster [52]. This method preserves local geometry and boundary integrity better than RS or FPS in low-density conditions, though at higher computational cost.

A comparative assessment of these strategies is conducted across all point resolutions to determine the optimal balance between segmentation accuracy, computational cost, and the preservation of structural detail. The outcomes of this evaluation directly inform the downstream tasks, particularly the quality of joint estimation.

3.4. PointNet-Based Deep Learning Architecture for 3D Model Segmentation

This study employs a PointNet-based deep learning framework for the part-wise segmentation of 3D human models. PointNet is chosen for its capability to process raw, unordered point cloud data directly—without requiring voxelization, gridding, or image projection—making it highly suitable for irregular and complex geometries. Its architecture preserves point-wise features while enabling global shape reasoning, which is essential for segmenting the model into 37 anatomically relevant parts.

To improve segmentation accuracy and computational efficiency, we extend the standard PointNet model with several architectural modifications. These enhancements include residual connections, self-attention mechanisms, and optimized MLP layers. The following sections describe both the original PointNet architecture and the modified variants used in our ensemble.

3.4.1. Baseline Model: Standard PointNet Architecture for 3D Model Segmentation

The architecture is based on PointNet, a baseline model selected for its ability to process raw point cloud data directly without voxel conversion. This design supports the segmentation of highly detailed and structurally diverse 3D models. The architecture comprises the following core components:

(1) Transformation Block 1: This module normalizes raw input by learning transformations related to scale, rotation, and position, enhancing the stability of downstream feature encoding.

(2) Convolutional Blocks: A sequence of three convolutional layers (with filter sizes of 64, 128, and 128) extracts hierarchical point features. This enables the model to capture both local and global geometric structures, facilitating better part differentiation.

(3) Transformation Block 2: Learned features are further refined to improve spatial consistency across diverse 3D inputs, reducing segmentation errors.

(4) Multi-Layer Perceptron (MLP) Layer: This stage captures complex point relationships using ReLU activations and batch normalization for regularization.

(5) Conv1D Layer and Output Scores: The final layer assigns class scores to each point, producing a dense segmentation map of the human model.

Figure 4 illustrates the standard PointNet architecture and the data flow through its key components. These foundational elements support the enhanced models used in subsequent stages of segmentation and skeleton extraction.

3.4.2. Architectural Enhancements to PointNet

Building upon the standard PointNet structure, this section outlines the key architectural enhancements designed to support more expressive feature learning and spatial adaptability. These modifications address the limitations of the original model in handling complex anatomical geometry, particularly in the dense part-wise segmentation of 3D human models. The major improvements include the following:

Convolutional Layers and Residual Networks: The convolutional layers are optimized for point cloud feature extraction. Residual connections help mitigate vanishing gradient issues, enabling deeper architectures without substantially increasing parameter count. Batch normalization further stabilizes training, and auxiliary structures help preserve fine spatial details during segmentation [53,54].

Multi-Layer Perceptron (MLP) Block: MLP layers capture point-level interactions and non-linear spatial relationships. ReLU activation facilitates complex feature learning, while dropout is applied to reduce overfitting.

Attention Mechanism Integration: To enhance the model’s ability to prioritize relevant spatial features, a self-attention mechanism is incorporated. This enables the network to shift focus dynamically between local and global point relationships, which is especially beneficial in highly variable 3D structures [55].

Transformation Networks: A two-stage transformation network is embedded to align unstructured input data and refine spatial features. This improves consistency across models and supports accurate segmentation and joint alignment.

Together, these enhancements strengthen PointNet’s capability to manage the spatial complexity and anatomical variability of human 3D models, forming the basis for the ensemble segmentation strategy introduced in Section 3.5.

3.4.3. Medium-Modified PointNet Architecture

Building on the enhancements described in Section 3.4.2, the Medium-Modified PointNet integrates both a Residual Convolutional Block and an Attention Block to improve segmentation performance. The residual design mitigates vanishing gradients, supporting deeper learning without training instability [56]. The attention mechanism further guides the model to prioritize spatially relevant regions, enhancing feature discrimination and reducing sensitivity to input noise [57]. As shown in Figure 5a, this architecture employs a Residual Convolutional Block (Conv_block 64 × 10) for deep feature extraction, followed by an Attention Block with 512 filters to refine feature maps. Final segmentation is performed through MLP and Conv1D layers that assign per-point class scores.

3.4.4. Small-Modified PointNet Architecture

Optimized for efficiency, Small-Modified PointNet reduces computational overhead while maintaining reasonable segmentation performance. This variant is especially suitable for real-time applications or systems with limited processing capabilities.

To achieve this trade-off, several components were simplified. The transformation blocks were made shallower to reduce spatial alignment complexity. A compact residual block (Conv_block 32 × 10) was employed to minimize computation during feature extraction, and the MLP layer was reduced to 256 units to lower memory usage while preserving essential spatial representation. As shown in Figure 5b, this streamlined design balances segmentation accuracy with lightweight deployment requirements.

3.4.5. Large-Modified PointNet Architecture

Large-Modified PointNet builds upon the Medium-Modified version by deepening its convolutional structure to support more complex and fine-grained feature extraction. This configuration targets high-precision segmentation tasks, particularly for large-scale 3D human models with intricate geometry.

Key upgrades include an expanded convolutional pipeline composed of Conv_block layers with filter sizes of 64, 128, and 256, enhancing the network’s ability to capture subtle geometric variations across multiple spatial resolutions. The Residual Convolutional Block (Conv_block 64 × 10) is retained to stabilize training and reduce gradient degradation. Additionally, the MLP layer is increased to 512 units, enabling richer spatial encoding and high-dimensional feature learning. As shown in Figure 5c, this architecture improves interpretability and adaptability in demanding 3D segmentation workflows.

Table 2 presents a detailed comparison of the three Modified PointNet variants developed in this study. Each architecture incorporates a Residual Block and varies in terms of convolutional depth, MLP size, and other design components to meet specific segmentation and efficiency requirements.

The inclusion of residual connections in all versions supports stable training, while differences in depth and attention handling allow for adaptation to diverse segmentation scenarios.

3.4.6. Data Augmentation for Improved Generalization

To enhance generalization and mitigate overfitting, this study applies data augmentation techniques to 3D point cloud models, increasing the model’s adaptability to real-world variations. The augmentation strategies include random translation [58], which shifts point clouds along different axes to promote positional invariance, and anisotropic scaling [15], which adjusts axis-specific dimensions to simulate shape diversity. Random rotation [37] improves robustness to viewpoint changes, while uniform noise injection [59] introduces controlled distortions that reflect real-world imperfections. Combined, these augmentations boost segmentation accuracy and contribute to more reliable kinematic skeleton extraction.

3.4.7. Optimization Algorithm Selection for Deep Learning

Segmentation accuracy and training stability are significantly influenced by the choice of optimization algorithm and learning rate schedule. To evaluate performance trade-offs, three optimization strategies were explored: constant learning rate, exponential decay, and cyclical schedule with cosine annealing. Among them, Adaptive Moment Estimation (Adam) [60] was selected as the baseline optimizer due to its proven ability to assign parameter-specific learning rates and maintain stable convergence during high-dimensional learning tasks.

To further enhance training dynamics, an exponential decay schedule [61] was applied to gradually reduce the learning rate across epochs, maintaining model momentum while avoiding overshooting, a strategy commonly used in deep point cloud processing. In addition, a cyclical learning rate schedule with cosine annealing [62] was implemented, allowing the model to periodically explore different learning rates and avoid local minima. This technique has shown significant improvements in generalization, especially when paired with ensemble training in 3D vision models.

These adaptive strategies contributed to consistent and effective training across all Modified PointNet variants. The training protocol aligns with best practices established in prior work on 3D segmentation [63] and was validated using three standard metrics: training time, point-wise classification accuracy, and mean Intersection over Union (mIoU). Point-wise accuracy quantifies the proportion of correctly predicted points, while mIoU evaluates spatial overlap between predicted and actual segments, serving as a reliable indicator in segmentation tasks involving noisy or complex data structures [64].

3.5. Ensemble Learning Strategies for Modified PointNet in 3D Segmentation

In the final phase of Stage 1, an ensemble learning framework aggregates the outputs of three Modified PointNet models: Small, Medium, and Large. This framework leverages the complementary strengths of each variant to enhance segmentation robustness, reduce the likelihood of misclassification, and improve performance across diverse spatial configurations.

To encourage diversity among ensemble members, each Modified PointNet variant is trained multiple times using randomized subsets of the dataset. This approach enables individual models to learn distinct feature representations. For computational feasibility, the total number of models is constrained such that it does not exceed three times the size of the largest base model included in the ensemble.

As a means of further enhancing segmentation quality, three ensemble methods are applied, each combining the predictive strengths of the Small-, Medium-, and Large-Modified PointNet architectures. The ensemble techniques include majority voting (unweighted), the unweighted averaging of class probabilities, and adaptive weighting based on performance metrics such as accuracy and mIoU. These strategies are evaluated comparatively to assess their effectiveness in handling noise, pose variation, and model diversity.

(1) Majority Vote (Unweighted Ensemble Method): In this approach, segmentation is determined by selecting the most frequently predicted class label across the ensemble models [40]. Each model contributes equally to the decision, and the final output is based on a simple majority rule. The ensemble prediction

\hat{y}

is defined as follows:

\hat{y} = m o d e ({\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{k})

(1)

where

{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{k}

represent the predicted class labels at a given point from each of the

k

ensemble models. The mode () function returns the class label that occurs most frequently across models for each point.

This voting-based method is straightforward and widely used in ensemble classification tasks due to its simplicity and interpretability. While it does not consider prediction confidence or class probability distributions, it remains a practical baseline in segmentation scenarios. In this study, the majority vote strategy was implemented and formally evaluated as part of the ensemble methods.

(2) Unweighted Average (Simple Averaging Ensemble Method): The predicted class is determined by computing the mean of the SoftMax output probabilities from all ensemble members [65]. Equal weighting across models helps mitigate prediction variance, reduce model-specific bias, and enhance segmentation smoothness in complex structures. The ensemble output

\hat{y}

is defined as follows:

\hat{y} = \frac{1}{k} \sum_{i = 1}^{k} {\hat{y}}_{i}

(2)

where

{\hat{y}}_{i}

denotes the per-point SoftMax probability vector predicted by the i-th model and

k

is the total number of ensemble models. The resulting

\hat{y}

is a class probability vector, which is then converted into a predicted label by selecting the class with the highest probability score.

This unweighted averaging strategy, often referred to as Soft Voting, is widely adopted in ensemble learning for both classification and segmentation tasks. In the context of 3D point cloud segmentation, Atik and Duran [59] demonstrated that soft averaging across deep models leads to improved semantic consistency and greater robustness against noisy predictions.

(3) Weighted Average Ensemble Method: Model weights in traditional ensembles are often assigned arbitrarily or remain constant. This method replaces that rigidity with a performance-sensitive mechanism that uses accuracy and mIoU to guide weight allocation. In contrast, this method assigns weights dynamically, guided by performance metrics—specifically, mIoU and accuracy—captured during evaluation. By anchoring the weighting process in real-time model behavior, the ensemble becomes more reflective of actual performance rather than assumptions.

What differentiates this method is its inclusive approach to ensemble construction. Models that exhibit lower performance are retained within the ensemble by assigning proportionally reduced weights. By retaining all models, the system can incorporate additional spatial information that higher-performing models alone may fail to represent. The weight

w_{i}

for each model in the ensemble is computed as follows:

w_{i} = (\frac{{(a_{i} - m i n (a))}^{β}}{\sum_{j = 1}^{n} {(a_{j} - m i n (a))}^{β}} \times (1 - α) + \frac{α}{n})

(3)

where

a

is the vector of accuracy values for all models,

a_{i}

represents the accuracy of the i-th model,

m i n (a)

denotes the lowest accuracy value in the accuracy vector,

β

is a scaling parameter that controls how accuracy differences influence weight distribution,

α

ensures that every model retains a minimum non-zero weight, preventing models from being completely excluded, and

n

is the total number of models in the ensemble.

The term

{(a_{i} - m i n (a))}^{β}

allows for a flexible and dynamic weighting mechanism, ensuring that models with higher accuracy contribute more while still retaining input from all models. The inclusion of

α / n

prevents models from being entirely ignored, preserving useful segmentation patterns even from models with lower accuracy.

This approach differs from conventional Weighted Average Ensemble Methods in that it incorporates a flexible scaling mechanism

β

and a minimum weight assignment

α / n

, making it adaptable to varying dataset characteristics. By introducing this adaptive weighting, this ensemble method significantly enhances segmentation accuracy, prediction stability, and model robustness. Compared to majority voting and traditional weighted averaging, this method provides a more flexible data-driven approach that is highly suitable for high-precision kinematic skeleton extraction in 3D point cloud segmentation. Figure 6 illustrates the effect of varying

α

and

β

on the distribution of ensemble weights across models.

3.6. Kinematic Skeleton Extraction from 3D Models Based on Hierarchical Segmentation

Following the segmentation stage using an ensemble deep learning model, this study proceeds to extract kinematic skeletons through a hierarchical segmentation framework. The ensemble-based model first segments the 3D human models into 37 structural components, each corresponding to specific body regions. Based on this segmentation, joint positions are estimated by computing the centroids of 20 predefined skeletal regions. To ensure the anatomical accuracy of the extracted skeletons, these estimated joint locations are validated against ground truth data, supporting reliable motion analysis and downstream applications.

3.6.1. Model Testing on DanceDB Dataset

To evaluate the effectiveness of the proposed segmentation-based skeleton extraction pipeline, we test the trained ensemble model on the independent DanceDB [13] dataset. Since DanceDB was not included in the training phase and provides annotated 3D joint positions, it offers a robust basis for objective validation. Joint localization accuracy is assessed using the Mean Per Joint Position Error (MPJPE), a widely accepted metric that quantifies the average Euclidean distance between the predicted joint positions and their ground truth counterparts. Representative 3D model samples from the DanceDB dataset are shown in Figure 7.

3.6.2. Centroid Calculation for Joint-Specific Point Clouds

Skeletal joint estimation is performed by segmenting the 3D model into 20 regions, each representing a specific joint. The spatial centroid of each segmented region is computed as an approximation of the joint position [66], using the standard centroid formulas as shown in Equations (4)–(6):

C_{x} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}

(4)

C_{y} = \frac{1}{N} \sum_{i = 1}^{N} y_{i}

(5)

C_{z} = \frac{1}{N} \sum_{i = 1}^{N} z_{i}

(6)

where N represents the total number of points in the segmented region and

C_{x}

,

C_{y,}

C_{z}

denote the centroid coordinates along each axis.

3.6.3. Centroid Refinement with Radius Limitation

To enhance precision in joint estimation, we propose a refinement technique that addresses limitations in standard centroid computation, especially under noisy or uneven point cloud distributions. Unlike conventional methods that assign equal weight to all points, this approach incorporates radius-based filtering to exclude spatial outliers and prioritize structurally relevant points. The refinement process involves the following steps:

(1) Compute the average radius of each joint-specific point cluster to define a local reference scale.

(2) Set a dynamic threshold radius as twice the average to accommodate variations across joints.

(3) Discard all points beyond this threshold to eliminate outliers and noise.

(4) Recalculate the centroid using only the refined subset of points.

This adaptive refinement improves the anatomical accuracy of joint localization by anchoring the centroid to the most relevant spatial structure. It significantly improves robustness in datasets with irregular geometry and enhances applicability in fields such as biomechanics, motion analysis, and 3D animation.

3.6.4. Evaluation of Extracted Joint Accuracy Using MPJPE

To assess the accuracy of the extracted kinematic skeleton, the predicted joint centroids are compared against ground truth joint positions provided by the DanceDB dataset. The Mean Per Joint Position Error (MPJPE) is employed as the evaluation metric, which computes the average Euclidean distance between predicted and actual joint coordinates. The MPJPE [67] is formally defined as follows:

M P J P E = \frac{1}{N} \sum_{i = 1}^{N} (\frac{1}{N} \sum_{j = 1}^{M} {‖{\hat{p}}_{i, j} - p_{i, j}‖}^{2})

(7)

where

N

is the total number of samples,

M

is the number of predicted joint positions,

{\hat{p}}_{i, j}

denotes the predicted joint position, and

p_{i, j}

denotes the corresponding ground truth.

4. Results

4.1. Evaluation of Structure-Preserving Preprocessing Techniques for 3D Model Part Segmentation

This section analyzes how different downsampling strategies influence the preservation of 3D structure in point clouds, an essential factor for successful part segmentation. Specifically, we assess Random Sampling (RS), Farthest Point Sampling (FPS), and Cluster-KNN, each applied at varying densities (512, 1024, and 2048 points) to investigate their structural retention capabilities. The impact of each method is illustrated in Figure 8, Figure 9 and Figure 10.

(1) Effects of Random Sampling (RS) on 3D Model Representation

To evaluate the effect of RS on model fidelity, we progressively reduced point cloud density and analyzed the resulting shapes. At 512 points (Figure 8b), the model shows significant structural degradation: limbs become coarse and facial features blur. With 1024 points (Figure 8c), the overall geometry improves slightly, though distortions remain around curved surfaces. At 2048 points (Figure 8d), the structure appears more cohesive, striking a balance between efficiency and spatial detail.

Because RS ignores geometric context, structurally critical regions may be underrepresented or completely omitted. This non-uniform degradation directly affects the model’s ability to distinguish parts during segmentation, highlighting the importance of selecting sampling techniques that preserve semantic structure.

(2) Effects of Farthest Point Sampling (FPS) on 3D Model Representation

Following the same evaluation procedure, we assessed how FPS impacts structural preservation under varying point densities. At 512 points (Figure 9b), the global structure remains largely intact, although fine-grained details are noticeably reduced. With 1024 points (Figure 9c), both local and global features become more distinguishable. At 2048 points (Figure 9d), the representation closely approximates the original, preserving surface continuity and shape integrity.

Unlike RS, FPS selects points that are maximally distant from previously sampled points, resulting in an even distribution across the model. This uniformity enhances the retention of both geometric form and surface detail, attributes that are critical for accurate part segmentation. As such, FPS is particularly suitable for tasks that demand structural fidelity and balanced point coverage, aligning well with the goals of this segmentation framework.

(3) Effects of Cluster-KNN Sampling on 3D Model Representation

Cluster-KNN Sampling was evaluated through progressive downsampling to assess its ability to preserve model integrity. At 512 points (Figure 10b), the overall structure remains discernible, though fine details—particularly in extremities such as hands and feet—are diminished. With 1024 points (Figure 10c), the model exhibits sharper feature boundaries and improved spatial coherence. At 2048 points (Figure 10d), most of the original detail is preserved, with clearly defined part boundaries and stable geometry.

What distinguishes Cluster-KNN from RS and FPS is its emphasis on local neighborhoods, ensuring that critical regions receive adequate point coverage during sampling. This targeted approach improves structural preservation without incurring significant computational overhead. These characteristics make Cluster-KNN especially effective in segmentation tasks where maintaining detail and boundary clarity is essential for accurate part labeling.

4.2. Comparison of Deep Learning Model Performance and Ensemble Learning for 3D Model Part Segmentation

This section presents a systematic evaluation of the proposed Modified PointNet architectures for 3D model part segmentation. The experiments investigate how architectural complexity, optimization strategies, and data augmentation levels affect segmentation performance. The analysis is grounded in clearly defined metrics, including accuracy, mean Intersection over Union (mIoU), and training time, and is directly tied to the objectives of the framework established in Section 3.

4.2.1. Experimental Setup

The evaluation used a curated dataset of 2104 segmented 3D point clouds spanning 37-part classes. These were split into 1683 samples for training, 210 for validation, and 211 for testing. Each model was trained using point clouds downsampled to 1024 points via Farthest Point Sampling (FPS). All models were trained for 300 epochs with a batch size of 32 using the Adam optimizer. Three learning rate strategies were explored: constant, exponential decay, and cyclical (with cosine annealing). Data augmentation was applied in varying intensities (×1, ×5, ×10) using random translation, anisotropic scaling, rotation, and uniform noise.

Three architectural variants were evaluated. The Small-Modified PointNet (17.81 M parameters) was designed for low-resource applications; the Medium-Modified PointNet (37.91 M parameters) balanced performance and efficiency; and the Large-Modified PointNet (263.63 M parameters) aimed to maximize segmentation accuracy with high model capacity.

4.2.2. Performance Evaluation of Deep Learning Models for 3D Model Part Segmentation

This study presents an optimized PointNet-based deep learning architecture for part segmentation in 3D models, integrating model optimization techniques and data augmentation strategies to enhance segmentation accuracy. The experiment aims to analyze the impact of different optimization algorithms, learning rate schedules, and levels of data augmentation on the model’s performance.

The baseline PointNet (36.81 M parameters) was compared with three variants: Small-Modified PointNet, Medium-Modified PointNet, and Large-Modified PointNet. Key performance indicators included training time, segmentation accuracy, and mean Intersection over Union (mIoU).

Table 3 presents the results without data augmentation. Across all three optimizer configurations (Adam, Adam with Learning Rate Schedule, and Adam with Cyclical Schedule), the Medium-Modified PointNet consistently outperformed the baseline in both accuracy and mIoU. For example, using the cyclical learning rate, it reached 73.21% accuracy and 0.4319 mIoU compared to PointNet’s 59.45% and 0.3156, respectively. The addition of learning rate scheduling clearly enhanced performance, especially when used with architectural improvements.

Table 4 shows performance across different augmentation levels (×1, ×5, ×10) using various PointNet architectures. The highest overall accuracy—78.85%—and mIoU of 0.5016 were achieved by the Large-Modified PointNet when trained with the Adam optimizer and a learning rate schedule under 10× data augmentation. However, this configuration required over 4.5 h of training. In contrast, the Medium-Modified PointNet achieved a slightly lower but still competitive result—77.52% accuracy and 0.4881 mIoU—under the same augmentation level and learning rate schedule, with a significantly shorter training time of just 1 h and 24 min.

These findings indicate that the Medium-Modified PointNet offers the best trade-off between segmentation accuracy and computational efficiency, especially when combined with a learning rate schedule and high-level augmentation. Compared to the baseline PointNet, which achieved only 68.45% accuracy and 0.395 mIoU under the same optimizer and augmentation level, the improvements are substantial. Additionally, the cyclical and decaying schedules were less consistent in performance. This configuration was retained as the experimental baseline for further evaluation in Section 4.2.3, which explores the impact of different point cloud preprocessing methods under fixed model and training conditions.

4.2.3. Performance Evaluation with Different Preprocessing Techniques

Building on the selected baseline configuration, this experiment evaluates the impact of different point cloud preprocessing methods on segmentation performance. Three downsampling techniques were tested: Random Sampling (RS), Farthest Point Sampling (FPS), and Cluster-KNN. Table 5 presents results across Small-, Medium-, and Large-Modified PointNet models. All models were trained under a standardized setup—1000 epochs, a batch size of 32, and using the Adam optimizer with a learning rate schedule—to ensure a fair comparison and enhance convergence across varying model complexities.

Across all models, FPS delivered the highest accuracy and mIoU while maintaining an acceptable training time. For example, the Medium-Modified model with FPS and 1024 points reached 79.25% accuracy and 0.5159 mIoU, significantly outperforming RS, which scored only 68.42% and 0.3838. Cluster-KNN performed well at lower densities (e.g., Small-Modified model at 512 points: 72.18% accuracy, 0.4746 mIoU) but lost effectiveness at higher densities.

The Large-Modified model with FPS achieved the best raw performance (80.27% accuracy and 0.5178 mIoU at 1024 points) but with extremely high computational cost: over 2.5 h of training. These findings reinforce FPS as the most balanced approach for maintaining structural consistency, segmentation precision, and training efficiency across varying model complexities.

These results highlight that no single preprocessing technique consistently outperforms the others across all model sizes and input densities. While FPS offers the most reliable segmentation accuracy, Cluster-KNN is advantageous in preserving fine-grained structures at lower resolutions. Random Sampling, despite its simplicity, remains the least effective. This variability confirms that model performance is sensitive to preprocessing choices and supports the use of ensemble learning to integrate the strengths of multiple methods.

4.2.4. Ensemble Learning for Enhanced 3D Model Segmentation

After identifying FPS and Medium-Modified PointNet as the optimal configuration, ensemble learning was applied using Majority Vote, Unweighted Average, mIoU Weighting, and Accuracy Weighting strategies. Seven ensemble setups were evaluated using Small-, Medium-, or Large-Modified models (Table 6).

This table confirms that the seven-model ensemble using accuracy-based weighting yields the highest segmentation performance. While performance improves with more models, the marginal gain beyond five models is small relative to the added computational cost.

The best performance was achieved using Accuracy Weighting with an ensemble of seven Medium-Modified PointNet models, reaching 81.25% accuracy and 0.5380 mIoU. This exceeded the best single-model configuration by over 2% in accuracy and 0.0221 in mIoU. However, this improvement came at the cost of increased training time and computational resources: training the seven-model ensemble took over 11.5 h.

Beyond five ensemble models, performance improvements diminished; accuracy increased by only 0.16% and mIoU by 0.0019, despite over 3 additional hours of training. This highlights the trade-off between segmentation quality and computational scalability.

Figure 11 illustrates segmentation results from the best-performing ensemble. While the output demonstrated improved part boundary definition, minor misclassifications remained, particularly at joint edges, as indicated by the red highlights in Figure 11c.

Overall, the results confirm that ensemble learning, particularly when weighted by model accuracy, enhances segmentation quality. However, its practical use must consider resource constraints, as improvements in accuracy come with significant increases in training cost.

Based on the evaluation in Section 4.2, the Medium-Modified PointNet—especially in its ensemble form with accuracy-based weighting—offered the most effective balance between segmentation accuracy, computational efficiency, and generalization. These segmentation outcomes provide a solid foundation for the subsequent stage of this study: estimating kinematic joint positions from the segmented parts. In the next section, we assess how well this segmentation framework supports anatomically consistent skeleton reconstruction and precise joint localization.

These results clearly illustrate the effectiveness of the proposed segmentation pipeline in achieving structurally consistent and anatomically aware part labeling. The ensemble approach—particularly when using accuracy-based weighting with the Medium-Modified PointNet—delivered consistent improvements across all metrics, including segmentation accuracy and mIoU. Notably, the accuracy increased by over 2% compared to single-model setups, while mIoU reached 0.5380. This confirms the benefit of leveraging architectural diversity and adaptive weighting in ensemble configurations. Furthermore, the observed diminishing returns beyond five models emphasize the importance of balancing performance gains with computational cost. Overall, the segmentation results strongly validate the robustness and scalability of the proposed method across varying model complexities and input resolutions.

4.3. Development and Performance Evaluation of Kinematic Skeleton Extraction by Joint Position Estimation from 3D Model Part Segmentation

This section presents the kinematic skeleton extraction pipeline based on segmented 3D human models and evaluates its performance across multiple datasets. The evaluation consists of three parts: (1) joint estimation using centroid-based refinement, (2) qualitative skeleton evaluation, and (3) comparative analysis against benchmark methods. This structure directly addresses reviewer concerns regarding the lack of validation and experimental discussion.

4.3.1. Joint Estimation Using Centroid-Based Refinement

Following the identification of the ensemble of seven Medium-Modified PointNet models as the most effective configuration for 3D part segmentation, this study extends the evaluation to a downstream task: extracting a kinematic skeleton by estimating joint positions from segmented 3D models. To evaluate generalization performance, the model was applied to the DanceDB dataset, which had not been included in any part of the training process. It provides ground truth joint coordinates captured via the Kinect SDK, with annotations for 20 standard skeletal joints.

The segmentation model was initially applied to extract 37 structural components from each 3D input. Joint centers were then estimated by computing the centroid of point clusters corresponding to each target joint. A Limit-Radius constraint was used during centroid computation to reduce segmentation-related errors. Misclassified or distant points are filtered out, which improves the accuracy of joint estimation and reduces noise in regions with overlapping structures.

Joint estimation accuracy was measured by comparing predicted positions to the ground truth from DanceDB. Table 7 summarizes the comparative breakdown using two centroid computation techniques: (1) Standard Centroid Calculation and (2) Centroid with Limit-Radius, which confines computation to local regions to reduce outlier effects.

Together, model refinement, ensemble learning, and precise centroid estimation significantly improve the accuracy of skeleton extraction. Although the segmentation model achieved an accuracy of 81.25%, the Limit-Radius method excluded misclassified points during centroid computation, enhancing the spatial reliability of joint placement.

The results in Table 7 demonstrate that both model complexity and ensemble strategy contribute to reducing Mean Per Joint Position Error (MPJPE), but the refinement of centroid computation has the greatest impact. The proposed Limit-Radius method alone reduced MPJPE by over 10 mm in all configurations, confirming its critical role in filtering out segmentation noise and stabilizing joint localization. These findings validate the robustness of the segmentation-to-skeleton pipeline and justify the further qualitative examination of skeleton structure consistency across varied poses.

4.3.2. Qualitative Skeleton Structure Assessment

Figure 12 illustrates failure cases where (a) ground truth joint locations are shown, (b) segmentation errors—particularly in overlapping regions such as inner thighs and bent elbows—are highlighted, and (c) the extracted skeleton is presented using the Centroid with Limit-Radius method. Most joints were accurately localized, though slight deviations appeared in occluded or geometrically complex areas.

In contrast, Figure 13 presents successful examples that demonstrate the method’s robustness across varied postures. The extracted joints closely aligned with the original geometry, exhibiting symmetry and structural integrity. The Limit-Radius constraint was particularly effective in removing misclassified points and preserving spatial consistency, especially in challenging poses involving body contact or non-upright limbs.

Although not reflected in MPJPE, these qualitative results reinforce the anatomical plausibility and visual reliability of the extracted skeletons, particularly in configurations where numerical metrics may be insufficient.

4.3.3. Comparative Analysis Against Benchmark Methods

To assess the effectiveness of the proposed method, a comparative analysis was conducted against a traditional algorithm (L1 [18], Pinocchio [20]) and one modern image-based approach (Mediapipe [30]). Due to fundamental differences in skeleton representation and joint definitions across these methods, a direct numerical comparison was not feasible. Therefore, we adopted a qualitative evaluation based on visual fidelity, structural consistency, and robustness across varied poses.

Figure 14 shows skeletons from point-cloud-based methods. L1-Medial, relying on medial axis and Manhattan distances, performs reasonably on upright poses but fails in overlapping or bent configurations. Pinocchio uses KD-Tree structures to compute distances from points to the surface, then places non-nested spheres inside the model to define its skeletal core. To keep the skeleton intact, the spheres are connected through Gabriel graph constraints. However, this method typically requires the model to have grounded and upright legs (i.e., feet in contact with the base plane and extended downward) to initialize the skeleton correctly. When legs are bent, overlapping, or detached from the ground, the system often fails to generate a connected lower body structure, resulting in missing or misaligned joints.

Figure 15 presents comparisons with MediaPipe. While MediaPipe demonstrates flexibility across poses and input types, it often fails to maintain spatial coherence due to its reliance on 2D-to-3D projection. There were occasional slips in joint positions, but the structure mostly stayed intact, and joint locations still followed the original geometry fairly well. Some of the earlier methods tended to struggle when the mesh was dense or disorganized, but the proposed method held up more reliably in those spots. In contrast, our method, operating fully within the 3D domain, produces skeletons with greater anatomical and spatial integrity.

In summary, L1-Medial is computationally simple and performs well on upright poses with spatial clarity but tends to fail in complex or occluded configurations due to its reliance on medial axis assumptions. Pinocchio offers better structural preservation in models with distinct limbs but breaks down in cases of overlap or severe articulation. MediaPipe shows general robustness in diverse poses through deep learning-based landmark detection; yet, its 2D-to-3D lifting step can lead to spatial inaccuracies, especially when parts are occluded. The proposed method outperforms these alternatives by estimating joint positions directly in 3D space, incorporating local refinement through the Limit-Radius constraint. This yields higher anatomical consistency and spatial reliability, even under non-standard postures or dense meshes.

Collectively, the experimental results validate the strength of the proposed skeleton extraction framework, which combines part-wise segmentation and centroid-based joint localization with a novel refinement strategy. The Limit-Radius method proved essential in reducing spatial noise and improving the precision of joint estimation. Furthermore, qualitative comparisons with traditional and state-of-the-art methods—including L1-Medial, Pinocchio, and MediaPipe—demonstrate that the proposed approach maintains higher anatomical accuracy and structural coherence. These outcomes confirm its generalization ability and practical applicability in real-world 3D animation, motion capture, and biomechanical applications.

5. Conclusions

This study presents a 3D skeleton extraction pipeline that integrates hierarchical part segmentation with modified PointNet architectures, followed by centroid-based joint localization enhanced through a Limit-Radius refinement. Unlike conventional 2D-to-3D lifting or medial-axis simplification approaches, our method operates entirely within the native 3D domain, enabling structural consistency across a wide range of poses, including highly articulated or overlapping configurations. However, it is important to note that the segmentation ground truth used in this study was manually generated based on Kinect skeletal mapping. While this allowed for anatomically aligned labels, it also introduces potential scalability limitations and subjective bias. Furthermore, the current model has not yet been thoroughly evaluated on highly deformed or fragmented point cloud data, which may affect robustness in less-structured or noisy scenarios.

To enhance segmentation accuracy and robustness, we implemented an ensemble of Small-, Medium-, and Large-Modified PointNet models. These models were trained with varying architectural complexities and data augmentations, and their outputs were aggregated using adaptive weighting strategies based on segmentation performance. This ensemble approach resulted in cleaner segmentation boundaries, reduced misclassifications in ambiguous regions, and improved downstream joint estimation accuracy. To support reproducibility and facilitate further research, we will make our manually labeled segmentation dataset—derived from standardized Kinect skeletal mapping—available upon request.

Quantitative evaluation—particularly on the unseen DanceDB dataset—demonstrated strong generalization capabilities, with the proposed method achieving a Mean Per Joint Position Error (MPJPE) of 22.82 mm. In challenging configurations involving twisted limbs, dense geometries, or close contact between body parts, the system maintained structural integrity and outperformed baseline methods such as L1-Medial, Pinocchio, and MediaPipe. Unlike these benchmarks, which rely on either medial-axis heuristics or 2D-to-3D projection, our method performs all joint estimation directly in 3D space, avoiding projection-related distortion and ensuring spatial integrity. In addition to MPJPE, qualitative assessments based on visual fidelity and structural symmetry further support the method’s anatomical coherence in ambiguous or occluded postures. While the method shows strong potential for integration into downstream applications in animation, motion tracking, and biomechanics—especially where symmetry and joint accuracy are critical—further investigation is needed to evaluate its robustness on non-anatomically plausible or highly fragmented models, which remains a future research direction. Future work may also explore extending this framework to multi-subject environments, partial scans, or real-time inference pipelines for broader applicability.

Overall, the results emphasize the advantages of operating directly within the 3D domain for skeletal structure extraction. The proposed method consistently delivers precise joint localization and remains robust across a wide range of poses and anatomical variations. Its ability to preserve left–right anatomical symmetry provides a critical advantage for downstream applications that require bilateral consistency, ranging from motion tracking and biomechanical modeling to character rigging in animation. This symmetry-conscious design makes the method particularly well suited for real-world scenarios where structural fidelity is essential, such as physical therapy analysis, ergonomic simulation, and autonomous human modeling in robotics.

While the current approach demonstrates strong performance in both segmentation and skeleton extraction, future work may explore integration with temporal data for motion tracking, as well as testing generalization on noisy or incomplete scans. Further optimization may also target reduced training cost while preserving structural accuracy in complex poses.

Author Contributions

Conceptualization, S.T. and N.M.; Methodology, S.T. and N.M.; Software, N.M.; Validation, S.T. and N.M.; Formal analysis, S.T. and N.M.; Investigation, N.M.; Resources, N.M.; Data curation, N.M.; Writing—original draft preparation, S.T. and N.M.; Writing—review and editing, S.T. and N.M.; Visualization, N.M.; Supervision, N.M.; Project administration, S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data are available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, C.; Zhou, M.; Geng, G.; Xie, Y.; Zhang, Y.; Liu, Y. EPCS: Endpoint-based part-aware curve skeleton extraction for low-quality point clouds. Comput. Graph. 2023, 117, 209–221. [Google Scholar] [CrossRef]
Wen, C.; Yu, B.; Tao, D. Learnable skeleton-aware 3d point cloud sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Xu, Z.; Zhou, Y.; Kalogerakis, E.; Landreth, C.; Singh, K. Rignet: Neural rigging for articulated characters. arXiv 2020, arXiv:2005.00559. [Google Scholar] [CrossRef]
Fang, Y.; Tang, J.; Shen, W.; Shen, W.; Gu, X.; Song, L.; Zhai, G. Dual attention guided gaze target detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Hu, H.; Li, Z.; Jin, X.; Deng, Z.; Chen, M.; Shen, Y. Curve skeleton extraction from 3D Point clouds through hybrid feature point shifting and clustering. Comput. Graph. Forum 2020, 39, 111–132. [Google Scholar] [CrossRef]
Liu, L.; Chen, N.; Ceylan, D.; Theobalt, C.; Wang, W.; Mitra, N.J. CurveFusion: Reconstructing thin structures from RGBD sequences. arXiv 2021, arXiv:2107.05284. [Google Scholar]
Bærentzen, A.; Rotenberg, E. Skeletonization via Local Separators. ACM Trans. Graph. (TOG) 2021, 40, 1–18. [Google Scholar] [CrossRef]
Li, Z.; Liu, S.; Bai, J.; Peng, C.; Li, Y.; Du, S. A Novel Skeleton-based Model with Spine for 3D Human Pose Estimation. In Proceedings of the 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 26–29 January 2022. [Google Scholar]
Bian, S.; Zheng, A.; Chaudhry, E.; You, L.; Zhang, J.J. Automatic generation of dynamic skin deformation for animated characters. Symmetry 2018, 10, 89. [Google Scholar] [CrossRef]
Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Cham, T.-J.; Yuan, J.; Thalmann, N.M. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Li, R.; Si, W.; Weinmann, M.; Klein, R. Constraint-based optimized human skeleton extraction from single-depth camera. Sensors 2019, 19, 2604. [Google Scholar] [CrossRef]
Mahmood, N.; Ghorbani, N.; Troje, N.F.; Pons-Moll, G.; Black, M.J. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Kolotouros, N.; Pavlakos, G.; Black, M.; Daniilidis, K. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.S.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Cai, Z.; Pan, L.; Wei, C.; Yin, W.; Hong, F.; Zhang, M.; Loy, C.C.; Yang, L.; Liu, Z. Pointhps: Cascaded 3d human pose and shape estimation from point clouds. arXiv 2023, arXiv:2308.14492. [Google Scholar]
Zhang, F.; Chen, X.; Zhang, X. Parallel thinning and skeletonization algorithm based on cellular automaton. Multimedia Tools Appl. 2020, 79, 33215–33232. [Google Scholar] [CrossRef]
Huang, H.; Wu, S.H.; Cohen-Or, D.; Gong, M.L.; Zhang, H.; Li, G.Q.; Chen, B.Q. L1-medial skeleton of point cloud. ACM Trans. Graph. (TOG) 2013, 32, 65:1–65:8. [Google Scholar] [CrossRef]
Manolas, I.; Lalos, A.S.; Moustakas, K. Parallel 3D Skeleton Extraction Using Mesh Segmentation. In Proceedings of the 2018 International Conference on Cyberworlds (CW), Singapore, 3–5 October 2018. [Google Scholar]
Baran, I.; Popović, J. Automatic rigging and animation of 3d characters. ACM Trans. Graph. (TOG) 2007, 26, 72-es. [Google Scholar] [CrossRef]
Wang, J.-D.; Chao, J.; Chen, Z.-G. Feature-preserving skeleton extraction algorithm for point clouds. J. Graph. 2023, 44, 146. [Google Scholar]
Qi, C.; Su, H.; Mo, K.; Guibas, L. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. Adv. Neural Inf. Process. Syst. 2018, 31, 828–838. [Google Scholar]
Wu, W.; Qi, Z.; Fuxin, L. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Tian, W.; Gao, Z.; Tan, D. Single-view multi-human pose estimation by attentive cross-dimension matching. Front. Neurosci. 2023, 17, 1201088. [Google Scholar] [CrossRef]
Qin, H.; Zhang, S.; Liu, Q.; Chen, L.; Chen, B. PointSkelCNN: Deep learning-based 3D human skeleton extraction from point clouds. Comput. Graph. Forum 2020, 39, 363–374. [Google Scholar] [CrossRef]
Lin, C.; Li, C.; Liu, Y.; Chen, N.; Choi, Y.K.; Wang, W. Point2skeleton: Learning skeletal representations from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Yu, X.; Tang, L.; Rao, Y.; Huang, T.; Zhou, J.; Lu, J. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. Mediapipe: A framework for building perception pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar]
Ma, X.; Qin, C.; You, H.; Ran, H.; Fu, Y. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. arXiv 2022, arXiv:2202.07123. [Google Scholar]
Sakae, Y.; Noda, Y.; Li, L.; Hasegawa, K.; Nakada, S.; Tanaka, S. Realizing Uniformity of 3D Point Clouds Based on Improved Poisson-Disk Sampling. In Methods and Applications for Modeling and Simulation of Complex Systems: 19th Asia Simulation Conference, AsiaSim 2019, Singapore, 30 October–1 November 2019, Proceedings 19; Springer: Singapore, 2019. [Google Scholar]
Chen, X.; Chen, M. Skeleton partition models for 3D printing. In Proceedings of the Second International Conference on Medical Imaging and Additive Manufacturing (ICMIAM 2022), Xiamen, China, 25–27 February 2022. [Google Scholar]
Han, X.F.; Cheng, H.; Jiang, H.; He, D.; Xiao, G. Pcb-randnet: Rethinking random sampling for lidar semantic segmentation in autonomous driving scene. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Kobe, Japan, 13–17 May 2024. [Google Scholar]
Battikh, M.S.; Lensky, A.; Hammill, D.; Cook, M. knn-res: Residual neural network with knn-graph coherence for point cloud registration. In Principle and Practice of Data and Knowledge Acquisition Workshop; Springer: Singapore, 2024. [Google Scholar]
Han, M.; Wang, L.; Xiao, L.; Zhang, H.; Zhang, C.; Xu, X.; Zhu, J. QuickFPS: Architecture and algorithm co-design for farthest point sampling in large-scale point clouds. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4011–4024. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5099–5108. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. (TOG) 2019, 38, 1–12. [Google Scholar] [CrossRef]
Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef] [PubMed]
Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Atik, M.E.; Duran, Z. An Efficient Ensemble Deep Learning Approach for Semantic Point Cloud Segmentation Based on 3D Geometric Features and Range Images. Sensors 2022, 22, 6210. [Google Scholar] [CrossRef] [PubMed]
Koguciuk, D.; Chechliński, Ł.; El-Gaaly, T. 3D object recognition with ensemble learning—A study of point cloud-based deep learning models. In International Symposium on Visual Computing; Springer: Cham, Switzerland, 2019. [Google Scholar]
Ruan, Y.; Singh, S.; Morningstar, W.; Alemi, A.A.; Ioffe, S.; Fischer, I.; Dillon, J.V. Weighted ensemble self-supervised learning. arXiv 2022, arXiv:2211.09981. [Google Scholar]
Bogo, F.; Romero, J.; Pons-Moll, G.; Black, M.J. Dynamic FAUST: Registering human bodies in motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.; Tzionas, D.; Black, M.J. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Carnegie Mellon University. Carnegie Mellon Motion Capture Database. Available online: http://mocap.cs.cmu.edu (accessed on 21 May 2024).
Lähner, Z.; Rodola, E.; Bronstein, M.M.; Cremers, D.; Burghard, O.; Cosmo, L.; Dieckmann, A.; Klein, R.; Sahillioǧlu, Y. SHREC’16: Matching of deformable shapes with topological noise. In Eurographics Workshop on 3D Object Retrieval, EG 3DOR; Eurographics Association: Goslar, Germany, 2016. [Google Scholar]
Wang, Z.; Ma, M.; Feng, X.; Li, X.; Liu, F.; Guo, Y.; Chen, D. Skeleton-based human pose recognition using channel state information: A survey. Sensors 2022, 22, 8738. [Google Scholar] [CrossRef]
Tölgyessy, M.; Dekan, M.; Chovanec, Ľ. Skeleton tracking accuracy and precision evaluation of kinect v1, kinect v2, and the azure kinect. Appl. Sci. 2021, 11, 5756. [Google Scholar] [CrossRef]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Learning semantic segmentation of large-scale point clouds with random sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8338–8354. [Google Scholar] [CrossRef]
Li, J.; Zhou, J.; Xiong, Y.; Chen, X.; Chakrabarti, C. An adjustable farthest point sampling method for approximately-sorted point cloud data. In Proceedings of the 2022 IEEE Workshop on Signal Processing Systems (SiPS), Rennes, France, 2–4 November 2022. [Google Scholar]
Mahdaoui, A.; Sbai, E.H. 3D Point Cloud Simplification Based on k-Nearest Neighbor and Clustering. Adv. Multimed. 2020, 2020, 8825205. [Google Scholar] [CrossRef]
Desai, A.; Parikh, S.; Kumari, S.; Raman, S. PointResNet: Residual network for 3D point cloud segmentation and classification. arXiv 2022, arXiv:2211.11040. [Google Scholar]
Gezawa, A.S.; Liu, C.; Jia, H.; Nanehkaran, Y.A.; Almutairi, M.S.; Chiroma, H. An improved fused feature residual network for 3D point cloud data. Front. Comput. Neurosci. 2023, 17, 1204445. [Google Scholar] [CrossRef]
Xie, Y.; Yang, B.; Guan, Q.; Zhang, J.; Wu, Q.; Xia, Y. Attention mechanisms in medical image segmentation: A survey. arXiv 2023, arXiv:2305.17937. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Feng, M.; Zhang, L.; Lin, X.; Gilani, S.Z.; Mian, A. Point attention network for semantic segmentation of 3D point clouds. Pattern Recognit. 2020, 107, 107446. [Google Scholar] [CrossRef]
Kim, S.; Lee, S.; Hwang, D.; Lee, J.; Hwang, S.J.; Kim, H.J. Point cloud augmentation with weighted local transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Zhang, Z.; Hua, B.-S.; Yeung, S.-K. Shellnet: Efficient point cloud convolutional neural networks using concentric shells statistics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Smith, L.N. Cyclical learning rates for training neural networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Engelmann, F.; Kontogianni, T.; Hermans, A.; Leibe, B. Exploring spatial context for 3D semantic segmentation of point clouds. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017. [Google Scholar]
Ju, C.; Bibaut, A.; van der Laan, M. The relative performance of ensemble methods with deep convolutional neural networks for image classification. J. Appl. Stat. 2018, 45, 2800–2818. [Google Scholar] [CrossRef]
Yang, Y.; Sun, S.; Huang, T.; Qian, L.; Liu, K. Method for measuring the center of mass and moment of inertia of a model using 3D point clouds. Appl. Opt. 2022, 61, 10329–10336. [Google Scholar] [CrossRef]
Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]

Figure 1. Workflow of the proposed method for kinematic skeleton extraction from 3D models.

Figure 2. Skeletal joint structure and ground truth segmentation. (a) The 20-joint skeletal structure used for kinematic skeleton extraction, with joints numbered 1–20 for reference. (b) Ground truth segmentation of 3D models, dividing them into 37 distinct parts based on alignment with the skeletal structure.

Figure 3. Automated workflow for 3D point cloud segmentation and labeling: (a) Flowchart depicting the procedure for exporting vertex labels based on pre-defined skeletal groups. (b) Ground truth segmentation result comprising 37 anatomical regions, labeled numerically 20 joint-based and 17 structural components.

Figure 4. Architecture of the baseline PointNet model for segmentation.

Figure 5. Modified PointNet architectures proposed in this study. (a) Medium-Modified PointNet; (b) Small-Modified PointNet; and (c) Large-Modified PointNet versions differ in depth, complexity, and component arrangement to support various trade-offs between segmentation performance and computational efficiency.

Figure 6. Effect of

β

and

α

on ensemble weight distribution. Larger

β

increases contrast between models, while larger

α

enforces uniformity across weights.

Figure 6. Effect of

β

and

α

on ensemble weight distribution. Larger

β

increases contrast between models, while larger

α

enforces uniformity across weights.

Figure 7. DanceDB 3D models used to evaluate generalization performance in joint localization.

Figure 8. Illustration of the effects of Random Sampling (RS) on 3D model representation at different point cloud densities: (a) original model with full structural details; (b) downsampled to 512 points; (c) downsampled to 1024 points; and (d) downsampled to 2048 points. Higher point densities retain more structural details, while lower densities result in the loss of fine features.

Figure 9. Illustration of the effects of Farthest Point Sampling (FPS) on 3D model representation at different point cloud densities: (a) original model; (b) downsampled to 512 points; (c) downsampled to 1024 points; and (d) downsampled to 2048 points. Higher point densities result in better structural consistency.

Figure 10. Effects of Cluster-KNN Sampling on 3D model representation at different point densities: (a) original model; (b) downsampled to 512 points; (c) downsampled to 1024 points; and (d) downsampled to 2048 points. Increasing point density improves structural continuity and segmentation accuracy.

Figure 11. Example of 3D model part segmentation results using the ensemble of seven Medium-Modified PointNet models: (a) ground truth segmentation; (b) predicted segmentation result; and (c) segmentation errors, with misclassified points in red and correctly classified points in blue.

Figure 12. Kinematic skeleton extraction results using the Centroid with Limit-Radius method: (a) ground truth; (b) segmentation error; and (c) extracted skeleton.

Figure 13. Skeleton extraction results using the proposed method in successful cases. The extracted joints align closely with the original 3D geometry and exhibit structural symmetry.

Figure 14. Skeleton extraction results from 3D model point clouds using L1, Pinocchio, and the proposed method.

Figure 15. Skeleton extraction results using the MediaPipe and proposed methods: (a) MediaPipe output. (b) Proposed method output.

Table 1. Details of 3D models collected from standard datasets.

Dataset Name	Number of Points	Gender	Model Characteristics	Total Poses
DFaust [44] DFaust [44]	6890	Male, Female	Multi-Person	100
DFaust [44] DFaust [44]	10,475	Male, Female	Multi-Person	673
EHF [45]	10,475	Male	Single Person	100
CMU [46]	10,475	Male, Female	Multi-Person	1199
Kids [47]	59,727	Male	Multi-Person	32
Total				2104

Table 2. Architectural comparison and trade-offs among the Modified PointNet variants.

Aspect	Small-Modified	Medium-Modified	Large-Modified
Target Use Case	Real-time or low-resource systems	Balanced performance and efficiency	High-precision segmentation tasks
Convolutional + Residual Layers	32 × 10 (residual), 64, 128	64 × 10 (residual), 128, 128	64 × 10 (residual), 128, 256
Attention Block	512 filters	512 filters	512 filters
MLP Size	256 units	128 units	512 units
Transformation Blocks	Shallower	Standard	Standard
Global Max Pooling	Yes	Yes	Yes
Advantages	Lightweight, efficient, low latency	Good accuracy, stable training	Rich features, high spatial resolution
Disadvantages	Less accurate in complex geometry	Higher load than small, less detail than large	High computational cost, longer training time

Table 3. Performance comparison of 3D model segmentation using different optimizers in Modified PointNet without data augmentation.

Optimization Algorithm	Learning Algorithm	No Data Augmentation
Optimization Algorithm	Learning Algorithm	Training Time	Accuracy Rate	mIoU
Adam	PointNet	8:53	56.88	0.2862
	Small-Modified PointNet	7:13	68.84	0.3968
	Medium-Modified PointNet	9:32	72.17	0.4221
	Large-Modified PointNet	30:32	71.57	0.4115
Adam + Learning Rate Schedule	PointNet	9:48	58.98	0.3156
	Small-Modified PointNet	7:34	70.51	0.3998
	Medium-Modified PointNet	9:53	72.98	0.4338
	Large-Modified PointNet	30:53	73.13	0.4322
Adam + Cyclical Learning Rate Schedule	PointNet	8:30	59.45	0.3156
	Small-Modified PointNet	7:55	70.69	0.4015
	Medium-Modified PointNet	10:14	73.21	0.4319
	Large-Modified PointNet	31:14	73.71	0.4357

Table 4. Performance comparison of 3D model segmentation using different optimizers and data augmentation in Modified PointNet.

Optimization Algorithm	Learning Algorithm	Data Augmentation and Number of Multiple Data
		X1			X5			X10
		Training Time	Accuracy Rate	mIoU	Training Time	Accuracy Rate	mIoU	Training Time	Accuracy Rate	mIoU
Adam	PointNet	9:32	60.33	0.3218	35:45	67.89	0.3872	1:10:23	70.21	0.4221
	Small-Modified PointNet	7:43	71.30	0.3950	32:15	75.67	0.4544	1:03:45	77.32	0.4823
	Medium-Modified PointNet	10:02	75.21	0.4552	42:36	77.56	0.4798	1:24:27	78.21	0.4952
	Large-Modified PointNet	31:02	73.43	0.4407	2:17:06	76.21	0.4627	4:33:27	77.02	0.4809
Adam + Learning Rate Schedule	PointNet	9:56	61.25	0.3405	37:50	68.45	0.395	1:12:30	71.34	0.4278
	Small-Modified PointNet	8:04	73.77	0.4300	32:27	76.97	0.4709	1:03:48	77.25	0.4823
	Medium-Modified PointNet	10:23	76.17	0.4601	42:57	77.01	0.4677	1:24:48	77.52	0.4881
	Large-Modified PointNet	31:23	75.33	0.4590	2:17:27	77.78	0.4834	4:33:48	78.85	0.5016
Adam + Cyclical Learning Rate Schedule	PointNet	9:57	64.78	0.3554	38:45	69.52	0.3996	1:13:07	72.89	0.4367
	Small-Modified PointNet	8:25	71.21	0.3976	32:48	75.12	0.4518	1:04:09	77.34	0.4889
	Medium-Modified PointNet	10:44	75.62	0.4547	43:18	76.95	0.4678	1:25:09	77.78	0.4872
	Large-Modified PointNet	31:44	73.65	0.4420	2:17:48	76.32	0.4644	4:34:09	78.23	0.4921

Table 5. Comparison of part segmentation performance across Modified PointNet architectures with different point cloud preprocessing methods.

Modified PointNet Architecture	Sample Point Method	Number Input of Points
		512			1024			2048
		Training Time (hh:mm:ss)	Accuracy Rate (%)	mIoU	Training Time (hh:mm:ss)	Accuracy Rate (%)	mIoU	Training Time (hh:mm:ss)	Accuracy Rate (%)	mIoU
Small- Modified PointNet (parameters size 17.81 M)	RS	29:17	63.60	0.3160	55:57	64.52	0.3157	1:19:58	65.03	0.3398
	Farhest	29:41	76.54	0.4523	57:27	77.18	0.4830	1:24:37	77.05	0.4822
	Cluster-KNN	31:01	72.18	0.4746	1:10:27	75.12	0.5028	2:12:28	75.03	0.4978
Medium-Modified PointNet (parameters size 37.91 M)	RS	52:15	65.48	0.3474	1:39:14	68.42	0.3838	2:33:06	68.74	0.3857
	Farhest	52:39	78.78	0.5120	1:40:44	79.25	0.5159	2:37:45	77.71	0.4955
	Cluster-KNN	53:59	73.97	0.4957	1:53:44	75.69	0.5151	3:25:36	74.25	0.4624
Large- Modified PointNet (parameters size 263.63 M)	RS	2:04:16	72.28	0.4812	2:29:14	74.49	0.4965	5:23:08	73.29	0.4857
	Farhest	2:04:40	78.92	0.5125	2:30:44	80.27	0.5178	5:27:47	80.07	0.5164
	Cluster-KNN	2:06:00	74.81	0.4862	2:43:44	75.63	0.4929	6:15:38	75.14	0.4923

Table 6. Performance comparison of 3D model part segmentation using ensemble learning with Modified PointNet.

Ensemble Modified PointNet	Parameter Size (M)	Training Time (hh:mm:ss)	Majority Vote		Unweight Average		Miou Weighted		Accuracy Weighted
Ensemble Modified PointNet	Parameter Size (M)	Training Time (hh:mm:ss)	Accuracy Rate	mIoU	Accuracy Rate	mIoU	Accuracy Rate	mIoU	Accuracy Rate	mIoU
Ensemble 3 Large	790.89	7:27:48	80.69	0.5304	80.93	0.5332	80.97	0.5343	80.96	0.5342
Ensemble 3 Medium	113.73	4:57:48	80.61	0.5281	80.79	0.5302	80.79	0.5305	80.80	0.5308
Ensemble 5 Medium	189.55	8:14:52	80.94	0.5336	81.05	0.5355	81.06	0.5359	81.09	0.5361
Ensemble 7 Medium	265.37	11:31:56	81.15	0.5374	81.22	0.5374	81.23	0.5379	81.25	0.5380
Ensemble 5 Small	89.05	4:38:27	77.58	0.4783	77.80	0.4806	77.87	0.4816	77.81	0.4809
Ensemble 9 Small	160.29	8:19:27	77.74	0.4809	77.85	0.4814	77.87	0.4817	77.88	0.4816
Ensemble 13 Small	231.53	12:00:27	77.97	0.4840	78.10	0.4847	78.13	0.4853	78.12	0.4848

Table 7. Comparison of kinematic skeleton extraction accuracy using the proposed method.

Modified PointNet Architecture	MPJPE (mm)
Modified PointNet Architecture	Centroid Without Limit-Radius	Centroid with Limit-Radius
PointNet	51.81	38.87
Small-Modified PointNet	45.07	34.51
Medium-Modified PointNet	40.38	29.57
Large-Modified PointNet	39.57	28.85
AW Ensemble 3 Large-Modified PointNet	36.21	25.40
AW Ensemble 7 Medium-Modified PointNet	33.39	22.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mata, N.; Tangwannawit, S. Kinematic Skeleton Extraction from 3D Model Based on Hierarchical Segmentation. Symmetry 2025, 17, 879. https://doi.org/10.3390/sym17060879

AMA Style

Mata N, Tangwannawit S. Kinematic Skeleton Extraction from 3D Model Based on Hierarchical Segmentation. Symmetry. 2025; 17(6):879. https://doi.org/10.3390/sym17060879

Chicago/Turabian Style

Mata, Nitinan, and Sakchai Tangwannawit. 2025. "Kinematic Skeleton Extraction from 3D Model Based on Hierarchical Segmentation" Symmetry 17, no. 6: 879. https://doi.org/10.3390/sym17060879

APA Style

Mata, N., & Tangwannawit, S. (2025). Kinematic Skeleton Extraction from 3D Model Based on Hierarchical Segmentation. Symmetry, 17(6), 879. https://doi.org/10.3390/sym17060879

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Kinematic Skeleton Extraction from 3D Model Based on Hierarchical Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Skeleton Extraction Methods for 3D Models

2.1.1. Thinning and Boundary Propagation Methods

2.1.2. Distance-Field-Based Methods

2.1.3. Geometric-Based Methods

2.1.4. Deep Learning-Based Methods

2.2. Preprocessing of Point Cloud Data for 3D Models

2.3. Deep Learning Models for Part Segmentation

2.4. Ensemble Methods for Deep Learning

3. Proposed Framework

3.1. Data Collection

3.2. 3D Model Analysis and Ground Truth Generation for Part Segmentation

3.3. Point Cloud Preprocessing for 3D Model Segmentation

3.4. PointNet-Based Deep Learning Architecture for 3D Model Segmentation

3.4.1. Baseline Model: Standard PointNet Architecture for 3D Model Segmentation

3.4.2. Architectural Enhancements to PointNet

3.4.3. Medium-Modified PointNet Architecture

3.4.4. Small-Modified PointNet Architecture

3.4.5. Large-Modified PointNet Architecture

3.4.6. Data Augmentation for Improved Generalization

3.4.7. Optimization Algorithm Selection for Deep Learning

3.5. Ensemble Learning Strategies for Modified PointNet in 3D Segmentation

3.6. Kinematic Skeleton Extraction from 3D Models Based on Hierarchical Segmentation

3.6.1. Model Testing on DanceDB Dataset

3.6.2. Centroid Calculation for Joint-Specific Point Clouds

3.6.3. Centroid Refinement with Radius Limitation

3.6.4. Evaluation of Extracted Joint Accuracy Using MPJPE

4. Results

4.1. Evaluation of Structure-Preserving Preprocessing Techniques for 3D Model Part Segmentation

4.2. Comparison of Deep Learning Model Performance and Ensemble Learning for 3D Model Part Segmentation

4.2.1. Experimental Setup

4.2.2. Performance Evaluation of Deep Learning Models for 3D Model Part Segmentation

4.2.3. Performance Evaluation with Different Preprocessing Techniques

4.2.4. Ensemble Learning for Enhanced 3D Model Segmentation

4.3. Development and Performance Evaluation of Kinematic Skeleton Extraction by Joint Position Estimation from 3D Model Part Segmentation

4.3.1. Joint Estimation Using Centroid-Based Refinement

4.3.2. Qualitative Skeleton Structure Assessment

4.3.3. Comparative Analysis Against Benchmark Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI